Homework Assignment 6 COGS 118A

Starting from:

$30

Homework Assignment 6
COGS 118A: Supervised Machine Learning Algorithms

Instructions: Combine your answer to the questions below with your notebook to
create a PDF file; submit your file via Gradescope.
1 (8 points) Conceptual Questions
1. Which one below best describes what support vectors are:
A. The decision boundary.
B. The positive and the negative planes.
C. The training samples that determine the positive and the negative planes.
D. The test samples that determine the positive and the negative planes.
2. When tuning the hyper-parameters of a model, we find the hyper-parameters
that
A. minimize error on the test set
B. minimize error on the test set
C. maximize the margin of the classifier
D. minimize error on the validation set
3. Select all that apply. k-fold cross validation is
A. Not necessary when you have huge amounts of labelled data
B. A way to do model selection while minimizing over-fitting
1
C. The only way to do hyper-parameter tuning
D. Subject to the bias-variance trade-off
4. When you increase the k in cross-validation while keeping the dataset the same,
you
A. Get a decrease in both the generalization error and the validation error
B. Get an increase both the generalization error and the validation error
C. Get a decrease in the variability of the validation error across folds
D. Get an increase in the variability of the validation error across folds
2 (10 points) Cross-Validation
Given a training dataset Straining = {(xi
, yi)}, i = 1, . . . , 6} where xi ∈ R is the feature
scalar and yi ∈ {−1, +1} is the corresponding label. The data points in the dataset
Straining are given below:
(x1, y1) = (2, −1), (x2, y2) = (7, −1), (x3, y3) = (4, +1),
(x4, y4) = (1, −1), (x5, y5) = (3, +1), (x6, y6) = (6, +1).
Suppose you are training a linear classifier f(x; a, b) = sign(ax + b) with 2-fold crossvalidation where sign(z) is defined as:
sign(z) =
1, z ≥ 0
−1, z < 0.
• You have split the dataset Straining into:
S1 = {(x1, y1),(x2, y2),(x3, y3)}
S2 = {(x4, y4),(x5, y5),(x6, y6)}
• After training the classifier f(x; a, b) on S1, you have obtained the parameters
a1 = −1, b1 = 5 and then try to validate the classifier on S2.
• After training the classifier f(x; a, b) on S2, you have obtained the parameters
a2 = 2, b2 = −3 and then try to validate the classifier on S1.
Please finish the tasks below:
1. Calculate the average training error in the 2-fold cross-validation.
Note: The definition of average training error is the mean of the error of
classifier f(x; a1, b1) on S1 and the error of classifier f(x; a2, b2) on S2.
2. Calculate the average validation error (i.e. the cross-validation error) in the
2-fold cross-validation.
3 (12 points) Shattering
In this problem, consider a classifier f(x; a, b) = sign(ax
>x − b) where the feature
vector x = [x1, x2]
> ∈ R
2 and its prediction f(x; a, b) ∈ {−1, +1}. Besides, a, b ∈ R
are the model parameters and sign(z) is defined as:
sign(z) =
1, z ≥ 0
−1, z < 0.
The classifier f(x; a, b) performs a binary classification on an input feature vector x
under model parameters a, b ∈ R that can be learned.
In this question, please attempt to show that if the classifier f(x; a, b) can shatter
two points, x1 = [1, 1]> and x2 = [−2, −3]>.
• If you think f(x; a, b) can shatter x1 and x2, you need to show that for each
possible label configuration (y1, y2) ∈ {(+1, +1),(−1, −1),(+1, −1),(−1, +1)}
of x1 and x2, there exists a classifier f(x; a, b) that classifies the x1 and x2
correctly, and you should illustrate each classifier by the following steps:
– Draw the decision boundary of the classifier.
– Shade the area where the classifier makes the positive prediction.
You can use the figures of the coordinate system in the next page to show your
drawings.
• If you think classifier f(x; a, b) cannot shatter x1 and x2, please explain the
reason.
(y1, y2) = (+1, +1) (y1, y2) = (−1, −1)
(y1, y2) = (+1, −1) (y1, y2) = (−1, +1)
4 (10 points) SVM: Gradient
Given a training dataset Straining = {(xi
, yi)}, i = 1, . . . , n}, we wish to optimize the
loss L(w, b) of a linear SVM classifier:
L(w, b) = 1
2
||w||2
2 + C
Xn
i=1

1 − yi(w
T xi + b)

+
(1)
where (z)+ = max(0, z) is called the rectifier function and C is a scalar constant.
The optimal weight vector w∗ and the bias b
∗ used to build the SVM classifier are
defined as follows:
w
∗
, b∗ = arg min
w,b
L(w, b) (2)
In this problem, we attempt to obtain the optimal parameters w∗ and b
∗ by using a
standard gradient descent algorithm.
Hint: To derive the derivative of L(w, b), please consider two cases:
(a) 1 − yi(wT xi + b) ≥ 0, (b) 1 − yi(wT xi + b) < 0
1. Derive the derivative: ∂L(w, b)
∂w
.
2. Derive the derivative: ∂L(w, b)
∂b