EECS 445 Discussion 4: SVMs and Nonlinear Mappings

Starting from:

$27

EECS 445 Discussion 4: SVMs and Nonlinear Mappings
Introduction
This discussion focuses on the following topics:
1. Positive definite matrix and positive semi-definite matrix
2. Finding a Closed Form Solution for Ridge Regression
3. Support Vector Machine: Hard-margin and Soft-margin version
1 Positive definite matrix and positive semi-definite matrix
A symmetric matrix A ∈ R
n×n is said to be positive definite if ∀~z ∈ R
n − {~0}, ~zT A~z > 0.
A symmetric matrix A ∈ R
n×n is said to be positive semi-definite if ∀~z ∈ R
n, ~zT A~z ≥ 0.
Negative definite and negative semi-definite matrices are defined analogously.
• For a positive definite matrix, all its eigenvalues are positive.
• For a positive semi-definite matrix, all its eigenvalues are non-negative.
2 Finding a Closed Form Solution for Ridge Regression
Recall from last discussion, a common machine learning problem is regression. Given a training set
Sn = {~x(i)
, y(i)}
n
i=1, where y
(i) ∈ R and ~x(i) ∈ R
d
, we aim to learn a decision boundary ~θ that minimizes the
following objective function (with regularization):
J(
~θ) = λ
2
k
~θ −~0k
2 +
Xn
i=1
(y
(i) − ~θ · ~x(i)
))2
2
(1)
By denoting Y ∈ R
n×1
, where each row of Y is the scalar yi
, and X ∈ R
n×d
, where each row of X is the
corresponding ~xi
. We can rewrite J(
~θ) in matrix form:
J(
~θ) = λ
2
~θ
T ~θ +
(Y − X~θ)
T
(Y − X~θ)
2
(2)
To minimize J, we locate where its gradient (with respect to ~θ) vanishes. By setting ∇θ~J(
~θ) equal to zero,
we can find an optimal ~θ (denoted as ~θ
∗
) that results in our minimal loss. This is known as finding a closed
form solution. Let’s start the minimization process from deriving an expression for the gradient.
1
∇θ~J(
~θ) = ∇θ~

λ
2
k
~θk
2 +
Xn
i=1
(y
(i) − ~θ · ~x(i)
)
2
2
!
= λ~θ −
Xn
i=1
(y
(i) − ~θ · ~x(i)
)~x(i)
= λ~θ −
Xn
i=1
y
(i)
~x(i) +
Xn
i=1
(
~θ · ~x(i)
)~x(i)
Note that we can rewrite the last term as
(
~θ · ~x(i)
)~x(i) = ~x(i)
(
~θ · ~x(i)
) 1. Scalar multiplication can be applied on both right and left
= ~x(i)
(~x(i)
·
~θ) 2. Commutativity of inner product
= ~x(i)
((~x(i)
)
T ~θ) 3. Matrix multiplication representation of inner product
= ~x(i)
(~x(i)
)
T ~θ 4. Associativity of matrix multiplication
Therefore, by setting the derivative to zero, we get the closed form solution as
~0 = ∇θ~J(
~θ)
= λ~θ −
Xn
i=1
y
(i)
~x(i) +
Xn
i=1
~x(i)
(~x(i)
)
T
!
~θ 1. From the above derivations
= λ~θ −

~x(1)
· · · ~x(n)




y
(1)
.
.
.
y
(n)


 +



~x(1)
· · · ~x(n)




(~x(1))
T
.
.
.
(~x(n)
)
T




 ~θ 2. Matrix form
= λ~θ − XT Y +

XT X

~θ 3. Matrix notation
= (λI + XT X)
~θ − XT Y
We can then find the optimal theta that minimizes our function:
(λI + XT X)
~θ
∗ = XT Y
~θ
∗ = (λI + XT X)
−1XT Y
Recall that this is regression via L2-regularized linear least-squares is also known as “ridge regression”.
Let’s review the benefit of ridge regression. Preference towards a “sparse” solution has several advantages:
1. it lessens overfitting
2. it can increase accuracy
3. it makes underdetermined systems determined
Also, watch for underfitting, which occurs when we are too stubborn to learn much from our data. In
other words, the model may be too simplistic to capture the underlying distribution of the data, which can
happen if we set λ too high.
One more important advantage of this approach is that (λI + XT X) is always invertible if λ > 0, while
XT X, the matrix we invert in the closed-form solution to least-squares regression without regularization, is
not. We outline a proof of this below:
Page
Theorem 1. ∀λ ∈ R and X ∈ R
n×n if λ > 0, then λI + XT X is invertible.
We use the following fact without proof: a matrix is invertible if and only if it does not have a 0 eigenvalue.
Lemma 1. XT X is positive semi-definite (PSD).
We show that XT X is PSD based on its definition in Section 1. To see that XT X is symmetric, note
that the element in the i
th row and j
th column of XT X, (XT X)ij is ~cT
i
~cj where ci
is the i
th column vector
of X (the i
th row of XT
is ~cT
i
) so
(XT X)ij = ~cT
i
cj
= ~ci
· ~cj
= ~cj · ~ci
= ~cT
j ~ci
= (XT X)ji
Another way to show that XT X is symmetric is to simply note that (XT X)
T = XT X. Now, let ~z ∈ R
n×n.
~zT XT Xz = (X~z)
T
(X~z) = (X~z) · (X~z) = kX~zk
2
Therefore, ~zT XT Xz ≥ 0 and XT X is PSD.
Lemma 2. If matrix A has eigenvalue k, then A + λI has eigenvalue k + λ.
Discussion Question 1. Prove Lemma 2.
Since XT X is PSD (Lemma 1), all of its eigenvalues are at least 0 (see Section 1). By Lemma 1, we can
see that adding λI to XT X for λ > 0 nudges the eigenvalues so that they are all positive. Thus, λI + XT X
is invertible.
Page 3
3 SVM: Hard-margin and Soft-margin version
In this discussion, we will review the two versions of SVM: hard-margin and soft-margin. We put an
emphasis on reinforcing our intuition behind what these equations mean. Hard-margin SVMs assume points
are linearly separable; soft-margin SVMs do not assume points are linearly separable and thus allow for
misclassifications.
3.1 Hard-Margin SVMs
We have formulated the classification problem as the problem of finding decision boundary that separates
positive and negative training data points. See left part of the Figure 1. All three decision boundaries
are correctly separating training data points. Which hyperplane should we choose among these decision
boundaries?
One valid answer is to maximize the margin between the data points to generalize better on test data.
Recall that hard-margin Support Vector Machines (SVMs) maximize the distance between the decision
boundary and points that are closest to the decision boundary as in the right part of the Figure 1.
Figure 1: Hard-Margin SVM Classifier
Source: Raschka, S. (2015). Python machine learning. Packt Publishing Ltd.
To get an intuition on decision boundary that maximize margin, take Figure 2 as an example. Suppose
we only have 2 data points and we want to maximize the distance between the decision boundary and the
2 points. We can find a unique decision boundary that the distance from the boundary to both of the data
points are maximized and both equal to d (Figure 2(a)). Now suppose we rotate it a little bit as in Figure
2(b), we will find that the maximum distance from the closest data point to the margin is d
0 ≤ d. Suppose
we shift the decision boundary as in Figure 2(c), we will find that the maximum distance from the closest
data point (negative point in this case) becomes d
00 ≤ d.
Page 4
Figure 2: Maximum Margin Separator
Example. Consider a separating hyperplane ~θ · ~x + b = 2x1 + 3x2 − 1 = 0 and the following data points:
~x(1) = (3, 5), ~x(2) = (−2, −1), ~x(3) = (0, 0), ~x(4) = (1, 0) with labels y
(1) = +1, y(2) = −1, y(3) = −1, y(4) =
+1. Then, each point is correctly classified by the separating hyperplane, and ~x(3), ~x(4) are support vectors.
i ~x(i) y
(i) ~θ · ~x + b Classification Support vector?
1 (3, 5) +1 2 · 3 + 3 · 5 − 1 = 20 +1 No
2 (-2, -1) −1 2 · (−2) + 3 · (−1) − 1 = −8 −1 No
3 (0, 0) −1 2 · 0 + 3 · 0 − 1 = −1 −1 Yes
4 (1, 0) +1 2 · 1 + 3 · 0 − 1 = 1 +1 Yes
3.2 Optimization Formulation
The optimization formulation of the SVM equation is as follows
min
θ~
1
2
k
~θk
2
subject to y
(i)
(
~θ · ~x(i)
) ≥ 1, ∀i = 1, . . . , n
(3)
This means that we want to minimize the `2 norm of ~θ, and that we want all points to be classified correctly.
In order to accomplish this, we need to ensure that the distance is maximized between our decision boundary
(
~θ · ~x + b = 0) and our margin boundaries (~θ · ~x + b = ±1). This margin reflects the minimum distance
between any point and the decision boundary.
Consider a point ~x(0) that is on the margin (i.e. y(
~θ · ~x + b) = 1), which we refer to as a support vector.
Then, as discussed in lecture, the distance would be:
D(~x(0)) = 1
k
~θk
So, to maximize the distance between points on the margin and the decision boundary, we should minimize
the norm of ~θ.
Page 5
3.3 Soft Margin SVMs
As we’ve seen in class this week, the traditional hard-margin SVM is a bit too constrained for most purposes.
If our data set is not linearly separable, we can’t use the hard-margin formulation, so we turn to the soft
margin SVM. In addition to being usable with non-separable datasets, another benefit of soft margin SVMs
is their ability to be more robust to outliers in our data.
The key difference between the soft margin and hard margin SVMs is in the constraints. With hard margin
SVMs, our margins were defined by the hard constraint:
y
(i)
(
~θ · ~x(i) + b) ≥ 1
In the soft margin SVM, we introduce n slack variables ξi for i = 1, . . . , n (i.e., one per data point), and
rewrite our constraints in the following form:
y
(i)
(
~θ · ~x(i) + b) ≥ 1 − ξi ξi ≥ 0
With the introduction of this new slack variable, we can rewrite our primal formulation to include the new
variable.
min
θ,ξ ~
1
2
||~θ||2 + C
Xn
i=1
ξi
subject to y
(i)
(
~θ · ~x(i) + b) ≥ 1 − ξi
, ξi ≥ 0, ∀i = 1, . . . , n
What’s so special about this slack variable? By including slack variables, we now pose a problem that allows
for a margin that is less than 1. Any point that doesn’t fall on the correct side of the hyperplane will incur a
penalty cost of Cξi
. If the point is within the margin, but on the correct side of the hyperplane, ξi will lie
between 0 and 1 (i.e., 0 < ξi ≤ 1). If the point is on the wrong side of the hyperplane, ξ will be greater than 1.
Example. Consider the separating hyperplane ~θ · ~x + b = 2x1 + 3x2 − 1 = 0 that was in the previous
example and the following new data points: ~x(5) = (4, 2), ~x(6) = (−1, −3), ~x(7) = (0.25, 0), ~x(8) = (0.75, 0)
with labels y
(5) = −1, y(6) = +1, y(7) = −1, y(8) = +1. Then, ~x(7), ~x(8) is not correctly classified by the
separating hyperplane, and ~x(7), ~x(8) are correctly classified but are within margin. Therefore, each point
does not satisfy hard-margin constraint.
i ~x(i) y
(i) ~θ · ~x + b Classification Correct? Margin ξi
5 (4, 2) −1 2 · 4 + 3 · 2 − 1 = 13 +1 No 14
6 (-1, -3) +1 2 · (−1) + 3 · (−3) − 1 = −12 −1 No 13
7 (0.25, 0) −1 2 · 0.25 + 3 · 0 − 1 = −0.5 −1 Yes 0.5
8 (0.75, 0) +1 2 · 0.75 + 3 · 0 − 1 = 0.5 +1 Yes 0.5
By posing our problem in this way, we tell our classifier that we’d prefer to have a hyperplane that
correctly separates all of the data, but with our soft constraints, we can tolerate misclassified examples (each
one incurring a cost proportional to the amount by which it was misclassified by). The hyperparameter C
controls the trade-off between two extremes: a high C value leads to our classifier focusing on trying to
classify every example correctly (it can’t afford to pay the high cost of Cξi). On the other hand, a low C
aims to maximize the margin, with less regard to how many points are being misclassified in the process.
As with any hyperparameter, there is no magical C, and as you’ll seen in your project, we usually find C
through cross-validation.
Page 6