$27
EECS 445 Discussion 5: Kernels and Soft Margin SVMs
1 More on SVMs
1.1 Dual Formulation
The main difference between the primal and the dual is that we interchange the order in which we
optimize. In other words, we swap the order of the maximum and minimum in the objective function:
max
~α
min
θ~
1
2
k
~θk
2 +
Xn
i=1
αi(1 − y
(i)
(
~θ · ~x(i)
))
subject to αi ≥ 0, ∀i = 1, . . . , n
(1)
The dual typically gives us a lower bound to the solution of the primal. Under certain conditions (which
are met in the SVM formulation), the solution from dual is actually equal to the solution from primal. This
is incredibly helpful, because instead of maximizing with respect to ~α (which makes it difficult to then solve
for ~θ), we can first simplify the expression by minimizing with respect to ~θ and finding an optimal ~θ
∗
. We
can then substitute ~θ
∗
in our original function for ~θ
∗
, then maximize in terms of ~α.
0 = ∇θ~(
1
2
k
~θk
2 +
Xn
i=1
αi(1 − y
(i)
(
~θ · ~x(i)
)))
~θ
∗ =
Xn
i=1
αiy
(i)
~x(i)
Finally, we substitute ~θ in our original function for ~θ
∗
:
max
~α
min
θ~
J(
~θ) = max
~α
min
θ~
1
2
k
~θk
2 +
Xn
i=1
αi(1 − y
(i)~θ · ~x(i)
)
= max
~α
1
2
Xn
i=1
αiy
(i)
~x(i)
!
·
Xn
i=1
αiy
(i)
~x(i)
!
+
Xn
i=1
αi −
Xn
i=1
αiy
(i)
Xn
j=1
αjy
(j)
~x(j)
· ~x(i)
= max
~α
Xn
i=1
αi +
1
2
Xn
i=1
αiy
(i)
~x(i)
!
·
Xn
j=1
αjy
(j)
~x(j)
−
Xn
i=1
Xn
j=1
αjαiy
(i)
y
(j)
~x(j)
· ~x(i)
= max
~α
Xn
i=1
αi −
1
2
Xn
i=1
Xn
j=1
αjαiy
(i)
y
(j)
~x(j)
· ~x(i)
1
After we minimize against ~θ, the dual looks like the following.
max
~α
min
θ~
J(
~θ) = max
~α
Xn
i=1
αi −
1
2
Xn
i=1
Xn
j=1
αjαiy
(i)
y
(j)
~x(j)
· ~x(i)
Once we optimize our function with ~α
∗
, the solution satisfies the “complementary slackness constraints”,
which means the following holds for all i:
αi =
(
y
(i)~θ
∗
· ~x(i) = 1, if αi > 0
y
(i)~θ
∗
· ~x(i) > 1, if αi = 0
Therefore, when αi
is non-zero, we know that ~x(i)
lies on the margins (or, in the case of Soft-Margin
SVMs, within the margin or misclassified), making it a support vector. Hence, only points that have non-zero
alphas will contribute to the position of the decision boundary.
Discussion Question 1. What is the number of parameters we have to learn in the primal? What
about the dual? Which one is better under which circumstances?
1.2 Recovering ~θ and Classifying New Examples
We can either use an explicit mapping, or the kernel trick, to get each αi
. Then, we can recover our original
~θ by using the expression in terms of αi
’s which we obtained during differentiation.
~θ
∗ =
Xn
i=1
αiy
(i)φ(~x(i)
)
If we can find this theta, we can use it directly to classify new data. But sometimes, especially when we
use kernels, we can’t always find φ(x). How do we then use our recovered αis to infer our linear classifier?
Luckily, because of the above equality, we can substitute the equation for ~θ
∗ and still use it to classify new
examples, using only the αis that we recover.
h(~xnew) = sign(
Xn
i=1
αiy
(i)K(~x(i)
, ~xnew))
2 Nonlinear Mappings
Discussion Question 2. Quiet environments help me study. And good discussion notes help me study.
Will good discussion notes in a quiet environment help me study?
Here, we assumed that combined inputs produce a combined effect. Let [x1, x2] represent the two factors.
Then we might calculate their combined effect y as a linear combination: y = θ1x1 + θ2x2. This function is
linear in , as ~θ is fixed.
For the first few lectures, we focused on linear functions with thresholds: a vector ~x is classified as sign(~θ ·~x).
Though the sign function is nonlinear, the part of the function that changes in training—that is, the part
that is learned—is linear: ~θ · ~x as a function of ~x.
Page 2
When our data does not in fact represent a linear relationship, however, fitting a linear model to the data
may result in underfitting. In such cases, we may consider a feature map φ(~x) that maps the original input
space ~x = [x1, x2, ..., xn] to a higher dimensional space. By distorting the space in which a data set lives,
the data may become linearly separable in this new space. See Figures 1a and 1b for illustrations in which
we apply a transformation that adds a dimension and allows for linear separability.
Example:
Consider a non-linearly separable 1 dimensional data set of 3 points. ~x = [x1, x2, x3] where each point
xi
is a scalar value. We can map these scalar values to 2 dimensional space with the feature mapping
φ(xi) = [xi
, x2
i
]. We now find that the decision boundary z
2 + 4z + 3.5 can correctly classify all the points.
(a) A non-(linearly separable) dataset in R
1 becomes linearly separable after embedding into R
2
.
For instance, the line h separates the two classes.
(b) A non-(linearly separable) dataset in R
2 becomes linearly separable after embedding into R
3
.
For instance, the hyperplane h separates the two
classes.
Figure 1: Two linear classifiers pre-composed with non-linear maps.
Discussion Question 3. Suppose we have several data points that we want to classify, where for a
single point ~x ∈ R
2
, ~x = [x1, x2]. After plotting our data, we notice a pattern similar to the figure below,
Page 3
in which points are fully separable by a circular/elliptical decision boundary:
Our goal is to find a mapping of our data that is linearly separable in a higher dimensional space. Using
the equation of a circle:
(x1 − h)
2 + (x2 − k)
2 = r
2
Find a φ(~x) in terms of x1 and x2 and a corresponding ~θ in terms of h, k, and r that correctly represents
this decision boundary.
3 Kernels
3.1 Motivation
A kernel is a function of two inputs that produces a metric of similarity—an inner product—of the feature
mapping of each input (e.g., K(¯x, z¯) = φ(¯x)·φ(¯z)). Kernels are useful to us as they can eliminate the explicit
computation of expensive feature vectors. For example, if we want to represent the quadratic relationship
between each value in our input vector, we get the feature vector
φ(¯x) =
x1x1
x1x2
x1x3
.
.
.
x2x2
x2x3
.
.
.
xdxd
(2)
For a quadratic relationship and small d this is fairly easy to compute, but as we wish to model higher orders
of relation between data points and thousands of features, our feature vector will grow far too large. In some
scenarios—such as in SVMs—we can get around this problem by using the kernel trick. Recall the dual
formulation of the SVM with offset derived in class:
max
α¯
Xn
i=1
αi −
1
2
Xn
i=1
Xn
j=1
αjαiy
(i)
y
(j)x¯
(j)
· x¯
(i)
(3)
Note that the only instance of the input vectors ¯x
(i)
is in the dot product between two vectors: ¯x
(i)
· x¯
(j)
.
If we are instead interested in fitting our data in a higher-dimensional feature space (i.e., after performing
Page 4
our non-linear mapping φ(¯x)), this dot product becomes φ(¯x
(i)
)· φ(¯x
(j)
). Explicitly calculating each feature
vector may be computationally expensive, but the only expression we actually need is the dot product of
the two feature vectors for each pair of input data points. If we can efficiently compute these values, then
we are no longer required to compute each feature vector.
3.2 Kernel functions
The dot product of these high-dimensional feature vectors may be efficiently computed using a kernel
function. For example, the mapping φ(¯x) given in equation (2) corresponds to the following kernel:
K(¯x, z¯) = (¯x · z¯)
2
=
X
d
i=1
X
d
j=1
xixj zizj
=
X
d
i,j=1
(xixj )(zizj )
= φ(¯x) · φ(¯z)
Discussion Question 4. What is the computational complexity of computing the kernel K(¯x, z¯) directly
in equation (3.2) in terms of the number of multiplications? What about φ(¯x) · φ(¯z)?
Depending on the problem at hand, it may be beneficial to add a constant term to the mapping as an offset.
We can use a similar kernel function:
K(¯x, z¯) = (¯x · z¯ + c)
2
This kernel function corresponds to the following feature mapping:
φ(x) = hc, √
2cx1, . . . , √
2cxn
| {z }
linear terms
,
√
2x1x2,
√
2x1x3, . . . , √
2x1xn,
√
2x2x3 . . . , √
2xn−1xn
| {z }
cross terms xixj ,i6=j
, x2
1
, . . . , x2
n
| {z }
square terms
i
In other words, for the definition of K(¯x, z¯) in equation (3.2) and φ(¯x) in equation (??):
K(¯x, z¯) = φ(¯x) · φ(¯z)
Discussion Question 5. Given the kernel K(¯x, z¯) = 4(¯x · z¯) + (¯x · z¯)
3
for x, ¯ z¯ ∈ R
2
, find the feature
mapping φ that satisfies K(¯x, z¯) = φ(¯x) · φ(¯z).
Discussion Question 6. Assuming the original feature space is in R
2
, find the kernel that corresponds
to the following φ(¯x) =
r, x1
√
2r, x2
√
2r, x1x2
√
2, x
2
1
, x
2
2
T
3.3 RBF kernel
We now examine a popular kernel function that is used often in practice: the radial basis function (RBF)
kernel. The RBF kernel is defined as follows:
KRBF (¯x, z¯) = exp(−γ kx¯ − z¯k
2
) (5)
Page 5
We are unable to explicitly write out the feature mapping that corresponds with the RBF kernel; this
is because the feature mapping corresponding with the RBF kernel maps its input vectors to an infinite
dimensional space. Below are some 2 dimensional visualizations of the RBF kernel.
Figure 2: RBF visualization
Page 6
4 Receiver Operating Characteristic Curve (ROC)
Recall the definitions of specificity and false positive rate which we introduced last week. In this section,
we will show how we can use the tradeoff between them to define another useful performance metric, the
ROC Curve.
• Sensitivity/Recall/TP Rate: T P
T P +F N
• FP Rate: F P
F P +T N
The Receiver Operating Characteristic (ROC) Curve is a curve that shows the tradeoff between the true
positive rate and false positive rate. It indicates the probability that the classifier will rank a randomly
chosen positive instance higher than a randomly chosen negative instance. Using the ROC curve, we obtain
another performance metric for our classifiers: the Area Under the ROC Curve (AUROC). A perfect AUROC
score would be of 1.00 whilst a classifier that performs no better than random would have a score of 0.50.
ROC curve can be used to select a threshold for a classifier which maximises the true positives, while
minimising the false positives.
Figure 3: AUROC Curve
To obtain the ROC Curve, we would first obtain the classification score of each point (note: not only
would we use the sign of the classification, but also the magnitude). Then, we arrange them in descending
order. We start by classifying everything as positive and progressively move up one-by-one our classifications
in ascending order until we classify all points as negative. Whenever we move our classification line up, we
plot the TPR and FPR at that time. There is a link to https://www.youtube.com/watch?v=OAl6eAyPyot=513s where you can find a good tutorial of AUROC.
Page 7
5/AUROCexplain.jpg
Figure 4: AUROC explanation
Discussion Question 7. To get an idea of how to obtain the ROC Curve, go through the following
exercise and finish the ROC Curve. Some points on the curve have already been plotted.
(i) h(¯x
(i)
,
¯θ) y
(i)
1 5.3 +1
3 4.0 +1
4 3.2 −1
2 0.7 +1
5, 6 0.5 +1, −1
7 −0.9 −1
10 −1.9 +1
9 −3.0 −1
8 −5.2 −1
Page 8
Discussion Question 8. Why would AUROC be a good metric? What could be some limitations of it?
Page 9
5 Cross-Validation
Often, our algorithms come with hyperparameters, that is, numbers that must be tuned outside of
training. For instance, the strength λ of regularization for linear regression is a hyperparameter. Usually
hyperparameter-search occurs by trial and error: we try many hyperparameter values, and for each value,
we train the main algorithm (two popular implementations are random search or grid search). Then we
stick with the value that yields the lowest loss.
To avoid overfitting, we must calculate that validation loss on a different set of data than that on
which we trained. Otherwise, we will commit the same mistake as the teacher who measures which teaching
technique works best by making the Final Exam a duplicate of the Midterm (this just tests memorization,
not general understanding). So, each step of the learning process uses a separate dataset. The most common
names for these sets are:
1. Test set
2. Validation set
3. Train set
For instance, one might randomly shuffle one’s datapoints, then allocate 10% of samples to the test set,
10% to the validation set, and 80% to the train set.
See Figure 6 for a depiction of these splits.
Figure 6: A flowchart depicting the creation of train, test, and validation sets from an original dataset.
However, a main drawback of just holding-out a single portion of the entire dataset for validation is that
we are evaluating hyperparameters based on the performance on just one validation dataset. One way to
help alleviate this problem is k-fold cross-validation. In this approach, we first randomly split the training
data into k partitions. We pick one of the partitions to use for validation and train a model on the other
k − 1 partitions. This is repeated for each of the k partitions and the hyperparameter settings are evaluated
based on the average validation performance across all k partitions. Note that this requires us to train k
different models which can take quite a bit longer to complete. The pseudo code for cross validation is given
below. In this case there are N data points in the training set and the data are separated into n folds.
Page 10
Algorithm 1 Cross Validation
Require: X [n x d], and ¯y [n x 1]
Ensure: Average score using different folds as test and training data
score ← 0
F old1, · · · , F oldk ← SeparatingTrainingData(X, ¯y)
for i ← 1 to k do
Train(F old1, · · · , F oldi−1, F oldi+1, · · · , F oldk)
ypred ← Predict(F oldi)
score ← score + Performance(ypred, F oldi)
end for
return score/n
Feature Selection
The goal of feature selection is to systematically decide which features are most relevant to the problem
at hand. Feature selection can also be used for data visualization; for example, by plotting the two or three
most salient features together or projecting a high-dimensional dataset onto a lower-dimensional space.
There are simple feature selection methods that try to discard irrelevant features, such as filter and wrapper
methods; these often involve finding correlation between performance and features or subsets of features.
However, in this course we will be focusing on embedded methods, which offer a more ”elegant” solution.
Embedded Feature Selection
Some models come with a feature selection mechanism built-in. Typically, this is achieved by encouraging
sparsity in the model parameters. For example, if the vector of weights learned for a linear regression model
contains many zeros, it is probably safe to discard the corresponding features, as they have little or no impact
on the final prediction. There are several common ways to encourage sparsity in a model. Here, we will
focus on the use of regularization as a means of feature selection, specifically the `0 and `1 regularization
terms. Recall the definition of the `p-norm.
`p = p
sX
i
|θi
|
p (6)
The `0-norm—while not strictly a norm—represents the optimal solution to our issue of sparsity by
minimizing the number of non-zero elements in a vector. As this yields a binary optimization (i.e., NPHard) problem, we tend not to use it. Instead, we often use the `1-norm to encourage sparsity.
Discussion Question 9. How does the `1-norm encourage sparsity? (Hint: consider the gradient
functions in Figure 2 and Figure 3)
Figure 7: The `1-norm and its gradient
Page 11
Figure 8: The `2-norm and its gradient
Page 12