Discussion 6: Soft Margin SVMs and Decision Trees

Starting from:

$27

EECS 445 Discussion 6: Soft Margin SVMs and Decision Trees
1 Soft Margin SVMs
As we’ve seen in class, the traditional hard-margin SVM is a bit too constrained for most purposes. If our
data set is not linearly separable, we can’t use the hard-margin formulation, so we turn to the soft margin
SVM. In addition to being usable with non-separable datasets, another benefit of soft margin SVMs is their
ability to be more robust to outliers in our data.
The key difference between the soft margin and hard margin SVMs is in the constraints. With hard margin
SVMs, our margins were defined by the hard constraint:
y
(i)
(
~θ · ~x(i) + b) ≥ 1
In the soft margin SVM, we introduce n slack variables ξi for i = 1, . . . , n (i.e., one per data point), and
rewrite our constraints in the following form:
y
(i)
(
~θ · ~x(i) + b) ≥ 1 − ξi ξi ≥ 0
With the introduction of this new slack variable, we can rewrite our primal formulation to include the new
variable.
min
θ,ξ ~
1
2
||~θ||2 + C
Xn
i=1
ξi
subject to y
(i)
(
~θ · ~x(i) + b) ≥ 1 − ξi
, ξi ≥ 0, ∀i = 1, . . . , n
What’s so special about this slack variable? By including slack variables, we now pose a problem that allows
for a margin that is less than 1. Any point that doesn’t fall on the correct side of the hyperplane will incur a
penalty cost of Cξi
. If the point is within the margin, but on the correct side of the hyperplane, ξi will lie
between 0 and 1 (i.e., 0 < ξi ≤ 1). If the point is on the wrong side of the hyperplane, ξ will be greater than 1.
By posing our problem in this way, we tell our classifier that we’d prefer to have a hyperplane that
correctly separates all of the data, but with our soft constraints, we can tolerate misclassified examples (each
one incurring a cost proportional to the amount by which it was misclassified by). The hyperparameter C
controls the trade-off between two extremes: a high C value leads to our classifier focusing on trying to
classify every example correctly (it can’t afford to pay the high cost of Cξi). On the other hand, a low C
aims to maximize the margin, with less regard to how many points are being misclassified in the process.
As with any hyperparameter, there is no magical C, and as you’ll seen in your project, we usually find C
through cross-validation.
1.1 The Dual for Soft Margin SVMs
The dual formulation for the soft margin SVM is as follows:
1
max
α,α ¯ i≥0
Xn
i=1
αi −
1
2
Xn
i=1
Xn
j=1
αiαjy
(i)
y
(j)x¯
(i)
· x¯
(j)
subject to 0 ≤ αi ≤ C ∀i = 1, . . . , n
Xn
i=1
αiy
(i) = 0
(1)
As seen in Equation (1), the objective function is identical to the hard margin case, except our constraints
differ. Although αi ≥ 0 (same as before), now αi ≤ C
2 Model Interpretability
So far in Machine Learning, we have focused on finding models that can perform well in terms of minimizing their loss function or maximizing some performance metric such as accuracy or AUROC score.
However, there are instances in which understanding how the model performs classification/regression is just
as important as having a model that performs well. With the prevalence of neural networks and models that
work in high-dimensional feature spaces, model interpretability, meaning the ability to understand how a
model works, has quickly become an important question in current Machine Learning research
Discussion Question 1. Can you think of a Machine Learning problem in which interpretability might
be important?
Suppose we have a linear classification model h(¯x;
¯θ) = sign(
¯θ · x¯) where ¯x, ¯θ ∈ R
3
, where we are
given three features and we try to predict if a person will like a movie or not. Let x1 be ”number of
dogs” in the movie, let x2 be ”number of scenes where people are littering”, and let x3 be ”number of
red objects in the movie”.
Discussion Question 2. Suppose we learn the optimal ¯θ to be ¯θ
∗ = [3.0, −0.5, 0.0]. Based on ¯θ
∗
, can
we explain how the model is making classifications?
Though Support Vector Machines have the ability to learn very complex datasets, we have a problem of
interpreting how the model makes decisions, especially when we learn the parameters in terms of the dual.
There is no doubt that SVMs have lots of use cases, but as we introduce things like the kernel trick and
kernels like the RBF Kernel (which maps each training example to φ(x) ∈ R∞), what we gain in speed and
performance we lose in interpretability.
In order to find interpretability in our models, we will focus on a very interpretable model: Decision
Trees. These models are based on learning rules that help make the model make a final decision on how to
classify the data (see figure 1). To learn these models, we will first introduce the necessary concepts from
information theory.
Page 2
Figure 1: Example of Decision Tree (image credit to Charlie Bickerton from towardsdatascience.com)
3 Entropy
Let’s introduce a function f : X −→ Y , which maps X to a continuous, discrete, or binary Y. We can define
a decision tree as our function, and we can specify that at each level, we will split our data on the best
feature. By split, we mean that whatever data we’re considering at the node, we’re going to separate it into
two ”buckets”: one bucket satisfies a certain condition, the other doesn’t.
A better split can be thought of us giving more information, so that we’re more informed about classification.
To get a mathematical measure of information gain, we first need a mathematical measure of uncertainty,
which we call entropy. Entropy is the choice measure of uncertainty, and for decision trees, we will use
Shannon Entropy, which describes the expected number of bits needed to encode a randomly drawn value of
y. To define Shannon Entropy, we define p(+) as the proportion of positive training examples in the training
data, and p(−) as the proportion of negative training examples. Our entropy is then:
H = −

p(+) log2 p(+) + p(−)
log2 p(−)

Graphically, we can display the entropy for a binary-valued random variable as follows.
Page
Figure 2: Entropy versus Probability
Discussion Question 3. Let’s say we have a coin with the probability of success being p(+) = 0.5.
What’s the entropy associate with a single coin flip? What happens if we have a biased coin, with the
probability of success now being p(+) = 0.7?
Discussion Question 4. For this discussion question, we will work in the reverse direction as discussion
question number 3. Consider a biased coin where you can choose a specific bias. What bias would
maximize the entropy related to the coin flip experiment.
Similar to the binary valued figure, if we measure the output probabilities of each class (or even in a
continuous range), we will get a distribution. Entropy can be thought of how ”peaky” this distribution is:
If we are more certain about outputs, our probability distribution will have more peaks (corresponding to
higher probabilities of those outputs), and entropy will be low. If we are less certain, our distribution will
be flatter (each output will have similar probabilities), and entropy will be higher.
Discussion Question 5. Let’s say that the two pictures describe two probability distributions over a
continuous range of values. Using the ”peakiness” definition above, which of the two probability distributions, a uniform or the curve, has more entropy?
Page 4
Next week, we’ll see how we can use Entropy to calculate information gain, and how to use that metric to
build strong, interpretable classifiers.
4 Conditional Entropy
Recall the Shannon entropy, a measure of uncertainty that measures how many “bits” are needed to
encode the distribution produced by a set of random observations. Similar to the definitions used in physical
sciences, a high entropy value corresponds to data that is more disorderly and unpredictable. In the case
of a probability distribution, this corresponds to a more equal likelihood of each possible event; uniform
distributions maximize entropy. The general form of Shannon entropy for a multi-class setting is as follows.
H(X) = −
X
i
p(X = xi) log2 p(X = xi) (2)
The idea of “bits” shows up in the base of the logarithm (i.e., log base 2), and is what distinguishes
“Shannon” entropy from other forms of entropy. You should convince yourself that the binary classification
case (e.g., xi ∈ {−1, 1}) yields the familiar form discussed in lecture.
Consider this: you are a up-and-coming musical artist, Young SVM, who has just been nominated for
the Album of the Year Grammy! You want to figure out how likely you are to win the award given that you
are a rap artist and that you’ve never before won a Grammy award. We want to use entropy to measure our
uncertainty for winning the award, but we also need to factor in this additional information of genre and
award status. To do so, we need to introduce the conditional entropy.
Conditional entropy simply factors in how your uncertainty changes given that some additional event
occurred. For example, we assume that taking a machine learning course reduces the uncertainty of obtaining
a job in machine learning. We formulate this intuition mathematically as follows.
H(Y |X) = −
X
i
X
j
p(X = xi
, Y = yj ) log2 P(Y = yj |X = xi)
= −
X
i
p(X = xi)
X
j
p(Y = yj |X = xi) log2 P(Y = yj |X = xi)
(3)
You should convince yourself that the first line makes intuitive sense: the log expression describes the
number of bits needed to encode a conditional probability, and is weighted by the probability of those two
events ever occurring together (e.g., if some xi and yj never occur together, we shouldn’t waste bits trying
to represent them).
Rap Pop
Previous Award 5 6
No Previous Award 3 2
Figure 3: Distribution of Grammy awards by genre and award status
So how can Young SVM go about predicting whether he or she will win a Grammy? As with all machine
learning models, we need both data and a metric to optimize. Let’s assume that we’ve recovered data from
previous years’ Grammy award winners (Figure 3), and that there were a total of 10 nominees for each
category (i.e., 10 rap artists with previous awards, 10 pop artists with no previous award, etc.).
Our optimization criteria will focus on the selection of a single feature that, when used as the conditional
in equation 3, minimizes entropy (i.e., makes us more certain that the prediction is accurate). This is also
mathematically equivalent to maximizing the information gain, which is defined as follows.
Page 5
argmax
k
IG(Xk, Y ) = argmax
k
H(Y ) − H(Y |Xk)
= argmax
k
− H(Y |Xk)
= argmin
k
H(Y |Xk)
(4)
Discussion Question 6. What is the entropy of winning a Grammy award, H(A), without considering
any of our features?
Discussion Question 7. Compute the conditional entropies, H(A|G) and H(A|S), where G and S are
the random variables of genre and award status, respectively.
Discussion Question 8. Which of our features will be selected by our current model as maximizing
information gain?
Now we can select a single criterion to split our data on, but it would be rather wasteful to only ever
split on one feature. How can we leverage two or more of our features to improve our predictions? Enter the
decision tree.
5 Decision Trees
Another way of describing what we have done so far is finding the root node of a decision tree. To create
this first node, we had construct a formulation of our optimization function, IG(X, Y ), that represents how
our uncertainty changes due to a feature selection. We want to recursively build a tree, so for each node we
must analyze how our distribution has changed after traversal from the parent node in order to create a new
optimization function. Let X∗
represent the optimal feature to select as the root of the tree, and assume that
X∗
can take on one of n possible outcomes. Once we have done this, our table that we used to determine
the best split is no longer applicable, because we are now modeling a conditional distribution generated with
respect to X∗
. Let’s assume X∗ = S and focus on only the child node for which S = no award. We find
that our new table looks like the one in Figure 4.
Rap Pop
Previous Award 0 0
No Previous Award 3 2
Figure 4: Distribution of Grammy awards by genre within population of nominees without prior awards
Discussion Question 9. Without doing any calculations, which feature will we next split on given that
X∗ = S and we are at the child node in which S = no award? Why?
Discussion Question 10. Would it ever be appropriate to split more than once on the same feature?
Hint: consider a real-valued feature such as height, weight, etc.
We have now constructed enough of our decision tree to infer whether Young SVM is likely to receive
a grammy! Inference in a decision tree works by analyzing the population belonging to a child node, and
Page 6
assigning to the node the majority label of the population (e.g., if most of the individuals in the population
won Grammy awards, then the model predicts that nominees sharing those features will also win an award).
As decision trees are a probabilistic model, we can also opt to not discretize our prediction but return a
probability of winning a Grammy. Let’s do both.
Discussion Question 11. According to our model, will Young SVM—a rap artist with no previous
award—win a Grammy?
Discussion Question 12. What is the probability of Young SVM winning a Grammy award, as predicted
by our model?
Notice that in our inference, we know exactly why our model made the decision it did. Decision trees are
considered the gold-standard of interpretability in machine learning for exactly this reason. Also note that
it is not always the case that we will split on every feature—as we did here. If we had hundreds of features,
splitting on all of them could potentially lead to overfitting. We negate this effect by setting a maximum
tree depth. This also acts as a form of feature selection by not splitting on features of negligible impact.
6 Ensemble Methods: Random Forests
The idea of a “power in numbers” underlies the intuition behind ensemble methods, which aggregate the
predictions of multiple classifiers and output a prediction based on their consensus.
One subset of ensemble methods uses a technique called bootstrapping—or sampling with replacement—
and is given the name bagging for “bootstrap aggregating”. By sampling with replacement, some observations/features may be repeated in each training dataset. Random forests one such example of this, where
each of the constituent classifiers is a decision tree. Bagging is a special case of the model averaging approach.
By averaging across multiple models, we will get a more stable model and reduce estimation error without
increasing the structural error.
One underlying assumption of ensemble methods is that the predictions of each constituent classifier are
not identical. Another way of stating this is that we want our classifiers to be decorrelated. Two standard
methods exist for decorrelating decision trees:
1. Randomly sample training data, such that each decision tree is trained on slightly different data
2. Randomly sample features, such that each decision tree must split on a slightly different feature set
We will focus on the latter method: bootstrapping features. Assume we want to train N decision trees,
each with M features. The algorithm for implementing this random forest is then as follows.
for i = 1, . . . , N do
Randomly sample M features (i.e., columns of data matrix X) with replacement
Train decision tree DTi on this random sample
end for
Output ensemble {DT1, . . . , DTn}
To give a concrete example, imagine we wanted to use a Random Forest to classify the MNIST dataset.
This is a dataset of 28×28 images of numbers 0-9. Here, we flatten the image such that ¯x ∈ R
784. By choosing
M random features, we are selecting M pixels and attempting to classify the correct number represented by
an image via a subset of its pixels. This is a sensible thing to do; for instance, the number 0 has smaller
pixel values in the center of the image, whereas a 1 has greater values in the center and smaller values as
you deviate from the center. Inference in a random forest is then conducted by running inference on all N
decision trees and returning the majority vote.
Page 7