EECS 445 Discussion 3: Enhancing Linear Models

Starting from:

$27

EECS 445 Discussion 3: Enhancing Linear Models
Introduction
This discussion focuses on four topics:
1. Loss Functions
2. Convexity and Gradient Descent
3. Regularization
1 Loss Functions
Recall that Training Error is defined as follows,
En(
¯θ) = 1
n
Xn
i=1
[[y
(i)
6= h(¯x
(i)
;
¯θ)]]
Intuitively, training error is the fraction of training examples for which the classifier predicts a wrong label.
On the other hand, as an approximation of training error, we define Empirical Risk based on some loss
function,
Rn(
¯θ) = 1
n
Xn
i=1
loss(y
(i)
(¯x
(i)
·
¯θ))
So far we have seen 3 loss functions, which are all suitable for a (binary) classification task:
• loss0−1(z) = [[z ≤ 0]] (0-1 Loss)
• lossh(z) = max{1 − z, 0} (Hinge Loss)
• losslog(z) = log2
(1 + e
−z
) (Logistic Loss)
where z = y (¯x ·
¯θ), with label y ∈ {−1, +1}. Interpretation for z = y (¯x ·
¯θ): “a numerical quantity that
indicates whether or not you’re right, and to what extent you are right or wrong”
These loss functions are plotted on same set of axes.
Why do we not use 0-1 loss in practice?
• it is non-convex
• it is non-smooth (at origin)
• the derivative is zero (where differentiable)
This makes it infeasible to implement computationally efficient algorithms to solve for the optimization
problem, where we seek the global minimum.
1
Properties of good loss functions:
• small value (or zero) for correct classification, large/increasing value for incorrect classification
• convex: if the line segment between any two points on the graph of the function lies above or on the
graph, useful for a optimization problem setup
• mostly smooth and differentiable, easy to compute gradient
• value of derivative is not zero, especially for negative z values
Exercise
Given the following vectors and their labels:
• x¯
(1) = [0, −1]T
, y(1) = −1
• x¯
(2) = [0, 3]T
, y(2) = +1
• x¯
(3) = [1, 2]T
, y(3) = −1
• x¯
(4) = [1.5, 0]T
, y(4) = +1
Consider ¯θ = [−1, 1, 1]T
, where first element represents an offset.
Calculate the empirical risk based on
(i) 0-1 loss;
(ii) hinge loss;
(iii) logistic loss.
2 Convexity and Gradient Descent
Convexity and Optimization
A convex function is defined as a continuous function whose value at the midpoint of every interval in its
domain does not exceed the arithmetic mean of its values at the ends of the interval. Convex functions have
mathematical properties that make them easy to minimize or maximize. As a result, we like to use convex
functions as our loss functions.
Page 2
Figure 1: Convex Function in 2D
Figure 1 above presents an example of convex function and an example of concave in 2D. Notice, for the
convex function, the value at the mid point in the range from x1 to x2 is lower than the arithmetic mean of
the two ends.
The goal of optimization is to find optima in our functions. Optima are points in the function where the
gradient is 0, or where it no longer exists. In low dimensional spaces optima can be visualized as the peaks
and valleys of the function. Minimizing the convex function through the descent of its gradient can then
be thought of as a ball ”rolling down the hills” of the function that we are trying to minimize. When our
function is convex, all local optima are guaranteed to be global optima as well.
However, in this class, we will not always be dealing with the nicely-behaved family of convex functions.
When our function is not convex, we might encounter locally optimal solutions, meaning that we can no
longer guarantee that the solution are globally optimal.
Page 3
Gradient descent is a first-order iterative optimization algorithm. It is useful when it is impossible, or
inefficient, to compute the exact value of the local/global minima analytically. In each iteration of the
algorithm, we take a step proportional to the negative of the gradient (or of the approximate gradient) of
the function at the current point.
GD vs SGD
A slight variation of (batch) Gradient Descent is Stochastic Gradient Descent (SGD). The pseudocode
for both algorithms are provided below.
Algorithm 1 Gradient Descent
¯θ
(0) ← ¯0, k ← 0
while convergence criteria not met do
¯θ
(k+1) ← ¯θ
(k) − η∇θ¯Rn(
¯θ)|θ¯=θ¯(k)
k++
end while
return ¯θ
Algorithm 2 Stochastic Gradient Descent
¯θ
(0) ← ¯0, k ← 0
while convergence criteria not met do
randomly shuffle points
for t = 1...n do
¯θ
(k+1) ← ¯θ
(k) − η∇θ¯loss(y
(t)
, h(¯x
(t)
;
¯θ))|θ¯=θ¯(k)
k++
end for
end while
return ¯θ
Differences
• In each iteration,
GD considers all examples, requires more computation, result is “true” gradient
SGD only consider one example, requires less computation, result is approximate gradient
• The empirical risk, Rn(
¯θ
(k)
), for
GD is monotonically decreasing by definition
SGD decreases in general, but fluctuates in the process
Figure 2 is an illustration of how the empirical risk changes as a function of number of iterations.
Practical Notes
• SGD usually runs much faster than GD, because for each iteration of GD, the gradient is computed
using the whole dataset, whereas SGD only uses one example.
• In practice, we use mini-batch stochastic gradient descent, i.e. computing the gradient based on a subset
of the entire training set. It is faster than batch GD, while giving a more accurate approximation of
the true gradient than SGD.
• For stochastic gradient descent algorithms, it is often quite beneficial to keep track of the best solution
obtained so far in addition to the current ¯θ
(k)
. In other words, we would also keep ¯θ
(ik)
, where
ik = arg mini=1,...,k Rn(
¯θ
(i)
). The empirical risk of the current solution ¯θ
(k)
, i.e., Rn(
¯θ
(k)
), goes down
in a noisy fashion while Rn(
¯θ
(ik)
) is monotonically decreasing (by definition!). If the algorithm is
Page 4
(a) GD (b) SGD
Figure 2: Rn(
¯θ
(k)
) for GD and SGD
stopped at any point, we should report ¯θ
(ik)
(the best solution so far) rather than ¯θ
(k)
. Note that
limk→∞ Rn(
¯θ
(ik)
) is not necessarily zero as the points may not be linearly separable.
(a) SGD (b) SGD “best so far”
Figure 3: Rn(
¯θ
(k)
) and Rn(
¯θ
(ik)
) for SGD
• How do we know when the algorithm has converged? The convergence criteria, or stopping criteria, is
often defined according to:
– some max number of iterations is reached
– when there’s only negligible change in ¯θ
– when there’s only negligible change in Rn(
¯θ
(k)
)
• How do we set η?
If η too large, it might “overshoot” the global minimum, and algorithm does not converge
If η too small, it takes a long time to reach convergence
– If learning rate needs to be constant, we can use cross validation and select the learning rate (a
hyperparameter) that performs well on average for each of the validation folds.
Page 5
– We can also make the step size a function of k. In our context, ηk =
1
k+1 would ensure that our
stochastic sub-gradient descent algorithm converges to the minimum of Rn(θ). Other choices are
possible as well. In fact, any positive learning rate that satisfies
X∞
k=0
η
2
k < ∞, and X∞
k=0
ηk = ∞
would permit the algorithm to converge though the rate may vary.
Exercise
Denote ¯x = [1, x1, x2]
T and ¯θ = [b, θ1, θ2]
T
. We have ¯θ · x¯ = b + θ1x1 + θ2x2. We will consider hinge loss for
this exercise.
Given the following vectors and their labels:
• x¯
(1) = [0, −1]T
, y(1) = −1
• x¯
(2) = [0, 3]T
, y(2) = +1
• x¯
(3) = [1, 2]T
, y(3) = −1
• x¯
(4) = [1.5, 0]T
, y(4) = +1
And our initial decision boundary:
¯θ · x¯ = −1 + x1 + x2, i.e. ¯θ
(0) = [−1, 1, 1]T
Update the parameters ¯θ with learning rate η = 1 by
(i) going through one while-loop in GD
(ii) going through one while-loop in SGD without shuffling points
3 Regularization
Nonlinear transforms, compared to linear transforms, typically increase the size of our hypothesis space
and increases the power of our models to fit more complex data. This is useful in many cases, but sometimes
our models are so powerful that they also start to fit to the noise in the dataset - this is called overfitting.
The key concept is that we want our learning algorithms to be imaginative but not gullible. Regularization is about reducing gullibility. The key method to avoid overfitting (gullibility) is to learn from more
data! We can provide more data by enlarging the training set or by telling the machine directly about the
data we ourselves have seen. How do we tell directly? Answer: Occam’s Razor.
Let’s work out a simple example: suppose we wish to do regression on some data. That is, we want
to generalize a training set of output-input pairs Sn = {(y
(i)
, ~x(i)
)}
n
i=1, where y
(i) ∈ R and ~x(i) ∈ R
d
, to a
function f. Let’s only consider linear fs: this is an assumption we are building into our model. Then f has
form f(~x) = ~θ · ~x for some ~θ ∈ R
d
. We choose a way to measure error, say P
i
(y
(i) − f(~x(i)
))2
.
But, what happens if we keep ~θ simple (i.e., close to ~0)? Why is closer to zero simpler? If we were working
in a 5-dimensional setting then we could achieve a “simpler” (i.e., lower-dimensional) model by ignoring or
setting the weight associated with a subset of the dimensions to zero (or close to zero).:
J(
~θ) = λ
2
k
~θ −~0k
2 +
1
n
Xn
i=1
(y
(i) − ~θ · ~x(i)
))2
2
(1)
3.1 Minimizing the objective function J(
~θ)
We want to minimize J(
~θ). Let’s denote Y ∈ R
n×1
, where each row of Y is a scalar yi
. Similarly, we
have X ∈ R
n×d
, where each row of X is a corresponding ~xi
. We can then write J(
~θ) more concisely:
J(
~θ) = λ
2
~θ
T ~θ +
(Y − X~θ)
T
(Y − X~θ)
2
(2)
Page 6
To minimize J, we locate where its gradient (with respect to ~θ) vanishes. By setting ∇θ~J(
~θ) equal to
zero, we can find an optimal ~θ (denoted as ~θ
∗
) that results in our minimal loss. This derivation will be
covered in the next lecture and discussion, but the closed form solution is shown below:
~θ
∗ = (λI + XT X)
−1XT Y
This is regression via L2-regularized linear least-squares, also known as “ridge regression”.
Preference towards a “simpler” solution has several advantages:
1. it lessens overfitting
2. it can increase accuracy
3. it makes underdetermined systems determined
Also, watch for underfitting, which occurs when we are too stubborn to learn much from our data. In
other words, the model may be too simplistic to capture the underlying distribution of the data, which can
happen if we set λ too high.
Another more important advantage of this approach is that (λI + XT X) is always invertible if λ > 0,
while XT X, the matrix we invert in the closed-form solution to least-squares regression without regularization, is not.
Page 7