Starting from:

$30

CSC321 Homework 4

CSC321  Homework 4
Homework 4

Submission: You must submit your solutions as a PDF file through MarkUs1
. You can produce
the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.
Late Submission: MarkUs will remain open until 2 days after the deadline; until that time, you
should submit through MarkUs. If you want to submit the assignment more than 2 days late,
please e-mail it to csc321staff@cs.toronto.edu. The reason for this is that MarkUs won’t let us
collect the homeworks until the late period has ended, and we want to be able to return them to
you in a timely manner.
Weekly homeworks are individual work. See the Course Information handout2
for detailed policies.
1. Gradient descent. [5pts] We can get quite a bit of insight into the behavior of gradient
descent by looking at how it behaves on quadratic functions. Suppose we are trying to
optimize a quadratic function
C(θ) = a1
2
(θ1 − r1)
2 + · · · +
aN
2
(θN − rN )
2
,
with each ai > 0. We can exactly solve for the dynamics of gradient descent. In other words,
we can find an exact formula for θ
(t)
i
, where t is the number of gradient descent updates.
(a) [1pt] Derive the gradient descent update rule for each θi with learning rate α. It should
have the form
θ
(t+1)
i = · · · ,
where the right-hand side is some function of the previous value θ
(t)
i
, as well as ri
, ai
,
and α. (It’s an interesting and useful fact that the different θi
’s evolve independently,
so we can analyze a single coordinate at a time.)
(b) [2pts] Now let’s introduce the error e
(t)
i = θ
(t)
i − ri
. Take your update rule from the
previous part, and write a recurrence for the errors. It should have the form
e
(t+1)
i = · · · ,
where the right-hand side is some function of e
(t)
i
, ai
, and α.
(c) [1pt] Solve this recurrence to obtain an explicit formula for e
(t)
i
in terms of the initial
error e
(0)
i
. For what values of α is the procedure stable (the errors decay over time), and
for what values is it unstable (the errors grow over time)?
Aside: your answer will indicate that large learning rates are more unstable than small
ones, and that high curvature dimensions are more unstable than low curvature ones.
(d) [1pt] Using your answer for the previous part, write an explicit formula for the cost
C(θ
(t)
) as a function of the initial values θ
(0). (You can write it as a summation over
indices, i.e., you don’t need to vectorize it.) As t → ∞, which term comes to dominate?
1
https://markus.teach.cs.toronto.edu/csc321-2018-01
2
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/syllabus.pdf
1
CSC321 Winter 2018 Homework 4
Aside: you’ll find that if you use the optimal α, the asymptotic behavior roughly depends
on the condition number
κ =
maxi ai
mini ai
.
This supports the claim that narrow ravines are problematic for gradient descent.
(e) [Optional (not marked), and advanced] This part is optional, but you may find
it interesting. We’ll make use of eigendecompositions of symmetric matrices; see MIT
OpenCourseware for a refresher:
https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/positive-definite-matrices-and-applications/
symmetric-matrices-and-positive-definiteness/
It turns out we’ve actually just analyzed the fully general quadratic case. I.e., suppose
we try to minimize a cost function of the form
C(θ) = 1
2
(θ − r)
T A(θ − r),
where A is a symmetric positive definite matrix, i.e. a symmetric matrix with all positive
eigenvalues. (This is the general form for a quadratic function which curves upwards.)
Determine the gradient descent update for θ in vectorized form. Then write a recurrence
for the error vector e = θ − r, similarly to Part (1b). It will have the form
e
(t+1) = Be(t)
,
where B is a symmetric matrix. Determine the eigenvectors and eigenvalues of B in
terms of the eigenvectors and eigenvalues of A, and use this to find an explicit form for
e
(t) and for C(θ
(t)
) in terms of θ
(0). The result will be closely related to your answer
from Part (1d).
2. Dropout. [5pts] For this question, you may wish to review the properties of expectation
and variance: https://metacademy.org/graphs/concepts/expectation_and_variance
Dropout has an interesting interpretation in the case of linear regression. Recall that the
predictions are made stochastically as:
y =
X
j
mjwjxj ,
where the mj ’s are all i.i.d. (independnet and identically distributed) Bernoulli random variables with expectation 1/2. (I.e., they are indepdendent for every input dimension and every
data point.) We would like to minimize the cost
E =
1
2N
X
N
i=1
E[(y
(i) − t
(i)
)
2
], (1)
where the expectation is with respect to the m
(i)
j
’s.
Now we show that this is equivalent to a regularized linear regression problem:
(a) [2pts] Find expressions for E[y] and Var[y] for a given x and w.
2
CSC321 Winter 2018 Homework 4
(b) [1pt] Determine ˜wj as a function of wj such that
E[y] = ˜y =
X
j
w˜jxj .
Here, ˜y can be thought of as (deterministic) predictions made by a different model.
(c) [2pts] Using the model from the previous section, show that the cost E (Eqn. 1) can be
written as
E =
1
2N
X
N
i=1
(˜y
(i) − t
(i)
)
2 + R( ˜w1, . . . , w˜D),
where R is a function of the ˜wD’s which does not involve an expectation. I.e., give an
expression for R. (Note that R will depend on the data, so we call it a “data-dependent
regularizer.”)
Hint: write the cost in terms of the mean and variance formulas from part (a). For
inspiration, you may wish to refer to the derivation of the bias/variance decomposition
from Lecture 9.
3