$30
CSE 250B: Machine learning
Homework 3 — Generative models, exponential families, linear algebra
By the due date (midnight of Wednesday Jan 31), upload the PDF of your typewritten solutions to
gradescope. Problems 6–19 are intended as a review of basic linear algebra.
1. Handwritten digit recognition using a Gaussian generative model. Recall that you can obtain the
MNIST data from http://yann.lecun.com/exdb/mnist/index.html. In this problem, you will build
a classifier for this data by modeling each class as a multivariate (784-dimensional) Gaussian.
(a) Split the training set into two pieces – a training set of size 50000, and a separate validation set
of size 10000. Also load in the test data.
(b) Now fit a Gaussian generative model to the training data of 50000 points:
• Determine the class probabilities: what fraction π0 of the training points are digit 0, for
instance? Call these values π0, . . . , π9.
• Fit a Gaussian to each digit, by finding the mean and the covariance of the corresponding
data points. Let the Gaussian for the jth digit be Pj = N(µj , Σj ).
Using these two pieces of information, you can classify new images x using Bayes’ rule: simply
pick the digit j for which πjPj (x) is largest.
(c) One last step is needed: it is important to smooth the covariance matrices, and the usual way to
do this is to add in cI, where c is some constant and I is the identity matrix. What value of c is
right? Use the validation set to help you choose.
(d) Turn in the following details:
• A (very) short description of your procedure for choosing c.
• A graph showing the validation error for all the values of c you tried.
• Your final choice of c.
• Overall error rate on the MNIST test set.
2. Compute the entropy of each of the following random variables X.
(a) X is a coin of bias 2/3.
(b) X is the outcome of rolling a fair die.
(c) X = (X1, . . . , X10), where the Xi are independent and are each coins of bias 1/2.
(d) X = (X1, . . . , X10) where X1 is a coin of bias 1/2 and Xi = X1 + i.
3. Entropy of continuous distributions. As defined in class, a distribution P over a discrete set S has
entropy
H(P) = X
x∈S
P(x) log 1
P(x)
= EX∼P
log 1
P(X)
,
where X ∼ P means that random variable X is distributed according to P. The second term shows
how this notion can be generalized to the continuous setting: for a density p, we define the differential
entropy as
h(p) = Z
p(x) log 1
p(x)
dx.
3-1
CSE 250B Homework 3 — Generative models, exponential families, linear algebraWinter 2018
(a) Show that the differential entropy of the one-dimensional Gaussian with mean µ and variance σ
2
is 1
2
log(2πσ2
e).
(b) What is the differential entropy of the uniform distribution over [a, b]? Why is this suspicious
when b − a is small?
4. You come across a mysterious new language and find, after looking at some examples of words, that
• Each word is composed of characters ’a’ through ’z’
• Each word has a length of at most 15
• The average word length is 4
• 80% of words end with a vowel
• In 90% of words, every consonant is followed by a vowel
• The first letter of a word is ’z’ 50% of the time
You decide to use the maximum entropy approach to come up with a simple distribution for the words
in this language. What is the set S, and what features T would you choose? What is the functional
form of the maximum entropy solution?
5. Consider the family of distributions on the positive reals [0, ∞) parametrized by λ > 0, with density
p(x) = λe−λx. Show that this is an exponential family by exhibiting a suitable (S, π, T). What is the
natural parameter space Θ?
6. Find the unit vector in the same direction as x = (1, 2, 3).
7. Find all unit vectors in R
2
that are orthogonal to (1, 1).
8. How would you describe the set of all points x ∈ R
d with x · x = 25?
9. The function f(x) = 2x1 − x2 + 6x3 can be written as w · x for x ∈ R
3
. What is w?
10. For a certain pair of matrices A, B, the product AB has dimension 10 × 20. If A has 30 columns, what
are the dimensions of A and B?
11. We have n data points x
(1), . . . , x(n) ∈ R
d and we store them in a matrix X, one point per row.
(a) What is the dimension of X?
(b) What is the dimension of XXT
?
(c) What is the (i, j) entry of XXT
, simply?
12. Vector x has length 10. What is x
T xxT xxT x?
13. For x = (1, 3, 5) compute x
T x and xxT
.
14. Vectors x, y ∈ R
d both have length 2. If x
T y = 2, what is the angle between x and y?
15. The quadratic function f : R
3 → R given by
f(x) = 3x
2
1 + 2x1x2 − 4x1x3 + 6x
2
3
can be written in the form x
TMx for some symmetric matrix M. What is M?
16. Which of the following matrices is necessarily symmetric?
3-2
CSE 250B Homework 3 — Generative models, exponential families, linear algebraWinter 2018
(a) AAT
for arbitrary matrix A.
(b) AT A for arbitrary matrix A.
(c) A + AT
for arbitrary square matrix A.
(d) A − AT
for arbitrary square matrix A.
17. Let A = diag(1, 2, 3, 4, 5, 6, 7, 8).
(a) What is |A|?
(b) What is A−1
?
18. Vectors u1, . . . , ud ∈ R
d all have unit length and are orthogonal to each other. Let U be the d × d
matrix whose rows are the ui
.
(a) What is UUT
?
(b) What is U
−1
?
19. Matrix A =
1 2
3 z
is singular. What is z?
3-3