Discussion 9: Backpropagation, VC Dimensions, and Unsupervised Methods

Starting from:

$27

EECS 445 Discussion 9: Backpropagation, VC Dimensions, and
Unsupervised Methods

1 Understanding Backpropagation
1.1 Review: Chain Rule and Linearity of the Gradient
Backpropagation allows us to efficiently compute the gradient of the loss function L with respect to each
of the parameters (weights) of a neural network, so that we can perform SGD. The efficiency of the algorithm hinges decomposing each gradient using of the Chain Rule, which allows us to reuse partial derivatives.
Consider the following functions:
L(h) = h
2
2
h(y, z) = y − z
z(x, θ) = θ · x + b
We could also rewrite L as a function of y, x, and θ:
L(y, x, θ) = (h(y, z(x, θ)))2
2
=
(y − (θ · x + b))2
2
If we want to compute the gradient with respect to θi
, we have two choices: either compute it directly,
or by using chain rule, decomposing L into simpler functions:
∂L
∂θi
=
∂L
∂h
∂h
∂θi
=
∂L
∂h
∂h
∂z
∂z
∂θi
= (h)(−1)(xi) = −(y − (θ · x + b))xi
The benefit is that in order to compute ∂L
∂θj
for i 6= j, we only need to compute ∂z
∂θj
:
∂L
∂θj
=
∂L
∂h
∂h
∂z
∂z
∂θj
= −(y − (θ · x + b))xj
The second useful property of the gradient that is needed to efficiently perform backpropagation is linearity.
Consider the following example:
L(h) = f(h) + g(h) + p(h)
∂L
∂θi
=
∂L
∂f
∂f
∂θi
+
∂L
∂g
∂g
∂θi
+
∂L
∂p
∂p
∂θi
=
∂L
∂f
∂f
∂h
∂h
∂z
∂z
∂θi
+
∂L
∂g
∂g
∂h
∂h
∂z
∂z
∂θi
+
∂L
∂p
∂p
∂h
∂h
∂z
∂z
∂θi
We notice two things:
• The gradient with respect to θi
is obtained by summing over the individual gradients of terms that
make up L.
• We are still able to reuse many partial derivatives.
1
1.2 Backprop for Multi-layer Perceptron
Consider a 3-class classification task with ¯x ∈ R
2 and y ∈ {0, 1, 2}. We construct the following two
hidden-layer neural network, with sigmoid activation for the first hidden layer and cross-entropy loss applied
to the output layer :
x1
x2
+1
h
(2)
1
h
(2)
2
+1
z
(3)
1
z
(3)
2
z
(3)
3
+1
xi
’s h
(2)
j
’s z
(3)
k
’s
w
(1)
ji w
(2)
kj
We can write down the forward and backward propagation of values in following table:
Forward Backward
z
(2)
j =
X
i
w
(1)
ji xi + b
(1)
j
∂z(2)
j
∂w(1)
ji
= xi
∂z(2)
j
∂xi
= w
(1)
ji
h
(2)
j = σ(z
(2)
j
)
∂h(2)
j
∂z(2)
j
= h
(2)
j
(1 − h
(2)
j
)
z
(3)
k =
X
j
w
(2)
kj h
(2)
j + b
(2)
k
∂z(3)
k
∂w(2)
kj
= h
(2)
j
∂z(3)
k
∂h(2)
j
= w
(2)
kj
L(y, z¯
(3)) = −
X
k
yk log( e
z
(3)
k
P
k0 e
z
(3)
k0
)
∂L
∂z(3)
k
=
e
z
(3)
k
P
k0 e
z
(3)
k0
− yk
With these local partial derivative worked out, we can easily write down the partial derivative of loss
with respect to each weight parameter:
•
∂L
∂w(2)
11
=
∂L
∂z(3)
1
∂z(3)
1
∂w(2)
11
•
∂L
∂w(2)
12
=
∂L
∂z(3)
1
∂z(3)
1
∂w(2)
12
.
These two partial derivatives differ only in the last term in the product.
•
∂L
∂w(1)
11
=
X
k
∂L
∂z(3)
k
∂z(3)
k
∂h(2)
1
∂h(2)
1
∂z(2)
1
∂z(2)
1
∂w(1)
11
•
∂L
∂w(1)
12
=
X
k
∂L
∂z(3)
k
∂z(3)
k
∂h(2)
1
∂h(2)
1
∂z(2)
1
∂z(2)
1
∂w(1)
12
These two partial derivatives also differ only in the last term in the product.
Backpropagation is just a fancy name for this organizational system for computing the ∂L
∂w(·)
·
’s neatly and
efficiently, by observing that local partial derivatives are shared. If you remember dynamic programming
from EECS 281, the trick is exactly that.
Page 2
1.3 (Optional) Backprop for Convolutional Neural Networks
While the shapes may sometimes get messy, there is almost no difference between backprop in fully connected
layers and convolutional layers. Let’s take a look at the example below.
X =


x11 x12 x13
x21 x22 x23
x31 x32 x33

 W =

w11 w12
w21 w22
H =

h11 h12
h21 h22
Assume X is our input, W is our filter, and H is the result of the convolution of the two (no activation
applied).
Discussion Question 1. Write the formula for hij in terms of Xij and wij
We want to use this information to derive a method for computing dW, defined below, whose entries
are the partial derivatives of a loss function L with respect to each filter weight wij .
Suppose we have a way of computing the partial derivatives dhij =
∂L
∂hij
, and they have been
passed to us from the next layer during backpropagation. We can compose these partial derivative
into a matrix dH which represent the gradient of the loss function with respect to the convolution output.
Discussion Question 2. Write an expression for dW using dhij ’s and xij ’s in matrix form.
Hint: ∂L
∂wij
=
∂L
∂h11
∂h11
∂wij
+
∂L
∂h12
∂h12
∂wij
+
∂L
∂h21
∂h21
∂wij
+
∂L
∂h22
∂h22
∂wij
Discussion Question 3. What matrix operation using X and dH does dW correspond to?
Discussion Question 4. What would be the update rule for the filter matrix W in an SGD setting?
2 Unsupervised Learning Overview
Thus far throughout the course, we have focused on supervised learning methods, which learn a mapping
from inputs {x¯
(i)}
n
i=1 to some semantic label {y
(i)}
n
i=1 in either a discrete or continuous domain. In practice,
these semantic labels can pose a problem: they may require arduous human annotation, and can be a source
of noise. Here, we discuss powerful methods for unsupervised learning, which are attractive given their lack
of reliance on explicit labels. We focus on two applications: clustering and matrix completion. Common to
both of these is a notion of similarity or affinity between data points. Without labels, it is this similarity
between data points that allows us to extract meaningful information from our data.
3 Clustering
The goal of a clustering algorithm is to uncover underlying structure within our data. For example, consider
a data distribution consisting of points sampled from one of three Gaussian distributions (Figure 1a). In
this case, the goal of a clustering algorithm is to assign each point to the cluster representing the Gaussian
distribution from which it was sampled.
Here, the data clearly fits into one of three distributions, so when fitting our unsupervised algorithm it is a
natural choice to search for a division of data into three clusters. However, it is not always the case that we
know the correct number of clusters. To determine a suitable value for k, the number of clusters, we define
a metric for evaluating a cluster and perform a hyperparameter search for a value of k that maximizes this
value under the constraint that we prefer “simpler” solutions with fewer clusters. Visualizing the data by
Page 3
(a) Data sampled from 3 independent Gaussians (b) A k-means clustering on independent Gaussians
creating a low-dimensional representation—via PCA or t-SNE, for instance—can also provide some intuition
for finding a sensible number of clusters. Naive clustering methods such as k-means perform very well on
globular clusters with similar variance (Figure 1b). Other clustering methods are needed for more complex
distributions.
3.1 K-Means Clustering
The typical objective of clustering is to minimize the distance between points in the same cluster and
maximize the distance between points in different clusters. Since this optimal clustering cannot be efficiently
found, k-means serves as a relatively fast, easy to understand approximation. The algorithm is as follows:
Given a data set with n points, ¯x
(1)
... x¯
(n)
:
1. Randomly choose k cluster means, ¯µ
(1)
... µ¯
(k)
2. Iterate between two main steps until convergence:
(a) Assign each data point ¯x
(1) to the cluster i with the mean ¯µ
(i)
that is closest to ¯x
(1)
(b) Recompute each cluster mean ¯x
(1) based on all data points now assigned to that cluster i
K-means converges when the cluster assignments or means do not change between iterations.
Page 4
Discussion Question 5. Consider the data plotted in the figure below, which consists of 22 points that
make up a smiley face.
K-means clustering (with k = 3 and using Euclidean distance) is initialized with three centroids at
coordinates z¯
(1) = [2, 2]T
,z¯
(2) = [8, 2]T
, and z¯
(3) = [5, 7]T
.
1. What are the final clusters and corresponding centroids after convergence? Ties are broken consistently, where cluster i is favored over cluster j in a tie if i < j.
2. Would we be able to cluster our smiley face into two ‘eye’ clusters and one ‘mouth’ cluster by
changing our centroid initialization? Why or why not?
Discussion Question 6. Consider two more example data sets in the following two figures:
For each figure, provide initial values for centroids (k = 2) for which k-means will not converge to
an optimal result. Optimal is considered the clustering that best captures the trend in the data. Explain
why this initialization, or the data in general, causes an indesirable outcome.
Page 5
3.2 Spectral Clustering
Spectral clustering is, at its core, a graph partitioning algorithm. Specifically, we consider each data point
to be a node and construct edges connecting these nodes. We define Si,j to be the (symmetric) similarity
between points ¯x
(i) and ¯x
(j)
. One common technique for determining similarity is the cosine similarity metric
defined in equation (1).
Si,j =
x¯
(i)
· x¯
(j)
kx¯
(i)k2kx¯
(j)k2
(1)
We weight the edges of the graph with their similarity value, and prune edges for which the similarity is
below some thresholded value to obtain the graph adjacency matrix.
wi,j =
(
Si,j , if Si,j >
0, otherwise
(2)
The problem of clustering is then equivalent to finding a partitioning of the graph into k connected components that minimizes the cost of those cuts. Define the diagonal matrix D such that Di,i is the degree of
graph node i and Di,j = 0 for j 6= i.
Di,i =
X
j
wi,j (3)
We construct the unnormalized graph Laplacian L as follows.
L = D − W (4)
Intuitively, we want to split our graph in a manner that has a minimal effect on the values of the graph
Laplacian. Thus, we wish to perform our partition on the k eigenvectors corresponding with the smallest k
eigenvalues of L. Because L is symmetric and positive semi-definite, we know that these eigenvalues must
be non-negative and real. Let ¯u
(i) ∈ R
n be the eigenvector corresponding to the ith smallest eigenvalue and
define the matrix U as follows.
U =


| | |
u¯
(1) u¯
(2)
. . . u¯
(k)
| | |

 (5)
We can then find a clustering of our original points by applying another clustering algorithm (e.g., k-means)
to the row vectors of U. This clustering represents both our optimal graph cut approximation as well as the
clustering of data points within the input space.
Page 6
Discussion Question 7. Apply spectral clustering to the following graph. Find the clusters corresponding to k = 2 and k = 3.
1
2
3
4
5
1
3
1
5
1
2
1
5
1
4
Page 7