Discussion 10: Spectral Clustering, Hierarchical Clustering, and Collaborative Filtering

Starting from:

$27

EECS 445 Discussion 10: Spectral Clustering, Hierarchical
Clustering, and Collaborative Filtering

1 More on Spectral Clustering
Spectral clustering reformulates the clustering problem as a graph cut problem. For a weighted graph G, a
cut is defined as a partitioning of the vertices of G into two sets, A and A. The cost of the cut is the sum
of the weights of the edges whose endpoings fall in the two sets:
cut(A, A) = X
u∈A,v∈A
wuv
where wij is the weight of the edge between vertices i and j. Note that the definition of a partioning implies
that A ∩ A = and A ∪ A = V , where V is the vertex set of G, i.e. with a single cut we are splitting the
vertices into two groups.
We extend the definition of the cost of a cut to the cost of a partitioning into k sets by defining the ratio
cut:
cut(A1, . . . , Ak) = 1
2
X
k
i=1
cut(Ai
, Ai)
|Ai
|
Spectral clustering first creates a weighted graph by having each data point be a vertex and each edge be
weighted by some similarity score between its endpoints (a higher weight should mean that the points are
more similar). It then clusters the points by looking for the partitioning that approximately minimizes the
ratio cut of the partitioning.
For example, let k = 2. We can define our partitioning as some vector f ∈ {−1, 1}
n, where a value of 1
indicates the point is in one cluster while a value of −1 indicates the point is in the other cluster. Now we
can compute the cost of the partitioning defined by f as followed:
R =
1
2
X
i,j
wij (fi − fj )
2
For simplicity we have omitted the normalization by the size of each set in the cost (see the original definition
of ratio cut) from this formulation. Now, if we define W to be our adjacency (weights) matrix, D to be
the degree matrix such that D is zero off of the diagonal and ith element along the diagonal is the degree
of the ith vertex, then L = D − W is called the Laplacian matrix. As it turns out, R = f
TLf so we can
reformulate our problem as finding
f
∗ = arg min
f∈{−1,1}n
f
TLf
This is NP-hard, so spectral clustering relaxes the formulation by allowing f to take on any real value:
f
∗ = arg min
f∈Rn
f
TLf
and performs the minimization by clustering the rows of the matrix formed by the two eigenvectors of L
that have the lowest eigenvalues. In general, to find k clusters, we cluster the rows of the matrix formed by
the k eigenvectors of L with the lowest eigenvalues.
1
Discussion Question 1. Apply spectral clustering to the following graph. Find the clusters corresponding
to k = 2 and k = 3.
1
2
3
4
5
1
3
1
5
1
2
1
5
1
4
2 Hierarchical Clustering
In hierarchical clustering, we build a hierarchy of the points by either starting with every point in its own
cluster and then repeatedly merging clusters together based on some criteria (“agglomerative clustering”) or
by starting with one big cluster and then repeatedly splitting clusters up based on some criteria (“divisive
clustering”). The hierarchy formed by this process can be visuzlied as a dendogram:
We can then use the information in the hierarchy to perform our clustering, as shown in the figure. The
following are some common linkage criteria that are used for determining how to split or merge clusters,
where d(¯x, y¯) is some distance measure between two vectors ¯x and ¯y:
Single-linkage D(Ci
, Cj ) = minx¯∈Ci,y¯∈Cj d(¯x, y¯)
Complete-linkage D(Ci
, Cj ) = maxx¯∈Ci,y¯∈Cj d(¯x, y¯)
Average-linkage D(Ci
, Cj ) = 1
|Ci|
1
|Cj |
P
x¯∈Ci,y¯∈Cj
d(¯x, y¯)
3 K Nearest Neighbors
In problems like matrix completion, we borrow data from similar rows to infer better estimates of the values
for missing entries. In the example of recommender systems (i.e. Netflix), one method of finding similar
users is known as K-Nearest-Neighbors (KNN prediction). The idea is to find the k most ”similar” users
to the user in question, and make predictions based on these users. This entails the following: 1) Defining
a metric of similarty, 2) Calculating the similarity from the user in question to all other users, and 3) rate
movie i for the user in question based on the data from the k most similar users.
First we must define some metric of similarity for two users a and b. Following the example from lecture,
let R(a, b) be the set of movies rated by both users a b. Let the rows of matrix Y represent user data, and
each column as specific movie, where Yai gives the rating of user a for movie i. We write the average rating
of user a for movies rated by both users as:
Y˜
a:b =
1
|R(a, b)|
X
j∈R(a,b)
Yaj .
Page 2
Given this, we define the correlation between users a and b as a metric of similarity as follows:
sim(a, b) := N = expected value of how much user ratings vary together
D = how much the ratings vary individually
Where:
N = X
j∈R(a,b)
(Yaj − Y˜
a:b)(Ybj − Y˜
b:a)
And:
D =
s X
j∈R(a,b)
(Yaj − Y˜
a:b)
2 X
j∈R(a,b)
(Ybj − Y˜
b:a)
2
This metric of similarity ranges in [-1, 1], where 0 represents a complete lack of linear correlation, while -1
and 1 represent a perfect linear relationship.
Discussion Question 2. Given the following matrix Y, let the first row represent user a, the second
row represent user b, and the third row represent user c. Calculate the similarity between users a and b,
and between users a and c:
Y =


3 3 ? 6 8
2 5 1 10 7
5 1 3 3 ?


Finally, let KNN(a, i) represent the K most similar users to a who have also rated movie i. Given the results
of our K ”most similar” users, we predict on user a’s rating of movie i using the following equation:
Yˆ
ai = Y¯
a +
1
P
b∈KNN(a,i)
|sim(a, b)|
X
b∈KNN(a,i)
sim(a, b)(Ybi − Y¯
b)
Note that while the expression here uses Ybi − Y¯
b inside the summation, it is also valid to use Ybi − Y˜
b:a, i.e.
to compute a weighted average of how much each of the k nearest neighbor users’ (b) ratings for movie i
vary from the average rating across movies rated by both a and b rather than how they vary from b’s overall
average ratings. This weighting scheme is, in the end, a heuristic, meaning it wasn’t rigorously derived to
be “optimal“, so there are possible variations that still make intuitive sense and are still valid.
Discussion Question 3. Assume we are using the same rating matrix as before, with KNN for K=1
such that b is the most similar user to a. Give a prediction for Yˆ
a3.
Page 3
4 Matrix Factorization
Recommendation systems were historically the result of substantial human effort. For example, the music
service Pandora used to hire experts to create curated playlists that corresponded to specific genres and
moods. Collaborative filtering both negates the need for substantial expert knowledge and improves upon
the quality of recommendations by solving a regression problem for the expected rating that a user would
give to a piece of content using only a metric of similarity between users. For instance, if user A and user
B liked very similar music and user A likes song X that B has not heard, a collaborative filtering algorithm
would conclude that user B may also like song X. This is the same technique currently used by corporations
such as Spotify and Netflix.
Here we will focus on the matrix factorization technique for collaborative filtering. Consider the case of n
users and m possible song selections. Let Yai be the observed rating that user a assigned to song i, where
Yai ∈ {−1, 0, 1}. We wish to construct an approximation Yˆ
ai for all values of a, i to satisfy the following
optimization problem.
min
Yˆ
J(Yˆ ) = min
Yˆ
1
2
X
{a,i}∈D
(Yai − Yˆ
ai)
2 +
λ
2
kYˆ k
2
(1)
Here, D represents the set of all observed ratings and λ is a tuneable hyperparameter. Note that without
any constraint on Yˆ , we end up with the trivial solution that Yˆ
ai = 0 for all unobserved values. We therefore
constrain Yˆ to be low-rank, so that for some U ∈ R
n×d and V ∈ R
m×d
, where d ≤ min(n, m), we have that
Yˆ = UV T
. Our new optimization problem is as follows.
min
U,V
J(U, V ) = min
U,V
1
2
X
{a,i}∈D
(Yai − u¯
(a)
· v¯
(i)
)
2 +
λ
2
Xn
a=1
ku¯
(a)
k
2 +
λ
2
Xn
i=1
kv¯
(i)
k
2
(2)
With our cost function defined, we now consider how to proceed with constructing an algorithm to minimize
this cost. We will use an alternating minimization, meaning that we will hold one minimization parameter
constant while optimizing with respect to the other, and then switch. This alternation is repeated multiple
times until some stopping criteria is met (e.g., after a fixed number of iterations or when the magnitude of
an update is sufficiently small).
Here we will go through one step of the optimization given the following observation matrix, in which
rows represent the ratings provided by a user, and columns represent all user’s ratings of a particular
song. Note that in this case, there are not any missing ratings.
Y =

−1 1 1
1 −1 0
Define our initial matrices U and V to be as follows.
U =

2 2T
V =

−1 1 −1
T
Discussion Question 4. What is d, the rank of UV T ?
Discussion Question 5. What is the value of 1
2
(Y2,3 − u¯
(2)
· v¯
(3))
2
, the error of our approximation of
user 2’s rating of song 3?
Discussion Question 6. Assume we hold V constant. Solve for ∇u¯
(2)J.
Discussion Question 7. Assume λ = 1. For what value of u¯
(2) is J minimized?
Discussion Question 8. What is our new value of 1
2
(Y2,3 − u¯
(2)
· v¯
(3))
2 after updating u¯
(2)?
Page 4