$30
EECS 445 Discussion 11: Probability and Generative Models
1 Probability Review
For a complete review of the probability relevant to EECS445, see discussion 1.
1.1 Probability Density Function
In case of a discrete random variable, we have a fixed probability value for each element in the sample space.
However, in case of a continuous random variable, the value P(X=x) does not make sense, because there are
infinite possible values of X. Picking one specific point value has an absolute likelihood of zero. In a more
precise sense, the PDF is used to specify the probability of the random variable falling within a particular
range of values, as opposed to taking on any one value.
This probability is given by the integral of this variable’s PDF over that range—that is, it is given by the area
under the density function but above the horizontal axis. The probability density function is non negative
everywhere, and its integral over the entire space is equal to 1.
P(a ≤ X ≤ b) = Z b
a
f(x)dx
Where f(x) is the probability density function.
1.2 Bernoulli Distribution
Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1
with probability p and the value 0 with probability q = 1 p. That is, the probability distribution of any
single experiment that asks a yes–no question. It is often used to represent a coin toss. For example if you
have a biased coin probability p of heads being 0.7, then you would have probability q = 1 - 0.7 = 0.3 of
tails.
1
1.3 Joint Distribution
Consider two random variables X and Y . A joint probability model denotes the probability of a pair of
events being drawn—one from X and one from Y . The joint probability is denoted by
p(X = x, Y = y)
For example, if we let X be the result of a roll of a six-sided die and Y be the result of a coin flip, then the
probability of rolling a 5 and flipping a heads is denoted p(X = 5, y = H).
1.4 Conditional distribution
We often need to know the probability of an event happening when another event is known to have previously
happened. This is known as a conditional distribution.
p(X|Y ) = p(X, Y )
p(Y )
For example, let X be the sum of the results of two rolls of a six-sided die and let Y be the result of the first
roll, then the probability that X = 9 given that Y = 4 can be modeled as
p(X = 9 | Y = 4) = p(X = 9, Y = 4)
p(Y = 4)
There are a couple of related properties that we will use:
1. Product rule: p(X, Y ) = p(X|Y )p(Y ) = p(Y |X)p(X)
2. Bayes rule: p(X|Y ) = p(Y |X)p(X)
p(Y )
Note: We have considered the case of two random variables but the same properties extends to an arbitrary
number of variables.
1.5 Independence
For a set of random variables X1, X2, . . . , Xk, we say that they are independent of each other if the occurrence
of one doesn’t affect the another. Using the definition of conditional probability we can write the following.
p(X1, X2, . . . , Xk) = p(X1|X2, . . . , Xk)P(X2, . . . , Xk)
= p(X1)p(X2, . . . , Xk)
. . .
=
Y
k
i=1
p(Xi)
As in section 1.3, let X be the result of a single roll of a six-sided die and Y be the result of a coin flip.
These two random variables are independent of each other (rolling the die does not affect the result of the
coin flip), so we can write
p(X = 5, y = H) = p(X = 5)p(y = H)
Assuming both the die and the coin are fair, then
p(X = 5, y = H) = (1/6) · (1/2) = 1/12
In section 1.4, we had X be the sum of the results of two rolls of a six-sided die and Y be the result of the
first roll. These random variables are not independent, so p(X = 9, Y = 5) 6= p(X = 9)p(Y = 5).
Page 2
1.6 Marginal Distribution
We can model the random variables individually by summing over all other variables. This is called a
marginal distribution. For discrete random variables,
p(X = x) = X
y∈Y
p(X = x, Y = y)
For example, let X be the sum of the results of two rolls of a six-sided die and let Y be the result of the first
roll, then you can model the probability of X = 9 as follows:
p(X = 9) = X
y∈{1,...,6}
p(X = 9, Y = y)
Discussion Question 1. Finish computing p(X = 9) using the marginal distribution assuming that the
die is fair.
If Y is a continuous random variable, then
p(X = x) = Z ∞
−∞
p(X = x, Y = y)dy
For example, let X denote whether or not it is raining today and Y be the air temperature today. We can
model the probability that it is raining as follows:
p(X = T) = Z ∞
−∞
p(X = T, Y = y)dy
1.7 Gaussian Distribution
The normal distribution for a random vector ¯x ∈ R
d
is parameterized by a mean vector ¯µ ∈ R
d and covariance
matrix Σ as:
N (¯x; ¯µ, Σ) = 1
(2π)
d/2|Σ|
1/2
exp
−
1
2
(¯x − µ¯)
T Σ
−1
(¯x − µ¯)
In case of a single random variable this distribution has the equivalent form in terms of µ ∈ R and σ
2 ∈ R
+
and is given by:
N (x; µ, σ2
) = 1
√
2πσ2
exp
−
(x − µ)
2
2σ
2
Note that these expressions are probability density functions and their values cannot be interpreted as the
absolute probability that the variable takes on a particular value. For example,
N (x = 0.08; µ = 0.1, σ2 = 0.01) = 1
√
2π · 0.01
exp
−
(0.08 − 0.1)2
2 · 0.01
≈ 3.91043
Note that x= 0.12 will have the same probability density
N (x = 0.12; µ = 0.1, σ2 = 0.01) = 1
√
2π · 0.01
exp
−
(0.12 − 0.1)2
2 · 0.01
≈ 3.91043
Page 3
Figure 1: Gaussian mixture model when k=1 and k=2
In fact, because the variable is continuous, the absolute probability that it takes on a certain value is actually
0. Instead, we typically look at the probability that the value falls in a certain range by integrating the
probability density function over that range. Again using N (x; µ = 0.1, σ2 = 0.01),
p(0.08 < x < 0.12) = Z 0.12
0.08
N (x; µ = 0.1, σ2 = 0.01)
= 0.1585
2 Generative Models
We often make the assumption that our input data is sampled from some distribution (i.e., ¯x
(i) ∼ p(¯x)).
Thus far in the course, we have not dealt directly with this underlying distribution and instead focused on
identifying key factors of the data to place it into a category or regress its features to estimate a value. A
machine learning model that performs classification or regression is called a discriminative model. Here’s an
entirely different framework: let’s use a generative model to try to learn p(¯x) directly! Generative models
parameterize a probability distribution and learn parameter values from observed data. In other words, we
want to create a probability distribution pθ¯(¯x) as similar as possible to p(¯x). Generative models are typically
divided into two categories: models that aim to explicitly represent the probability density of the data, and
models that implicitly represent the probability density as a black box. Models using explicit representations
of density first define a probabilistic structure and learn parameters for that structure (e.g., assume the data
is normally distributed and learn the mean and variance). Implicit representations of density make no
assumptions of structure, and learn both the structure and values in one—usually uninterpretable—realvalued vector. Once trained, generative models allow for new data points to be sampled directly from the
learned distribution: ˆx ∼ pθ¯(¯x).
2.1 Maximum Likelihood Estimation
The point of Maximum Likelihood Estimation is to find the parameters of a probabilistic model ¯θ that
maximize the likelihood of the data X being sampled from the distribution. This is a core principle used in
generative models.
Page 4
¯θ
∗ = argmax
θ¯
pθ¯(X)
= argmax
θ¯
ln pθ¯(X)
= argmin
θ¯
− ln pθ¯(X)
Discussion Question 2. Let x ∼ Binomial(x; N, p) and ¯θ = (N, p) where x is one observation, N is
total number of Bernoulli observations, n is number of successes observations and p is probability of
success. What is the maximum likelihood estimate of p given observations X = [x
(1), . . . , x(n)
]? The
binomial distribution is as follows:
Binomial(x; N, p) =
N
x
p
x
(1 − p)
N−x
(1)
2.2 Gaussian Mixture Models
In GMMs, we assume that the underlying data distribution is a mixture of k Gaussians. Each Gaussian is
parameterized by a mean and covariance. The training of a GMM involves learning values for these means
and covariances, as well as a prior probability of assignment to each Gaussian called a “mixing weight”. A
visualization of GMM can be found through https://lukapopijac.github.io/gaussian-mixture-model/
We will work with the following notation, assuming ¯x ∈ R
d
:
γ1, . . . , γk mixing weights
µ¯1, . . . , µ¯k cluster mean
Σ1, . . . , Σk cluster covariance
(4)
¯θ = {γ1, . . . , γk, µ¯1, . . . , µ¯k, Σ1, . . . , Σk} model parameters (5)
Note that we are assuming our Gaussians are all spherical, such that Σj = σ
2
j
I for some scalar σj .
After learning these values, we can generate new values from our GMM as follows:
z
(i) ∼ Categorical(¯γ) cluster indicators
pdf(¯x
(i)
| z
(i)
; θ) ∼ N (¯µz
(i) , Σz
(i) ) base distribution
(6)
In other words, we first sample which Gaussian the data point should come from and then we sample from
that Gaussian to generate the data point.
2.2.1 Complete Data Case
We will begin by assuming that we magically know the cluster assignment z
(i)
for each data point, also called
the hidden or latent values of the generative model. When this knowledge is assumed, we say that our data
is “complete”. Without any training data, the best we can do is assume that points are distributed based
on our mixing weights—also called priors. This yields the following probability model for sampling a data
point ¯x
(i)
:
pθ¯(¯x
(i)
) = X
k
j=1
γjN (¯x
(i)
; ¯µj , Σj ) (7)
In the complete data case, the conditional probability of cluster assignment can only result in 0s and 1s (i.e.,
we have a “ground truth” cluster). We can embed this prior into our model as follows.
Page 5
p(j|i) = (
1, if z
(i) = j
0, otherwise
(8)
pθ¯(¯x
(i)
|z
(i)
) = X
k
j=1
p(j|i)N (¯x
(i)
; ¯µj , Σj ) (9)
We will later relax this assumption and allow the prior to take on more complex distributions.
Discussion Question 3. Let X = [¯x
(1)
, . . . , x¯
(n)
] and Z = [z
(1), . . . , z(n)
]. Assume our data is independently sampled. Show that the following is true:
ln pθ¯(X, Z) = Xn
i=1
X
k
j=1
p(j|i) ln
γjN (¯x
(i)
; ¯µj , Σj )
(10)
So now that we’ve defined our probabilistic model, we need to construct a method for optimizing our model’s
parameters w.r.t. input data {x¯
(i)
, z(i)}
n
i=1. The process is similar to our previous means of optimization—
take the gradient w.r.t. the parameter of interest, set it equal to 0, and solve—but now it goes by a different
name: Maximum Likelihood Estimation (MLE).
2.2.2 MLE for Complete-Data GMMs
Let’s return to our GMM setting. Recall the complete data log-likelihood formulation:
ln pθ¯(X, Z) = Xn
i=1
X
k
j=1
p(j|i) ln
γjN (¯x
(i)
; ¯µj , Σj )
(12)
Discussion Question 4. Find γ
∗
j
, the MLE estimate of the cluster prior of cluster j. Note that we
have the additional constraint that Pk
j=1 γj = 1.
3 Review: Gradient of squared l2 norm
You may have noticed that in some of the derivations above, we wrote that
∇ kx¯ − µ¯k
2
2 = 2 (¯x − µ¯)
The proof is as follows:
According to the definition of l2 norm,
kx¯k2 =
vuutXn
k=1
x
2
k
Therefore,
kx¯ − µ¯k
2
2 =
Xn
k=1
(¯x − µ¯)
2
k =
Xn
k=1
(¯xk − µ¯k)
2
Page 6
From this, we can calculate the ith element of the gradient:
∇ kx¯ − µ¯k
2
i
=
∂
∂xi
kx¯ − µ¯k
2
=
∂
∂xi
Xn
k=1
(¯xk − µ¯k)
2
= 2 (xi − µi)
Therefore, ∇ kx¯ − µ¯k
2
2 = 2 (¯x − µ¯).
Page 7