$27
EECS 445 Discussion 12: EM Algorithm and Bayesian Networks
1 EM Example: Movie Reviews
Recall Project 1, in which we used SVMs to classify movie reviews. We will use a simplified version of this
problem to practice using MLE and E-M.
Suppose there are two possible types of reviews: “good” and “bad.” Each review contains exactly M words,
where each word ¯x
(i) has been classified as either a “positive” or “negative” word. If a review is good, each
word in the review has a fixed probability θG of being a positive word. Similarly, each word in a bad review
has a probability θB of being a positive word.
We collect data from a series of N trials in order to estimate these probabilities. Each trial i consists of
randomly selecting a review type (good or bad), then recording the number of positive words in the review,
x¯
(i)
. This situation is best described by the following generative probabilistic model:
θ = (θG, θB) fixed review biases
for a trial i : i ∈ {1, . . . , N}
z
(i) ∼ Uniform{G, B} review types
x
(i)
| z
(i)
; θ ∼ Binomial(M, θz
(i) ) positive words count
We collected 3 reviews, where each review had a total of 20 words (N = 3, M = 20):
1. The first review is a good review with 15 positive words.
2. The second review is a bad review with 6 positive words.
3. The third review is a bad review with 10 positive words.
The binomial probability distribution is (assuming we look at a good review):
p(x
(i)
| z
(i) = G; θ) =
M
x
(i)
(θG)
x
(i)
(1 − θG)
M−x
(i)
1.1 Complete Data (MLE)
Assume we knew z
(i)
’s: the type of each review in our N trials.
Discussion Question 1. What is the complete-data log-likelihood?
Discussion Question 2. What are the complete-data maximum-likelihood estimates for θG and θB?
1
1.2 Incomplete Data (E-M)
Now assume we do not have complete data, i.e. we don’t know z
(i)
’s, and use the E-M algorithm.
Discussion Question 3. In E-step, what are the hidden posterior probabilities , p(z
(i) = G | x
(i)
; θ)
and p(z
(i) = B | x
(i)
; θ)?
Discussion Question 4. What expression do we optimize in M-step, given the hidden posterior probabilities p(z
(i) = G | x
(i)
; θ) and p(z
(i) = B | x
(i)
; θ)? What are the updates for θG and θB?
2 Bayesian Networks
Bayesian networks describe a general framework for learning probability distributions. We use a graphical
structure to express the dependencies in probability distributions of interests. These graphs allow us to
quickly reason about the model at hand.
2.1 The Setup
We want to learn a probabilistic model that describes our data; in order to do so, we will assume that the
data we observe is generated by a set of probability distributions with parameters, and that the probability
distributions may depend on each other in some way. To better describe these dependencies, we will draw
them as a graph with the following assumption:
Fundamental Assumption of Bayesian Networks: Given its parents, a node xi
is conditionally independent from all other nodes in the graph.
That is, given nodes x1, . . . , xn in a graph, the probability density function over the nodes is:
p(x1, . . . , xn) = Qn
i=1 P(xi
|xpai
), where xpai
is the set of parents of node xi
. This set may possibly be empty.
Figure 1: An example Bayesian Network from lecture with the associated density function over the nodes
2.2 Marginal and Conditional Independence
Given that we have a graph with the independence assumption above, it turns out that we can use that
graph to deduce independence properties about the distributions.
The two related properties that we care about are marginal independence and conditional independence.
For X1 and X2 to be marginally independent means that knowing something about X2 would not change
Page 2
the probability distribution of X1 and vice versa. Similarly, X1 and X2 are conditionally independent given
a third node X3 if knowing X2 and X3 does not change our beliefs for X1.
Suppose we are interested in whether a pair of nodes X1, X2 are conditionally independent given a third
node X3, or marginally independent given no nodes. We perform the following procedure on the graph:
1. Keep only the “ancestral” graph of the variables of interest. In other words, keep the variables of
interest themselves in the graph, and keep all nodes that have a directed path to those nodes (ancestors).
Remove all other nodes from the graph.
2. Moralize the parents: add undirected edges between any two nodes in the ancestral graph that share
a child. If a child has more than two parents, add edges between all of the parents.
3. Change the remaining edges in the graph to be undirected.
4. We are now ready to read off independence properties from the graph:
• If there is no path from X1 to X2, then they are marginally independent (assuming those two
were the only nodes of interest when doing the procedure). We write: X1 |=X2
• If there are no paths from X1 to X2 or all such paths go through X3, then X1 and X2 are
conditionally independent given X3. We write: X1 |=X2|X3.
Note: If we were to check if X1 |=X2|{X3, X4}, then we would need to verify that all paths from
X1 to X2 go through any of the variables in {X3, X4}.
Lets use an example to illustrate the concept: Consider the following graph below:
Figure 2: A Bayesian Network
Discussion Question 5. What is the probability density function of this graph?
Discussion Question 6. Are E and F conditionally independent? Are they conditionally independent
given A? Given B? Given C? Given C AND D?
Page 3
Note that for these sets of questions, the final graph we arrive at doing the above steps will be the same
because A, B, C, D are all ancestors of E and F! Thus we can use the below moralized graph to answer all
of these.
(a) Ancestral Graph
(b) Moralized Graph
Figure 3: Deriving Conditional Independence Properties
Discussion Question 7. Let’s do a computational example. Recall the radio report/alarm graph from
lecture, together with its probability distribution tables.
Figure 4: Generative probability model over 4 variables
Compute the probability that a burglary happened given that the alarm went off ? Does the result match
your intuition? Explain.
Discussion Question 8. Suppose we learned that the radio report also occurred. How does the probability that a burglary occurred change? In other words, what is P(B = T|R = T, A = T)?
Page 4
3 Plate Notation
It turns out that drawing the entirety of a graph can be time-consuming and difficult. Sometimes we want
a convenient notation for specifying graphs without having to draw the entire graph. We also need a way
to specify graphs without a fixed number of nodes. For example, how would you draw the graph with the
distribution below?
P(x1, . . . , xN , y1, . . . , yN ) = Y
N
i=1
P(xi)P(yi
|xi)
Since N is not a fixed number, it is impossible to draw the entire graph. You might try and use dots like
the picture on the left below. But there is a better way! We can use plate notation.
In plate notation, everything inside the square “plate” is repeated a number of times as indicated by the
number on the bottom right corner on the plate. For example, in the picture on the right below, we just
repeat everything in the plate N times, which gives us the same graph on the left. This way, we can draw
graphs more compactly.
(a) Using dots to draw a graph
(b) Using plate notation to draw the same graph
Figure 5: Comparison of graph notations
You can think of plate notation like a summation. We could either write
x1 + x2 + · · · + xN
or we could write:
X
N
i=1
xi
Both expressions are equivalent, but we typically prefer using the summation because it is more compact.
We can have edges that cross from a node outside of a plate to a node inside a plate. When that happens, it
indicates that the node outside the plate is connected to many repeated copies of the node inside the plate,
like so:
Page 5
(a) The full graph
(b) An edge crossing a plate
Figure 6: A graph drawn with and without plate notation
As you will see in your homework, you can also nest plates inside plates. To understand graphs with nested
plates, you can unravel the highest level plate by repeating everything inside the highest level plate, and
then recurse on each resulting plate.
(a) Nested plates
(b) One level of plates unraveled
Figure 7: Nested plates (left), and the highest level plate unraveled (right)
Page 6
4 If Time Permits: Learning Bayesian Networks
4.1 Maximizing Log-Likelihood
Suppose we have a specified graph structure with variables (nodes) X1, . . . , Xm and a joint density:
P(X1 = x1, . . . , Xm = xm) = Ym
i=1
P(Xi = xi
|Xpai = xpai
)
How can we learn the probabilities of the distributions underlying the graph structure?
Lets make a couple of assumptions to make our life easier:
• Complete data: lets assume that we have a training set of n observations, where each observation
has a value assignment for each of the m variables of interest. That is, our data consists of Sn =
{(x
(j)
1
, . . . , x
(j)
m )}
n
j=1. Recall that this is often not the case. For example, when we learned GMMs, we
observed the values of the Gaussian distributions, but we did not observe the values of the categorical
selection distribution (we did not know which points belonged to which clusters).
• Conditional probabilities are fully parameterized: that is, we will treat each P(Xi = xi
|Xpai = xpai
) as
a probability table that has a learned parameter for each possible value setting Xi = xi
, Xpai = xpai
.
Given these assumptions, we can use maximum likelihood estimation to estimate the parameters of our
model θ: The log-likelihood is:
`(Sn, θ, G) = log Y
N
j=1
P(X1 = x
(j)
1
, . . . , Xm = x
(j)
m ; θ) =
= log Y
N
j=1
Ym
i=1
P(Xi = x
(j)
i
|Xpai = x
(j)
pai
; θ) =
=
X
N
j=1
logYm
i=1
P(Xi = x
(j)
i
|Xpai = x
(j)
pai
; θ) =
=
Xn
j=1
Xm
i=1
log P(Xi = x
(j)
i
|Xpai = x
(j)
pai
; θ)
The optimal probabilities specified by the optimum theta are:
P(Xi = xi
|Xpai = xpai
;
ˆθ) = n(xi
, xpai
)
P
x
0
i
n(x
0
i
, xpai
)
Where:
• n(xi
, xpai
) is the number of observations in Sn where Xi = xi and Xpai = xpai
.
Intuitively, the update is the fraction of times we see Xi = xi given that the parents are set as Xpai = xpai
.
Discussion Question 9. How does the graph parameter G impact the log-likelihood `(Sn; θ, G)?
Page 7
4.2 Learning the Graph Structure
How can we learn the graph structure of our model? One way to approach this problem would be to try
several graphical structures and see which one produces the best log-likelihood. However, just like with
GMMs, there is a problem here: the more connected the model is, the better we expect it to perform. Why
is that?
For any probability P(Xi
|Xpai
), our model could learn P(Xi
|Xpai
) = P(Xi
|S ⊂ Xpai
). That is, for any
specified parents of Xi
, our model could just ignore some of those parents and learn that Xi was conditionally independent of them. Thus, a model with fewer connections can be learned on a model with more
connections. This means that a model with more connections can learn anything and more than a model
with fewer dependencies. However, adding more dependencies doesn’t necessarily mean we are capturing the
correct trends in our data.
(a) No Connections (b) Two Connections
(c) Three Connections
Figure 8: Understanding Graph Complexity
Consider the graphs in Figure 6. Suppose that x1 can take on r1 possible values, x2 can take on r2 possible
values, and x3 can take on r3 possible values.
Discussion Question 10. What is the factored probability distribution of each graph, and how many
parameters does each graph have?
We can see that as we add more connections, we have a more complicated model. When we have more
complicated models, we should always be wary of them doing better on our training data!
Alright so now we have a problem. How can we resolve it? Lets use Bayesian Information Criterion (BIC)!
BIC measures the tradeoff between the additional likelihood we gain from adding connections in our graph
and the model complexity of our graph. It is one way in which we can regularize our model.
Suppose that each random variable Xi can take on ri possible values for i = 1, . . . , m. How many parameters
are in such a model? We need to calculate the number of learned parameters in each probability table
Page 8
P(Xi
|Xpai
). For this one probability, there are (ri − 1) Q
j∈pai
rj possible values. We now have to sum over
all such probabilities to get:
Number of parameters in a graphical model = k =
Xm
i=1
(ri − 1) Y
j∈pai
rj
We can then use the BIC as normal:
BIC(Sn,
ˆθ, G) = k
2
log(n) − `(Sn,
ˆθ, G)
Page 9