$27
Probability Primer
EECS445
1 Probability Theory
Machine learning is the field of design of algorithms to derive conclusions from observed data. We
consider the observed data to be representative of how the underlying domain behaves and therefore
uncertainty is a part and parcel of machine learning. Probability theory helps us in systematically
look at uncertainty. In this section, we will try to cover the basics of probability and statistics
required for EECS 445.
1.1 Probability space
A probability space is what constitutes the core of the topic. A probability space is a tuple (Ω, F, P)
with each element describes as follows:
• Sample space Ω: We represent the possible set of outcomes of a random experiment by Ω.
Every trial in the random experiment returns an element ω ∈ Ω.
• Event space F: It is a set of subsets of Ω which satisfies some conditions, specifically, F
should be a σ-field. A σ-field is a set which contains Ω and is closed under complement and
countable unions.
• Probability Measure P: This is a mapping P : F → R with following properties called axioms
of probability:
– P(A) ≥ 0 ∀A ∈ Ω
– P(Ω) = 1
– If A1, A2, . . . , An is a countable collection of pairwise disjoint sets in F, i.e. Ai ∩ Aj = φ
unless i = j, then P(∪
i=n
i=1Ai) = Pi=n
i=1 P(Ai)
Example: Rolling a die has Ω = {1, 2, 3, 4, 5, 6}. We can let the set F to be set of all subsets of
Ω. A valid probability measure for this case is when each outcome is equally likely which gives
P(ω = i) = 1/6. Also, by the third axiom of probability: P(ω ∈ A) = |A|/6 where A ⊆ Ω.
Note: A few important properties of a probability measure is noted below. Students can
attempt to show the following properties for a probability space using the axioms of probability:
1. P(A ∩ B) ≤ min(P(A), P(B))
2. P(A ∪ B) ≤ P(A) + P(B)
3. P(Ω − A) = 1 − P(A)
1
1.2 Random variables
We don’t always deal with the outcomes of an event in a random experiment as it is. Instead,
we are interested in some function of the observed data, for example, the mean delay for a flight,
maximum height of kids between age 12-15 etc. For that, we define a random variable as follows:
Definition 1.1 (Random Variable). A random variable X is function X : Ω → R. We denote the
random variable using upper case letters X(ω) or just X for short notation. The observed values
are denoted by the lower case letters x.
Not all functions can be a random variable and have to satisfy some property. However, we’ll
not dwell deep into such requirements.
We consider primarily two kinds of random variables:
• Discrete random variables: This is when the random variable takes finite countable number
of values. For example, number of heads in a series of coin tosses. We write the probability
P(X = a) = P({ω : X(ω) = a}).
• Continuous random variable: X can take infinite number of values in a continuous domain in
which case we call it a continuous random variable. For example: height of students in EECS
445. We write the probability of P(a ≤ X ≤ b) = P({ω : a ≤ X(ω) ≤ b}).
1.3 Probability measures on random variables
We described the two categories of random variables and mentioned what the associated probabilities mean. In case of a discrete random variable, we have a fixed probability value for each
element in the sample space and this probability measure is called a probability mass function. The
probability mass function (PMF) should satisfy the axioms of probability and derived corollaries.
However, in case of a continuous random variable, the value P(X = x) doesn’t even make sense.
Instead, we define a function f : Ω → R with the following property:
P(a ≤ X ≤ b) = Z b
a
f(x)dx
This function f is called the probability distribution function (pdf) of X. Also, the probability
that the random variable X lies in (−∞, a) is
FX(a) = Z a
−∞
f(x)dx
This function FX is called the cumulative distribution function (cdf) of X. Note that the pdf
may not exist for every continuous random variable. If the cdf of a random variable is continuous,
the corresponding pdf would always exist and is given by
f(x) = dFX(x)
dx
When we say that a sample has been drawn from a distribution f(X), we denote it by x ∼ f(X).
Important: In many cases, the probability for a particular random variable value is denoted as
p(X = x) or p(x) and it’s pdf or pmf is denoted as p(X).
2
1.4 Two random variables
So far, we have only looked at a single random variable and their distributions. However, we might
be interested in more than one function of a random experiment and therefore, in this section we
consider case of two random variables.
1.4.1 Joint and marginal distribution
Consider the two random variables to be X and Y . A joint probability model denotes the probability
of the co-occurrences of the two random variables. The joint cumulative distribution function is
now denoted by
FXY (x, y) = P(X ≤ x, Y ≤ y)
At the same time, we can model the random variables individually as well, which would define the
cdf’s FX and FY respectively. The functions are called marginal distribution functions and are
related to the joint model as:
FX(x) = lim
y→∞
FXY (x, y)
FY (y) = limx→∞
FXY (x, y)
In case of discrete random variables, we again have a joint probability mass function such that
fXY (x, y) = P(X = x, Y = y)
In this particular case, the marginal distribution is given as
fX(x) = X
y
fXY (x, y)
and is called the marginal probability mass function. In case of a continuous random variable, we
can define a joint probability distribution function (assuming it is differentiable at all x and y):
fXY (x, y) = ∂
2FXY (x, y)
∂x∂y
Similar to the discrete case, we define the marginal probability distribution function as
fX(x) = Z ∞
−∞
fXY (x, y)dy
Note: In this case, we have P(a ≤ X ≤ b, c ≤ Y ≤ d) = R b
a
R d
c
fXY (x, y)dy dx
1.4.2 Conditional distribution
In machine learning, we often look at what is the probability of an event happening when another
event is known to have happened. Specifically, for a discrete random variable pair, we use the
probability mass function to define a conditional probability mass function:
fX|Y (x|y) = fXY (x, y)/fY (y)
3
Here, the left hand side of the equation denotes the conditional probability of X given Y. In case
of a continuous random variable, we don’t define a probability value for each element in sample
space, but we define the analogue for the pdf, i.e., conditional probability distribution function in
the same manner.
For example, consider the roll of a die, if X1 and X2 denote the number on each, then the
random variable pair (X, X + Y ) has a probability mass distribution. In this case, given X + Y is
5 has following possibilities: 1+4, 2+3, 3+2, 4+1. Therefore, the conditional probability P(X =
2|X + Y = 5) = (1/36)/(4/36) = 1/4.
Important Properties:
1. Product rule: P(X, Y ) = P(X|Y )P(Y ) = P(Y |X)P(X)
2. Bayes rule: P(X|Y ) = P(Y |X)P(X)
P(Y )
. (Specific cases for discrete and continuous random
variables shown in class.)
Note: We have considered the case of two random variables but the notion can be extended to
more than two random variables as well using similar extensions in the calculus.
1.4.3 Independence
For a set of random variables X1, X2, . . . , Xk, we say that they are independent of each other if the
occurrence of one doesn’t affect the another. Using the definition of conditional probability we can
therefore write:
f(x1, x2, . . . , xk) = f(x1|x2, . . . , xk)f(x2, . . . , xk)
= f(x1)f(x2|x3, . . . , xk)f(x3, . . . , xk)
= Πi=k
i=1f(xi)
1.5 Expectation and covariance
The expectation of a random variable X is defined as follows for discrete and continuous random
variables:
E[X] = X
x
xfX(x)
E[X] = Z
x
xfX(x)dx
The expected value can be seen as the value which is observed at an average. Similarly, another
quantity defined for a random variable is the variance, which is
Var[X] = E[(X − E[X])2
]
Variance denotes the extent to which the values of the random variable can differ. The square root
of variance is called standard deviation and is usually denoted by σ.
Properties:
1. Linearity of expectation: If X1 and X2 are two random variables, then E[X1 +X2] = E[X1]+
E[X2]
4
2. Function of random variables: If
P
g : R → R is a function of random variable X, then E[g(X)] =
x
g(x)fX(x) if X is discrete and R
x
g(x)fX(x)dx if it is continuous. This property is known
as the law of the unconscious statistician.
3. Law of iterated expectations: If X1 and X2 are two random variables, then E[X1] = E[E[X1|X2]]
4. Conditional expectation: If X1 and X2 are two random variables, then E[X|Y ] is defined in
the same way by replacing fX(x) by fX|Y (x|y).
Till now, we have considered the expected value of a single random variable or it’s function.
But we can also define quantities which qualitatively denote the relation between random variables.
In particular, the covariance of two random variables X and Y is defined as
Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])]
If we instead have random vectors x and y, i.e., each random variable consists of multiple random
variables (x = (x1, x2, . . . , xk) ), we define the covariance matrix as:
Cov[x, y] = E[(x − E[x])(y
T − E[y
T
])]
The above covariance matrix Σ ∈ R
k×k has each entry as Σij = Cov(xi
, yj ). For a random vector,
the notion of variance is replaced by the covariance matrix Cov(x, x) and has the diagonal entries
as Var(xi).
1.6 Some probability distributions
In this section we’ll look at some popular distribution functions:
1.6.1 Bernoulli Distribution
This is a distribution over a binary random variable x ∈ {0, 1}. A perfect example is a coin toss
with a coin of bias p. p denotes the probability of observing 1 (say heads).
p(x = 1) = p
Therefore, the probability distribution is:
Bernoulli(x; p) = p
x
(1 − p)
1−x
The mean µ and variance σ
2 are p and p(1 − p) respectively.
1.6.2 Binomial Distribution
This is the distribution over number of successes in a series of experiments with a binary valued
random variable. For example, observing x number of heads in N coin tosses with bias p. The
distribution function is given by:
Binomial(x; N, p) =
N
x
p
x
(1 − p)
N−x
Here, we have µ = N p and σ
2 = N p(1 − p).
5
1.6.3 Multinomial Distribution
Multinomial distribution is a generalization of binomial distribution to more than two events. Here,
instead of having a single bias p, we have a vector [p1, p2, . . . , pk] defining the probability of observing
each of the k outcomes (Pi=k
i=1 pi = 1). And instead of observing number of outcomes for a single
event, we have a vector [x1, x2, . . . , xk] denoting number of each event. The distribution function
is given by:
Multinomial(x; N, p) = N!
x1! . . . xk!
Π
i=k
i=1p
xi
i
Here, µk = N pk and σ
2
k = N pk(1 − pk).
1.6.4 Gaussian Distribution
This is one of the most important distribution functions in machine learning and has some very good
properties. We would firstly look at multivariate Gaussian distribution also known as multivariate
normal distribution. Normal distribution for a random vector X ∈ R
d
is defined using a mean
vector µ ∈ R
d and covariance matrix Σ ∈ S
d
++ (space of symmetric positive definite matrices in
R
d×d
) as:
N (x; µ, Σ) = 1
(2π)
d/2|Σ|
1/2
exp (−
1
2
(x − µ)
T Σ
−1
(x − µ))
By symmetric positive definite, we mean that all eigenvalues must be positive. In case of a single
random variable this distribution has the equivalent form in terms of µ ∈ R and σ
2 ∈ R
+ and is
given by:
N (x; µ, σ2
) = 1
√
2πσ
exp (−
(x − µ)
2
2σ
2
)
The distribution function with parameters µ = 0 and σ
2 = 1 is called standard normal distribution. Some useful and intersting properties for a multivariate gaussian distribution are:
1. The marginal distribution and conditional distributions for Gaussian distribution are also
Gaussian.
2. Given a random variable x ∼ N (µ, Σ), the function y = Ax + b, with A ∈ R
D×d and
µ ∈ R
D, has the distribution: N (x; Aµ + b, AΣAT
). Therefore, every gaussian random
variable y ∼ N (µ, Σ) in R
d
can be expressed as a transformation y = Σ1/2x + µ where
x ∼ N (0, Id).
2 Reference
• http://cs229.stanford.edu/section/cs229-prob.pdf
6