EECS445 Discussion 1

Starting from:

$25

EECS445 Discussion 1
EECS445

Introduction
In this discussion, we are going to cover 2 main topics:
1. Linear Algebra
2. Matrix Calculus
1 Linear Algebra
1.1 Vector and Matrix
1.1.1 Vector
• ~x ∈ R
n×1 or ~x ∈ R
n
is real-valued vector, i.e., a matrix that only has one column.
e.g.
~x =


1
2
3

 ∈ R
3
,~0 =


0
0
0

 ∈ R
3
The second vector is known as the zero vector - a vector in which all entries are zero.
• Consider a vector ~x ∈ R
n
. We define p-norm ||~x||p = (Pn
i=1 |xi
|
p
)
1/p where p has to be a
positive integer and xi
is the i-th entry of ~x
e.g. the 2-norm (Euclidean norm)
~x =


1
2
3

 , ||~x||2 =
p
1
2 + 22 + 32
• The definition of dot product (inner product) is
~a ·
~b = |~a||~b|cosθ
where θ is the angle between two vectors. If two non-zero vectors ~a and ~b satisfy
~a ·
~b = 0,
then they are orthogonal to each other.
1
e.g. If ~a =

2
1

and ~b =

−1
2

, then ~a ·
~b = (2 × −1) + (1 × 2) = 0
Graphically, this would be represented by
Figure 1: Orthogonal Vector Example.
1.1.2 Matrix
• A ∈ R
n×m is a real-valued matrix with n rows and m columns.
• If n = m, then A is a square matrix. Otherwise, A is a rectangular matrix.
e.g. A =

2 0 5
1 3 4
∈ R
2×3
is a rectangular matrix
B =

2 0
1 3
∈ R
2×2
is a square matrix
• A diagonal matrix is a square matrix in which the non-diagonal elements are all zero.
e.g. D =


a 0 0
0 b 0
0 0 c

 where a,b,c ∈ R
• An n × n identity matrix In is a diagonal matrix in which the diagonal elements are all one.
e.g. I3 =


1 0 0
0 1 0
0 0 1


• We denote [A]ij = aij as the entry at the i-th row and j-th column of A.
e.g. For the above A, a13 = 5
• Consider matrix A, B. A = B if and only if aij = bij ∀i, j
2
1.2 Transpose of a Matrix
• Consider A ∈ R
n×m. Then, AT ∈ R
m×n where [AT
]ij = aji
e.g.
A =


1 2
3 4
5 6

 , AT =

1 3 5
2 4 6
Sometimes, we might write a vector using transpose to save space.
e.g.
~x =


1
2
3

 =

1 2 3T
• A matrix A is symmetric if and only if AT = A
e.g.
A =

1 2
2 3
1.3 Addition / Subtraction
• Consider A, B ∈ R
n×m. Then, A ± B ∈ R
n×m where [A ± B]ij = aij ± bij
e.g.
A =

1 2
3 4
, B =

3 2
4 1
A − B =

−2 0
−1 3
• Note that commutativity, associativity hold for addition
1.4 Scalar multiplication
• Consider a scalar k ∈ R A ∈ R
n×m. Then, kA ∈ R
n×m where [kA]ij = kaij
e.g.
3A = 3
1 2
3 4
=

3 6
9 12

1.5 Matrix multiplication
• Consider A ∈ R
n×p
, B ∈ R
p×m. Then, AB ∈ R
n×m where [AB]ij =
Pp
k=1 aikbkj
e.g.
A =

1 2
3 4
, B =

1
3

AB =

1 × 1 + 2 × 3
3 × 1 + 4 × 3

=

7
15
e.g.
A =

1 2
3 4
, ~x =

x1
x2

A~x =

x1 + 2x2
3x1 + 4x2

Note that we can write a linear system x1 + 2x2 = 5, 3x1 + 4x2 = 6 in matrix form
A~x =

x1 + 2x2
3x1 + 4x2

=

5
6

• Some useful properties of matrix multiplication:
1. Associativity: (AB)C = A(BC)
2. Distributivity: A(B + C) = AB + AC
3. AIn = ImA = A assuming A ∈ R
m×n
4. Commutativity does not hold for matrix multiplication in general, i.e., AB is not neccesarily equal to BA
• Special case: vector-vector multiplication (dot product)
– Consider ~x, ~y ∈ R
n
. The dot product of ~x, ~y,
~x · ~y = ~xT
~y
e.g.
~x =


1
2
3

 , ~y =


4
5
6


~x · ~y = 1 × 4 + 2 × 5 + 3 × 6 = 32
– Note that ||~x||2 =
p
(~x · ~x)
4
1.6 Inverse of a Square Matrix
• Consider A ∈ R
n×n
. The inverse of a square matrix A (provided it exists) is a unique matrix
A−1
such that
A
−1A = AA−1 = I
• A is invertible if and only if A−1
exists.
• There is a closed form formula for finding the inverse of a 2×2 matrix. If B =

a b
c d
. Then
B
−1 =
1
ad − bc
d −b
−c a
e.g.
A =

−2 3
2 −2

A
−1 =

1 1.5
1 1
AA−1 =

1 0
0 1
e.g.
B =

−2 3
−4 6
B
−1 =
1
0

6 −3
4 −2

B−1 does not exist so B is not invertible.
• Useful theorem: a square matrix A is invertible if and only if all of its eigenvalues are non-zero.
• Why inverse? We can use inverse to solve a linear system. As aforementioned, we can express
a linear system as A~x = ~b. Then we can solve the linear system by ~x = A−1~b
• For more information, see http://cs229.stanford.edu/section/cs229-linalg.pdf page
11 and 12
5
1.7 Projection
1.7.1 Projection of a Point onto a Line
• The projection of a point onto a line is the point on the line closest to the point being
projected.
Figure 2: Projection example.1
In the 2D example in Fig 2 above, the red point is the projected point of the black point. ~s
is a nonzero vector that defines the line. The line connecting the red point to the black point
is perpendicular to the line defined by ~s, as this is the shortest possible distance from the
black point to the line. c~p is some coefficient such that c~p~s gives us the component of ~v that
is parallel to ~s, which is also the projection of ~v onto ~s. ~v − c~p~s gives us the perpendicular
component. We can solve for c~p using (~v − c~p~s) · ~s = 0, since we know that ~v − c~p~s and ~s are
perpendicular. We get that c~p =
~v·~s
~s·~s , which gives us:
proj[~s ]
(~v) = ~v · ~s
~s · ~s~s
This result generalizes to higher dimensions.
• The orthogonal projection of ~v onto the line spanned by a nonzero ~s can be calculated as
proj[~s ]
(~v) = ~v · ~s
~s · ~s~s
1
https://en.wikibooks.org/wiki/Linear_Algebra/Orthogonal_Projection_Onto_a_Line
6
Exercise: Project the vector
2
3

onto the line y = 2x. We first must pick a direction vector (a non-zero
vector parallel to the line) to project onto. For instance, ~s =

1
2

.
Figure 3: Projection Exercise
Note: The direction vector ~s points towards the positive side. The sign of the dot product
between ~s and points on the same side is positive. In the example above, when the point is
on the same side as ~s, the dot product is positive. When the point is on the opposite side, the
dot product is negative. Since the line defined by ~s stays the same regardless of the direction
of ~s, the resulting projection is the same in both cases. This means that any direction vector
that is parallel to the line will result in the same projection.
1.7.2 Projection of a Point onto a Plane
• Given a plane P
ax + by + cz + d = 0,
the normal vector to the plane is given by N~ =


a
b
c

. For a point A(x0, y0, z0) and its
projection point A0
(x, y, z) on plane P, since the AA0 has the same direction with N~ , we can
get
x = x0 + a · t, y = y0 + b · t, z = z0 + c · t
7
Because the projection point also need to be on plane P, we are able to find the value of x,
y, z and t.
Exercise: Find the orthogonal projection of the point A(5, −6, 3) onto the plane 3x − 2y +
z − 2 = 0.
• A plane is defined by a normal vector ~n = (a, b, c) and a point on the plane ~p0 = (x0, y0, z0).
The equation of the plane can also be written as ~n · (~p − ~p0) = 0, i.e., all points ~p = (x, y, z)
that are perpendicular to ~n. The normal vector ~n specifies the positive direction of the plane,
i.e., all points ~p on this side of the plane will satisfy ~n·(~p− ~p0) > 0. This is because the angle
between ~n and ~p − ~p0 will be less than 90°, and by ~n · (~p − ~p0) = |~n||(~p − ~p0)|cosθ and that
cosθ is positive when −90 < θ < 90 we know that ~n ·(~p − ~p0) is positive. Thus, for a point ~p,
if ~n · (~p − ~p0) > 0 then ~p is on the positive side of the plane, if ~n · (~p − ~p0) < 0 then ~p is on
the negative side of the plane, and if ~n · (~p − ~p0) = 0 then ~p is on the plane.
2 Matrix Calculus
In ML, we often have to find some argument (usually a vector) that optimizes a cost function (You
will see your first cost function in the coming week or two). Our cost functions are almost always
multivariate (you already know how to optimize single variate functions) and we need to resort
to matrix calculus to optimize a multivariate function. In the following section, we would define
the gradient and the Hessian of a function that are analogous to the first derivative and second
derivative we learn in elementary calculus.
2.1 The Gradient
• Consider a function f : R
n → R. The gradient of the function f with respect to some input
~x ∈ R
n
is defined to be
∇~xf(~x) =
∂f
∂x1
,
∂f
∂x2
, . . . ,
∂f
∂xn
T
provided that all the partial derivatives exist.
e.g
f(~x) = f(x, y) = x
2 + 2y
2
∇~xf = (2x, 4y)
T
e.g
∇~x(
~b · ~x) = ~b
• Useful properties of gradient. Consider f, g : R
n → R, ~x ∈ R
n and k ∈ R
1. ∇~xkf(~x) = k∇~xf(~x)
2. ∇~x(f(~x) ± g(~x)) = ∇~xf(~x) ± ∇~xg(~x)
3. Product Rule: ∇~x(f(~x)g(~x)) = f(~x)∇~xg(~x) + (∇~xf(~x))g(~x)
• Exercise. Show that if A ∈ R
n×n
is symmetric and q(~x) = ~xT A~x, then ∇~xq(~x) = 2A~x. Note
that q(~x) is known as quadratic form. (Hint: First, expand ~xT A~x and then find ∂q
∂xi
)
8
2.2 The Hessian
• Consider a function f : R
n → R. The Hessian of the function f (∇2
~xf(~x) or H) with respect
to ~x ∈ R
n
is an (n × n) matrix such that
[∇2
~xf(~x)]ij =
∂
2f
∂xi∂xj
provided that all the 2nd partial derivatives exist.
• If f is twice continuously differentiable, then H is symmetric. You can assume that all
the functions encountered in this class are twice continuously differentiable, namely, all the
Hessians in this class are symmetric.
• Example. f(x, y) = x
2 + 2y
2
∂
2f
∂x2
= 2,
∂
2f
∂y2
= 4,
∂
2f
∂x∂y =
∂
2f
∂y∂x = 0
∇2
~xf(~x) =
2 0
0 4
2.3 Chain Rule
• Recall the chain rule in single variable calculus: d
dx f(g(x)) = f
0
(g(x))g
0
(x). For example:
d
dx
hp
3x
2 − x
i
f(x) = √
x, g(x) = 3x
2 − x
f
0
(x) = 1
2
x
− 1
2 , f0
(g(x)) = 1
2
(3x
2 − x)
− 1
2 , g0
(x) = 6x − 1
d
dx
hp
3x
2 − x
i
=
1
2
(3x
2 − x)
− 1
2 (6x − 1)
• The chain rule also extends to multivariable cases. Here, we look at a special case. Consider
a function g : R
n → R and f : R → R. By Chain Rule,
∇~x [f ◦ g(~x)] = f
0
(g(~x))∇~xg(~x)
e.g h(~x) = (1 + ~θ · ~x)
2
. h is an implicit function that can be written in the form of f ◦ g
where f(y) = y
2 and g(~x) = 1 + ~θ · ~x
∇~xh(~x) = ∇~x [f ◦ g(~x)]
= f
0
(g(~x))∇~xg(~x)
= 2(1 + ~θ · ~x)
~θ
3 Reference
Matrix Calculus:
• http://cs229.stanford.edu/section/cs229-linalg.pdf
• http://www.cs.toronto.edu/~bonner/courses/2012s/csc338/matrix_cookbook.pdf
9