EECS445 Discussion 2

Starting from:

$25

EECS445 Discussion 2
EECS445
Introduction
This discussion will cover 2 main topics:
1. Matrix Calculus Review
2. Perceptron Algorithm
1 Matrix Calculus Review
(Optional) Show that if A is symmetric and q(~x) = ~xT A~x, then ∇q(~x) = 2A~x. Note that q(~x) is
known as quadratic form. (Hint: First, expand ~xT A~x and then find ∂q
∂xi
)
Solution:
q(~x) = ~xT A~x =

x1 x2 . . . xn
~a1 ~a2 . . . ~an





x1
x2
. . .
xn




=

~x · ~a1 ~x · ~a2 . . . ~x · ~an





x1
x2
. . .
xn




=
Xn
i=1
(~x · ~ai)xi =
Xn
i=1
Xn
j=1
aijxjxi
= (a11x1+a12x2+· · ·+a1nxn)x1+(a21x1+a22x2+· · ·+a2nxn)x2+· · ·+(an1x1+an2x2+· · ·+annxn)xn
∂q
∂xi
=
X
i−1
j=1
ajixj +
Xn
j=i+1
ajixj +
Xn
j=1
aijxj + aiixi =
Xn
j=1
2ajixj = 2~ai
· ~x
∇q(~x) = h
∂q
∂x1
∂q
∂x2
. . .
∂q
∂xn
iT
=

2~a1 · ~x 2~a2 · ~x . . . 2~an · ~xT = 2A~x
1
Exercise
Let ¯x be an n-dimensional vector, and A be an n × n matrix. ¯x
T Ax¯ is often called the quadratic
form. Show that the gradient with respect to ¯x
T Ax¯ is (A + AT
)¯x.
2 Perceptron Algorithm
Linear Separability
A collection of points are linearly separable if it is possible to find a hyperplane that fully divides
the “positive” points from the “negative” points. (If the decision boundary that perfectly divides
the data does not have an offset, then we would call this collection of points linearly separable
through the origin). Determining linear separability is not always easy. It requires proof in the
form of finding that hyperplane, or decision boundary, which is what much of machine learning is
about.
Example
In the two-dimensional case, such a hyperplane is simply a line. For each of the two sets of points,
use your intuition to determine whether the points are linearly separable.
Figure 1: Draw the line that separates positive from negative points if it exists.
Note: The second subplot represents an XOR gate.
Finding a decision boundary ¯θ that minimizes training error En(
¯θ) is difficult in the general case.
However, if we consider the special case, where the positive and negative examples are perfectly
linearly separable, the simplest algorithm that we could fall back to is the Perceptron Algorithm.
The Perceptron Algorithm is a mistake-driven algorithm that attempts to correct for any misclassification by nudging the current decision boundary’s parameters in the direction, where the
2
misclassified point would be correctly labeled. If the data is linearly separable, this algorithm is
guaranteed to converge in a finite number of steps.
An outline for the algorithm is given on the next page.
Algorithm 1 Perceptron Algorithm
k ← 0, ¯θ
(k) ← ¯0
while at least one point is misclassified do
for i=1,...,n do
if y
(i)
(
¯θ
(k)
· x¯
(i)
) ≤ 0 then
¯θ
(k+1) ← ¯θ
(k) + y
(i)x¯
(i)
k++
end if
end for
end while
return ¯θ
Exercise
1. Consider the situation where ¯θ
(k) = [2, 2]T and the i
th point is ¯x
(i) = [1, −2], y
(i) = 1. Perform
the update rule to obtain ¯θ
(k+1). Using the new decision boundary, is ¯x
(i)
classified correctly?
2. Does ¯θ
(k+1) always shift in the correct direction?
3. Consider the following dataset:
• x¯
(1) = [1, 2]T
, y(1) = +1
• x¯
(2) = [−1, −1]T
, y(2) = +1
• x¯
(3) = [1, −2]T
, y(3) = −1
How many times is each data point misclassified before convergence? What is the resulting
decision boundary? Would the number of updates be different if we swapped the order of the
points (consider order ¯x
(3)
, x¯
(2)
, x¯
(1))?
4. In general, is the hyperplane to which the Perceptron algorithm converges dependent on the
order of points?
3
Perceptron Algorithm with Offset
While the Perceptron Algorithm has a very simple and intuitive formulation, it only allows us to
find a separating hyperplane through the origin, if one exists. Fortunately, we can expand the set
of potential linear classifiers using a simple modification.
As mentioned in discussion 1, we can specify any hyperplane in a d-dimensional space using the
equation {x¯|
¯θ · x¯ + b = 0}. Therefore, if we include a scalar offset b as a model parameter, we
could increase the search space to include all possible hyperplanes. We should also now modify the
condition for correctly classifying a point: y
(i)
(
¯θ
(k)
· x¯
(i) + b) > 0.
Notice that the statement ¯θ ·x¯+b = 0 is equivalent to saying [¯θ, b]·[¯x, 1] = 0 (why?). Therefore, we
can augment our parameter ¯θ and all points such that ¯x
0(i) = [¯x
(i)
, 1] and ¯θ
0 = [¯θ, 1]. The algorithm
remains unchanged since y
(i)
(
¯θ
0
· x¯
0(i)
) > 0 is equivalent to y
(i)
(b + ¯θ
·x¯
(i)
) > 0. The update rule also
remains unchanged (¯θ
0(k+1) ← ¯θ
0(k) +y
(i)x¯
0(i)
). If y
(i)
(
¯θ
(k)
· x¯
(i) +b) ≤ 0, then the update to b would
always make the left expression more positive (write out the whole update to see why). Here is the
pseudocode for the Perceptron algorithm with offset:
Algorithm 2 Perceptron Algorithm with Offset
k ← 0, ¯θ
(k) ← ¯0,b
(0) ← 0
while at least one point is misclassified do
for i=1,...,n do
if y
(i)
(
¯θ
(k)
· x¯
(i) + b
(k)
) ≤ 0 then
¯θ
(k+1) ← ¯θ
(k) + y
(i)x¯
(i)
b
(k+1) ← b
(k) + y
(i)
k++
end if
end for
end while
return ¯θ
4