CPSC 340 Assignment 3

Starting from:

$30

CPSC 340 Assignment 3
Instructions
Rubric: {mechanics:3}
The above points are allocated for following the general homework instructions on the course homepage.
Contents
1 Vectors, Matrices, and Quadratic Functions 1
1.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Converting to Matrix/Vector/Norm Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Minimizing Quadratic Functions as Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Robust Regression and Gradient Descent 3
2.1 Weighted Least Squares in One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Smooth Approximation to the L1-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Linear Regression and Nonlinear Bases 5
3.1 Adding a Bias Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Polynomial Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Non-Parametric Bases and Cross-Validation 7
4.1 Proper Training and Validation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Cost of Non-Parametric Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Very-Short Answer Questions 8
5.1 Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2 These ones are optional and not for marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1 Vectors, Matrices, and Quadratic Functions
The first part of this question makes you review basic operations on vectors and matrices. If you are
rusty on basic vector and matrix operations, see the notes on linear algebra on the course webpage. If you
have time and want a deeper refresher on linear algebra, I have heard good things about the video series
Essence of Linear Algebra at https://youtu.be/kjBOesZCoqc and the e-book Immersive Linear Algebra at
http://immersivemath.com/ila/index.html. We will continue to use linear algebra heavily throughout
the rest of the course.
1.1 Basic Operations
1
Rubric: {reasoning:3}
Using the definitions below,
α = 5, x =

2
−3

, y =

1
4

, z =


2
0
1

 , A =


1 2
2 3
3 −2

 ,
evaluate the following expressions (show your work, but you may use answers from previous parts to simplify
calculations):
1. x
T x.
2. kxk
2
.
3. x
T
(x + αy).
4. Ax
5. z
T Ax
6. AT A.
If {α, β} are scalars, {x, y, z} are real-valued column-vectors of length d, and {A, B, C} are real-valued d × d
matrices, state whether each of the below statements is true or false in general and give a short explanation.
7. yyT y = kyk
2y.
8. x
T AT
(Ay + Az) = x
T AT Ay + z
T AT Ax.
9. x
T
(B + C) = Bx + Cx.
10. (A + BC)
T = AT + C
T BT
.
11. (x − y)
T
(x − y) = kxk
2 − x
T y + kyk
2
.
12. (x − y)
T
(x + y) = kxk
2 − kyk
2
.
Hint: check the dimensions of the result, and remember that matrix multiplication is generally not commutative.
1.2 Converting to Matrix/Vector/Norm Notation
Rubric: {reasoning:2}
Using our standard supervised learning notation (X, y, w) express the following functions in terms of vectors,
matrices, and norms (there should be no summations or maximums).
1. Pn
i=1 |w
T xi − yi
|.
2. maxi∈{1,2,...,n} |w
T xi − yi
| +
λ
2
Pd
j=1 w
2
j
.
3. Pn
i=1 zi(w
T xi − yi)
2 + λ
Pd
j=1 |wj |.
You can use Z to denote a diagonal matrix that has the values zi along the diagonal.
2
1.3 Minimizing Quadratic Functions as Linear Systems
Rubric: {reasoning:3}
Write finding a minimizer w of the functions below as a system of linear equations (using vector/matrix
notation and simplifying as much as possible). Note that all the functions below are convex so finding a w
with ∇f(w) = 0 is sufficient to minimize the functions (but show your work in getting to this point).
1. f(w) = 1
2
kw − vk
2
.
2. f(w) = 1
2
kwk
2 + w
T XT y .
3. f(w) = 1
2
Pn
i=1 zi(w
T xi − yi)
2
.
Above we assume that v is a d × 1 vector.
Hint: Once you convert to vector/matrix notation, you can use the results from class to quickly compute
these quantities term-wise. As a sanity check for your derivation, make sure that your results have the right
dimensions.
2 Robust Regression and Gradient Descent
If you run python main.py -q 2, it will load a one-dimensional regression dataset that has a non-trivial
number of ‘outlier’ data points. These points do not fit the general trend of the rest of the data, and pull
the least squares model away from the main downward trend that most data points exhibit:
0.0 0.2 0.4 0.6 0.8 1.0
5
0
5
10
15
20
25
Least Squares
Note: we are fitting the regression without an intercept here, just for simplicity of the homework question.
In reality one would rarely do this. But here it’s OK because the “true” line passes through the origin (by
design). In Q3.1 we’ll address this explicitly.
3
2.1 Weighted Least Squares in One Dimension
Rubric: {code:3}
One of the most common variations on least squares is weighted least squares. In this formulation, we have
a weight zi for every training example. To fit the model, we minimize the weighted squared error,
f(w) = 1
2
Xn
i=1
zi(w
T xi − yi)
2
.
In this formulation, the model focuses on making the error small for examples i where zi
is high. Similarly,
if zi
is low then the model allows a larger error.
Complete the model class, WeightedLeastSquares, that implements this model (note that Q1.3.3 asks you
to show how this formulation can be solved as a linear system). Apply this model to the data containing
outliers, setting z = 1 for the first 400 data points and z = 0.1 for the last 100 data points (which are the
outliers). Hand in your code and the updated plot.
2.2 Smooth Approximation to the L1-Norm
Rubric: {reasoning:3}
Unfortunately, we typically do not know the identities of the outliers. In situations where we suspect that
there are outliers, but we do not know which examples are outliers, it makes sense to use a loss function that
is more robust to outliers. In class, we discussed using the sum of absolute values objective,
f(w) = Xn
i=1
|w
T xi − yi
|.
This is less sensitive to outliers than least squares, but it is non-differentiable and harder to optimize.
Nevertheless, there are various smooth approximations to the absolute value function that are easy to
optimize. One possible approximation is to use the log-sum-exp approximation of the max function1
:
|r| = max{r, −r} ≈ log(exp(r) + exp(−r)).
Using this approximation, we obtain an objective of the form
f(w)=Xn
i=1
log
exp(w
T xi − yi) + exp(yi − w
T xi)

.
which is smooth but less sensitive to outliers than the squared error. Derive the gradient ∇f of this function
with respect to w. You should show your work but you do not have to express the final result in matrix
notation.
2.3 Robust Regression
Rubric: {code:2,reasoning:1}
The class LinearModelGradient is the same as LeastSquares, except that it fits the least squares model using
a gradient descent method. If you run python main.py -q 2.3 you’ll see it produces the same fit as we
obtained using the normal equations.
1Other possibilities are the Huber loss, or |r| ≈ √
r
2 + for some small .