EECS 445 Discussion 8: Neural Networks

Starting from:

$27

EECS 445 Discussion 8: Neural Networks

1 Feedforward Neural Network
1.1 Intuition
In order for machine learning algorithms to work, we must have a good representation of the data. Thus, many people
put a lot of effort into feature engineering, or using previous knowledge about the data to manually create features
that better represents the data. However, this requires a lot of manual work and expert knowledge, so we want to automate this process if possible. For example, consider the situation in Figure 2 below. We want to construct a machine
learning algorithm to detect whether an image is of a motorbike or not. We have two ways of representing the data: as
pixels, or as features of a motorbike (position of handle and wheel). If we represent the data as pixels, it doesn’t tell
us much, because pixel values could be very different even between images of motorbikes. If we represent the data
as features of a motorbike, it will be more helpful for our task, because the positions of wheels and handles are good
indications of whether the object in the image is a motorbike. By using this representation of data, we can separate
the space of motorbikes and non-motorbikes more, which makes learning a decision boundary much easier. However,
just handles and wheels might not be enough; we might also want features like number of wheels, size, position of
exhaust pipe, and so on and so forth. This could get out of hand very fast, because there can be many, many features to
use. Another issue is that we have to manually determine whether each particular feature is helpful or not for our task,
which is generally very hard. We don’t want to spend all our time on just finding useful features. We want to spend
time on solving the actual task! So how do we automatically find useful features for our learning algorithm? This is
where neural networks come in.
Figure 1: Feature representation example: Motorbike detection
1
1.2 Basic Structure
Neural networks are composed of simple computational units called neurons. Figure 3 below depicts an example of
a typical neuron.
Figure 2: Simple neuron
The x’s are the inputs. The w’s are the weights for each input. The value of the neuron, z(x), is computed as a linear
combination of the inputs, where each input is scaled by its weight: z(x) = Pn
i=1 wixi where n is the number of
inputs. In order to diversify the functions that neural networks are able to represent, the value of neurons are often
transformed using a non-linear transformation h = g(z). This non-linear transformation g is called an activation
function. Typical activation functions include the REctified Linear Unit (RELU) g(z) = max{0, z} and sigmoid
g(z) = 1
1+e−z . For more examples of activation functions, please refer to Section 2.5.
1.3 Multilayer Network
Now that we have our basic building block, we can use a combination of many neurons to represent more complicated
functions/models (much like with boosting). Figure 4 below shows a neural network with one hidden layer (middle
layer). Hidden layers are the layers between the input layer (first layer) and the output layer (last layer).
Figure 3: Neural network with one hidden layer
Notation: w
(i)
kj - weight parameter for the i
th layer, k
th unit, and j
th input
Looks cool, but how do we know what function this neural network is learning? Let’s work it out:
As you can see, a neural network is just a really complicated function. Through non-linear transformations, our network can transform our input data into another representation, which makes the input easier to work with. We’ll see
an example of this idea in the next section.
Page 2
Another important thing to mention is: what happens if we use linear transformations instead of non-linear transformations?
Discussion Question 1. Assume that our activation functions g and f are the linear transforms g(z) = cz and
f(z) = dz, where c and d are constants. Derive the final decision boundary h(¯x; W) again and reason why we
lose a lot of expressive power with our model if we only use linear transformations.
1.4 Example
In this section, we will work through an example to see how a simple neural network can transform the input data into
a space that is easier to work with.
Discussion Question 2. The input to our model is shown below in Figure 6(a). Obviously, the data is not linearly
separable, so it is impossible for any linear model to solve this problem. Let us construct a simple neural network
with two hidden units z1 and z2.
z
(2)
1 = w
(1)
11 x1 + w
(1)
12 x2 + w
(1)
10 → h
(2)
1 = max{0, z
(2)
1
}
z
(2)
2 = w
(1)
21 x1 + w
(1)
22 x2 + w
(1)
20 → h
(2)
2 = max{0, z
(2)
2
}
The boundaries corresponding to z1 = 0 and z2 = 0 are shown in Figure 6(a). Each input x¯ will be
mapped to a representation [h1(z1), h2(z2)]. Please fill in Figure 6(b) what the resulting data will look like after
the transformation.
(a) Original data
(b) Fill in data after transformation by the Neural
Network
Figure 4: Exercise of Neural Network transforming data
1.5 Backprop for Multi-layer Perceptron (If Time Permits)
Consider a 3-class classification task with x¯ ∈ R
2
and y ∈ {0, 1, 2}. We construct the following two hidden-layer
neural network, with sigmoid activation for the first hidden layer and cross-entropy loss applied to the output layer :
Page 3
x1
x2
+1
h
(2)
1
h
(2)
2
+1
z
(3)
1
z
(3)
2
z
(3)
3
+1
xi’s h
(2)
j
’s z
(3)
k
’s
w
(1)
ji w
(2)
kj
We can write down the forward and backward propagation of values in following table:
Forward Backward
z
(2)
j =
X
i
w
(1)
ji xi + b
(1)
j
∂z(2)
j
∂w(1)
ji
= xi
∂z(2)
j
∂xi
= w
(1)
ji
h
(2)
j = σ(z
(2)
j
)
∂h(2)
j
∂z(2)
j
= h
(2)
j
(1 − h
(2)
j
)
z
(3)
k =
X
j
w
(2)
kj h
(2)
j + b
(2)
k
∂z(3)
k
∂w(2)
kj
= h
(2)
j
∂z(3)
k
∂h(2)
j
= w
(2)
kj
L(¯y, z¯
(3)) = −
X
k
yk log( e
z
(3)
k
P
k0 e
z
(3)
k0
)
∂L
∂z(3)
k
=
e
z
(3)
k
P
k0 e
z
(3)
k0
− yk
With these local partial derivative worked out, we can easily write down the partial derivative of loss with respect
to each weight parameter:
•
∂L
∂w(2)
11
=
∂L
∂z(3)
1
∂z(3)
1
∂w(2)
11
•
∂L
∂w(2)
12
=
∂L
∂z(3)
1
∂z(3)
1
∂w(2)
12
.
These two partial derivatives differ only in the last term in the product.
•
∂L
∂w(1)
11
=
X
k
∂L
∂z(3)
k
∂z(3)
k
∂h(2)
1
∂h(2)
1
∂z(2)
1
∂z(2)
1
∂w(1)
11
•
∂L
∂w(1)
12
=
X
k
∂L
∂z(3)
k
∂z(3)
k
∂h(2)
1
∂h(2)
1
∂z(2)
1
∂z(2)
1
∂w(1)
12
These two partial derivatives also differ only in the last term in the product.
Backpropagation is just a fancy name for this organizational system for computing the ∂L
∂w(·)
·
’s neatly and efficiently, by observing that local partial derivatives are shared. If you remember dynamic programming from EECS
281, the trick is exactly that.
Page 4
1.6 Hyperparameters (If Time Permits)
Recall that hyperparameters are hand-tuned parameters of our model. Neural networks have a few hyperparameters,
which we will discuss in this section.
The first hyperparameter is the number of hidden layers in our network. You can always add more hidden layers
between the input layer and the output layer (this is what people refer to when they say ”stack more layers”). Intuitively, more layers corresponds to more complicated functions, since the input is going through more non-linear
transformations.
Figure 6: Neural network with two hidden layers1
The second hyperparameter is the number of neurons in each hidden layer. Besides the number of hidden layers, you
can also add more neurons in each hidden layer (not limited to just four). Each hidden layer can have a different
number of neurons. This also allows the neural network to learn more complicated functions, since there are more
parameters. A visualization of this phenomena is below in Figure 6. As you can see, the more hidden neurons the
network has, the more complicated the decision function becomes.
Figure 7: Number of hidden neurons and the corresponding decision function learned2
Another hyperparameter is the choice of activation function to use. This activation function can be different for each
hidden layer, but we generally assume that all neurons within a hidden layer use the same activation function. For
examples of activation functions, please refer to Section 3.6
A final hyperparameter is how we initialize our weights in the network. One of the things we should definitely
avoid when we initialize our neural networks is to initialize the weights to 0. This is because we want asymmetry
in our network so it will learn different parameters. Typically, the weights in neural networks are initialized to small
random values.
1Figure from http://cs231n.github.io/neural-networks-1/
2Figure from http://cs231n.github.io/neural-networks-1/
Page 5
1.7 Activation Functions (If Time Permits)
Figure 8: Activation Functions
Here is a very nice interactive visualization of Activation Functions.
Page 6
2 Architecture Specifications
In Project 2, we present a few different architecture specifications that describe deep neural network models. One
architecture describes a convolutional neural network, and the other describes an autoencoder. Here, we present a few
examples that may help you learn not only how to read these specifications when implementing the Project 2 networks,
but also how to describe your own architecture design for the challenge.
2.1 CNN Architecture
We now introduce a diagram that describes a convolutional neural network architecture. The convolution layers in a
CNN are often clarified by such a diagram, especially for two- or three-dimensional input data.
Figure 9: CNN architecture, from the DeepID2 network:
https://arxiv.org/abs/1406.4773
◦ in Convolutional layer 2:
– What is size of each filter (aka kernel)?
3 × 3 (×20). Each filter is convolved with all input channels.
– How many filters are there? 40
– How many model parameters are associated with the filters? How many bias parameters are in this layer?
3 × 3 × 20 × 40 weights, 40 biases
– What are the input and output dimensions? input: 20 × 26 × 22, output: 40 × 24 × 20.
Note: We follow the PyTorch’s tensor shape notation of CHW: n channels × height × width.
– What is the padding, and what is the stride size? No padding, unit stride.
– What would the output dimension be if we used SAME padding and stride=1? 40 × 26 × 22
– What would the output dimension be if we used SAME padding and stride=2? 40 × 13 × 11
◦ in Max-pooling layer 2:
– What is the filter size?
2 × 2. Each pooling filter is applied to every input channel independently.
– What are the input and output dimensions? input: 40 × 24 × 20, output: 40 × 12 × 10.
– What is the stride size? 2
Page 7
2.2 Autoencoder Architecture
We begin by introducing an autoencoder architecture that is similar to that in Project 2.
0. Input Image
• Output: 3 × 32 × 32
1. Max Pooling
• Stride size: 2 × 2
• Filter size: 2 × 2
• Output: 3 × 16 × 16
2. Fully-Connected Hidden Layer
• Activation: ReLU
• Weight initialization: normally distributed with µ = 0.0, σ
2 = 0.012
• Bias initialization: constant 0.01
• Output: 16
3. Fully-Connected Hidden Layer
• Activation: ReLU
• Weight initialization: normally distributed with µ = 0.0, σ
2 = 0.012
• Bias initialization: constant 0.01
• Output: 3 × 32 × 32
◦ How many channels are in the input image?
3
◦ What is the distribution of weight initialization for the last dense layer?
Normal distribution with mean 0.0 and standard deviation 0.01
◦ What is the total number of float-valued learnable parameters?
The pooling layer does not have any learnable model parameters.
The first dense layer has 16 outputs, each with input 3 × 16 × 16 and bias 1.
The second dense layer has 3 × 32 × 32 outputs, each with input 16 and bias 1.
Layer Weights Biases Total
1 - - 0
2 (3 × 16 × 16) × 16 16 12, 304
3 16 × (3 × 32 × 32) 32 × 32 × 3 52, 224
Total 64, 528
Page 8
3 Cross Entropy Loss and Softmax Activation
Let z¯ be the network output logits and yˆc the probability that the network assigns to the input example as belonging to
class c. Let y¯ ∈ R
D be a one-hot vector with a one in the position of the true label and zero otherwise (i.e., if the true
label is t, then yc = 1 if t = c and yc = 0 if t 6= c). D is the number of classes.
For a multiclass classification problem, we typically apply a softmax activation at the output layer to generate a
probability distribution vector. Each entry of this vector can be computed as:
yˆc =
exp(zc)
P
j
exp(zj )
We then compare this probability distribution yˆ with the ground truth distribution y¯ (a one-hot vector), computing the
cross entropy loss, L:
L(¯y, yˆ) = −
X
c
yc log ˆyc
These two steps can be done at once in PyTorch using the CrossEntropyLoss class, which combines the softmax
activation with negative log likelihood loss. We don’t need to add a separate soft max activation at the output layer.
This improves computational efficiency and numerical stability, and you will explore more about the mathematical
reason in Homework 3.
Discussion Question 3. Consider a neural net that outputs logits z¯ = [2, 4, 0.5] on training point x¯
(i)
, and the
corresponding label y
(i) = 2, compute the cross entropy loss.
4 Project Architectures (If Time Permits)
Here, we go over some justifications for the decisions that were made in designing the architectures you will be
implementing in Project 2.
• Preprocessing is used to normalize the images into a more standard format that can better be represented by a
single set of weights in the neural network architecture. If this was not done, images would have different norms
(e.g. bright vs. dark images).
• The images are resized to smaller dimensions to improve training performance by reducing model size.
• We provided you with a class-balanced dataset so that the loss function equally represents the loss across each
of the classes. Otherwise, in an imbalanced class scenario, loss can be naively minimized by choosing the most
likely class label. In that case, it is common to assign training instances different importances (weights).
• The optimizer that we recommend starting with is the torch.optim.Adam, first described in the paper by
Kingma and Ba: https://arxiv.org/abs/1412.6980.
This optimizer uses a stochastic approach with a changing learning rate to efficiently optimize the model parameters to minimize loss. However, for simpler models other approaches such as stochastic gradient descent
torch.optim.SGD, which we have seen in various instances in lecture, can also be used.
• We typically initialize weights to a Gaussian distribution with mean 0.0 and variance 1
n where n = [number of input nodes].
This normalizes the variance of the layer’s output to a standard Gaussian distribution rather than a Gaussian
distribution with variance that grows with the number of inputs.
Page 9
• We use the ReLU activation function, which is popular as it mitigates the vanishing gradient problem and is fast
to compute.
Why is the ReLU function fast to compute?
4.1 Convolutional Neural Network
• The convolutional layers are used to add position-invariance through parameter sharing, as input poster images
may be in different perspectives and of different sizes. The series of convolutional layers yield smaller and
smaller outputs with a greater number of channels, which condenses the image into a series of small images that
potentially represent different features.
• The fully-connected and output layers map the convolutional layer output into label predictions by allowing for
interactions between the convolutional filters.
4.2 Autoencoder
• An average pooling layer is used to reduce the size of the input image by a factor of 2. This makes the model
more efficient to train as future layers have smaller output dimensions, and it potentially also makes the model
less likely to overfit.
• Adding the extra dense (fully-connected) layers allows for greater abstractions in the network, which can decrease structural error but increase estimation error.
• The deconvolutional layer, like a convolutional layer, uses weight sharing to add position-invariance. As mentioned in the project, aspects of a poster such its face may be in different locations in the image but should be
identified as a face regardless.
Page 10