Multilayer Perceptron to Recognize Handwritten Digits

Starting from:

$30

Assignment 4 - Multilayer Perceptron to Recognize Handwritten Digits [12 Marks]

What to submit: An html version of this notebook after you have run all cells (File -> Download as); if you're using Google Colab, you can export an html file by saving your file in Google Colab and then adding the following instruction to a cell: !jupyter nbconvert --to html your_file.ipynb to html
Name: James Le

In this assignment you will implement a Multilayer Perceptron for recognizing handwritten digits. We will use the MNIST data set to train a classifier. You will have to install tensorflow to download the data set we will use. You should be able to install it with:

pip install tensorflow

Let one of the instructors know if that does not work for you.

Warning: This assignment requires that you perform operations with matrices in Numpy (e.g., multiplication and summation of the column vectors of a matrix). You should discover by yourself how to use the operations needed to complete the assignment. Numpy's documentation can be a good resource: https://numpy.org/doc/stable/

In this assingment we will use f-string formatting to organize the data we use to train a Multilayer Perceptron. f-string formatting allows us to create strings with values stored in a variable. Let's see the following example where we create many strings with integer values from variable i.

We will use this f-string formatting to create the key of dictionaries in Python. If you aren't familiar with dictionaries in Python, please read this: https://realpython.com/python-dicts/

for i in range(5):
print(f'Test{i}')
Test0
Test1
Test2
Test3
Test4
Multilayer Perceptron [9 Marks]
We will start by implementing the multilayer perceptron.

[0 Marks] Read the code that is provided to you. You should start with the constructor, where the weights of the model are initialized. The constructor __init__ is already fully implemented for you. Please read carefully the instructions using the dictionary self.weights. You will implement something similar with the dictionary self.cache. Then, read the method train, which is also fully implemented for you. It is in this method that the three methods you will need to implement are invoked. The method train first calls the forward propagation step (method forward), where the outputs of the model are computed. Then it calls the backward step (method backward), where the gradients of all weights are computed. Finally, it calls the update step (update), where we use the gradients to update the weights of the model.
[2.5 Marks] Implement the method forward. In your implementation you should store in self.cache the Zi and Ai values computed in the forward pass (see the lecture notes for details).
[4 Marks] Implement the backward method. In your implementation you should store in self.cache the dW and dB values with the gradients of all weights of the model. You should assume the Cross Entropy loss. This means that we don't multiply the pure error in the output layer by the derivative of the logistic function while computing
Δ
(
L
−
1
)
. We still need to use the derivative of the logistic function when computing the
Δ
(
i
)
for the units in the hidden layers.
[2.5 Marks] Implement the update method. In this method you will use the values of dW and dB to update the weights of the model.
from keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np

class MLP:
"""
Class implementing Multilayer Perceptron with sigmoid activation functions, which are used
both in the hidden layers and in the output layer of the model.
"""
def __init__(self, dims):
"""
When creating an MLP object we define the architecture of the model (number of layers and number
of units in each layer). The architecture is defined by a vector with a number of entries matching
the number of units in each layer.

For example, if dims = [784, 50, 20, 10], then the model has two hidden layers, with 50 and 20 neurons each,
and an output layer with 10 units. The input layer receives 784 values.

Since we are performing classification of handwritten digits, the input layer (784 units) and the output
layer (10 units) are fixed: the input layer has one value for each pixel in the image, while the output
layer has 10 units for each digit (0, 1, 2, ..., 9).

The data of the model will be stored in dictionaries where the key is given by the name of the value store.
In particular, we will have one dictionary called self.weights to store the weights and biases of the model;
we will also have a dictionary called self.cache where we will store the matrices Z, A, and Delta matrices
used in the backpropagation algorithm.

The constructor also initializes the set of weights the model uses. For example, if dims = [784, 50, 20, 10],
then the model has 3 sets of weights: one between the input layer and the first 50 units; another set of weights
between 50 and 20; and a last set of weights between 20 and 10. We initialize the weights with small random numbers.
The weights of the B vectors are initialized with zero.
"""
self.dims = dims
self.weights = {}
self.L = len(dims)

for i in range(1, len(dims)):
self.weights[f'W{i-1}'] = np.random.randn(dims[i], dims[i-1]) * (2/np.sqrt(dims[i-1]))
self.weights[f'B{i-1}'] = np.zeros((dims[i], 1))

def derivative(self, A):
"""
Derivative of the logistic function
"""
return np.multiply(A, 1 - A)

def activation_function(self, Z):
"""
Logistic function
"""
return 1 / (1 + np.exp(-Z))

def forward(self, X, Y=None):
"""
Forward pass. We initialize the self.cache dictionary with the vector representing the input layers, denoted A0.
The forward pass then computes Zi and Ai until reaching the ouput layer. The last matrix A is then returned.
"""
# implement what is missing for the forward pass
self.cache = {}
self.cache['A0'] = X
for i in range(1, self.L):
self.cache[f'Z{i}'] = np.matmul(self.weights[f'W{i-1}'], self.cache[f'A{i-1}']) + self.weights[f'B{i-1}']
self.cache[f'A{i}'] = self.activation_function(self.cache[f'Z{i}'])
return self.cache[f'A{self.L-1}']

def backward(self, Y):
"""
This function implements the backward step of the Backprop algorithm.

The deltas di and gradients dW and dB are stored in self.cache with the keys di, dWi, and dBi, respectively.
"""
# implement the backward pass
self.cache[f'd{self.L-1}'] = self.cache[f'A{self.L-1}'] - Y

for i in reversed(range(0, self.L - 1)):
self.cache[f'dW{i}'] = (1 / m) * np.matmul(self.cache[f'd{i+1}'], self.cache[f'A{i}'].T)
self.cache[f'dB{i}'] = (1 / m) * self.cache[f'd{i+1}'].sum(axis=1).reshape((self.dims[i + 1], 1))
if (i > 0):
self.cache[f'd{i}'] = np.multiply(np.matmul(self.weights[f'W{i}'].T, self.cache[f'd{i+1}']), self.derivative(self.cache[f'A{i}']))
return None

def update(self, alpha):
"""
Function must be called after backward is invoked.

It will use the dWs and dBs stored in self.cache to update the weights self.weights of the model.
"""
# implement the method for updating the weights of the model
for i in range(0, self.L - 1):
self.weights[f'W{i}'] = self.weights[f'W{i}'] - alpha * self.cache[f'dW{i}']
self.weights[f'B{i}'] = self.weights[f'B{i}'] - alpha * self.cache[f'dB{i}']
return None

def train(self, X, Y, X_validation, Y_validation, alpha, steps):
# creating one-hot encoding for the labels of the images
Y_one_hot = np.zeros((10, X.shape[1]))
for index, value in enumerate(Y):
Y_one_hot[value][index] = 1

# performs a number of gradient descent steps
for i in range(0, steps):
# computes matrices A and store them in self.cache
self.forward(X, Y_one_hot)

# computes matrices dW and dB and store them in self.cache
self.backward(Y_one_hot)

# use the matrices dW and dB to update the weights W and B of the model
self.update(alpha)

# every 100 training steps we print the accuracies of the model on a set of training and validation data sets
if i % 100 == 0:
percentage_train = self.evaluate(X, Y)
percentage_validation = self.evaluate(X_validation, Y_validation)

print('Accuracy training set %.3f, Accuracy validation set %.3f ' % (percentage_train, percentage_validation))

def evaluate(self, X, Y):
"""
Receives a set of images stacked as column vectors in matrix X, their one-hot labels Y.

Returns the percentage of images that were correctly classified by the model.
"""
Y_hat = self.forward(X)
classified_correctly = test_correct = np.count_nonzero(np.argmax(Y_hat, axis=0) == Y)
return classified_correctly / X.shape[1]
Use the following code to test your implementation. We are following the scheme of stacking up input images as column vectors of X, as we did in Assignment 3 (see the lecture notes for Backpropagation for details).

We will run backpropagation for
1000
training iterations on a data set with
20000
training images. During training we will also compute the accuracy of the model on a set of
10000
, which we call the validation set.

Disclaimer: The outputs of the next cell won't be meaningful before you finish implementing the multilayer perceptron.

(x_train, y_train), (x_test, y_test) = mnist.load_data()

m = 20000
val = 10000
images, labels = (x_train[0:m].reshape(m, 28*28) / 255, y_train[0:m])
images = images.T

images_validation, labels_validation = (x_train[m:m + val].reshape(val, 28*28) / 255, y_train[m:m + val])
images_validation = images_validation.T

dims = [784, 50, 20, 10]
mlp = MLP(dims)
mlp.train(images, labels, images_validation, labels_validation, 0.5, 3000)
Accuracy training set 0.101, Accuracy validation set 0.099
Accuracy training set 0.713, Accuracy validation set 0.697
Accuracy training set 0.848, Accuracy validation set 0.841
Accuracy training set 0.888, Accuracy validation set 0.879
Accuracy training set 0.905, Accuracy validation set 0.894
Accuracy training set 0.915, Accuracy validation set 0.906
Accuracy training set 0.923, Accuracy validation set 0.912
Accuracy training set 0.928, Accuracy validation set 0.918
Accuracy training set 0.933, Accuracy validation set 0.922
Accuracy training set 0.937, Accuracy validation set 0.926
Accuracy training set 0.940, Accuracy validation set 0.928
Accuracy training set 0.944, Accuracy validation set 0.931
Accuracy training set 0.947, Accuracy validation set 0.933
Accuracy training set 0.950, Accuracy validation set 0.935
Accuracy training set 0.953, Accuracy validation set 0.937
Accuracy training set 0.956, Accuracy validation set 0.939
Accuracy training set 0.959, Accuracy validation set 0.940
Accuracy training set 0.961, Accuracy validation set 0.942
Accuracy training set 0.963, Accuracy validation set 0.943
Accuracy training set 0.965, Accuracy validation set 0.944
Accuracy training set 0.967, Accuracy validation set 0.946
Accuracy training set 0.969, Accuracy validation set 0.946
Accuracy training set 0.971, Accuracy validation set 0.947
Accuracy training set 0.972, Accuracy validation set 0.948
Accuracy training set 0.973, Accuracy validation set 0.949
Accuracy training set 0.975, Accuracy validation set 0.949
Accuracy training set 0.976, Accuracy validation set 0.950
Accuracy training set 0.977, Accuracy validation set 0.950
Accuracy training set 0.978, Accuracy validation set 0.951
Accuracy training set 0.979, Accuracy validation set 0.951
Question 1 [3 Marks]
How do the results you obtained with the multilayer perceptron compare with the results you obtained with Logistic Regression using Cross Entropy loss in Assignment 3? How do you justify the difference in accuracy observed in the two experiments? (write your answer in this cell by double-clicking it)

The result here is significantly better. The reason is because neural models with hidden layers are more expressive than the models representing Linear Regression and Logistic Regression. Models with a hidden layer can learn non-linear functions of the input data. Note here that the activation is the sigmoid function, which is non-linear. Therefore, using multilayer perceptron is a better choice compared to using Logistic Regression using Cross Entropy loss.