$30
CSM148 Homework 3
1 Decision Tree [20 points] The following table contains training examples that help predict whether a patient is likely to have a heart attack. Suppose we want to build a decision tree based on the given data using entropy gain. PATIENT ID CHEST PAIN? MALE? SMOKES? EXERCISES? HEART ATTACK? 1. yes yes no yes yes 2. no no yes no yes 3. yes yes yes no yes 4. no yes no yes no 5. yes no yes yes yes 6. no yes yes yes no 7. no no no yes no 8. yes no yes no yes (a) (3 points) What is the overall entropy of whether or not a patient is likely to have a heart attack, without considering any features? (b) (8 points) Suppose we want to build a decision tree by selecting one feature to split. What are the information gains for the four given features, and which feature do you want to choose at the first step? (c) (3 points) To construct a full decision tree, we will need to re-calculate the Information Gain for the remaining features and continue the splits, until a stop criterion is met. What are possible stop criterions for this process? (d) (3 points) Should we standardize or normalize our features? (e) (3 points) Are decision trees robust to outliers? 1 2 Perceptron [15 points] Consider training a Perceptron model y = sgn(w⊤x), w ∈ R d on a dataset D = {(xi , yi)}, i = 1 . . . 5. Both w and xi are vectors of dimension d, and y ∈ {+1, −1} is binary. Assume that the bias term is already augmented in xi : xi = [1, x1, . . . , xd−1]. The activation function is a sign function where sgn(x) = 1 for all x > 0 and sgn(x) = −1 otherwise. The Perceptron algorithm is given below, Algorithm 1 Perceptron Initialize w = 0 for i = 1 . . . N do if yi ̸= sgn(w⊤xi) then w ← w + yixi end if end for return w (a) (5 points) The Perceptron model is trained for 1 epoch, i.e. iterated over the entire dataset once, and made mistakes on the following three data points: (x1, y1),(x3, y3),(x4, y4) . What will be the weight vector w after this training epoch? Express w in using variables xi and yi where you will figure out what values of i ∈ {1, 2, 3, 4, 5} will be used from the algorithm. (b) (5 points) Let d = 3 and the data points be given as follows i xi,1 xi,2 xi,3 yi 1 1 1 0 +1 2 1 2 -1 +1 3 1 1 -3 -1 4 1 3 -1 +1 5 1 1 -1 +1 Following the formulation of your answer in (a), what is w given the values of the data points? Express w as vector of numbers this time. Furthermore, if we iterate through the dataset again, will the model make a mistake on x1 again? Write it’s prediction on x1. (c) (5 points) State one difference between the given Perceptron model and the logistic regression model taught in class. 2 3 Neural Networks [20 points] Figure 1: Neural Network (a) (4 points) Refer to Lecture 13 for the activation functions of neural networks. Considering a binary classification problem where y ∈ {0, 1}, what are possible activations choices for the hidden and output layers repectively? Briefly explain why. (b) (3 points) Consider a binary classification problem where y ∈ {0, 1}. And we consider the neural network in Figure 2 with 2 inputs, 2 hidden neurons, and 1 output. We let neuron 1 use ReLU activiation and neuron 2 use sigmoid activiation, respectively. And the other layers use the linear activiation function. Suppose we have a input X1 = 2 and X2 = −3, with label y = 1. And the weights are initialized as W11 = 0.9, W12 = 0.4, W21 = −1.5, W22 = −0.7, W31 = −0.2, W32 = 1.6, and the bias term W10, W20, W30 are all initialized to be 0. Compute the output of the network. (Round to the 2nd decimal in your final answer.) (c) (4 points) We consider the binary cross entropy loss function. What is the loss of the network on the given data point in (b)? What is ∂L ∂yˆ where ˆy is the output of the neural network? (Hint. Refer to the lecture slides for the definition of a binary cross entropy loss function) (d) (6 points) We now consider the backward pass. Given the same initialized weights and input as in (b), write the formula and calculate the derivative of the loss w.r.t W12, i.e. ∂L ∂W12 . (e) (3 points) Given the neural network as in (b), how many parameters does the network have? (Hint. Each weight unit counts as a parameter, and we also consider the bias terms (W10, . . .) as parameters.) 3 4 Multi-class Classification [10 points] Figure 2: Multiclass Logistic Regression (a) (5 points) Consider a multi-class classification problem with 5 classes and 20 features. We will use the logistic regression model to first build our binary classifier. Then, what will be the total number of parameters for using One vs. Rest (ovr) strategies for the multi-class classification task using logistic regression? (Hint. Refer to lecture 11 for multi-class logistic regression model.) (b) (5 points) Consider a new multi-class classification problem with 3 classes. The distribution of the points is shown in the figure (Square - Class 1, Circle - Class 2, Star - Class 3). Draw the linear classifiers used for classifying the three classes, using (i) One vs. Rest (OvR), and (ii) Multinomial approaches. Your drawing does not have to be precise, roughly indicate the classifiers and jut make sure you can show the differences. 4 5 Decision Boundary [10 points] Consider the classification problems with two classes, which are illustrated by circles and crosses in the plots below. In each of the plots, one of the following classification methods has been used, and the resulting decision boundary is shown: (1) (2.5 points) Decision Tree (2) (2.5 points) Random Forest (3) (2.5 points) Neural Network (1 hidden layer with 10 ReLU) (4) (2.5 points) Neural Network (1 hidden layer with 10 tanh units) Assign each of the previous methods to exactly one of the following plots (in a one to one correspondence) by annotating the plots with the respective letters, and explain briefly why did you make each assignment. (a) (b) (c) (d) 5