Starting from:

$30

Stat 437 HW4

General rule
Conceptual exercises: I (Bayes classifier)
Conceptual exercises: II (k-NN classifier)
2.1
2.1 Response
2.2
2.2 Response
2.3
2.3 Response
2.4
2.4 Response
2.5
2.5 Response
Conceptual exercises: III (Discriminant analysis)
Applied exercises: I (k-NN classifier)
Applied exercises: II (Discriminant analysis)
Stat 437 HW4
Ben Schedin (11633240)
General rule
Due by 11:59pm Pacific Standard Time, March 30, 2021. Please show your work and submit your computer codes in order to get points. Providing correct answers without supporting details does not receive full credits. This HW covers

Bayes classifier
kNN classifier
Discriminant analysis
You DO NOT have to submit your HW answers using typesetting software. However, your answers must be legible for grading. Please upload your answers to the course space. This is a relatively long HW. So, please complete it very carefully.

For exercises on the Text, there might be solutions to them that are posted online. Please do not plagiarize those solutions.

Conceptual exercises: I (Bayes classifier)
. This exercise is on Bayes theorem and Bayes classifier.

1.1
State clearly the definition of the 0-1 loss function. Can this function be used in multi-class classification problems?

1.1 Response
State clearly the definition of the 0-1 loss function.
From Lecture Notes 4, the definition of the 0-1 loss function is as follows:

Let G^(x) be the estimated class label for an observation x from X, i.e., G^(x)=1 or 2
For k,l∈{1,2}, the 0-1 loss L is defined as:
L(k,l)=0 if k=l, and L(k,l)=1 if k≠l

In other words: if there are classification errors, then the loss will be 1, otherwise, the loss will be 0.

Can this function be used in multi-class classification problems?
Yes, the 0-1 loss function can be utilized in multi-class classification problems.

1.2
Let Y be the random variable for the class label of a random vector X, such that Y∈G={1,…,K} where K≥2 is the number of classes. Let Y^ be the estimated class label for X. Given the prior Pr(Y=k)=πk,k∈G on Class k and the conditional density fk(x) of X when it comes from Class k. Provide the formula to obtain the posterior Pr(Y=k|X=x), which is essentially the Bayes theorem. What is the Bayes classifier and how does it classify a new observation x0 from X? Is the decision boundary of the Bayes classifier linear or quadratic in X? Explain (but do not have to mathematically prove) why the Bayes classifier minimizes the expected 0-1 loss. Note the a proof of the fact that the Bayes classifier minimizes the expected 0-1 loss is given in “LectureNotes4_notes.pdf”. You should not copy and paste the proof. Instead, please provide the explanation based on your understanding of the proof.

1.2 Response
Provide the formula to obtain the posterior Pr(Y=k|X=x), which is essentially the Bayes theorem.
The formula for obtaining the posterior according to equation 4.10 in An Introduction to Statistical Learning is:

Pr(Y=k|X=x)=πkfk(x)Σkl=1πlfl(x)

What is the Bayes classifier and how does it classify a new observation x0 from X?
The Bayes classifier is a model-based classification method that classifies samples as belonging to the class with the highest posterior probability among the posterior probabilities of the sample belonging to each class. This is written in the provided lectures notes more formally as x from X being assigned to class j for which Pr(G=j|X=x) is the largest among all Pr(G=k|X=x)’s for k∈G={1,2}.

Is the decision boundary of the Bayes classifier linear or quadratic in X?
The decision boundary of the Bayes classifier will be quadratic in X.

Explain (but do not have to mathematically prove) why the Bayes classifier minimizes the expected 0-1 loss. The Bayes classifier minimizes the expected 0-1 loss by maximizing the posterior probability (selecting the most likely/highest posterior probability of the sample belonging to each class), which is then subtracted from 1 with the 0-1 loss calculation. Equation (2.22) from pp. 21 in Elements of Statistical Learning states the simplified 0-1 loss function as G^(x)=argming∈G[1−Pr(g|X=x)]. Therefore, by finding the highest posterior probability for the sample belonging to each class, or maximizing the Pr(g|X=x) term, the 0-1 loss is minimized.

1.3
If K=2 in subquestion 1.2), what is the threshold value on Pr(Y=1|X=x0) that is used by the Bayes classifier to determine the class label for x0? Suppose you use a different threshold value on Pr(Y=1|X=x0) to classify x0, is the corresponding classifier still the Bayes classifier, and is the corresponding loss function still the 0-1 loss? Explain your answer. Provide a scenario where to classify an observation a different threshold value is more sensible than the threshold value used by the Bayes classifier.

1.3 Response
If K=2 in subquestion 1.2), what is the threshold value on Pr(Y=1|X=x0) that is used by the Bayes classifier to determine the class label for x0?
Because of the property that dictates the sum of the posteriors will be 1, Pr(Y=1|x0)>Pr(Y=2|x0) will be true if and only if Pr(Y=1|x0)>0.5, so the threshold value on Pr(Y=1|X=x0) will be 0.5.

Suppose you use a different threshold value on Pr(Y=1|X=x0) to classify x0, is the corresponding classifier still the Bayes classifier, and is the corresponding loss function still the 0-1 loss?
No, if the corresponding classifier has its threshold value changed such that a sample is not classified as the class to which has the highest posterior probability (and therefore the lowest 0-1 loss), the classifier no longer behaves the same as the Bayes classifier. The corresponding 0-1 loss function will perform the same computation as seen in the standard Bayes classifier; however, if the threshold is changed such that a sample is classified as a class different from the standard Bayes classifier, the outcome will be different. In other words, the computed 0-1 loss value for a sample using the corresponding classifier will be the same, but the final class assignment (decision) may be different.

Provide a scenario where to classify an observation a different threshold value is more sensible than the threshold value used by the Bayes classifier.
A situation where a different threshold value may be desirable is one where the classification task is sensitive to either false positives or false negatives. One possible scenario would be in predicting a subject’s reaction to a particular drug. If the drug has a severe negative side effect, it may be prudent to change the threshold such that the classification favors false positives (predicting a negative reaction where none will occur) rather than false negatives (predicting no reaction, when a negative reaction will occur). In this case, one would hypothetically prefer not to recommend a drug that may cause a negative side effect, rather than accidentally recommending a drug, thinking it is safe, to someone who would experience negative side effects.

1.4
If K=2 in subquestion 1.2), π1=0.6, f1(x)∼Gaussian(0,1) and f2(x)∼Gaussian(2,1) and x0=1.5. Compute Pr(Y=1|X=x0) and use the Bayes classifier to classify x0.

1.4 Response
Compute Pr(Y=1|X=x0) and use the Bayes classifier to classify x0.

Given:

K=2
π1=0.6
π2=0.4
f1(x)∼Gaussian(0,1)
f2(x)∼Gaussian(2,1)
x0=1.5
x_0 <- 1.5

cndp1 <- dnorm(x_0, mean = 0, sd = 1)
cndp2 <- dnorm(x_0, mean = 2, sd = 1)

fmarginal <- 0.6 * cndp1 + 0.4 * cndp2

jp1 <- 0.6 * cndp1
jp2 <- 0.4 * cndp2

postp1 <- jp1/fmarginal
postp2 <- jp2/fmarginal

posteriors <- c(postp1, postp2)
posteriors
## [1] 0.355595 0.644405
From the above we see that Pr(Y=1|X=x0)=0.36 whereas Pr(Y=2|X=x0)=0.64. Therefore, the largest posterior probability for x0 belonging to class 1 or class 2 is 0.64 (class 2). The point x0 will be classified as class 2.

Conceptual exercises: II (k-NN classifier)
. Given the training set T of n observations (x1,y1),…,(xn,yn), where yi is the class label of observation xi and yi∈G={1,…,K} for K≥2, consider k-NN classifier, where k is the neighborhood size.

2.1
Describe how the decision boundary (such as its smoothness and shape) of k-NN classifier changes as k changes.

2.1 Response
Describe how the decision boundary (such as its smoothness and shape) of k-NN classifier changes as k changes.
The shape and texture of the decision boundary will become smoother with less definition as k increases. This is because as k increases, a larger “neighborhood” of points will be taken into account and the resulting decision boundary will be less affected by local points (better approximating the whole rather than the local).

2.2
Explain why the training error of 1-NN classifier is 0. Provide an estimator of the test error of a classifier and explain why it can be used as such an estimator. Is it true that a large k leads to a k-NN classifier with smaller test error? Can the test error of a k-NN classifier be equal to the test error of the Bayes classifier? When k is large and k/n is small, what is a k-NN classifier approximately estimating?

2.2 Response
Explain why the training error of 1-NN classifier is 0.
In the case of 1-NN, only the nearest point to the point being classified will be taken into account. Because there is only one point, it is guaranteed to be the supermajority class label, thus having an error of 0.

Provide an estimator of the test error of a classifier and explain why it can be used as such an estimator.
The test error can be computed by en=1mΣmi=1L(y0i,y^0i). This estimator works well because, as limn→∞,en=ρ. In other words, as the number of test samples approaches infinity, the test error will closely approximate the expected 0-1 loss.

Is it true that a large k leads to a k-NN classifier with smaller test error?
A large k will not necessarily lead to a smaller k-NN test error. If k is too small, the model will be over-fit to the training samples and will likely not generalize well. Conversely, if k is too large, the model will likely be too general. Therefore, the optimal value for k requires finding a balance between small and large neighborhoods.

Can the test error of a k-NN classifier be equal to the test error of the Bayes classifier?
Yes, given that kNN essentially estimates the unknown conditional probabilities used by the Bayes classifier, kNN asymptotically approaches the error of Bayes, possibly equaling it.

When k is large and k/n is small, what is a k-NN classifier approximately estimating?
In the case of k being large and k/n being small, the k-nn classifier is effectively calculating the posterior probability that a point belongs to a given class. This is formally described in the lecture materials as g^j(x)≈Pr(Y=j | X=x).

2.3
When there are K≥2 classes, how does a k-NN classifier classify a test observation x0?

2.3 Response
When there are K≥2 classes, how does a k-NN classifier classify a test observation x0?
For the case where k=2, based on the content from the lecture materials:

Let Nk(x) be the neighborhood of x consisting of k points xi in the training set T that are closest to x.
Compute the average label as: Y¯(x)=1kΣxi∈Nk(x)yi
Specify a threshold β∈(0,1) and estimate the class label of x as Y^(x)=1, if Y¯(x)>β or Y^(x)=0 otherwise.

In other words: Once the model is trained, find the k nearest neighbors to the new point being classified, and predict the label that is most common among the neighbors.

2.4
When should data be standardized before applying a k-NN classifier? When standardizing data, do we standardize each observation or each feature?

2.4 Response
When should data be standardized before applying a k-NN classifier?
Data should be standardized when variables contain a wide range of values since k-NN utilizes the Euclidean distance as a dissimilarity measure which is vulnerable to outliers.

When standardizing data, do we standardize each observation or each feature?
When standardizing data the features are scaled.

2.5
Using your understanding of Example 3 in “LectureNotes4b_notes.pdf”, provide a step-by-step guide on how to choose an optimal k for k-NN classifier using cross-validation. You can provide such as guide in the form of “pseudocode” (see, e.g., https://en.wikipedia.org/wiki/Pseudocode for some details on pseudocode). Suppose the training set has few observations, can you still perform cross-validation in order to choose an optimal k? Explain your answer. (Hint: for the 2nd part, think about if having more observations helps better estimate test error.)

2.5 Response
Using your understanding of Example 3 in “LectureNotes4b_notes.pdf”, provide a step-by-step guide on how to choose an optimal k for k-NN classifier using cross-validation.
Start by defining a range of k-values to test, as well as a number of folds for the cross-validation process (often 5 or 10). Then, for each fold, test the determined range of k-values and obtain a cross-validated error measure. The value of k for which the lowest cross-validated error is produced is then chosen as the estimated value for k.

Suppose the training set has few observations, can you still perform cross-validation in order to choose an optimal k?
Cross-validation can still be used to determine k if only a few observations are present; however, the estimated value of k may not be a good choice if there are few observations, making the method not recommended in this case.

Conceptual exercises: III (Discriminant analysis)
3
Exercise 2 of Section 4.7 of the Text, which starts with “It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case.” (Helpful information on how to prove this is contained in the lecture video on LDA and “LectureNotes5b_notes.pdf”.)

3 Response
Exercise 2 of Section 4.7 of the Text, which starts with “It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case.”
Problem: Prove that assigning the sample X=x to the class for which pk(x)=πk12πσ√exp(−12σ2(x−μk)2)Σkl=1πl12πσ√exp(−12σ2(x−μl)2) is the largest is the same as assigning X=x to the class for which δk(x)=x⋅μkσ2−μ2k2σ2+log(πk) is the largest.

This is true because a point that is classified as a class for which pk(X) will produce the lowest 0-1 loss, which will also be true when assigning a point to a class for which δk(x) is the largest given that the 0-1 loss will also be fully minimized.

4
Exercise 3 of Section 4.7 of the Text, which starts with “This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class specific mean vector and a class specific covariance matrix. We consider the simple case where p = 1; i.e. there is only one feature.” (Helpful information on how to prove this is contained in the lecture video on QDA and “LectureNotes5b_notes.pdf”.)

4 Response
Exercise 3 of Section 4.7 of the Text, which starts with “This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class specific mean vector and a class specific covariance matrix. We consider the simple case where p = 1; i.e. there is only one feature.”

This decision boundary for the Bayes classifier will be quadratic in this case because the densities of each class may produce covariance matrices with quadratic properties.

5
Exercise 5 of Section 4.7 of the Text, which starts with “We now examine the differences between LDA and QDA.” (Hint: for this question, you may also use information from Figures 4.9, 4.10 and 4.11 in the Text.)

5 Response
Exercise 5 of Section 4.7 of the Text, which starts with “We now examine the differences between LDA and QDA.”

5 (a)
Given a linear Bayes decision boundary, it is expected LDA will perform best on the test set, and on the training set, assuming the data is perfectly linearly separable. If the data is not perfectly linearly separable, it may be possible for QDA to perform better on the training set; however, this will likely not generalize to the training set, and is therefore not recommended.

5 (b)
Given a non-linear Bayes decision boundary, we would expect QDA to perform best (closest approximate the Bayes decision boundary) on both training and test sets.

5 (c)
As the sample size n increases, one would expect QDA to perform better on anything but a data set that is not perfectly linearly separable since the linear decision boundary will incur more and more errors as sample size is increased for non-linearly separable data. In the case of perfectly linearly separable data, one would expect them to perform the same since a linear decision boundary is capable of optimally separating the classes regardless of sample size.

5 (d)
False. For two reasons: First, the Bayes decision boundary is considered to be the “gold standard” which other classification methods approximate based on available information. If the Bayes decision boundary is linear, then the most desirable decision boundary will be linear. Second, QDA does not produce a linear decision boundary. It may “model” a linear decision boundary by producing a decision boundary that is very nearly linear, but it does not produce a linear boundary; for this we would use LDA. An example of where QDA will certainly not perform better than LDA is on data where the classes are perfectly linearly separable. In this case, it is not possible for QDA to outperform LDA because LDA will be able to fit a boundary which yields 0% error. This may be rare in practice, but it does serve as a counter example to an answer of “True” for this question.

6
Let Y be the random variable for the class label of a random vector X∈Rp (where p is the number of features), such that Y∈G={1,…,K} and Pr(Y=k)=πk for Class k with k∈G, where K≥2 is the number of classes. Consider the Gaussian mixture model such that the conditional density of X when it comes from Class k is fk(x)∼Gaussian(μk,Σk). Given the training set T of n observations (x1,y1),…,(xn,yn) on (X,Y), where yi is the class label of observation xi, do the following:

6.1
Provide the MLEs of πk, μk and Σk for each k∈G respectively for the case where all Σk’s are equal and for the case where not all Σk’s are equal. When p>n, is the MLE of Σk still accurate? If not, recommend a different estimator for estimating Σk and provide details on this estimator.

6.1 Response
Provide the MLEs of πk, μk and Σk for each k∈G respectively for the case where all Σk’s are equal and for the case where not all Σk’s are equal.
MLEs for πk, μk, Σk assuming all covariance matrices are equal:
π^k=nkn
μ^k=Σ{i:yi=k}xink
Σ^=1n−KΣKk=1Σ{i:yi=k}(xi−μ^k)(xi−μ^k)T

MLEs for πk, μk, Σk assuming not all covariance matrices are equal:
Both of the estimates for πk and μk will be the same assuming unequal covariance matrices. However, the estimate for Σk will be different.

When p>n, is the MLE of Σk still accurate? If not, recommend a different estimator for estimating Σk and provide details on this estimator.
No, when p>n the MLE of Σk will likely not be accurate. Other estimators that may be used instead are the regularized estimate and the shrinkage estimate. These estimators can be computed as follows:

Regularized Estimate: Σ^k,α=αΣ^k+(1−α)Σ^,α∈[0,1]
The regularized estimate is a convex combination of the pooled covariance matrix Σ^ and an estimate of k’s covariance matrix Σ^k. In this case, an α≈1 indicates that the covariance matrices are not similar, which produces a model similar to QDA. Conversely, in cases where α≈0, the covariance matrices will be similar, which produces a model similar to LDA.

Shrinkage Estimate: Σ^γ=γΣ^+(1−γ)σ^2I,γ∈[0,1]
The shrinkage estimate is another convex combination of the pooled covariance matrix Σ^ and diagonal matrix I.

6.2
Assume p=2 and K=2 and k=1. For the density fk(x)∼Gaussian(μk,Σk), what shape do its contours take, and how does Σk control the shape of these contours? How do you check if the conditional density of X given that it comes from Class k is Gaussian?

6.2 Response
For the density fk(x)∼Gaussian(μk,Σk), what shape do its contours take, and how does Σk control the shape of these contours?
The shape of the contour for the described parameters will be centered at μk, and the spread, height, sphericity, and feature mixture, will be determined by Σk. Given that there are p=2 features, assuming the features follow the Gaussian mixture model assumption, one would also expect a bi-modal distribution (mixture of two “bumps”) to be visible for each class when plotted.

How do you check if the conditional density of X given that it comes from Class k is Gaussian?
The conditional density of X for each class k can be examined by plotting 2D contour plots, 3D density plots, or by plotting the densities of each class (note: this method makes it difficult to see mixture properties for high-dimension variables).

6.3
Is it true that discriminant analysis will perform badly if the Gaussian assumption is violated? (Hint: for this question, you may also use the information provided by Figures 4.10 and 4.11 of the Text.) Let X=(X1,…,Xp)P, i.e., X1 up to Xp are the feature variables. Can discriminant analysis be applied to observations of X when some of Xj,j=1…,p is a discrete variable (such as a categorical variable)? Explain your answer.

6.3 Response
Is it true that discriminant analysis will perform badly if the Gaussian assumption is violated?
This depends on the severity of the violation. If the GMM is violated strongly, the likelihood that the model will perform well will be low, whereas if the GMM is violated only slightly, the model may still perform fairly well. In general, the further the GMM assumption is violated, the less likely the model will perform well.

Can discriminant analysis be applied to observations of X when some of Xj,j=1…,p is a discrete variable (such as a categorical variable)?
No, discriminant analysis relies on normally distributed variables in order to compute variable densities. It may theoretically be possible to encode categorical variables numerically, but it is unlikely that the model will perform well under such circumstances.

6.4
What is a ROC curve, and what is AUC? How is AUC used to gauge the performance of a classifier? If you apply the same classifier, say, LDA or QDA under the same Gaussian mixture model, to two data sets that are independently generated from the same data generating process, i.e., that are independently generated from (X,Y) for classification problems, and obtain two ROC curves, would the two ROC curves be quite different? Explain your answer. When there are 3 or more classes, are the codes provided in the lecture notes able to obtain ROC curves and their AUC’s for LDA and QDA?

6.4 Response
What is a ROC curve, and what is AUC?
A ROC curve visualizes the trade-offs of a models average false positive rate vs. average true positive rate across a range of prediction thresholds. In general, a model whose ROC curve equals the y=x line performs no better than chance, whereas a model whose ROC curve remains close to the positive x and y axes performs well. The AUC score is a measure representing the area under the curve and serves as a quick way to summarize the performance visualized by the ROC curve (larger is better, ranging from [0.5,1]).

How is AUC used to gauge the performance of a classifier?
An AUC of 0.5 indicates the model performs no better than chance, ranging up to an AUC of 1 which indicates the model performs perfectly. A large AUC value indicates the model performs with many true positives and few false negatives. AUC values near 0.5 indicate that the model is not very accurate.

If you apply the same classifier, say, LDA or QDA under the same Gaussian mixture model, to two data sets that are independently generated from the same data generating process, i.e., that are independently generated from (X,Y) for classification problems, and obtain two ROC curves, would the two ROC curves be quite different?
If the two data sets are generated based on the same mixture model, the ROC curves should be similar.

When there are 3 or more classes, are the codes provided in the lecture notes able to obtain ROC curves and their AUC’s for LDA and QDA?
No, if there are 3 or more classes, the codes used to produce a ROC curve for binary classification will not work. If attempted, the following error will result: Error in ROCR::prediction(test$posterior[, 2], test$Class) : Number of classes is not equal to 2. ROCR currently supports only evaluation of binary classification tasks.

6.5
Describe the key similarities and differences, respectively, between LDA and logistic regression. Provide a situation where discriminant analysis can still be sensibly applied but logistic regression is not well-defined.

6.5 Response
Describe the key similarities and differences, respectively, between LDA and logistic regression.
Similarities:

Both LDA and logistic regression are parametric classification methods.
Differences:

Logistic regression does not utilize a Gaussian mixture model and is not as sensitive to variables that have high covariance compared to Bayes.
In order to implement logistic regression, one does not need to know the marginal distributions.
The MLE regression parameters for logistic regression will be undefined if the data is perfectly linearly separable.
Provide a situation where discriminant analysis can still be sensibly applied but logistic regression is not well-defined.
A situation where the data is perfectly linearly separable is one where logistic regression will not be well-defined, but where LDA can still be utilized.

Applied exercises: I (k-NN classifier)
. Please refer to the NYC flight data nycflights13 that has been discussed in the lecture notes and whose manual can be found at https://cran.r-project.org/web/packages/nycflights13/index.html. We will use flights, a tibble from nycflights13.

Please use set.seed(123) for the whole of this exercise. Randomly select from flights for each of the 3 carrier “UA”, “AA” or “DL” 500 observations for the 3 features dep_delay, arr_delay and distance. Let us try to see if we can use the 3 features to identify if an observation belongs a specific carrier. The following tasks and questions are based on the extracted observations. Note that you need to remove rows with na’s from the extracted observations.

# Setting the seed for the chunk
set.seed(123)

# Inspecting the raw data
head(flights)
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Extracting bulk data, filtering, and removing missing
# values
filtered <- flights %>% dplyr::select(carrier, dep_delay, arr_delay, 
    distance) %>% filter(carrier %in% c("UA", "AA", "DL")) %>% 
    na.omit()

# Randomly selecting 500 observations from the bulk data
fl500 <- sample_n(filtered, size = 500, replace = FALSE)

# Inspecting the pre-processed data
head(fl500)
## # A tibble: 6 x 4
##   carrier dep_delay arr_delay distance
##   <chr>       <dbl>     <dbl>    <dbl>
## 1 AA             -9       -48      733
## 2 UA             83        41     2565
## 3 UA             -7       -41     2565
## 4 AA             -4       -39     2475
## 5 UA              9        -4     1400
## 6 DL             -7       -42     2475
7.1
First, you need to standardize the features since they are on very different scales. Then randomly split the observations into a training set that contains 70% of the observations and a test set that contains the remaining observations.

7.1 Code
# Setting the seed for the chunk
set.seed(123)

# Standardizing features and splitting data and label columns
fl500Labels <- fl500 %>% dplyr::select(carrier) %>% as.data.frame()
fl500Std <- as.data.frame(scale(fl500[, 2:4]))

# Splitting the samples into train and test sets
trainingIndices <- sample(1:nrow(fl500Std), floor(0.7 * nrow(fl500Std)), 
    replace = FALSE)
testIndices <- (1:nrow(fl500Std))[-trainingIndices]

trainData <- as.data.frame(fl500Std[trainingIndices, ])
trainLabels <- as.data.frame(fl500Labels[trainingIndices, ])
colnames(trainLabels) <- "carrier"

testData <- as.data.frame(fl500Std[testIndices, ])
testLabels <- as.data.frame(fl500Labels[testIndices, ])
colnames(testLabels) <- "carrier"
7.2
Consider the observations as forming 3 classes that are determined by carrier. To the training set, apply 10 fold cross-validation to k-NN classifier with features arr_delay, dep_delay, and distance to determine the optimal k from the values {1,…,15}. Apply the optimal k-NN to the test set, provide the classification table and the overall error rate, and provide visualization of the classification results. Do you think the error rate is reasonable? Explain your answer. (Hint: you can follow the strategy provided by Example 3 in “LectureNotes4b_notes.pdf”. )

7.2 Code
# Setting chunk seed
set.seed(123)

# 10-folds CV to determine best k from 1-15
m <- 10
folds <- sample(1:m, nrow(trainData), replace = TRUE)
kMax <- 15

testErrors <- matrix(0, nrow = 2, ncol = kMax)

for (k in 1:kMax) {
    testError1 <- double(m)
    
    for (s in 1:m) {
        trainingTmp <- trainData[folds != s, ]
        testTmp <- trainData[folds == s, ]
        trainingLabsTmp <- trainLabels[folds != s, ]
        testLabsTmp <- trainLabels[folds == s, ]
        
        knntmp <- knn(trainingTmp, testTmp, trainingLabsTmp, 
            k)
        nMissedObs <- sum(1 - as.numeric(knntmp == testLabsTmp))
        tError <- nMissedObs/length(testLabsTmp)
        testError1[s] <- tError
    }
    
    testErrors[, k] <- c(mean(testError1), sd(testError1))
}

colnames(testErrors) <- paste("k=", 1:kMax, sep = "")
rownames(testErrors) <- c("mean(TestError)", "sd(TestError)")

testErrors <- as.data.frame(testErrors)
as.numeric(testErrors[1, ])
##  [1] 0.5772902 0.5775851 0.5623222 0.5594703 0.5369742 0.5313338 0.5313181
##  [8] 0.5366948 0.5329016 0.5386426 0.5388035 0.5336004 0.5449212 0.5380716
## [15] 0.5372920
hatk <- which(testErrors[1, ] == min(testErrors[1, ]))
hatk
## [1] 7
# Applying kNN
knnOut <- knn(trainData, testData, trainLabels$carrier, hatk)
missedObs <- sum(1 - as.numeric(knnOut == testLabels$carrier))
tErrorOpt <- missedObs/length(testLabels$carrier)

# Classification table and error rate
table(knnOut, testLabels$carrier)
##       
## knnOut AA DL UA
##     AA  2  2  6
##     DL 14 29 21
##     UA 17 18 41
cat("Error Rate:", tErrorOpt, "\n")
## Error Rate: 0.52
# Visualizing results
testResData <- as.data.frame(testData)
testResData$Carrier <- unlist(testLabels)
testResData$EstimatedCarrier <- unlist(knnOut)

ggplot(testResData, aes(arr_delay, distance)) + geom_point(aes(shape = EstimatedCarrier, 
    color = Carrier)) + theme_bw() + labs(title = paste0(hatk, 
    "-NN Results"))


7.2 Response
Do you think the error rate is reasonable?
No, in this case the model did not perform adequately having produced an error rate of 52%.

7.3
Note that your have standardized the features arr_delay, dep_delay, and distance. However, with the unstandardized features, you would surely know that none of them follows a Gaussian distribution no matter with respect to which class (i.e., carrier) you look at their observations since these features have non-negative values (whereas a Gaussian distribution can generate negative values). Again, the 3 classes are determined by carrier. So, you will apply QDA based on the 3 standardized the features to the training set to train the model, and then apply the trained model to the test set (that contains the 3 standardized the features) to classify its observations.

7.3a
First, check if the Gaussian assumption is satisfied. For this, note that if the standardized features arr_delay, dep_delay, and distance follow a trivariate Gaussian distribution for each individual class, then any pair among the 3 standardized features follows a bivariate Gaussian distribution for each individual class.

7.3a Code
# Setting the seed for the chunk
set.seed(123)

# Inspecting the data
head(fl500Std)
##     dep_delay  arr_delay    distance
## 1 -0.57254000 -1.2198869 -0.88996180
## 2  1.90785556  0.8324748  1.54589275
## 3 -0.51861836 -1.0584652  1.54589275
## 4 -0.43773589 -1.0123447  1.42622741
## 5 -0.08724522 -0.2052362 -0.00310864
## 6 -0.51861836 -1.0815255  1.42622741
# Collecting data
fl500StdQDA <- as.data.frame(fl500Std)
fl500StdQDA$Carrier <- fl500Labels$carrier

# Melting the data to prepare for ggplot application
fl500Melt <- melt(fl500StdQDA, id = "Carrier")

# Plotting the densities of the class variables
ggplot(fl500Melt, aes(x = value, fill = Carrier)) + geom_density(alpha = 0.25) + 
    theme_bw() + labs(title = "Class Distributions", x = "Value", 
    y = "Density")


# Splitting data into training and test sets, 70-30 ratio
trainingIndices <- sample(1:nrow(fl500StdQDA), floor(0.7 * nrow(fl500StdQDA)), 
    replace = FALSE)
testIndices <- (1:nrow(fl500StdQDA))[-trainingIndices]

trainData <- fl500StdQDA[trainingIndices, ]
testData <- fl500StdQDA[testIndices, ]

# Training the model
qdaTrain <- qda(Carrier ~ ., data = trainData)
7.3a Response
Visualizing the observations from each of the three classes “AA” “DL” and “UA” the class data (mixed variables) do not appear to exhibit a trivariate Gaussian distribution as their curves are highly irregular, appear to be bi-modal rather than tri-modal, and also have very long right tails.

7.3b
Apply the estimated (i.e., trained) QDA model to the test set, provide the estimated mixing proportion and estimated mean vector for each class, and provide the classification table. If you randomly pick an observation on the 3 standardized features, is it approximately equally likely to belong to each of the 3 carriers? (You do not need to mathematically prove your answer. However, think along this line: we are testing the equality of 3 population proportions, and each estimated population proportion is based on around 350 observations, which approximately can be done via a z-test since the central limit theorem is in effect.) How is the performance of QDA on this test set? Explain your answers.

7.3b Code
# Setting the seed for the chunk
set.seed(123)

# Testing the model
qdaTest <- predict(qdaTrain, newdata = testData)

# Estimated mixing proportions
cat("Estimated Proportion - AA:", sum(fl500$carrier == "AA")/nrow(fl500), 
    "\n")
## Estimated Proportion - AA: 0.202
cat("Estimated Proportion - DL:", sum(fl500$carrier == "DL")/nrow(fl500), 
    "\n")
## Estimated Proportion - DL: 0.348
cat("Estimated Proportion - UA:", sum(fl500$carrier == "UA")/nrow(fl500), 
    "\n")
## Estimated Proportion - UA: 0.45
# Mean vectors
mv <- colMeans(fl500[, 2:4])
mv
## dep_delay arr_delay  distance 
##    12.236     4.900  1402.338
# Classification table and error
table(testData$Carrier, qdaTest$class)
##     
##      AA DL UA
##   AA  0 22 11
##   DL  1 23 25
##   UA  0 37 31
cat("Classification Error:", 1 - mean(qdaTest$class == testData$Carrier))
## Classification Error: 0.64
7.3b Response
If you randomly pick an observation on the 3 standardized features, is it approximately equally likely to belong to each of the 3 carriers?
No, observations belonging to UA are about twice as likely as that of AA. The mixing proportions for all three carriers (classes) are:

AA: 20.2%
DL: 34.8%
UA: 45%
How is the performance of QDA on this test set?
QDA Performed poorly on the test set with an error rate of 61.33%. This is likely because the Gaussian Mixture Model assumption is violated strongly based on the density plots of each class.

7.3c
Extract observations that are for “UA” or “DL” from the training set and the test set, respectively, to form a new training set and a new subset, so that there are now 2 classes “UA” and “DL”. Apply QDA to the new training set and then apply the trained model to the new test set. Report the overall error rate on the test set, provide the ROC curve, and calculate the AUC. How is the performance of QDA on this test set? Explain your answer.

7.3c Code
# Setting the seed for the chunk
set.seed(123)

# Filtering observations in classes 'UA' and 'DL' from
# previous train/test sets
trainData2 <- trainData %>% dplyr::select(dep_delay, arr_delay, 
    distance, Carrier) %>% dplyr::filter(Carrier %in% c("UA", 
    "DL"))

testData2 <- testData %>% dplyr::select(dep_delay, arr_delay, 
    distance, Carrier) %>% dplyr::filter(Carrier %in% c("UA", 
    "DL"))

# Training the QDA model
qdaTrain2 <- qda(Carrier ~ ., data = trainData2)
qdaTest2 <- predict(qdaTrain2, newdata = testData2)

# Classification table and error rate
table(testData2$Carrier, qdaTest2$class)
##     
##      DL UA
##   DL 23 26
##   UA 37 31
cat("Classification Error:", 1 - mean(qdaTest2$class == testData2$Carrier), 
    "\n")
## Classification Error: 0.5384615
# ROC and AUC
predQDA <- ROCR::prediction(qdaTest2$posterior[, 2], testData2$Carrier)
perfROC <- ROCR::performance(predQDA, "tpr", "fpr")
plot(perfROC, avg = "threshold", spread.estimate = "boxplot", 
    colorize = TRUE, main = "ROC Curve")


AUC <- predQDA %>% ROCR::performance(measure = "auc") %>% .@y.values
cat("AUC:", AUC[[1]], "\n")
## AUC: 0.4846939
7.3c Response
How is the performance of QDA on this test set?
The error rate for this model (compared to the last with all three carriers/classes) was slightly lower at 44.54%, but still failed to perform adequately. Based on the ROC curve, and the AUC (0.54), the model performed only slightly better than random guessing.

Applied exercises: II (Discriminant analysis)
. The following is on software commands:

8.1
What is the main cause of the message “Warning in lda.default(x, grouping, …): variables are collinear”? What is the main cause of the message “Error in qda.default(x, grouping, …) : some group is too small for ‘qda’”?

8.1 Response
What is the main cause of the message “Warning in lda.default(x, grouping, …): variables are collinear”?
This error indicates that the covariance matrix is singular, e.g., some variable(s) are not linearly independent.

What is the main cause of the message “Error in qda.default(x, grouping, …) : some group is too small for ‘qda’”?
This error is caused when there are not enough observations of each class to estimate corresponding component parameters. This means that the standard estimate of component covariance matrices will be singular.

8.2
Provide details on the list that predict{MASS} returns.

8.2 Response
The list returned by predict{MASS} contains two components: list$class contains the MAP (maximum a posteriori) classification, and list$posterior contains the posterior probabilities for each class in the form of an n×k matrix where k is the number of classes and n is the number of observations.

8.3
The arguments gamma and lambda of rda{klaR} are usually determined by cross-validation. Can they be set manually?

8.3 Response
Yes, they can be set manually by specifying gamma= and lambda= within the rda{klaR} function call.

. We will use the human cancer microarray data that were discussed in the lectures and are provided by the R library ElemStatLearn (available at https://cran.r-project.org/src/contrib/Archive/ElemStatLearn/). Pick 3 cancer types “MELANOMA”, “OVARIAN” and “RENAL”, and randomly select the same set of 60 genes for each cancer type. Please use set.seed(123) for the whole of this exercise. Your analysis will be based on observations for these genes and cancer types.

# Using:
# https://cran.r-project.org/src/contrib/Archive/ElemStatLearn/ElemStatLearn_2015.6.26.tar.gz

# Setting the seed for the chunk
set.seed(123)

# Inspecting the raw data
nciRaw <- as.data.frame(nci)
head(nciRaw)
##      CNS      CNS    CNS         RENAL BREAST    CNS    CNS BREAST  NSCLC
## 1  0.300 0.679961  0.940  2.800000e-01  0.485  0.310 -0.830 -0.190  0.460
## 2  1.180 1.289961 -0.040 -3.100000e-01 -0.465 -0.030  0.000 -0.870  0.000
## 3  0.550 0.169961 -0.170  6.800000e-01  0.395 -0.100  0.130 -0.450  1.150
## 4  1.140 0.379961 -0.040 -8.100000e-01  0.905 -0.460 -1.630  0.080 -1.400
## 5 -0.265 0.464961 -0.605  6.250000e-01  0.200 -0.205  0.075  0.005 -0.005
## 6 -0.070 0.579961  0.000 -1.387779e-17 -0.005 -0.540 -0.360  0.350 -0.700
##    NSCLC  RENAL  RENAL  RENAL  RENAL  RENAL  RENAL  RENAL BREAST  NSCLC  RENAL
## 1  0.760  0.270 -0.450 -0.030  0.710 -0.360 -0.210 -0.500 -1.060  0.150 -0.290
## 2  1.490  0.630 -0.060 -1.120  0.000 -1.420 -1.950 -0.520 -2.190 -0.450  0.000
## 3  0.280 -0.360  0.150 -0.050  0.160 -0.030 -0.700 -0.660 -0.130 -0.320  0.050
## 4  0.100 -1.040 -0.610  0.000 -0.770 -2.280 -1.650 -2.610  0.000 -1.610  0.730
## 5 -0.525  0.015 -0.395 -0.285  0.045  0.135 -0.075  0.225 -0.485 -0.095  0.385
## 6  0.360 -0.040  0.150 -0.250 -0.160 -0.320  0.060 -0.050 -0.430 -0.080  0.390
##   UNKNOWN OVARIAN MELANOMA PROSTATE OVARIAN OVARIAN OVARIAN OVARIAN OVARIAN
## 1  -0.200   0.430   -0.490   -0.530  -0.010   0.640  -0.480   0.140   0.640
## 2   0.740   0.500    0.330   -0.050  -0.370   0.550   0.970   0.720   0.150
## 3   0.080  -0.730    0.010   -0.230  -0.160  -0.540   0.300  -0.240  -0.170
## 4   0.760   0.600   -1.660    0.170   0.930  -1.780   0.470   0.000   0.550
## 5  -0.105  -0.635   -0.185    0.825   0.395   0.315   0.425   1.715  -0.205
## 6  -0.080  -0.430   -0.140    0.010  -0.100   0.810   0.020   0.260   0.290
##   PROSTATE NSCLC  NSCLC  NSCLC LEUKEMIA K562B-repro K562A-repro    LEUKEMIA
## 1    0.070 0.130  0.320  0.515    0.080       0.410      -0.200 -0.36998050
## 2    0.290 2.240  0.280  1.045    0.120       0.000       0.000 -1.38998000
## 3    0.070 0.640  0.360  0.000    0.060       0.210       0.060 -0.05998047
## 4    1.310 0.680 -1.880  0.000    0.400       0.180      -0.070  0.07001953
## 5    0.085 0.135  0.475  0.330    0.105      -0.255      -0.415 -0.07498047
## 6   -0.620 0.300  0.110 -0.155   -0.190      -0.110       0.020  0.04001953
##   LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA       COLON  COLON         COLON
## 1   -0.370   -0.430   -0.380   -0.550 -0.32003900 -0.620 -4.900000e-01
## 2    0.180   -0.590   -0.550    0.000  0.08996101  0.080  4.200000e-01
## 3    0.000   -0.500   -1.710    0.100 -0.29003900  0.140 -3.400000e-01
## 4   -1.320   -1.520   -1.870   -2.390 -1.03003900  0.740  7.000000e-02
## 5   -0.825   -0.785   -0.585   -0.215  0.09496101  0.205 -2.050000e-01
## 6   -0.130    0.520    0.120   -0.620  0.05996101  0.000 -1.387779e-17
##         COLON  COLON  COLON      COLON MCF7A-repro     BREAST MCF7D-repro
## 1  0.07001953 -0.120 -0.290 -0.8100195       0.200 0.37998050   0.3100195
## 2 -0.82998050  0.000  0.030  0.0000000      -0.230 0.44998050   0.4800195
## 3 -0.59998050 -0.010 -0.310  0.2199805       0.360 0.65998050   0.9600195
## 4 -0.90998050  0.130  1.500  0.7399805       0.180 0.76998050   0.9600195
## 5  0.24501950  0.555  0.005  0.1149805      -0.315 0.05498047  -0.2149805
## 6 -0.43998050 -0.550 -0.540  0.1199805       0.410 0.54998050   0.3700195
##   BREAST       NSCLC  NSCLC  NSCLC MELANOMA BREAST      BREAST MELANOMA
## 1  0.030 -0.42998050  0.160  0.010   -0.620 -0.380  0.04998047    0.650
## 2  0.220 -0.38998050 -0.340 -1.280   -0.130  0.000 -0.72001950    0.640
## 3  0.150 -0.17998050 -0.020 -0.770    0.200 -0.060  0.41998050    0.150
## 4 -1.240  0.86001950 -1.730  0.940   -1.410  0.800  0.92998050   -1.970
## 5 -0.305  0.78501950 -0.625 -0.015    1.585 -0.115 -0.09501953   -0.065
## 6  0.050  0.04001953 -0.140  0.270    1.160  0.180  0.19998050    0.130
##   MELANOMA MELANOMA MELANOMA      MELANOMA MELANOMA
## 1   -0.030   -0.270    0.210 -5.000000e-02    0.350
## 2   -0.480    0.630   -0.620  1.400000e-01   -0.270
## 3    0.070   -0.100   -0.150 -9.000000e-02    0.020
## 4   -0.700    1.100   -1.330 -1.260000e+00   -1.230
## 5   -0.195    1.045    0.045  4.500000e-02   -0.715
## 6    0.410    0.080   -0.400 -2.710505e-20   -0.340
# Extracting data based on required cancer types
n0 <- dim(nci)[2]
p0 <- dim(nci)[1]

rSel <- sample(1:p0, size = 60, replace = FALSE)
chk <- colnames(nci) %in% c("MELANOMA", "OVARIAN", "RENAL")
cSel <- which(chk == TRUE)
ncia <- nci[rSel, cSel]
colnames(ncia) <- colnames(nci)[cSel]

# Creating the dataframe for storing the required information
# with labels in the correct format
nciData <- data.frame(t(ncia))
nciData$Class <- colnames(ncia)
rownames(nciData) <- NULL
dim(nciData)
## [1] 23 61
head(nciData)
##              X1    X2    X3         X4    X5    X6    X7    X8    X9   X10
## 1 -1.030000e+00  0.08 -0.06  0.8500195 -0.10  0.00 -0.23 -0.55  0.00  0.11
## 2  4.900000e-01 -0.30 -0.42  0.7200195  1.63 -0.80  0.08 -0.16  0.00 -0.83
## 3 -1.380000e+00 -0.94 -0.68 -0.2299805 -0.20  0.51  0.14  0.60  0.00 -0.18
## 4 -8.605855e-19  0.30 -0.35  0.1300195  0.85  0.13 -0.69  0.38 -0.17 -0.55
## 5  1.420000e+00 -0.50 -0.26  0.2900195  0.89 -0.06 -0.76  0.15  0.17  0.51
## 6  2.000000e-01 -0.27  0.03 -0.2299805  0.46 -0.53 -0.62  0.40  0.00  1.54
##         X11        X12  X13          X14        X15   X16        X17
## 1 -0.669961 -0.1449902 0.00 -0.329990200  0.2600195 -0.40 -0.7300195
## 2  0.480039  0.4050098 0.68 -0.459990200  0.7700195 -0.16 -0.6200195
## 3  1.320039 -0.2649902 0.09 -0.009990234  2.8000190  0.10  0.1499805
## 4 -0.249961 -0.1649902 0.63 -0.249990200 -0.2499805 -0.01 -0.2000195
## 5  0.570039 -0.4449902 0.41 -0.189990200  0.4900195 -0.35 -0.4000195
## 6  0.130039 -0.4449902 0.66 -0.769990200  0.6500195 -0.62 -0.1000195
##           X18   X19   X20        X21   X22    X23   X24   X25         X26   X27
## 1 -0.30498050 -0.46 -0.49 -0.5999902  0.24  0.255 -0.28  0.60  0.98500980 -0.49
## 2 -0.34498050  0.24 -0.26 -0.3099902  0.72  0.235 -0.41 -0.07 -0.36499020 -0.29
## 3  0.08501949 -0.67  0.22  0.7800098  0.12 -0.875 -0.37  0.04  1.42501000  0.68
## 4 -0.09498051  0.23  0.26  0.1200098 -0.21 -0.145 -0.43 -0.02  0.15500980  0.52
## 5 -0.27498050  1.29  0.13 -0.1899902 -0.06  0.755 -0.54 -0.36  0.01500977  0.37
## 6 -0.67498050  1.68 -0.36  0.5700098 -0.19  0.335  0.00  0.23  0.45500980  0.03
##      X28    X29   X30   X31   X32   X33    X34    X35   X36   X37         X38
## 1 -1.255 -0.025 -0.61  0.80  0.68 -0.96  2.215  0.175  0.16  0.68  0.68001950
## 2 -0.195 -0.665 -0.03  0.10 -0.50 -0.09  1.255 -0.165  0.07 -0.10 -1.15998100
## 3 -0.345  0.255 -0.01 -0.40  0.00  0.13  1.765 -0.055  0.39 -0.18  0.76001950
## 4  0.000 -0.455 -0.50 -0.62 -0.68 -0.24 -1.085  0.315 -0.24  0.18 -0.55998050
## 5 -0.635 -0.075 -0.12 -0.73 -0.50  0.17 -0.415  0.055 -0.17 -0.01  0.05001949
## 6 -0.415  0.665  0.44  0.00 -1.04  0.09 -0.535  0.045  0.62 -0.29 -0.49998050
##      X39   X40   X41   X42   X43         X44           X45     X46    X47   X48
## 1 -0.935 -0.30 -0.16 -0.16 0.105  1.18003900  4.500000e-01 -0.0925 -0.505  0.83
## 2 -0.505 -0.31 -0.11  0.10 1.955  1.30003900 -4.700000e-01  0.3375 -0.465  0.10
## 3  0.475  0.15 -0.67 -1.04 0.475 -0.77996100 -1.214306e-17 -0.6525 -0.045 -0.36
## 4  0.335 -0.15 -0.41 -0.32 1.455 -0.63996100  3.200000e-01  0.0075  0.765 -0.20
## 5  0.165 -0.48 -0.04 -0.24 0.795 -0.08996101 -2.100000e-01  0.1075 -0.715  0.03
## 6 -0.265  0.16 -0.10 -1.08 1.375  0.93003900 -1.500000e-01  0.0875  0.015 -0.55
##         X49        X50    X51         X52   X53   X54   X55   X56        X57
## 1 0.1649902 -0.4450098  0.385  0.04001949  0.18 -0.51  0.28 -0.23  0.1300195
## 2 0.0000000  0.2249902  0.305 -0.78998050 -0.72  0.18 -0.15 -0.69 -0.1799805
## 3 1.6349900 -0.1150098 -0.475 -0.40998050  0.24 -0.22 -0.08 -0.03 -0.4099805
## 4 2.4049900  0.3149902  0.485 -0.24998050 -0.30 -0.01 -0.17 -0.63  0.3200195
## 5 1.6749900  0.5649902  0.245 -0.04998051 -0.22  0.13 -0.25 -0.03 -0.2299805
## 6 2.7449900  0.8549902  0.595 -0.02998051 -0.11  0.05 -0.48  0.33  0.4600195
##     X58   X59   X60 Class
## 1  0.66 -0.03  3.50 RENAL
## 2  0.02 -0.04  0.43 RENAL
## 3  0.04  0.00 -0.85 RENAL
## 4 -0.02 -0.17 -0.12 RENAL
## 5  0.10  0.65 -1.00 RENAL
## 6  0.45  0.28  0.00 RENAL
9.1
Pick 2 features and visualize the observations using the 2 features. Do you think it is hard to classify the observations based on the amount of overlap among the 3 neighborhoods of observations in each of the 3 classes? Here “a neighborhood of observations in a class” is a “open disk that contains the observations in the class”.

9.1 Code
# Setting the seed for the chunk
set.seed(123)

# Preparing the data for visualization
nciSub <- nciData %>% dplyr::select(X1, X2, Class)

# Plotting the data with two features (X1, X2)
ggplot(nciSub, aes(X1, X2)) + geom_point(aes(color = Class)) + 
    theme_bw() + ggtitle("Data by Class for Features X1, X2")


9.1 Response
Do you think it is hard to classify the observations based on the amount of overlap among the 3 neighborhoods of observations in each of the 3 classes?
It does appear that the classes will be difficult to separate given the overlap in 2D. There is some visual separation as blue seems to group towards the bottom left, red towards the bottom right, and green towards the top center. Having said this, these data are visualized in the plot in a 2D space, whereas the actual data exist in a 60D space. Therefore, it is difficult to say with certainty that it will be impossible to separate the classes in their ambient space based on the information from a 2D visualization.

9.2
Apply LDA and report the classwise error rate for each cancer type.

9.2 Code
# Setting the seed for the chunk
set.seed(123)

# Training the LDA model
lda.fit <- lda(Class ~ ., data = nciData)
## Warning in lda.default(x, grouping, ...): variables are collinear
# Testing the model
lda.pred <- predict(lda.fit, nciData)
TrueClassLabel <- nciData$Class
LDAEstimatedClassLabel <- lda.pred$class

# Generating 2x2 prediction table and computing class-wise
# error rates
predTable <- table(LDAEstimatedClassLabel, TrueClassLabel)
predTable
##                       TrueClassLabel
## LDAEstimatedClassLabel MELANOMA OVARIAN RENAL
##               MELANOMA        6       0     0
##               OVARIAN         0       5     0
##               RENAL           2       1     9
cwDF <- as.matrix.data.frame(predTable)
correct <- diag(cwDF)
total <- colSums(cwDF)

cwError <- 1 - (correct/total)
cat("Classwise Error:", cwError, "\n")
## Classwise Error: 0.25 0.1666667 0
9.3
Use the library klaR, and apply regularized discriminant analysis (RDA) by setting the arguments gamma and lambda of rda{klaR} manually so that the resulting classwise error rate for each cancer type is zero.

9.3 Code
# Setting the seed for the chunk
set.seed(123)

# Training the RDA model
rda.fit <- rda(Class ~ ., data = nciData, gamma = 0.825, lambda = 0.37)

# Testing the model
rda.pred <- predict(rda.fit, nciData)

# Collecting performance metrics
TrueClassLabel <- nciData$Class
rDAEstimatedClassLabel <- rda.pred$class

# Computing and displaying performance metrics
predTable <- table(rDAEstimatedClassLabel, TrueClassLabel)
predTable
##                       TrueClassLabel
## rDAEstimatedClassLabel MELANOMA OVARIAN RENAL
##               MELANOMA        8       0     0
##               OVARIAN         0       6     0
##               RENAL           0       0     9
cwDF <- as.matrix.data.frame(predTable)
correct <- diag(cwDF)
total <- colSums(cwDF)

cwError <- 1 - (correct/total)
cat("Classwise Error:", cwError, "\n")
## Classwise Error: 0 0 0
9.4
Obtain the estimated covariance matrices from the RDA and visualize them using the same strategy in Example 3 in “LectureNotes5c_notes.pdf”. What can you say about the degree of dependence among these genes for each of the three cancer types? (Hint and caution: the class labels “MELANOMA”, “OVARIAN” and “RENAL” will be ordered alphabetically by R. So, you need to keep track on which estimated covariance matrix is for which class. Otherwise, you will get wrong visualization.)

9.4 Code
# Setting the seed for the chunk
set.seed(123)

# Extracting covariance matrices
hatSigma1 <- rda.fit$covariances[, , 1]
hatSigma2 <- rda.fit$covariances[, , 2]
hatSigma3 <- rda.fit$covariances[, , 3]

# Melting covariance matrices and prepping data for
# visualization
melted_hatSigma1 <- melt(hatSigma1)
melted_hatSigma2 <- melt(hatSigma2)
melted_hatSigma3 <- melt(hatSigma3)

EstSigmaAll <- rbind(melted_hatSigma1, melted_hatSigma2, melted_hatSigma3)
EstSigmaAll$Cancer <- rep(c("MELANOMA", "OVARIAN", "RENAL"), 
    each = nrow(melted_hatSigma1))
EstSigmaAll$Cancer <- factor(EstSigmaAll$Cancer)

# Plotting the data
ggplot(EstSigmaAll, aes(Var1, Var2, fill = value)) + geom_tile() + 
    scale_fill_gradient2(low = "blue", high = "red") + facet_grid(~Cancer) + 
    labs(title = "Estimated Covariance Matrices", x = "", y = "") + 
    theme(plot.title = element_text(hjust = 0.5))


9.4 Response
What can you say about the degree of dependence among these genes for each of the three cancer types?
RENAL cancer expresses the strongest individual dependence between genes, with OVARIAN cancer expressing the least dependence. There is a slight to moderate level of dependence on a number of genes, with a moderate-strong dependence between only a few genes for all three cancer classes. In general though, the stronger dependencies between variables are few in number for each of the three cancer classes.