NOTE: This is part of a collection of notebooks that serve as my personal studies of the chapters from the Introduction to Statistical Learning with Applications in R. I include insights, highlights, and experiences in the implementation of the methods discussed in the book (as well attempts at the end-of-chapter exercises whenever I’m brave enough!).

Key Takeaways from This Chapter

Introduction

The title of this chapter is Statistical Learning. Here, a preliminary discussion of the chapter’s namesake and our motivations to study it is given. The concepts of Prediction and Inference, and assessments of statistical learning methods are also introduced and expounded upon.

Datasets Used

Below are the datasets used in this chapter.

Dataset Description
Advertising Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
Income Simulated dataset containing the income and years of education for 30 individuals.

2.1. What is Statistical Learning?

Using the Advertising dataset, we try to determine if there is an association between sales and the advertising budgets for three different media: TV, radio, and newspaper. We plot sales as a function of these input variables wherein each plot contains a least squares fit of sales to that media.

# AdHoc Plotting Function
get_figs_2_1 <- function(data, column) {
  
  ggplot(data = data, aes(x = data[,column], y = sales)) +
    
  # Geoms
  geom_point(color = "gold", size = 3) +
  geom_smooth(method = "lm", se = F, color = "royalblue") + # Use "lm" for least squares regression
    
  # Themes
  theme_light() +
  labs(x = ifelse(column == "TV", column, str_to_title(column)),
       y = "Sales") +
  theme(text = element_text(size = 18),
        panel.grid = element_blank(), 
        panel.border = element_rect(size = 2, color = "grey50")) 
  
}
  
# Apply Function over Columns
figs_2_1 <- lapply(c("TV", "radio", "newspaper"), get_figs_2_1, data = Advertising) 
  
# Print combined graphs
figs_2_1_caption <- paste("FIGURE 2.1.",
                          "In each plot we show the simple least squares fit of sales to that variable.",
                          "Each blue line represents a simple model that can be used to predict sales.")

grid.arrange(grobs = figs_2_1, nrow = 1,
             bottom = textGrob(x = 0.04, hjust = 0, gp = gpar(fontsize = 14, font = 3), label = figs_2_1_caption))

The book reiterates that \(f\) is in general, unknown, and must be estimated based on the observed points. Doing so with the simulated Income dataset, where we assume the smoothed curve is from a localized regression fit (the authors did not specify \(f\)), we see in the plot below that the estimated \(f\) contains errors \(\epsilon\) represented by the vertical lines between the curve and the observations.

# Left Plot
fig_2_2_left <- 
  
  Income1 %>%
  
    ggplot(aes(x = Education, y = Income)) +
  
    # Geoms
    geom_point(color = "gold", size = 3) +
  
    # Themes
    scale_x_continuous(breaks = c(10, 12, 14, 16, 18, 20, 22)) +
    theme_light() +
    labs(x = "Years of Education",
         y = "Income") +
    theme(text = element_text(size = 18),
          panel.grid = element_blank(),
          panel.border = element_rect(size = 2, color = "grey50"))

# Right Plot
fit_2_2_right_loess <- fitted(loess(Income ~ Education, data = Income1)) # Fit a loess curve first and get fitted values 

fig_2_2_right <-
  
  Income1 %>% 
  
    ggplot(aes(x = Education, y = Income)) +
  
    # Geoms
    geom_segment(data = Income1, size = 1,
                 aes(x = Education, y = Income, xend = Education, yend = fit_2_2_right_loess)) +
    geom_smooth(method = "loess", se = F, color = "royalblue") +
    geom_point(color = "gold", size = 3) +
  
    # Themes
    scale_x_continuous(breaks = c(10, 12, 14, 16, 18, 20, 22)) +
    theme_light() +
    labs(x = "Years of Education",
         y = "Income") +
    theme(text = element_text(size = 18),
          panel.grid = element_blank(),
          panel.border = element_rect(size = 2, color = "grey50"))

# Print Combined Graphs
figs_2_2_caption <- paste("FIGURE 2.2.",
                          "LEFT: the dots are the observed values of income and education.",
                          "RIGHT: the curve represents the TRUE relationship, while the black lines represent the error.")

grid.arrange(grobs = list(fig_2_2_left, fig_2_2_right), nrow = 1,
             bottom = textGrob(x = 0.04, hjust = 0, gp = gpar(fontsize = 14, font = 3), label = figs_2_2_caption))

More generally as well, \(f\) may involve more than one input variable.

fit_2_3_loess <- loess(Income ~ Education + Seniority, data = Income2) # Fit a loess curve first

# Predict Values on Expanded XY Grid
x.pred <- seq(min(Income2$Education), max(Income2$Education), length.out = 30)
y.pred <- seq(min(Income2$Seniority), max(Income2$Seniority), length.out = 30)
xy     <- expand.grid(Education = x.pred, Seniority = y.pred)
z.pred <- matrix(predict(fit_2_3_loess, newdata = xy), nrow = 30, ncol = 30)

# 3D Plot
figs_2_3_caption <- "FIGURE 2.3. Plot of income as a function of education and seniority."

Income2 %>% 
  
  scatter3D(
    
    type = "p",
    x = Income2$Education, 
    y = Income2$Seniority, 
    z = Income2$Income,
    colvar = NA, pch = 19, col = "gold", cex = 1.75,
    phi = 25, theta = 45, expand = 0.6,
    xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
    
    # Plot Regression Plane First
    panel.first = scatter3D(
                    x = Income2$Education,
                    y = Income2$Seniority,
                    z = Income2$Income,
                    colvar = NA, col = "black", add = T,
                    surf = list(x = x.pred, y = y.pred, z = z.pred, fit = predict(fit_2_3_loess),
                                facets = T, col = "skyblue", border = "royalblue", alpha = 0.45))
    ); text(x = 0, y = 0, pos = 1, offset = 20, font = 3, cex = 1.25, labels = figs_2_3_caption, xpd = T)

2.1.1. Why Estimate \(f\)?

The two main reasons are Prediction and Inference.

Prediction

  • Most notable when the input variables are available but the output is not.
  • We can predict \(Y\) using \(\hat{Y} = \hat{f}(X)\) where \(\hat{f}(X)\) is an estimate of \(f\) and \(\hat{Y}\) is the predicted outcome.
  • When prediction is the main focus, the exact form of \(f\) is less of a concern as opposed to improving prediction accuracy.
    • The accuracy of the prediction depends on the reducible and irreducible error.
    • reducible: this error can be reduced by applying the most appropriate statistical learning technique to estimate \(f\).
    • irreducible: error due to the variability in \(\epsilon\) which cannot be predicted by \(X\). This can be due to not including other significant predictors, and/or unmeasurable variation.
    • This can be shown mathematically by splitting the expected value of the squared difference between the observed and predicted value of \(Y\), or \(E(Y - \hat{Y})^2\), into \(\underbrace{[f(X) - \hat{f}(X)]^2}_\text{Reducible} + \underbrace{Var(\epsilon)}_\text{Irreducible}\) where the latter term represents the variance in the error term \(\epsilon\).

Inference

  • We also estimate \(f\) to derive inferences from the relationship between the input and the output specifically in how \(Y\) changes as a function of \(X_1, ..., X_p\).
  • The accuracy of our predictions isn’t made as important as our understanding of this relationship, so the exact form of \(f\) must be known.
  • Some insights that are useful are:
    • Identifying important predictors that are often a subset of the entire list of possible variables
    • Determining the positive/negative relationship between the predictor and the response
    • Making adjustments to the assumption of \(f\) having a linear form so as to find the most accurate representation of this relationship which is often more complicated

It is worth noting that different situations will call for either of the two, or both—it depends on the questions that we want to answer knowing that some may offer more value than others. Either way, different statistical learning methods may be less/more appropriate. Linear models for example, allow for simple and interpretable inferences, but are not the most accurate in predictions. In contrast, methods that account for non-linearity can be very accurate, but may not be the most interpretable choice for inference. This is an important mindset that must be kept while studying the methods in the book.

2.1.2. How Do We Estimate \(f\)?

Many of the methods discussed in the book share certain characteristics as will be discussed in this section. The set of \(n\) different data points are called the training data in that these observations are used to train the selected method on how to estimate \(f\). If \(x_{ij}\) represents the value of the \(j\)th predictor for observation \(i\), and \(y_i\) represents the response for the \(i\)th observation, then our training data consists of \(\{(x_1,y_1), (x_2,y_2), ..., (x_n,y_n)\}\) with \(x_i = (x_{i1}, x_{i2}, x_{i3}, ..., x_{ip})^{T}\).

Goal: apply a Statistical Learning method to the training data to estimate \(f\). These methods can be characterized as either parametric or non-parametric.

Parametric Methods

This is a two-step, model-based approach that reduces the problem of estimating \(f\) down to estimating a set of parameters.

Pros

  • This is much easier than fitting an entirely arbitrary function.

Cons

  • The model chosen will usually not follow the true form of \(f\).
  • If the fit is too far from the true form, the estimate will be poor.
  • While this can be alleviated by using more flexible models with more parameters to estimate, doing so can lead to overfitting wherein the model follows the errors too closely.

Steps

  1. We first make an assumption about the form of \(f\). A linear model for example assumes that \(f\) is linear in \(X\).
  2. We then need a procedure that fits the training data to the model. For the linear model, this means estimating the parameters \(\beta_1, \beta_2, ..., \beta_p\). The most common method to fit this model is through ordinary least squares (check this out for a brief history lesson on OLS!).

Applying a linear fit to the Income dataset, wherein we fit \(income \approx \beta_0 + \beta_1 \times education + \beta_1 \times seniority\), we see that the fit does a reasonable job of capturing the positive relationship between the predictors and the response, but fails to capture the curvature present in the true \(f\) as seen in Figure 2.3..

par(mfrow = c(1, 2))

# --- Linear Fit
fit_2_4_lm <- lm(Income ~ Education + Seniority, data = Income2)

# Predict Values on Expanded XY Grid
z.pred_lm <- matrix(predict(fit_2_4_lm, newdata = xy), nrow = 30, ncol = 30)

# 3D Plot of Linear Model
figs_2_4_caption <- "FIGURE 2.4. Linear model fit by OLS to the Income dataset."

Income2 %>% 
  
  scatter3D(
    
    type = "p",
    x = Income2$Education, 
    y = Income2$Seniority, 
    z = Income2$Income,
    colvar = NA, pch = 19, col = "gold", cex = 1.75,
    phi = 25, theta = 45, expand = 0.6,
    xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
    
    # Plot Regression Plane First
    panel.first = scatter3D(
                    x = Income2$Education,
                    y = Income2$Seniority,
                    z = Income2$Income,
                    colvar = NA, col = "black", add = T,
                    surf = list(x = x.pred, y = y.pred, z = z.pred_lm, fit = predict(fit_2_4_lm),
                                facets = T, col = "violet", border = "purple", alpha = 0.45))
    ); text(x = 0, y = 0, pos = 1, offset = 20, font = 3, cex = 1.25, labels = figs_2_4_caption, xpd = T)

# --- Smooth (!) Thin-Plate Spline Fit
Income2_X_scaled <- scale(Income2[c("Education", "Seniority")]) # Scale X first
Income3 <- data.frame(Income2_X_scaled, Income = Income2$Income)

fit_2_5_tp <- gam(Income ~ s(Education, Seniority, bs = "tp"), data = Income3)

# Predict Values on Expanded XY Grid
x.pred_tp <- seq(min(Income3$Education), max(Income3$Education), length.out = 30)
y.pred_tp <- seq(min(Income3$Seniority), max(Income3$Seniority), length.out = 30)
xy_tp     <- expand.grid(Education = x.pred_tp, Seniority = y.pred_tp)
z.pred_tp <- matrix(predict(fit_2_5_tp, newdata = xy_tp), nrow = 30, ncol = 30)

# Transform Back to Original Scale
scaled_center <- attr(Income2_X_scaled, "scaled:center")
scaled_scale  <- attr(Income2_X_scaled, "scaled:scale")
Income3$Education_orig <- x.pred_tp * scaled_scale[1] + scaled_center[1]
Income3$Seniority_orig <- y.pred_tp * scaled_scale[2] + scaled_center[2]

# 3D Plot of Thin-Plate Spline Regression Model
figs_2_5_caption <- "FIGURE 2.5. Smooth thin-plate spline fit."

Income3 %>% 
  
  scatter3D(
    
    type = "p",
    x = Income2$Education,
    y = Income2$Seniority,
    z = Income2$Income,
    colvar = NA, pch = 19, col = "gold", cex = 1.75,
    phi = 25, theta = 45, expand = 0.6,
    xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
    
    # Plot Regression Plane First
    panel.first = scatter3D(
                    x = Income2$Education, 
                    y = Income2$Seniority, 
                    z = Income2$Income,
                    colvar = NA, col = "black", add = T,
                    surf = list(x = x.pred, y = y.pred, z = z.pred_tp, fit = predict(fit_2_5_tp),
                                facets = T, col = "violet", border = "purple", alpha = 0.45))
    ); text(x = 0, y = 0, pos = 1, offset = 20, font = 3, cex = 1.25, labels = figs_2_5_caption, xpd = T)

Non-Parametric Methods

These methods do not make assumptions about the form of \(f\), but instead aim to get an estimate of \(f\) that gets as close to the data points as possible without “being too rough or wiggly,” which I take to mean as not overfitting over the training data.

Pros

  • By not having strict assumptions about \(f\), these methods allow for a greater range of shapes while accurately fitting \(f\).

Cons

  • A large number of observations is needed to accurately estimate \(f\), much more than that needed of a parametric approach.

An example is given by Figure 2.5. wherein a thin-plate spline is used to fit \(f\) without a pre-specified model. Note, however, that it does so such that the estimated fit is close to the data, but is still smooth. Comparing this to the true \(f\) in Figure 2.3., we see that the estimate is accurate indeed. Another characteristic of a non-parametric method such as fitting a thin-plate spline is the need to choose a level of smoothness. The plot below for example has a lower level of smoothing and thus results in a rough fit that follows the observed data perfectly. In doing so, we have overfitted on the data, and we can expect to have inaccurate estimates on new data that is not part of the original training data. Thus, another factor to consider is how we will choose the correct amount of smoothing.

# --- Rough (!) Thin-Plate Spline Fit
Income2_X_scaled <- scale(Income2[c("Education", "Seniority")]) # Scale X first
Income3 <- data.frame(Income2_X_scaled, Income = Income2$Income)

# - k = 30 since there are 30 exact unique points in our data, thereby disabling low-rank approximation
# - sp = 0 for disabling penalization
# - bp = "tp" to use thin plate regression spline basis function

fit_2_6_tp <- gam(Income ~ s(Education, Seniority, k = 30, sp = 0, bs = "tp"), data = Income3)

# Predict Values on Expanded XY Grid
z.pred_rtp <- matrix(predict(fit_2_6_tp, newdata = xy_tp), nrow = 30, ncol = 30)

# 3D Plot of Thin-Plate Spline Regression Model
figs_2_6_caption <- "FIGURE 2.6. Rough thin-plate spline fit with zero errors."

Income3 %>% 
  
  scatter3D(
    
    type = "p",
    x = Income2$Education,
    y = Income2$Seniority,
    z = Income2$Income,
    colvar = NA, pch = 19, col = "gold", cex = 1.75,
    phi = 25, theta = 45, expand = 0.6,
    xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
    
    # Plot Regression Plane First
    panel.first = scatter3D(
                    x = Income2$Education, 
                    y = Income2$Seniority, 
                    z = Income2$Income,
                    colvar = NA, col = "black", add = T,
                    surf = list(x = x.pred, y = y.pred, z = z.pred_rtp, fit = predict(fit_2_6_tp),
                                facets = T, col = "violet", border = "purple", alpha = 0.45))
    ); text(x = 0, y = 0, pos = 1, offset = 20, font = 3, cex = 1.25, labels = figs_2_6_caption, xpd = T)

2.1.3. The Trade-Off Between Prediction Accuracy and Model Interpretability

As we saw in some of the examples so far, some statistical learning methods are less flexible/more restrictive than others (e.g. OLS regression). There are reasons for choosing such methods as opposed to the flexible approaches (e.g. thin-plate spline regression):

  • If Inference is our main focus, then these restrictive methods are preferable in that it is easy to understand and interpret the relationship between \(X\) and \(Y\).
  • More flexible methods often lead to complex estimates that make this assessment of individual relationships between each predictor and the response very difficult.
  • Even when prediction is the sole interest, these restrictive approaches can sometimes outperform the most flexible method available because of the latter overfitting.

2.1.4. Supervised Versus Unsupervised Learning

Statistical learning problems can be categorized into supervised and unsupervised. The main difference between the two is that, in the former, for each observation there is an associated response, and that relationship is modeled with either prediction or inference in mind whereas in the latter, we have realized values of \(x_i\) for each observation \(i = 1, ..., n\), but no response. There is no response variable to supervise the method that we are using. One such method is Cluster Analysis which seeks to group observations into distinct clusters based on values of the input variables which may translate for example to segmenting customers based on spending habits and demographic factors.

Some problems can also call for semi-supervised learning methods wherein a subset of the data contains associated response values while another subset does not. An example of this scenario is when measuring the response is much more expensive than readily being able to measure the predictors. Thus, classifying problems between supervised and unsupervised may not always be clear from the get-go.

2.1.5. Regression versus Classification Problems

Problems with a quantitative response, or those with numerical values such as age and income, are often referred to as regression problems. Those that deal with a qualitative response wherein the variable takes one value of \(K\) different classes/categories (e.g. gender, presence of a disease, product brand) are referred to as classification problems. Whether the predictor variables are of either type is generally less important. This does not mean however that less attention should be given to the available features, especially when effectively engineered and coded.

2.2. Assessing Model Accuracy

It is necessary to know and understand many different statistical learning methods as there is no single method that works best on all possible data sets.

2.2.1. Measuring the Quality of Fit

Evaluating the performance of statistical learning methods require us to quantify how close the predictions are to the observed values. A very commonly used measure is the Mean Squared Error or MSE.

\[MSE = \frac{1}{n} \displaystyle\sum_{i = 1}^{n} (y_i - \hat{f}(x_i))^2\]

  • Used in the regression setting.
  • \(\hat{f}(x_i)\) is the predicted value for the ith observation (also \(\hat{y}_i\)).
  • A smaller value denotes predictions that are closer to the true value.
  • Computing the MSE on a model fit with the training data can be more accurately called as the training MSE.
  • However, we are more interested in the test MSE, or one that measures prediction accuracy when the method is applied to unseen test data.
  • We try to select the model with the lowest test MSE.

For unseen test data \((x_0, y_0)\) we compute the average squared prediction error.

\[Ave(y_0 - \hat{f}(x_0))^2\]

One way to implement the minimization of the MSE is to set aside data that was not used to train the model to serve as the test data, and evaluate the methods on that test data. However when test data is not available, simply relying on the training MSE is not ideal as well since these methods can have very large test MSEs despite having small training MSEs. Furthermore, the training MSE is monotonically decreasing as the flexibility of the method increases; relying on training MSE is likely to push us to use the most flexible/complex method which can have just as bad a test MSE than even an OLS model (consider when the true \(f\) is linear). This is illustrated when the test MSE is plotted as a function of model flexibility where a resulting U-shaped curve is observed.

When the model results in a small training MSE but a large test MSE, then we can say that the model has overfitted on the training data in that it has followed too closely the observed training data and its patterns that don’t exist in the test data. Another indicator is when a less flexible method would have resulted into a smaller test MSE. Note that we also almost always expect the training MSE to be smaller than the test MSE because “most statistical learning methods either directly or indirectly seek to minimize the training MSE.”

2.2.2. The Bias-Variance Trade-Off

The expected test MSE can always be decomposed into three fundamental terms (for a given value of \(x_0\)) : the variance of \(\hat{f}(x_0)\), the squared bias of \(\hat{f}(x_0)\), and the variance of the error terms \(\epsilon\) which is also the irreducible error. As the first two terms are nonnegative, the expected MSE can never go below \(Var(\epsilon)\).

\[\begin{align} MSE & = E(y_0 - \hat{f}(x_0))^2 \\ & = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon) \end{align}\]

Thus, the optimal statistical learning method must have both the variance and bias be low.

  • “Expected Test MSE”: the average test MSE obtained from repeatedly estimating \(f\) using a number of training sets, and testing each \(\hat{f}\) at \(x_0\).
  • “Overall Expected Test MSE”: obtained by averaging the former over ALL possible values of \(x_0\) in the test set.
  • Variance: the amount of variation in the estimated \(\hat{f}\) over different training sets. Ideally, \(\hat{f}\) should not vary much. High variance in a method can result in large changes in \(\hat{f}\) even with small changes in the training set. Generally, more flexible methods have higher variance.
  • Bias: the error from approximating a complex phenomenon with a simpler model; also the inaccuracy of the prediction estimates. Generally, more flexible methods have lower bias.

It is important to note that the rate of change of both the Bias and the Variance as flexibility increases will also depend on the data sets themselves. As the two quantities determine whether the test MSE increases/decreases, the amount of flexibility corresponding to the optimal test MSE are different across the data sets. There is a point, however, where increasing the flexibility only slightly decreases the Bias but causes the Variance to go up, consequently increasing the MSE overall. Thus, there is a trade-off between the two, and the challenge is to find the method that optimally minimizes both.

2.2.3. The Classification Setting

The aforementioned concepts also apply to the classification setting with changes due to \(y_i\) being qualitative instead of quantitative when we estimate \(f\) over the training data. The most common approach to quantify the accuracy of the estimated \(f\) is the training error rate which computes the proportion of incorrect classifications after applying \(\hat{f}\) to the training data.

\[ \frac{1}{n} \displaystyle\sum_{i = 1}^{n} I(y_i \neq \hat{y}_i) \\ where\ I(y_i \neq \hat{y}_i) = \begin{cases} 0 & \text{if $y_i = \hat{y}_i$, or correct classification} \\ 1 & \text{if $y_i \neq \hat{y}_i$, or wrong classification} \end{cases} \]

The test error rate which is the result after applying the classifier to the test data \((x_0, y_0)\) is given by the below measure where \(\hat{y}_0\) is the predicted class label. We generally choose the model with the lowest test error rate.

\[Ave(I(y_0 \neq \hat{y}_0))\]

The Bayes Classifier

The above test error rate can be shown (see Section 2.4 of the Elements of Statistical Learning) to be minimized on the average by classifying each observation to the most probable class, or where the below conditional probability for a given test observation with predictor vector \(x_0\) to the class \(j\) is the largest. This is referred to as the Bayes Classifier:

\[Pr(Y = j\ |\ X = x_0)\]

In a two-class setting, this classifier corresponds to predicting a response value of class one when \(Pr(Y = 1\ |\ X = x_0) > 0.5\), and returns class two otherwise. If plotted, a line drawn from points where the conditional probability is exactly 50% is called the Bayes Decision Boundary. Additionally, the Bayes Classifier produces the Bayes Error Rate which is the lowest test error rate possible for a classifier with a random outcome, and is also analogous to the irreducible error. This is given below where the probability over all possible values of \(X\) are averaged over the expectation:

\[1 - E(\underset{j}{max}Pr(Y = j\ |\ X)\]

K-Nearest Neighbors

When the conditional distribution of \(Y\) given \(X\) is not known as in real-world data, the Bayes Classifier cannot be computed. The K-Nearest Neighbors is an approach that estimates this conditional distribution and predicts a class for a test observation with the highest estimated probability. This is done by:

  1. Choosing the \(K\) nearest points closest to the training observation \(x_0\), represented by \(N_0\)
  2. Estimating the conditional probability for class \(j\), or the number of points with response values equal to \(j\) divided by \(K\):
    \(Pr(Y = j\ |\ X = x_0) = \frac{1}{K}\displaystyle\sum_{i\ \in\ N_0}I(y_i = j)\)
  3. Classifying the observation to the class with the largest probability

The KNN classifier can perform closely to the Bayes Classifier, but the choice of \(K\) affects the flexibility of the model.

  • When \(K\) = 1, the classifier has low bias and high variance.
  • As \(K\) increases, it becomes less flexible and produces a decision boundary near to linear; i.e. it has high bias and low variance.

Similar to the regression setting, the training error rate decreases as the flexibility increases. The test error rate also similarly exhibits a U-shaped curve where the model eventually overfits.

---
title: "Chapter 2: Statistical Learning"
author: David Angelo Brillantes (@daveonbrill)
header-includes:
   - \usepackage{amsmath}
output: 
  html_notebook:
    highlight: zenburn
    theme: paper
    toc: true
    toc_float: false
    code_folding: hide
---

<style type="text/css">
.main-container {
  max-width: 1000px;
  margin-left: auto;
  margin-right: auto;
}
</style>

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

<p style="text-align:justify">
<b>NOTE:</b> This is part of a [collection](https://github.com/dsbrillantes/ISL-R-Notebooks) of notebooks that serve as my personal studies of the chapters from the [Introduction to Statistical Learning with Applications in R](http://faculty.marshall.usc.edu/gareth-james/ISL/). I include insights, highlights, and experiences in the implementation of the methods discussed in the book (as well attempts at the end-of-chapter exercises whenever I'm brave enough!).
</p>

# Key Takeaways from This Chapter

- <p style="text-align:justify">Statistical learning refers to a set of approaches for estimating $f$ which represents the <i>systematic</i> information that the predictors $X$ provide about the response $Y$. The general form of the relationship (with $\epsilon$ as the <i>random error term</i>) can be written as $Y = f(X) + \epsilon$.</p>
- The two main reasons to estimate $f$ are <b>Prediction</b> and <b>Inference</b>.
- The goal is to apply a statistical learning method to the training data to estimate $f$. These can be classified into <b>Parametric</b> and <b>Non-Parametric</b> methods. Both have their own strengths and weaknesses specifically in either accurate prediction or understandable inferences.
- Statistical learning problems can generally be classified into <b>Supervised</b> and <b>Unsupervised</b> depending on the presence of an associated response for each observed value of the predictors.
- Statistical learning problems can also be classified into <b>Regression</b> and <b>Classification</b> depending on whether the response takes on quantitative or qualitative values.
- It is necessary to know and understand many different statistical learning methods as there is <b>no single method</b> that works best on all possible data sets.
- The challenge is to find the method that optimally minimizes both Bias and Variance.

# Introduction

<p style="text-align:justify">
The title of this chapter is <b>Statistical Learning</b>. Here, a preliminary discussion of the chapter's namesake and our motivations to study it is given. The concepts of Prediction and Inference, and assessments of statistical learning methods are also introduced and expounded upon.
</p>


## Datasets Used

Below are the datasets used in this chapter.

| Dataset | Description |
|-|-|
| Advertising | Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. |
| Income | Simulated dataset containing the income and years of education for 30 individuals. |


```{r SETUP, include = F}

# << Libraries >>
library(data.table)
library(dplyr)
library(stringr)
library(stringi)
library(ggplot2)
library(grid)
library(gridExtra)
library(plot3D)
library(ISLR)
library(mgcv)

# << Paths >>
# Link to external datasets not in the ISLR package
ISLR_datasets <- "http://faculty.marshall.usc.edu/gareth-james/ISL/"

# << Data >>
Advertising <- fread(paste0(ISLR_datasets, "Advertising.csv"), data.table = F) %>% select(-c("V1"))
Income1 <- fread(paste0(ISLR_datasets, "Income1.csv"), data.table = F)
Income2 <- fread(paste0(ISLR_datasets, "Income2.csv"), data.table = F)


```

# 2.1. What is Statistical Learning?

- <b>Input variables</b> are also called predictors, independent variables, features, or sometimes just variables.
- <b>Output variables</b> are also called response or dependent variables.
- $Y = f(X) + \epsilon$ is the general form of the relationship between $Y$ and $X = (X_1, X_2, ..., X_p)$.
- Statistical learning refers to a set of approaches for estimating $f$.

Using the <code>Advertising</code> dataset, we try to determine if there is an association between <code>sales</code> and the advertising budgets for three different media: <code>TV</code>, <code>radio</code>, and <code>newspaper</code>. We plot <code>sales</code> as a function of these input variables wherein each plot contains a <b>least squares fit</b> of <code>sales</code> to that media.

```{r Figure 2.1, message = FALSE, fig.align = "center", fig.width = 16, fig.height = 8}
# AdHoc Plotting Function
get_figs_2_1 <- function(data, column) {
  
  ggplot(data = data, aes(x = data[,column], y = sales)) +
    
  # Geoms
  geom_point(color = "gold", size = 3) +
  geom_smooth(method = "lm", se = F, color = "royalblue") + # Use "lm" for least squares regression
    
  # Themes
  theme_light() +
  labs(x = ifelse(column == "TV", column, str_to_title(column)),
       y = "Sales") +
  theme(text = element_text(size = 18),
        panel.grid = element_blank(), 
        panel.border = element_rect(size = 2, color = "grey50")) 
  
}
  
# Apply Function over Columns
figs_2_1 <- lapply(c("TV", "radio", "newspaper"), get_figs_2_1, data = Advertising) 
  
# Print combined graphs
figs_2_1_caption <- paste("FIGURE 2.1.",
                          "In each plot we show the simple least squares fit of sales to that variable.",
                          "Each blue line represents a simple model that can be used to predict sales.")

grid.arrange(grobs = figs_2_1, nrow = 1,
             bottom = textGrob(x = 0.04, hjust = 0, gp = gpar(fontsize = 14, font = 3), label = figs_2_1_caption))
```

<p style="text-align:justify">
The book reiterates that $f$ is in general, unknown, and must be estimated based on the observed points. Doing so with the simulated <code>Income</code> dataset, where we assume the smoothed curve is from a localized regression fit (the authors did not specify $f$), we see in the plot below that the estimated $f$ contains errors $\epsilon$ represented by the vertical lines between the curve and the observations.
</p>

```{r Figure 2.2., message = FALSE, fig.align = "center", fig.width = 16, fig.height = 8}
# Left Plot
fig_2_2_left <- 
  
  Income1 %>%
  
    ggplot(aes(x = Education, y = Income)) +
  
    # Geoms
    geom_point(color = "gold", size = 3) +
  
    # Themes
    scale_x_continuous(breaks = c(10, 12, 14, 16, 18, 20, 22)) +
    theme_light() +
    labs(x = "Years of Education",
         y = "Income") +
    theme(text = element_text(size = 18),
          panel.grid = element_blank(),
          panel.border = element_rect(size = 2, color = "grey50"))

# Right Plot
fit_2_2_right_loess <- fitted(loess(Income ~ Education, data = Income1)) # Fit a loess curve first and get fitted values 

fig_2_2_right <-
  
  Income1 %>% 
  
    ggplot(aes(x = Education, y = Income)) +
  
    # Geoms
    geom_segment(data = Income1, size = 1,
                 aes(x = Education, y = Income, xend = Education, yend = fit_2_2_right_loess)) +
    geom_smooth(method = "loess", se = F, color = "royalblue") +
    geom_point(color = "gold", size = 3) +
  
    # Themes
    scale_x_continuous(breaks = c(10, 12, 14, 16, 18, 20, 22)) +
    theme_light() +
    labs(x = "Years of Education",
         y = "Income") +
    theme(text = element_text(size = 18),
          panel.grid = element_blank(),
          panel.border = element_rect(size = 2, color = "grey50"))

# Print Combined Graphs
figs_2_2_caption <- paste("FIGURE 2.2.",
                          "LEFT: the dots are the observed values of income and education.",
                          "RIGHT: the curve represents the TRUE relationship, while the black lines represent the error.")

grid.arrange(grobs = list(fig_2_2_left, fig_2_2_right), nrow = 1,
             bottom = textGrob(x = 0.04, hjust = 0, gp = gpar(fontsize = 14, font = 3), label = figs_2_2_caption))

```

More generally as well, $f$ may involve more than one input variable.

```{r Figure 2.3., message = F, fig.align = "center", fig.width = 16, fig.height = 8}
fit_2_3_loess <- loess(Income ~ Education + Seniority, data = Income2) # Fit a loess curve first

# Predict Values on Expanded XY Grid
x.pred <- seq(min(Income2$Education), max(Income2$Education), length.out = 30)
y.pred <- seq(min(Income2$Seniority), max(Income2$Seniority), length.out = 30)
xy     <- expand.grid(Education = x.pred, Seniority = y.pred)
z.pred <- matrix(predict(fit_2_3_loess, newdata = xy), nrow = 30, ncol = 30)

# 3D Plot
figs_2_3_caption <- "FIGURE 2.3. Plot of income as a function of education and seniority."

Income2 %>% 
  
  scatter3D(
    
    type = "p",
    x = Income2$Education, 
    y = Income2$Seniority, 
    z = Income2$Income,
    colvar = NA, pch = 19, col = "gold", cex = 1.75,
    phi = 25, theta = 45, expand = 0.6,
    xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
    
    # Plot Regression Plane First
    panel.first = scatter3D(
                    x = Income2$Education,
                    y = Income2$Seniority,
                    z = Income2$Income,
                    colvar = NA, col = "black", add = T,
                    surf = list(x = x.pred, y = y.pred, z = z.pred, fit = predict(fit_2_3_loess),
                                facets = T, col = "skyblue", border = "royalblue", alpha = 0.45))
    ); text(x = 0, y = 0, pos = 1, offset = 20, font = 3, cex = 1.25, labels = figs_2_3_caption, xpd = T)
```

## 2.1.1. Why Estimate $f$?

The two main reasons are <b>Prediction</b> and <b>Inference</b>.

### Prediction

- Most notable when the input variables are available but the output is not.
- We can predict $Y$ using $\hat{Y} = \hat{f}(X)$ where $\hat{f}(X)$ is an estimate of $f$ and $\hat{Y}$ is the predicted outcome.
- When prediction is the main focus, the exact form of $f$ is less of a concern as opposed to improving prediction accuracy.
  - The accuracy of the prediction depends on the <b>reducible</b> and <b>irreducible</b> error.
  - <b>reducible</b>: this error can be reduced by applying the most appropriate statistical learning technique to estimate $f$.
  - <b>irreducible</b>: error due to the variability in $\epsilon$ which cannot be predicted by $X$. This can be due to not including other significant predictors, and/or unmeasurable variation.
  - <p style="text-align:justify">This can be shown mathematically by splitting the expected value of the squared difference between the observed and predicted value of $Y$, or $E(Y - \hat{Y})^2$, into $\underbrace{[f(X) - \hat{f}(X)]^2}_\text{Reducible} + \underbrace{Var(\epsilon)}_\text{Irreducible}$ where the latter term represents the <b>variance</b> in the error term $\epsilon$.</p>
  
### Inference

- We also estimate $f$ to derive inferences from the relationship between the input and the output specifically in how $Y$ changes as a function of $X_1, ..., X_p$.
- The accuracy of our predictions isn't made as important as our understanding of this relationship, so the exact form of $f$ must be <b>known</b>.
- Some insights that are useful are:
  - Identifying <i>important</i> predictors that are often a <b>subset</b> of the entire list of possible variables
  - Determining the <b>positive/negative relationship</b> between the predictor and the response
  - Making adjustments to the <b>assumption</b> of $f$ having a linear form so as to find the most accurate representation of this relationship which is often more complicated
  
<p style="text-align:justify">
It is worth noting that different situations will call for either of the two, or both—it depends on the <b>questions</b> that we want to answer knowing that some may offer more value than others. Either way, different statistical learning methods may be less/more appropriate. Linear models for example, allow for simple and interpretable inferences, but are not the most accurate in predictions. In contrast, methods that account for non-linearity can be very accurate, but may not be the most interpretable choice for inference. This is an <b>important mindset</b> that must be kept while studying the methods in the book.
</p>

## 2.1.2. How Do We Estimate $f$?

<p style="text-align:justify">
Many of the methods discussed in the book share certain characteristics as will be discussed in this section. The set of $n$ different data points are called the <b> training data</b> in that these observations are used to train the selected method on how to estimate $f$. If $x_{ij}$ represents the value of the $j$th predictor for observation $i$, and $y_i$ represents the response for the $i$th observation, then our training data consists of $\{(x_1,y_1), (x_2,y_2), ..., (x_n,y_n)\}$ with $x_i = (x_{i1}, x_{i2}, x_{i3}, ..., x_{ip})^{T}$.
</p>

<b>Goal:</b> apply a Statistical Learning method to the training data to estimate $f$. These methods can be characterized as either <b>parametric</b> or <b>non-parametric</b>.

### Parametric Methods

This is a two-step, model-based approach that reduces the problem of estimating $f$ down to estimating a set of parameters. 

#### Pros
- This is much easier than fitting an entirely arbitrary function.

#### Cons
- The model chosen will usually not follow the true form of $f$.
- If the fit is too far from the true form, the estimate will be poor.
- While this can be alleviated by using more flexible models with more parameters to estimate, doing so can lead to <b>overfitting</b> wherein the model follows the errors too closely.

#### Steps
1. We first make an assumption about the form of $f$. A linear model for example assumes that $f$ is linear in $X$.
2. We then need a procedure that <i>fits</i> the training data to the model. For the linear model, this means estimating the parameters $\beta_1, \beta_2, ..., \beta_p$. The most common method to fit this model is through <b>ordinary least squares</b> (check [this](http://econ.ucsb.edu/~doug/240a/The%20Discovery%20of%20Statistical%20Regression.htm#:~:text=Legendre%20was%20the%20first%20to,use%20of%20least%20squares%20regression.) out for a brief history lesson on OLS!).

<p style="text-align:justify">
Applying a linear fit to the <code>Income</code> dataset, wherein we fit $income \approx \beta_0 + \beta_1 \times education + \beta_1 \times seniority$, we see that the fit does a reasonable job of capturing the positive relationship between the predictors and the response, but fails to capture the curvature present in the true $f$ as seen in Figure 2.3..
</p>

```{r Figure 2.4. & Figure 2.5., message = FALSE, fig.align = "center", fig.width = 16, fig.height = 8}
par(mfrow = c(1, 2))

# --- Linear Fit
fit_2_4_lm <- lm(Income ~ Education + Seniority, data = Income2)

# Predict Values on Expanded XY Grid
z.pred_lm <- matrix(predict(fit_2_4_lm, newdata = xy), nrow = 30, ncol = 30)

# 3D Plot of Linear Model
figs_2_4_caption <- "FIGURE 2.4. Linear model fit by OLS to the Income dataset."

Income2 %>% 
  
  scatter3D(
    
    type = "p",
    x = Income2$Education, 
    y = Income2$Seniority, 
    z = Income2$Income,
    colvar = NA, pch = 19, col = "gold", cex = 1.75,
    phi = 25, theta = 45, expand = 0.6,
    xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
    
    # Plot Regression Plane First
    panel.first = scatter3D(
                    x = Income2$Education,
                    y = Income2$Seniority,
                    z = Income2$Income,
                    colvar = NA, col = "black", add = T,
                    surf = list(x = x.pred, y = y.pred, z = z.pred_lm, fit = predict(fit_2_4_lm),
                                facets = T, col = "violet", border = "purple", alpha = 0.45))
    ); text(x = 0, y = 0, pos = 1, offset = 20, font = 3, cex = 1.25, labels = figs_2_4_caption, xpd = T)

# --- Smooth (!) Thin-Plate Spline Fit
Income2_X_scaled <- scale(Income2[c("Education", "Seniority")]) # Scale X first
Income3 <- data.frame(Income2_X_scaled, Income = Income2$Income)

fit_2_5_tp <- gam(Income ~ s(Education, Seniority, bs = "tp"), data = Income3)

# Predict Values on Expanded XY Grid
x.pred_tp <- seq(min(Income3$Education), max(Income3$Education), length.out = 30)
y.pred_tp <- seq(min(Income3$Seniority), max(Income3$Seniority), length.out = 30)
xy_tp     <- expand.grid(Education = x.pred_tp, Seniority = y.pred_tp)
z.pred_tp <- matrix(predict(fit_2_5_tp, newdata = xy_tp), nrow = 30, ncol = 30)

# Transform Back to Original Scale
scaled_center <- attr(Income2_X_scaled, "scaled:center")
scaled_scale  <- attr(Income2_X_scaled, "scaled:scale")
Income3$Education_orig <- x.pred_tp * scaled_scale[1] + scaled_center[1]
Income3$Seniority_orig <- y.pred_tp * scaled_scale[2] + scaled_center[2]

# 3D Plot of Thin-Plate Spline Regression Model
figs_2_5_caption <- "FIGURE 2.5. Smooth thin-plate spline fit."

Income3 %>% 
  
  scatter3D(
    
    type = "p",
    x = Income2$Education,
    y = Income2$Seniority,
    z = Income2$Income,
    colvar = NA, pch = 19, col = "gold", cex = 1.75,
    phi = 25, theta = 45, expand = 0.6,
    xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
    
    # Plot Regression Plane First
    panel.first = scatter3D(
                    x = Income2$Education, 
                    y = Income2$Seniority, 
                    z = Income2$Income,
                    colvar = NA, col = "black", add = T,
                    surf = list(x = x.pred, y = y.pred, z = z.pred_tp, fit = predict(fit_2_5_tp),
                                facets = T, col = "violet", border = "purple", alpha = 0.45))
    ); text(x = 0, y = 0, pos = 1, offset = 20, font = 3, cex = 1.25, labels = figs_2_5_caption, xpd = T)
```

### Non-Parametric Methods

<p style="text-align:justify">
These methods do not make assumptions about the form of $f$, but instead aim to get an estimate of $f$ that gets as close to the data points as possible without "being too rough or wiggly," which I take to mean as not overfitting over the training data.
</p>

#### Pros
- By not having strict assumptions about $f$, these methods allow for a greater range of shapes while accurately fitting $f$.

#### Cons
- A large number of observations is needed to accurately estimate $f$, much more than that needed of a parametric approach.

<p style="text-align:justify">
An example is given by Figure 2.5. wherein a <i>thin-plate spline</i> is used to fit $f$ without a pre-specified model. Note, however, that it does so such that the estimated fit is close to the data, but is still <b>smooth</b>. Comparing this to the true $f$ in Figure 2.3., we see that the estimate is accurate indeed. Another characteristic of a non-parametric method such as fitting a thin-plate spline is the need to choose a level of smoothness. The plot below for example has a lower level of smoothing and thus results in a <b>rough</b> fit that follows the observed data perfectly. In doing so, we have overfitted on the data, and we can expect to have inaccurate estimates on new data that is not part of the original training data. Thus, another factor to consider is how we will choose the correct amount of smoothing.
</p>

```{r Figure 2.6., message = FALSE, fig.align = "center", fig.width = 16, fig.height = 8}
# --- Rough (!) Thin-Plate Spline Fit
Income2_X_scaled <- scale(Income2[c("Education", "Seniority")]) # Scale X first
Income3 <- data.frame(Income2_X_scaled, Income = Income2$Income)

# - k = 30 since there are 30 exact unique points in our data, thereby disabling low-rank approximation
# - sp = 0 for disabling penalization
# - bp = "tp" to use thin plate regression spline basis function

fit_2_6_tp <- gam(Income ~ s(Education, Seniority, k = 30, sp = 0, bs = "tp"), data = Income3)

# Predict Values on Expanded XY Grid
z.pred_rtp <- matrix(predict(fit_2_6_tp, newdata = xy_tp), nrow = 30, ncol = 30)

# 3D Plot of Thin-Plate Spline Regression Model
figs_2_6_caption <- "FIGURE 2.6. Rough thin-plate spline fit with zero errors."

Income3 %>% 
  
  scatter3D(
    
    type = "p",
    x = Income2$Education,
    y = Income2$Seniority,
    z = Income2$Income,
    colvar = NA, pch = 19, col = "gold", cex = 1.75,
    phi = 25, theta = 45, expand = 0.6,
    xlab = "Years of Education", ylab = "Seniority", zlab = "Income",
    
    # Plot Regression Plane First
    panel.first = scatter3D(
                    x = Income2$Education, 
                    y = Income2$Seniority, 
                    z = Income2$Income,
                    colvar = NA, col = "black", add = T,
                    surf = list(x = x.pred, y = y.pred, z = z.pred_rtp, fit = predict(fit_2_6_tp),
                                facets = T, col = "violet", border = "purple", alpha = 0.45))
    ); text(x = 0, y = 0, pos = 1, offset = 20, font = 3, cex = 1.25, labels = figs_2_6_caption, xpd = T)

```

## 2.1.3. The Trade-Off Between Prediction Accuracy and Model Interpretability

<p style="text-align:justify">
As we saw in some of the examples so far, some statistical learning methods are less flexible/more restrictive than others (e.g. OLS regression). There are reasons for choosing such methods as opposed to the flexible approaches (e.g. thin-plate spline regression):
</p>
- If Inference is our main focus, then these restrictive methods are preferable in that it is easy to understand and interpret the relationship between $X$ and $Y$.
- More flexible methods often lead to complex estimates that make this assessment of individual relationships between each predictor and the response very difficult.
- Even when prediction is the sole interest, these restrictive approaches can sometimes outperform the most flexible method available because of the latter overfitting.

## 2.1.4. Supervised Versus Unsupervised Learning

<p style="text-align:justify">
Statistical learning problems can be categorized into <b>supervised</b> and <b>unsupervised</b>. The main difference between the two is that, in the former, for each observation there is an associated response, and that relationship is modeled with either prediction or inference in mind whereas in the latter, we have realized values of $x_i$ for each observation $i = 1, ..., n$, but no response. There is no response variable to <i>supervise</i> the method that we are using. One such method is <b>Cluster Analysis</b> which seeks to group observations into distinct clusters based on values of the input variables which may translate for example to segmenting customers based on spending habits and demographic factors.
<br>
<br>
Some problems can also call for <b>semi-supervised</b> learning methods wherein a subset of the data contains associated response values while another subset does not. An example of this scenario is when measuring the response is much more expensive than readily being able to measure the predictors. Thus, classifying problems between supervised and unsupervised may not always be clear from the get-go.
</p>

## 2.1.5. Regression versus Classification Problems

<p style="text-align:justify">
Problems with a <b>quantitative</b> response, or those with numerical values such as age and income, are often referred to as <i>regression</i> problems. Those that deal with a <b>qualitative</b> response wherein the variable takes one value of $K$ different classes/categories (e.g. gender, presence of a disease, product brand) are referred to as <i>classification</i> problems. Whether the predictor variables are of either type is generally less important. This does not mean however that less attention should be given to the available features, especially when effectively engineered and coded.
</p>

# 2.2. Assessing Model Accuracy

It is necessary to know and understand many different statistical learning methods as there is <b>no single method</b> that works best on all possible data sets.

## 2.2.1. Measuring the Quality of Fit

<p style="text-align:justify">
Evaluating the performance of statistical learning methods require us to quantify how close the predictions are to the observed values. A very commonly used measure is the <b>Mean Squared Error</b> or MSE.
</p>

$$MSE = \frac{1}{n} \displaystyle\sum_{i = 1}^{n} (y_i - \hat{f}(x_i))^2$$

- Used in the regression setting.
- $\hat{f}(x_i)$ is the predicted value for the <i>i</i>th observation (also $\hat{y}_i$).
- A smaller value denotes predictions that are closer to the true value.
- Computing the MSE on a model fit with the training data can be more accurately called as the <b>training MSE</b>.
- However, we are more interested in the <b>test MSE</b>, or one that measures prediction accuracy when the method is applied to unseen test data.
- We try to select the model with the lowest test MSE.

For unseen test data $(x_0, y_0)$ we compute the average squared prediction error.

$$Ave(y_0 - \hat{f}(x_0))^2$$

<p style="text-align:justify">
One way to implement the minimization of the MSE is to set aside data that was not used to train the model to serve as the test data, and evaluate the methods on that test data. However when test data is not available, simply relying on the training MSE is not ideal as well since these methods can have very large test MSEs despite having small training MSEs. Furthermore, the training MSE is <b>monotonically decreasing</b> as the flexibility of the method increases; relying on training MSE is likely to push us to use the most flexible/complex method which can have just as bad a test MSE than even an OLS model (consider when the true $f$ is linear). This is illustrated when the test MSE is plotted as a function of model flexibility where a resulting <b>U-shaped</b> curve is observed.
<br>
<br>
When the model results in a <b><span style="color:red">small</span></b> training MSE but a <b><span style="color:green">large</span></b> test MSE, then we can say that the model has <b>overfitted</b> on the training data in that it has followed too closely the observed training data and its patterns that don't exist in the test data. Another indicator is when a less flexible method would have resulted into a smaller test MSE. Note that we also almost always expect the training MSE to be smaller than the test MSE because "most statistical learning methods either directly or indirectly seek to minimize the training MSE."
</p>

## 2.2.2. The Bias-Variance Trade-Off

The expected test MSE can always be decomposed into three fundamental terms (for a given value of $x_0$) : the <b>variance</b> of $\hat{f}(x_0)$, the <b>squared bias</b> of $\hat{f}(x_0)$, and the <b>variance of the error terms</b> $\epsilon$ which is also the irreducible error. As the first two terms are nonnegative, the expected MSE can never go below $Var(\epsilon)$.

\begin{align}
MSE & = E(y_0 - \hat{f}(x_0))^2 \\
& = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)
\end{align}

Thus, the optimal statistical learning method must have both the variance and bias be low.

- <b>"Expected Test MSE"</b>: the average test MSE obtained from repeatedly estimating $f$ using a number of training sets, and testing each $\hat{f}$ at $x_0$.
- <b>"Overall Expected Test MSE"</b>: obtained by averaging the former over ALL possible values of $x_0$ in the test set.
- <b>Variance</b>: the amount of variation in the estimated $\hat{f}$ over different training sets. Ideally, $\hat{f}$ should not vary much. High variance in a method can result in large changes in $\hat{f}$ even with small changes in the training set. Generally, more flexible methods have higher variance.
- <b>Bias</b>: the error from approximating a complex phenomenon with a simpler model; also the inaccuracy of the prediction estimates. Generally, more flexible methods have lower bias. 

<p style="text-align:justify">
It is important to note that the rate of change of both the Bias and the Variance as flexibility increases will also depend on the data sets themselves. As the two quantities determine whether the test MSE increases/decreases, the amount of flexibility corresponding to the optimal test MSE are different across the data sets. There is a point, however, where increasing the flexibility only slightly decreases the Bias but causes the Variance to go up, consequently increasing the MSE overall. Thus, there is a <b>trade-off</b> between the two, and the challenge is to find the method that optimally minimizes both.
</p>

## 2.2.3. The Classification Setting

<p style="text-align:justify">
The aforementioned concepts also apply to the classification setting with changes due to $y_i$ being qualitative instead of quantitative when we estimate $f$ over the training data. The most common approach to quantify the accuracy of the estimated $f$ is the <b>training error rate</b> which computes the proportion of incorrect classifications after applying $\hat{f}$ to the training data.
</p>

$$
\frac{1}{n} \displaystyle\sum_{i = 1}^{n} I(y_i \neq \hat{y}_i)
\\
where\ I(y_i \neq \hat{y}_i) = 
        \begin{cases}
          0 & \text{if $y_i = \hat{y}_i$, or correct classification} \\
          1 & \text{if $y_i \neq \hat{y}_i$, or wrong classification}
        \end{cases}
$$

<p style="text-align:justify">
The <b>test error rate</b> which is the result after applying the classifier to the test data $(x_0, y_0)$ is given by the below measure where $\hat{y}_0$ is the predicted class label. We generally choose the model with the lowest test error rate.
</p>

$$Ave(I(y_0 \neq \hat{y}_0))$$

### The Bayes Classifier

<p style="text-align:justify">
The above test error rate can be shown (see Section 2.4 of the <i>Elements of Statistical Learning</i>) to be minimized on the average by classifying each observation to the most probable class, or where the below <b>conditional probability</b> for a given test observation with predictor vector $x_0$ to the class $j$ is the <b>largest</b>. This is referred to as the <b>Bayes Classifier</b>:
</p>

$$Pr(Y = j\ |\ X = x_0)$$
<p style="text-align:justify">
In a two-class setting, this classifier corresponds to predicting a response value of <i>class one</i> when $Pr(Y = 1\ |\ X = x_0) > 0.5$, and returns <i>class two</i> otherwise. If plotted, a line drawn from points where the conditional probability is exactly 50% is called the <b>Bayes Decision Boundary</b>. Additionally, the Bayes Classifier produces the <b>Bayes Error Rate</b> which is the lowest test error rate possible for a classifier with a random outcome, and is also analogous to the irreducible error. This is given below where the probability over all possible values of $X$ are averaged over the expectation:
</p>

$$1 - E(\underset{j}{max}Pr(Y = j\ |\ X)$$

### K-Nearest Neighbors

<p style="text-align:justify">
When the conditional distribution of $Y$ given $X$ is not known as in real-world data, the Bayes Classifier cannot be computed. The <b>K-Nearest Neighbors</b> is an approach that estimates this conditional distribution and predicts a class for a test observation with the highest estimated probability. This is done by:
</p>
1. Choosing the $K$ nearest points closest to the training observation $x_0$, represented by $N_0$
2. Estimating the conditional probability for class $j$, or the number of points with response values equal to $j$ divided by $K$:<br><center>$Pr(Y = j\ |\ X = x_0) = \frac{1}{K}\displaystyle\sum_{i\ \in\ N_0}I(y_i = j)$</center>
3. Classifying the observation to the class with the largest probability

The KNN classifier can perform closely to the Bayes Classifier, but the choice of $K$ affects the flexibility of the model.

- When $K$ = 1, the classifier has low bias and high variance.
- As $K$ increases, it becomes less flexible and produces a decision boundary near to linear; i.e. it has high bias and low variance.

<p style="text-align:justify">
Similar to the regression setting, the training error rate decreases as the flexibility increases. The test error rate also similarly exhibits a U-shaped curve where the model eventually overfits.
</p>
