🌑

Explore / Study / Statistics / Statistical Machine Learning 2.3k words | 14 minutes

§4 Cross-validation and the Bootstrap

  1. Training Error versus Test Error
    1. More on prediction-error estimates
  2. Validation-set approach
    1. The validation process
    2. Drawbacks of validation set approach
  3. KKK-fold Cross-validation
    1. The details
    2. A nice special case!
    3. Other issues with Cross-validation
    4. Cross-Validation for Classification Problems
  4. The Bootstrap
    1. A simple example
    2. Now back to the real world
    3. The bootstrap in general
    4. Other uses of the bootstrap
    5. Can the bootstrap estimate prediction error?
    6. Removing the overlap

Training Error versus Test Error

  • Recall the distinction between the test error and the training error:
  • The test error is the average error that results from using a statistical learning method to predict the response on a new observation, one that was not used in training the method.
  • In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training.
  • But the training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.

More on prediction-error estimates

  • Best solution: a large designated test set. Often not available.
  • Some methods make a mathematical adjustment to the training error rate in order to estimate the test error rate. These include the Cp**C_p statistic**, AICAIC and BICBIC. They are discussed elsewhere in this course.
  • Here we instead consider a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.

Validation-set approach

  • Here we randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.
  • The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set.
  • The resulting validation-set error provides an estimate of the test error. This is typically assessed using MSE in the case of a quantitative response and misclassification rate in the case of a qualitative (discrete) response.

The validation process

  • A random splitting into two halves: left part is training set, right part is validation set.

Drawbacks of validation set approach

  • The validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
  • In the validation approach, only a subset of the observations — those that are included in the training set rather than in the validation set — are used to fit the model.
  • This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire data set.

KK-fold Cross-validation

  • Widely used approach for estimating test error.
  • Estimates can be used to select best model, and to give an idea of the test error of the final chosen model.
  • Idea is to randomly divide the data into KK equal-sized parts. We leave out part kk , fit the model to the other K1K-1 parts (combined), and then obtain predictions for the left-out kk th part.
  • This is done in turn for each part k=1,2,Kk=1,2, \ldots K , and then the results are combined.

The details

  • Let the KK parts be C1,C2,CKC_{1}, C_{2}, \ldots C_{K} , where CkC_{k} denotes the indices of the observations in part kk. There are nkn_{k} observations in part kk : if NN is a multiple of KK , then nk=n/Kn_{k}=n / K

  • Compute

    CV(K)=k=1KnknMSEk\mathrm{CV}_{(K)}=\sum_{k=1}^{K} \frac{n_{k}}{n} \mathrm{MSE}_{k}

    where MSEk=iCk(yiy^i)2/nk\operatorname{MSE}_{k}=\sum_{i \in C_{k}}\left(y_{i}-\hat{y}_{i}\right)^{2} / n_{k} , and y^i\hat{y}_{i} is the fit for observation ii , obtained from the data with part kk removed.

  • Setting K=nK=n yields nn -fold or leave-one out cross-validation (LOOCV).

A nice special case!

  • With least-squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds:

    CV(n)=1ni=1n(yiy^i1hi)2\mathrm{CV}_{(n)}=\frac{1}{n} \sum_{i=1}^{n}\left(\frac{y_{i}-\hat{y}_{i}}{1-h_{i}}\right)^{2}

    where y^i\hat{y}_{i} is the ii th fitted value from the original least squares fit, and hih_{i} is the leverage (diagonal of the “hat” matrix; see book for details.) This is like the ordinary MSE, except the ii th residual is divided by 1hi1-h_{i}.

  • LOOCV sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.

  • a better choice is K=5K=5 or 1010.

Other issues with Cross-validation

  • Since each training set is only (K1)/K(K-1) / K as big as the original training set, the estimates of prediction error will typically be biased upward.
  • This bias is minimized when K=n(LOOCV)K=n(\mathrm{LOOCV}) , but this estimate has high variance, as noted earlier.
  • K=5K=5 or 1010 provides a good compromise for this bias-variance tradeoff.

Cross-Validation for Classification Problems

  • We divide the data into KK roughly equal-sized parts C1,C2,CKC_{1}, C_{2}, \ldots C_{K}. CkC_{k} denotes the indices of the observations in part kk. There are nkn_{k} observations in part kk : if nn is a multiple of KK , then nk=n/Kn_{k}=n / K.

  • Compute

    CVK=k=1KnknErrk\mathrm{CV}_{K}=\sum_{k=1}^{K} \frac{n_{k}}{n} \operatorname{Err}_{k}

    where Errk=iCkI(yiy^i)/nk\operatorname{Err}_{k}=\sum_{i \in C_{k}} I\left(y_{i} \neq \hat{y}_{i}\right) / n_{k}.

  • The estimated standard deviation of CVK\mathrm{CV}_{K} is

    SE^(CVK)=k=1K(ErrkErrk)2/(K1)\widehat{\mathrm{SE}}\left(\mathrm{CV}_{K}\right)=\sqrt{\sum_{k=1}^{K}\left(\operatorname{Err}_{k}-\overline{\operatorname{Err}_{k}}\right)^{2} /(K-1)}

The Bootstrap

  • The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.
  • For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient.

A simple example

  • Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of XX and YY , respectively, where XX and YY are random quantities.

  • We will invest a fraction α\alpha of our money in XX , and will invest the remaining 1α1-\alpha in YY.

  • We wish to choose α\alpha to minimize the total risk, or variance, of our investment. In other words, we want to minimize Var(αX+(1α)Y)\operatorname{Var}(\alpha X+(1-\alpha) Y).

  • One can show that the value that minimizes the risk is given by

    α=σY2σXYσX2+σY22σXY\alpha=\frac{\sigma_{Y}^{2}-\sigma_{X Y}}{\sigma_{X}^{2}+\sigma_{Y}^{2}-2 \sigma_{X Y}}

    where σX2=Var(X),σY2=Var(Y)\sigma_{X}^{2}=\operatorname{Var}(X), \sigma_{Y}^{2}=\operatorname{Var}(Y) , and σXY=Cov(X,Y).\sigma_{X Y}=\operatorname{Cov}(X, Y).

  • But the values of σX2,σY2\sigma_{X}^{2}, \sigma_{Y}^{2} , and σXY\sigma_{X Y} are unknown.

  • We can compute estimates for these quantities, σ^X2,σ^Y2\hat{\sigma}_{X}^{2}, \hat{\sigma}_{Y}^{2} , and σ^XY\hat{\sigma}_{X Y} , using a data set that contains measurements for XX and YY.

  • We can then estimate the value of α\alpha that minimizes the variance of our investment using

    α^=σ^Y2σ^XYσ^X2+σ^Y22σ^XY\hat{\alpha}=\frac{\hat{\sigma}_{Y}^{2}-\hat{\sigma}_{X Y}}{\hat{\sigma}_{X}^{2}+\hat{\sigma}_{Y}^{2}-2 \hat{\sigma}_{X Y}}

  • To estimate the standard deviation of α^\hat{\alpha} , we repeated the process of simulating 100 paired observations of XX and YY , and estimating α\alpha 1,000 times.

  • We thereby obtained 1,000 estimates for α\alpha , which we can call α^1,α^2,,α^1000\hat{\alpha}_{1}, \hat{\alpha}_{2}, \ldots, \hat{\alpha}_{1000}

  • For these simulations the parameters were set to σX2=1,σY2=1.25\sigma_{X}^{2}=1, \sigma_{Y}^{2}=1.25 , and σXY=0.5\sigma_{X Y}=0.5 , and so we know that the true value of α\alpha is 0.60.6.

  • The mean over all 1,000 estimates for α\alpha is

    αˉ=11000r=11000α^r=0.5996\bar{\alpha}=\frac{1}{1000} \sum_{r=1}^{1000} \hat{\alpha}_{r}=0.5996

    very close to α=0.6\alpha=0.6 , and the standard deviation of the estimates is

    110001r=11000(α^rαˉ)2=0.083\sqrt{\frac{1}{1000-1} \sum_{r=1}^{1000}\left(\hat{\alpha}_{r}-\bar{\alpha}\right)^{2}}=0.083

  • This gives us a very good idea of the accuracy of α^\hat{\alpha} : SE(α^)0.083\mathrm{SE}(\hat{\alpha}) \approx 0.083

  • So roughly speaking, for a random sample from the population, we would expect α^\hat{\alpha} to differ from α\alpha by approximately 0.080.08 , on average.

Now back to the real world

  • The procedure outlined above cannot be applied, because for real data we cannot generate new samples from the original population.

  • However, the bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we can estimate the variability of our estimate without generating additional samples.
    • Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set with replacement.

  • Each of these “bootstrap data sets” is created by sampling with replacement, and is the same size as our original dataset. As a result some observations may appear more than once in a given bootstrap data set and some not at all.

  • Denoting the first bootstrap data set by Z1Z^{* 1} , we use Z1Z^{* 1} to produce a new bootstrap estimate for α\alpha , which we call α^1\hat{\alpha}^{* 1}.

  • This procedure is repeated BB times for some large value of BB (say 100 or 1000 ), in order to produce BB different bootstrap data sets, Z1,Z2,,ZBZ^{* 1}, Z^{* 2}, \ldots, Z^{* B} , and BB corresponding α\alpha estimates, α^1,α^2,,α^B\hat{\alpha}^{* 1}, \hat{\alpha}^{* 2}, \ldots, \hat{\alpha}^{* B}.

  • We estimate the standard error of these bootstrap estimates using the formula

    SEB(α^)=1B1r=1B(α^rα^)2\operatorname{SE}_{B}(\hat{\alpha})=\sqrt{\frac{1}{B-1} \sum_{r=1}^{B}\left(\hat{\alpha}^{* r}-\overline{\hat{\alpha}}^{*}\right)^{2}}

  • This serves as an estimate of the standard error of α^\hat{\alpha} estimated from the original data set. For this example SEB(α^)=0.087\operatorname{SE}_{B}(\hat{\alpha})=0.087.

The bootstrap in general

  • In more complex data situations, figuring out the appropriate way to generate bootstrap samples can require some thought.
  • For example, if the data is a time series, we can’t simply sample the observations with replacement. (why not?).
  • We can instead create blocks of consecutive observations, and sample those with replacements. Then we paste together sampled blocks to obtain a bootstrap dataset.

Other uses of the bootstrap

  • Primarily used to obtain standard errors of an estimate.
  • Also provides approximate confidence intervals for a population parameter.
  • The above interval is called a Bootstrap Percentile confidence interval. It is the simplest method (among many approaches) for obtaining a confidence interval from the bootstrap.

Can the bootstrap estimate prediction error?

  • In cross-validation, each of the KK validation folds is distinct from the other K1K-1 folds used for training: there is no overlap. This is crucial for its success.
  • To estimate prediction error using the bootstrap, we could think about using each bootstrap dataset as our training sample, and the original sample as our validation sample.
  • But each bootstrap sample has significant overlap with the original data. About two-thirds of the original data points appear in each bootstrap sample.
  • This will cause the bootstrap to seriously underestimate the true prediction error.
  • The other way around - with original sample = training sample, bootstrap dataset = validation sample - is worse!

Removing the overlap

  • Can partly fix this problem by only using predictions for those observations that did not (by chance) occur in the current bootstrap sample.
  • But the method gets complicated, and in the end, cross-validation provides a simpler, more attractive approach for estimating prediction error.

— Jul 15, 2022

Creative Commons License
§4 Cross-validation and the Bootstrap by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.