🌑

Explore / Study / Statistics / Statistical Machine Learning 1.2k words | 7 minutes

§1 Statistical Learning

  1. The regression function f(x)f(x)f(x)
  2. How to estimate fff
  3. Parametric and structured models
  4. Assessing Model Accuracy
  5. Bias-Variance Trade-off
  6. Classification Problems
  7. Classification: some details

The regression function f(x)f(x)

  • This ideal f(x)=E(YX=x)f(x) = E(Y \mid X = x) is called the regression function.

  • f(x)=E(YX=x)f(x)=E(Y \mid X=x) is the function that minimizes E[(Yg(X))2X=x]E\left[(Y-g(X))^{2} \mid X=x\right] over all functions gg at all points X=xX=x.

  • ϵ=Yf(x)\epsilon=Y-f(x) is the irreducible error.

  • For any estimate f^(x)\hat{f}(x) of f(x)f(x) , we have

    E[(Yf^(X))2X=x]=E[f(x)f(x)]2 Reducible +Var(ϵ)Irreducible  E\left[(Y-\hat{f}(X))^{2} \mid X=x\right]=\underbrace{E[{f(x)-f(x)]^{2}}}_\text { Reducible }+\underbrace{\operatorname{Var}(\epsilon)}_{\text {Irreducible }}

How to estimate ff

  • Relax the definition and let

    f^(x)=Ave(YXN(x))\hat{f}(x)=\operatorname{Ave}(Y \mid X \in \mathcal{N}(x))

    where N(x)\mathcal{N}(x) is some neighborhood of xx.

  • Nearest neighbor methods can be lousy when pp is large.

    Reason: the curse of dimensionality. Nearest neighbors tend to be far away in high dimensions.

Parametric and structured models

  • The linear model is an important example of a parametric model:

    fL(X)=β0+β1X1+β2X2+βpXpf_{L}(X)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\ldots \beta_{p} X_{p}

    • A linear model is specified in terms of p+1p+1 parameters β0,β1,,βp\beta_{0}, \beta_{1}, \ldots, \beta_{p}.
    • We estimate the parameters by fitting the model to training data.

Assessing Model Accuracy

  • Suppose we fit a model f^(x)\hat{f}(x) to some training data Tr={xi,yi}1N\operatorname{Tr}=\left\{x_{i}, y_{i}\right\}_{1}^{N} , and we wish to see how well it performs.

    We could compute the average squared prediction error over Tr\rm Tr:

    MSETr=AveiTr[yif^(xi)]2\operatorname{MSE}_{\operatorname{Tr}}=\operatorname{Ave}_{i \in \operatorname{Tr}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}

  • This may be biased toward more overfit models

    Instead we should, if possible, compute it using fresh test data Te={xi,yi}1M:\mathrm{Te}=\left\{x_{i}, y_{i}\right\}_{1}^{M}:

    MSETe=AveiTe[yif^(xi)]2\operatorname{MSE}_{\mathrm{Te}}=\operatorname{Ave}_{i \in \mathrm{Te}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}

Bias-Variance Trade-off

  • Suppose we have fit a model f^(x)\hat{f}(x) to some training data Tr\mathrm{Tr} , anc let (x0,y0)\left(x_{0}, y_{0}\right) be a test observation drawn from the population. If the true model is Y=f(X)+ϵY=f(X)+\epsilon (with f(x)=E(YX=x)f(x)=E(Y \mid X=x)), then

    E(y0f^(x0))2=Var(f^(x0))+[Bias(f^(x0))]2+Var(ϵ)E\left(y_{0}-\hat{f}\left(x_{0}\right)\right)^{2}=\operatorname{Var}\left(\hat{f}\left(x_{0}\right)\right)+\left[\operatorname{Bias}\left(\hat{f}\left(x_{0}\right)\right)\right]^{2}+\operatorname{Var}(\epsilon)

    The expectation averages over the variability of y0y_{0} as well as the variability in Tr\rm Tr. Note that Bias(f^(x0))=E[f^(x0)]f(x0)\operatorname{Bias}\left(\hat{f}\left(x_{0}\right)\right)=E\left[\hat{f}\left(x_{0}\right)\right]-f\left(x_{0}\right).

Classification Problems

  • Here the response variable YY is qualitative.

    • Build a classifier C(X)C(X) that assigns a class label from C\mathcal{C} to a future unlabeled observation XX.
    • Assess the uncertainty in each classification.
    • Understand the roles of the different predictors among X=(X1,X2,,Xp)X=\left(X_{1}, X_{2}, \ldots, X_{p}\right).
  • Suppose the KK elements in C\mathcal{C} are numbered 1,2,,K.1,2, \ldots, K. Let

    pk(x)=Pr(Y=kX=x),k=1,2,,Kp_{k}(x)=\operatorname{Pr}(Y=k \mid X=x), k=1,2, \ldots, K

    These are the conditional class probabilities at xx. Then the Bayes optimal classifier at xx is

    C(x)=j if pj(x)=max{p1(x),p2(x),,pK(x)}C(x)=j \text { if } p_{j}(x)=\max \left\{p_{1}(x), p_{2}(x), \ldots, p_{K}(x)\right\}

  • Nearest-neighbor averaging can be used as before.

    Also breaks down as dimension grows. However, the impact on C^(x)\hat{C}(x) is less than on p^k(x),k=1,,K\hat{p}_{k}(x), k=1, \ldots, K.

Classification: some details

  • Typically we measure the performance of C^(x)\hat{C}(x) using the misclassification error rate:

    ErrTe=AveiTeI[yiC^(xi)]\operatorname{Err}_{\mathrm{Te}}=\operatorname{Ave}_{i \in \mathrm{Te}} I\left[y_{i} \neq \hat{C}\left(x_{i}\right)\right]

  • The Bayes classifier (using the true pk(x)p_{k}(x)) has smallest error (in the population).

  • K-nearest neighbors (KNN) classifier: Given a positive integer KK and a test observation x0x_{0} , KNN first identifies a set of KK points in the training data that are closest to x0x_{0} , denoted by N0\mathcal N_{0}. It then estimates the conditional probability for class jj by

    Pr(Y=jX=x0)=1KiN0I(yi=j) \operatorname{Pr}\left(Y=j \mid X=x_{0}\right)=\frac{1}{K} \sum_{i \in \mathcal{N}{0}} I\left(y_{i}=j\right)

    Finally, assign x0x_{0} to class jj with the largest probability.

— Jul 15, 2022

Creative Commons License
§1 Statistical Learning by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.