🌑

☀️

Hi Folks.

Home Explore About

🇨🇳 🔎

Explore / Study / Statistics / Statistical Machine Learning 1.2k words | 7 minutes

This ideal $f(x) = E(Y \mid X = x)$ is called the regression function.
$f(x)=E(Y \mid X=x)$ is the function that minimizes $E\left[(Y-g(X))^{2} \mid X=x\right]$ over all functions $g$ at all points $X=x$ .
$\epsilon=Y-f(x)$ is the irreducible error.
For any estimate $\hat{f}(x)$ of $f(x)$ , we have
$E\left[(Y-\hat{f}(X))^{2} \mid X=x\right]=\underbrace{E[{f(x)-f(x)]^{2}}}_\text { Reducible }+\underbrace{\operatorname{Var}(\epsilon)}_{\text {Irreducible }}$

Relax the definition and let
$\hat{f}(x)=\operatorname{Ave}(Y \mid X \in \mathcal{N}(x))$
where $\mathcal{N}(x)$ is some neighborhood of $x$ .
Nearest neighbor methods can be lousy when $p$ is large.

Reason: the curse of dimensionality. Nearest neighbors tend to be far away in high dimensions.

The linear model is an important example of a parametric model:

$f_{L}(X)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\ldots \beta_{p} X_{p}$
- A linear model is specified in terms of $p+1$ parameters $\beta_{0}, \beta_{1}, \ldots, \beta_{p}$ .
- We estimate the parameters by fitting the model to training data.

Suppose we fit a model $\hat{f}(x)$ to some training data $\operatorname{Tr}=\left\{x_{i}, y_{i}\right\}_{1}^{N}$ , and we wish to see how well it performs.

We could compute the average squared prediction error over $\rm Tr$ :
$\operatorname{MSE}_{\operatorname{Tr}}=\operatorname{Ave}_{i \in \operatorname{Tr}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}$
This may be biased toward more overfit models

Instead we should, if possible, compute it using fresh test data $\mathrm{Te}=\left\{x_{i}, y_{i}\right\}_{1}^{M}:$
$\operatorname{MSE}_{\mathrm{Te}}=\operatorname{Ave}_{i \in \mathrm{Te}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}$

Suppose we have fit a model $\hat{f}(x)$ to some training data $\mathrm{Tr}$ , anc let $\left(x_{0}, y_{0}\right)$ be a test observation drawn from the population. If the true model is $Y=f(X)+\epsilon$ (with $f(x)=E(Y \mid X=x)$ ), then
$E\left(y_{0}-\hat{f}\left(x_{0}\right)\right)^{2}=\operatorname{Var}\left(\hat{f}\left(x_{0}\right)\right)+\left[\operatorname{Bias}\left(\hat{f}\left(x_{0}\right)\right)\right]^{2}+\operatorname{Var}(\epsilon)$
The expectation averages over the variability of $y_{0}$ as well as the variability in $\rm Tr$ . Note that $\operatorname{Bias}\left(\hat{f}\left(x_{0}\right)\right)=E\left[\hat{f}\left(x_{0}\right)\right]-f\left(x_{0}\right)$ .

Here the response variable $Y$ is qualitative.
- Build a classifier $C(X)$ that assigns a class label from $\mathcal{C}$ to a future unlabeled observation $X$ .
- Assess the uncertainty in each classification.
- Understand the roles of the different predictors among $X=\left(X_{1}, X_{2}, \ldots, X_{p}\right)$ .
Suppose the $K$ elements in $\mathcal{C}$ are numbered $1,2, \ldots, K.$ Let
$p_{k}(x)=\operatorname{Pr}(Y=k \mid X=x), k=1,2, \ldots, K$
These are the conditional class probabilities at $x$ . Then the Bayes optimal classifier at $x$ is
$C(x)=j \text { if } p_{j}(x)=\max \left\{p_{1}(x), p_{2}(x), \ldots, p_{K}(x)\right\}$
Nearest-neighbor averaging can be used as before.

Also breaks down as dimension grows. However, the impact on $\hat{C}(x)$ is less than on $\hat{p}_{k}(x), k=1, \ldots, K$ .

Typically we measure the performance of $\hat{C}(x)$ using the misclassification error rate:
$\operatorname{Err}_{\mathrm{Te}}=\operatorname{Ave}_{i \in \mathrm{Te}} I\left[y_{i} \neq \hat{C}\left(x_{i}\right)\right]$
The Bayes classifier (using the true $p_{k}(x)$ ) has smallest error (in the population).
K-nearest neighbors (KNN) classifier: Given a positive integer $K$ and a test observation $x_{0}$ , KNN first identifies a set of $K$ points in the training data that are closest to $x_{0}$ , denoted by $\mathcal N_{0}$ . It then estimates the conditional probability for class $j$ by
$\operatorname{Pr}\left(Y=j \mid X=x_{0}\right)=\frac{1}{K} \sum_{i \in \mathcal{N}{0}} I\left(y_{i}=j\right)$
Finally, assign $x_{0}$ to class $j$ with the largest probability.

— Jul 15, 2022

Related: #Classification, #Regression

§1 Statistical Learning by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.

Made with ❤ at Earth.