🌑

☀️

Hi Folks.

Home Explore About

🇨🇳 🔎

Explore / Study / Statistics / Statistical Machine Learning 3.2k words | 20 minutes

§3 Classification

Classification

Qualitative variables take values in an unordered set $\mathcal C$ .
Given a feature vector $X$ and a qualitative response $Y$ taking values in the set $\mathcal{C}$ , the classification task is to build a function $C(X)$ that takes as input the feature vector $X$ and predicts its value for $Y$ ; i.e. $C(X) \in \mathcal{C}$ .
Often we are more interested in estimating the probabilities that $X$ belongs to each category in $\mathcal{C}$ .

Linear Regression

Suppose for the Default classification task that we code

$Y=\left\{\begin{array}{ll} 0 & \text { if } \mathrm{No} \\ 1 & \text { if Yes. } \end{array}\right.$

Can we simply perform a linear regression of $Y$ on $X$ and classify as Yes if $\hat{Y}>0.5$ ?
- In this case of a binary outcome, linear regression does a good job as a classifier, and is equivalent to linear discriminant analysis which we discuss later.
- Since in the population $E(Y \mid X=x)=\operatorname{Pr}(Y=1 \mid X=x)$ , we might think that regression is perfect for this task.
- However, linear regression might produce probabilities less than zero or bigger than one. Logistic regression is more appropriate.
Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.
$Y=\left\{\begin{array}{ll} 1 & \text { if stroke } \\ 2 & \text { if drug overdose; } \\ 3 & \text { if epileptic seizure. } \end{array}\right.$
This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure.

Linear regression is not appropriate here.

Multiclass Logistic Regression or Discriminant Analysis are more appropriate.

Logistic Regression

Let’s write $p(X)=\operatorname{Pr}(Y=1 \mid X)$ for short and consider using balance to predict default. Logistic regression uses the form
$p(X)=\frac{e^{\beta_{0}+\beta_{1} X}}{1+e^{\beta_{0}+\beta_{1} X}}$
( $e \approx 2.71828$ is a mathematical constant [Euler’s number.])

It is easy to see that no matter what values $\beta_{0}, \beta_{1}$ or $X$ take, $p(X)$ will have values between 0 and 1.

A bit of rcarrangcment gives
$\log \left(\frac{p(X)}{1-p(X)}\right)=\beta_{0}+\beta_{1} X$
This monotone transformation is called the $**\log$ odds** or logit transformation of $p(X)$ .
Logistic regression ensures that our estimate for $p(X)$ lies
between 0 and 1.

Maximum Likelihood

We use maximum likelihood to estimate the parameters.
$\ell\left(\beta_{0}, \beta\right)=\prod_{i: y_{i}=1} p\left(x_{i}\right) \prod_{i: y_{i}=0}\left(1-p\left(x_{i}\right)\right)$
This likelihood gives the probability of the observed zeros and ones in the data. We pick $\beta_{0}$ and $\beta_{1}$ to maximize the likelihood of the observed data.

Most statistical packages can fit linear logistic regression models by maximum likelihood. In R we use the glm function.

Confounding

Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students.
But for each level of balance, students default less than non-students.
Multiple logistic regression can tease this out.

Case-control sampling and logistic regression

In South African data, there are 160 cases, 302 controls $\tilde{\pi}=0.35$ are cases. Yet the prevalence of MI in this region is $\pi=0.05$ .
With case-control samples, we can estimate the regression parameters $\beta_{j}$ accurately (if our model is correct); the constant term $\beta_{0}$ is incorrect.
We can correct the estimated intercept by a simple transformation
$\hat{\beta}_{0}^{*}=\hat{\beta}_{0}+\log \frac{\pi}{1-\pi}-\log \frac{\tilde{\pi}}{1-\tilde{\pi}}$
Often cases are rare and we take them all; up to five times that number of controls is sufficient.

Diminishing returns in unbalanced binary data

Sampling more controls than cases reduces the variance of the parameter estimates. But after a ratio of about 5 to 1 the variance re- duction flattens out.

Logistic regression with more than two classes

So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form
$\operatorname{Pr}(Y=k \mid X)=\frac{e^{\beta_{0 k}+\beta_{1 k} X_{1}+\ldots+\beta_{p k} X_{p}}}{\sum_{\ell=1}^{K} e^{\beta_{0 \ell}+\beta_{1 \ell} X_{1}+\ldots+\beta_{p \ell} X_{p}}}$
Here there is a linear function for each class.
(Some cancellation is possible, and only $K-1$ linear functions are needed as in 2-class logistic regression.)

Multiclass logistic regression is also referred to as multinomial regression.

Discriminant Analysis

Here the approach is to model the distribution of $X$ in each of the classes separately, and then use Bayes theorem to flip things around and obtain $\operatorname{Pr}(Y \mid X)$ .
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis.
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions.

Bayes theorem for classification

Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem:

$\operatorname{Pr}(Y=k \mid X=x)=\frac{\operatorname{Pr}(X=x \mid Y=k) \cdot \operatorname{Pr}(Y=k)}{\operatorname{Pr}(X=x)}$

One writes this slightly differently for discriminant analysis:

$\operatorname{Pr}(Y=k \mid X=x)=\frac{\pi_{k} f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)}$

where
- $f_{k}(x)=\operatorname{Pr}(X=x \mid Y=k)$ is the density for $X$ in class $k$ . Here we will use normal densities for these, separately in each class.
- $\pi_{k}=\operatorname{Pr}(Y=k)$ is the marginal or prior probability for class $k$ .

Classify to the highest density

We classify a new point according to which density is highest.
When the priors are different, we take them into account as well, and compare $\pi_{k} f_{k}(x)$ . On the right, we favor the pink class - the decision boundary has shifted to the left.

Why discriminant analysis?

When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
If $n$ is small and the distribution of the predictors $X$ is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.

Linear Discriminant Analysis when $p = 1$

The Gaussian density has the form
$f_{k}(x)=\frac{1}{\sqrt{2 \pi} \sigma_{k}} e^{-\frac{1}{2}\left(\frac{x-\mu_{k}}{\sigma_{k}}\right)^{2}}$
Here $\mu_{k}$ is the mean, and $\sigma_{k}^{2}$ the variance (in class $k$ ). We will assume that all the $\sigma_{k}=\sigma$ are the same.

Plugging this into Bayes formula, we get a rather complex expression for $p_{k}(x)=\operatorname{Pr}(Y=k \mid X=x)$ :
$p_{k}(x)=\frac{\pi_{k} \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2}\left(\frac{x-\mu_{k}}{\sigma}\right)^{2}}}{\sum_{l=1}^{K} \pi_{l} \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2}\left(\frac{x-\mu_{l}}{\sigma}\right)^{2}}}$
Happily, there are simplifications and cancellations.

Discriminant functions

To classify at the value $X=x$ , we need to see which of the $p_{k}(x)$ is largest. Taking logs, and discarding terms that do not depend on $k$ , we see that this is equivalent to assigning $x$ to the class with the largest discriminant score:
$\delta_{k}(x)=x \cdot \frac{\mu_{k}}{\sigma^{2}}-\frac{\mu_{k}^{2}}{2 \sigma^{2}}+\log \left(\pi_{k}\right)$
Note that $\delta_{k}(x)$ is a linear function of $x$ .

If there are $K=2$ classes and $\pi_{1}=\pi_{2}=0.5$ , then one can see that the decision boundary is at
$x=\frac{\mu_{1}+\mu_{2}}{2}$
Typically we don’t know these parameters; we just have the training data. In that case we simply estimate the parameters and plug them into the rule.

Estimating the parameters

\begin{aligned}\hat{\pi}_{k} &=\frac{n_{k}}{n} \\\hat{\mu}_{k} &=\frac{1}{n_{k}} \sum_{i: y_{i}=k} x_{i} \\\hat{\sigma}^{2} &=\frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_{i}=k}\left(x_{i}-\hat{\mu}_{k}\right)^{2} \\&=\sum_{k=1}^{K} \frac{n_{k}-1}{n-K} \cdot \hat{\sigma}_{k}^{2}\end{aligned}

where $\hat{\sigma}_{k}^{2}=\frac{1}{n_{k}-1} \sum_{i: y_{i}=k}\left(x_{i}-\hat{\mu}_{k}\right)^{2}$ is the usual formula for the estimated variance in the $k$ th class.

Linear Discriminant Analysis when $p > 1$

Density:
$f(x)=\frac{1}{(2 \pi)^{p / 2}|{\boldsymbol{\Sigma}}|^{1 / 2}} e^{-\frac{1}{2}(x-\mu)^{T} \boldsymbol{\Sigma}^{-1}(x-\mu)}$
Discriminant function:
$\delta_{k}(x)=x^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}-\frac{1}{2} \mu_{k}^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}+\log \pi_{k}$
Despite its complex form, $\delta_{k}(x)=c_{k 0}+c_{k 1} x_{1}+c_{k 2} x_{2}+\ldots+c_{k p} x_{p}$ is a linear function.

Illustration: $p = 2$ and $K = 3$ classes

Here .

Here $\pi_{1}=\pi_{2}=\pi_{3}=1 / 3$ .

The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers.

From $\delta_{k}(x)$ to probabilities

Once we have estimates $\hat{\delta}_{k}(x)$ , we can turn these into estimates for class probabilities:
$\widehat{\operatorname{Pr}}(Y=k \mid X=x)=\frac{e^{\hat{\delta}_{k}(x)}}{\sum_{l=1}^{K} e^{\hat{\delta}_{l}(x)}}$
So classifying to the largest $\hat{\delta}_{k}(x)$ amounts to classifying to the class for which $\widehat{\operatorname{Pr}}(Y=k \mid X=x)$ is largest.

When $K=2$ , we classify to class 2 if $\widehat{\operatorname{Pr}}(Y=2 \mid X=x) \geq 0.5$ else to class 1.

Types of errors

False positive rate: The fraction of negative examples that are classified as positive.
False negative rate: The fraction of positive examples that are classified as negative.
We produced this table by classifying to class Yes if
$\rm \widehat{\operatorname{Pr}}( Default = Yes \mid Balance, Student )\geq 0.5$
We can change the two error rates by changing the threshold from $0.5$ to some other value in $[0,1]$ :
${\rm \widehat{\operatorname{Pr}}( Def ault = Yes \mid Balance, Student )} \geq threshold$
and vary threshold.

Varying the threshold

In order to reduce the false negative rate, we may want to reduce the threshold to $0.1$ or less.

The ROC plot displays both simultaneously.

Sometimes we use the AUC or area under the curve to summarize the overall performance. Higher AUC is good.

Other forms of Discriminant Analysis

\operatorname{Pr}(Y=k \mid X=x)=\frac{\pi_{k} f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)}

When $f_{k}(x)$ are Gaussian densities, with the same covariance matrix $\boldsymbol{\Sigma}$ in each class, this leads to linear discriminant analysis. By altering the forms for $f_{k}(x)$ , we get different classifiers.

With Gaussians but different $\boldsymbol{\Sigma}_{k}$ in each class, we get quadratic discriminant analysis.
With $f_{k}(x)=\prod_{j=1}^{p} f_{j k}\left(x_{j}\right)$ (conditional independence model) in each class we get naïve Bayes. For Gaussian this means the $\boldsymbol{\Sigma}_{k}$ are diagonal.
Many other forms, by proposing specific density models for $f_{k}(x)$ , including nonparametric approaches.

Quadratic Discriminant Analysis

\delta_{k}(x)=-\frac{1}{2}\left(x-\mu_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1}\left(x-\mu_{k}\right)+\log \pi_{k}

Because the $\boldsymbol{\Sigma}_{k}$ are different, the quadratic terms matter.

Naïve Bayes

Assumes features are independent in each class.

Useful when $p$ is large, and so multivariate methods like QDA and even LDA break down.
- Gaussian naïve Bayes assumes each $\boldsymbol{\Sigma}_{k}$ is diagonal:
  $\delta_{k}(x) \propto \log \left[\pi_{k} \prod_{j=1}^{p} f_{k j}\left(x_{j}\right)\right]=-\frac{1}{2} \sum_{j=1}^{p} \frac{\left(x_{j}-\mu_{k j}\right)^{2}}{\sigma_{k j}^{2}}+\log \pi_{k}$
- can use for mixed feature vectors (qualitative and quantitative). If $X_{j}$ is qualitative, replace $f_{k j}\left(x_{j}\right)$ with probability mass function (histogram) over discrete categories.
Despite strong assumptions, naive Bayes often produces good classification results.

Logistic Regression versus LDA

For a two-class problem, one can show that for LDA

$\log \left(\frac{p_{1}(x)}{1-p_{1}(x)}\right)=\log \left(\frac{p_{1}(x)}{p_{2}(x)}\right)=c_{0}+c_{1} x_{1}+\ldots+c_{p} x_{p}$

So it has the same form as logistic regression.

The difference is in how the parameters are estimated.
- Logistic regression uses the conditional likelihood based on $\operatorname{Pr}(Y \mid X)$ (known as discriminative learning).
- LDA uses the full likelihood based on $\operatorname{Pr}(X, Y)$ (known as generative learning).
- Despite these differences, in practice the results are often very similar.
Footnote: logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.

Summary

Logistic regression is very popular for classification, especially when $K=2$ .
LDA is useful when $n$ is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when $K>2$ .
Naïve Bayes is useful when $p$ is very large.

— Jul 15, 2022

Related: #Classification, #Regression

View / Make Comments

Creative Commons License

§3 Classification by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.

◀ §4 Cross-validation and the Bootstrap

§2 Generalized Linear Models ▶

Made with ❤ at Earth.