🌑

Explore / Study / Statistics / Statistical Machine Learning 3.2k words | 20 minutes

§3 Classification

  1. Classification
  2. Linear Regression
  3. Logistic Regression
    1. Maximum Likelihood
    2. Confounding
    3. Case-control sampling and logistic regression
    4. Diminishing returns in unbalanced binary data
    5. Logistic regression with more than two classes
  4. Discriminant Analysis
    1. Bayes theorem for classification
    2. Classify to the highest density
    3. Why discriminant analysis?
    4. Linear Discriminant Analysis when p=1p = 1p=1
      1. Discriminant functions
      2. Estimating the parameters
    5. Linear Discriminant Analysis when p>1p > 1p>1
      1. Illustration: p=2p = 2p=2 and K=3K = 3K=3 classes
    6. From δk(x)\delta_{k}(x)δk​(x) to probabilities
    7. Types of errors
      1. Varying the threshold
    8. Other forms of Discriminant Analysis
      1. Quadratic Discriminant Analysis
      2. Naïve Bayes
    9. Logistic Regression versus LDA
  5. Summary

Classification

  • Qualitative variables take values in an unordered set C\mathcal C.
  • Given a feature vector XX and a qualitative response YY taking values in the set C\mathcal{C} , the classification task is to build a function C(X)C(X) that takes as input the feature vector XX and predicts its value for YY ; i.e. C(X)CC(X) \in \mathcal{C}.
  • Often we are more interested in estimating the probabilities that XX belongs to each category in C\mathcal{C}.

Linear Regression

  • Suppose for the Default classification task that we code

    Y={0 if No1 if Yes. Y=\left\{\begin{array}{ll} 0 & \text { if } \mathrm{No} \\ 1 & \text { if Yes. } \end{array}\right.

    Can we simply perform a linear regression of YY on XX and classify as Yes if Y^>0.5\hat{Y}>0.5 ?

    • In this case of a binary outcome, linear regression does a good job as a classifier, and is equivalent to linear discriminant analysis which we discuss later.
    • Since in the population E(YX=x)=Pr(Y=1X=x)E(Y \mid X=x)=\operatorname{Pr}(Y=1 \mid X=x) , we might think that regression is perfect for this task.
    • However, linear regression might produce probabilities less than zero or bigger than one. Logistic regression is more appropriate.
  • Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.

    Y={1 if stroke 2 if drug overdose; 3 if epileptic seizure. Y=\left\{\begin{array}{ll} 1 & \text { if stroke } \\ 2 & \text { if drug overdose; } \\ 3 & \text { if epileptic seizure. } \end{array}\right.

    This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure.

    Linear regression is not appropriate here.

    Multiclass Logistic Regression or Discriminant Analysis are more appropriate.

Logistic Regression

  • Let’s write p(X)=Pr(Y=1X)p(X)=\operatorname{Pr}(Y=1 \mid X) for short and consider using balance to predict default. Logistic regression uses the form

    p(X)=eβ0+β1X1+eβ0+β1Xp(X)=\frac{e^{\beta_{0}+\beta_{1} X}}{1+e^{\beta_{0}+\beta_{1} X}}

    (e2.71828e \approx 2.71828 is a mathematical constant [Euler’s number.])

    It is easy to see that no matter what values β0,β1\beta_{0}, \beta_{1} or XX take, p(X)p(X) will have values between 0 and 1.

    A bit of rcarrangcment gives

    log(p(X)1p(X))=β0+β1X\log \left(\frac{p(X)}{1-p(X)}\right)=\beta_{0}+\beta_{1} X

    This monotone transformation is called the log**\log odds** or logit transformation of p(X)p(X).

  • Logistic regression ensures that our estimate for p(X)p(X) lies
    between 0 and 1.

Maximum Likelihood

  • We use maximum likelihood to estimate the parameters.

    (β0,β)=i:yi=1p(xi)i:yi=0(1p(xi))\ell\left(\beta_{0}, \beta\right)=\prod_{i: y_{i}=1} p\left(x_{i}\right) \prod_{i: y_{i}=0}\left(1-p\left(x_{i}\right)\right)

    This likelihood gives the probability of the observed zeros and ones in the data. We pick β0\beta_{0} and β1\beta_{1} to maximize the likelihood of the observed data.

    Most statistical packages can fit linear logistic regression models by maximum likelihood. In R we use the glm function.

Confounding

stat3612-ch3-1.png

  • Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students.
  • But for each level of balance, students default less than non-students.
  • Multiple logistic regression can tease this out.

Case-control sampling and logistic regression

  • In South African data, there are 160 cases, 302 controls π~=0.35\tilde{\pi}=0.35 are cases. Yet the prevalence of MI in this region is π=0.05\pi=0.05.

  • With case-control samples, we can estimate the regression parameters βj\beta_{j} accurately (if our model is correct); the constant term β0\beta_{0} is incorrect.

  • We can correct the estimated intercept by a simple transformation

    β^0=β^0+logπ1πlogπ~1π~\hat{\beta}_{0}^{*}=\hat{\beta}_{0}+\log \frac{\pi}{1-\pi}-\log \frac{\tilde{\pi}}{1-\tilde{\pi}}

  • Often cases are rare and we take them all; up to five times that number of controls is sufficient.

Diminishing returns in unbalanced binary data

stat3612-ch3-2.png

  • Sampling more controls than cases reduces the variance of the parameter estimates. But after a ratio of about 5 to 1 the variance re- duction flattens out.

Logistic regression with more than two classes

  • So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form

    Pr(Y=kX)=eβ0k+β1kX1++βpkXp=1Keβ0+β1X1++βpXp\operatorname{Pr}(Y=k \mid X)=\frac{e^{\beta_{0 k}+\beta_{1 k} X_{1}+\ldots+\beta_{p k} X_{p}}}{\sum_{\ell=1}^{K} e^{\beta_{0 \ell}+\beta_{1 \ell} X_{1}+\ldots+\beta_{p \ell} X_{p}}}

    Here there is a linear function for each class.
    (Some cancellation is possible, and only K1K-1 linear functions are needed as in 2-class logistic regression.)

    Multiclass logistic regression is also referred to as multinomial regression.

Discriminant Analysis

  • Here the approach is to model the distribution of XX in each of the classes separately, and then use Bayes theorem to flip things around and obtain Pr(YX)\operatorname{Pr}(Y \mid X).
  • When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis.
  • However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions.

Bayes theorem for classification

  • Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem:

    Pr(Y=kX=x)=Pr(X=xY=k)Pr(Y=k)Pr(X=x)\operatorname{Pr}(Y=k \mid X=x)=\frac{\operatorname{Pr}(X=x \mid Y=k) \cdot \operatorname{Pr}(Y=k)}{\operatorname{Pr}(X=x)}

    One writes this slightly differently for discriminant analysis:

    Pr(Y=kX=x)=πkfk(x)l=1Kπlfl(x)\operatorname{Pr}(Y=k \mid X=x)=\frac{\pi_{k} f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)}

    where

    • fk(x)=Pr(X=xY=k)f_{k}(x)=\operatorname{Pr}(X=x \mid Y=k) is the density for XX in class kk. Here we will use normal densities for these, separately in each class.
    • πk=Pr(Y=k)\pi_{k}=\operatorname{Pr}(Y=k) is the marginal or prior probability for class kk.

Classify to the highest density

stat3612-ch3-3.png

  • We classify a new point according to which density is highest.
  • When the priors are different, we take them into account as well, and compare πkfk(x)\pi_{k} f_{k}(x). On the right, we favor the pink class - the decision boundary has shifted to the left.

Why discriminant analysis?

  • When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
  • If nn is small and the distribution of the predictors XX is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
  • Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.

Linear Discriminant Analysis when p=1p = 1

  • The Gaussian density has the form

    fk(x)=12πσke12(xμkσk)2f_{k}(x)=\frac{1}{\sqrt{2 \pi} \sigma_{k}} e^{-\frac{1}{2}\left(\frac{x-\mu_{k}}{\sigma_{k}}\right)^{2}}

    Here μk\mu_{k} is the mean, and σk2\sigma_{k}^{2} the variance (in class kk ). We will assume that all the σk=σ\sigma_{k}=\sigma are the same.

    Plugging this into Bayes formula, we get a rather complex expression for pk(x)=Pr(Y=kX=x)p_{k}(x)=\operatorname{Pr}(Y=k \mid X=x) :

    pk(x)=πk12πσe12(xμkσ)2l=1Kπl12πσe12(xμlσ)2p_{k}(x)=\frac{\pi_{k} \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2}\left(\frac{x-\mu_{k}}{\sigma}\right)^{2}}}{\sum_{l=1}^{K} \pi_{l} \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2}\left(\frac{x-\mu_{l}}{\sigma}\right)^{2}}}

    Happily, there are simplifications and cancellations.

Discriminant functions

  • To classify at the value X=xX=x , we need to see which of the pk(x)p_{k}(x) is largest. Taking logs, and discarding terms that do not depend on kk , we see that this is equivalent to assigning xx to the class with the largest discriminant score:

    δk(x)=xμkσ2μk22σ2+log(πk)\delta_{k}(x)=x \cdot \frac{\mu_{k}}{\sigma^{2}}-\frac{\mu_{k}^{2}}{2 \sigma^{2}}+\log \left(\pi_{k}\right)

    Note that δk(x)\delta_{k}(x) is a linear function of xx.

    If there are K=2K=2 classes and π1=π2=0.5\pi_{1}=\pi_{2}=0.5 , then one can see that the decision boundary is at

    x=μ1+μ22x=\frac{\mu_{1}+\mu_{2}}{2}

    Typically we don’t know these parameters; we just have the training data. In that case we simply estimate the parameters and plug them into the rule.

Estimating the parameters

π^k=nknμ^k=1nki:yi=kxiσ^2=1nKk=1Ki:yi=k(xiμ^k)2=k=1Knk1nKσ^k2\begin{aligned}\hat{\pi}_{k} &=\frac{n_{k}}{n} \\\hat{\mu}_{k} &=\frac{1}{n_{k}} \sum_{i: y_{i}=k} x_{i} \\\hat{\sigma}^{2} &=\frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_{i}=k}\left(x_{i}-\hat{\mu}_{k}\right)^{2} \\&=\sum_{k=1}^{K} \frac{n_{k}-1}{n-K} \cdot \hat{\sigma}_{k}^{2}\end{aligned}

where σ^k2=1nk1i:yi=k(xiμ^k)2\hat{\sigma}_{k}^{2}=\frac{1}{n_{k}-1} \sum_{i: y_{i}=k}\left(x_{i}-\hat{\mu}_{k}\right)^{2} is the usual formula for the estimated variance in the kk th class.

Linear Discriminant Analysis when p>1p > 1

stat3612-ch3-4.png

  • Density:

    f(x)=1(2π)p/2Σ1/2e12(xμ)TΣ1(xμ)f(x)=\frac{1}{(2 \pi)^{p / 2}|{\boldsymbol{\Sigma}}|^{1 / 2}} e^{-\frac{1}{2}(x-\mu)^{T} \boldsymbol{\Sigma}^{-1}(x-\mu)}

  • Discriminant function:

    δk(x)=xTΣ1μk12μkTΣ1μk+logπk\delta_{k}(x)=x^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}-\frac{1}{2} \mu_{k}^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}+\log \pi_{k}

    Despite its complex form, δk(x)=ck0+ck1x1+ck2x2++ckpxp\delta_{k}(x)=c_{k 0}+c_{k 1} x_{1}+c_{k 2} x_{2}+\ldots+c_{k p} x_{p} is a linear function.

Illustration: p=2p = 2 and K=3K = 3 classes

Here  .

Here π1=π2=π3=1/3\pi_{1}=\pi_{2}=\pi_{3}=1 / 3.

  • The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers.

From δk(x)\delta_{k}(x) to probabilities

  • Once we have estimates δ^k(x)\hat{\delta}_{k}(x) , we can turn these into estimates for class probabilities:

    Pr^(Y=kX=x)=eδ^k(x)l=1Keδ^l(x)\widehat{\operatorname{Pr}}(Y=k \mid X=x)=\frac{e^{\hat{\delta}_{k}(x)}}{\sum_{l=1}^{K} e^{\hat{\delta}_{l}(x)}}

    So classifying to the largest δ^k(x)\hat{\delta}_{k}(x) amounts to classifying to the class for which Pr^(Y=kX=x)\widehat{\operatorname{Pr}}(Y=k \mid X=x) is largest.

    When K=2K=2 , we classify to class 2 if Pr^(Y=2X=x)0.5\widehat{\operatorname{Pr}}(Y=2 \mid X=x) \geq 0.5 else to class 1.

Types of errors

  • False positive rate: The fraction of negative examples that are classified as positive.

  • False negative rate: The fraction of positive examples that are classified as negative.

  • We produced this table by classifying to class Yes if

    Pr^(Default=YesBalance,Student)0.5\rm \widehat{\operatorname{Pr}}( Default = Yes \mid Balance, Student )\geq 0.5

    We can change the two error rates by changing the threshold from 0.50.5 to some other value in [0,1][0,1] :

    Pr^(Default=YesBalance,Student)threshold{\rm \widehat{\operatorname{Pr}}( Def ault = Yes \mid Balance, Student )} \geq threshold

    and vary threshold.

Varying the threshold

stat3612-ch3-6.png

  • In order to reduce the false negative rate, we may want to reduce the threshold to 0.10.1 or less.

stat3612-ch3-7.png

  • The ROC plot displays both simultaneously.

    Sometimes we use the AUC or area under the curve to summarize the overall performance. Higher AUC is good.

Other forms of Discriminant Analysis

Pr(Y=kX=x)=πkfk(x)l=1Kπlfl(x)\operatorname{Pr}(Y=k \mid X=x)=\frac{\pi_{k} f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)}

When fk(x)f_{k}(x) are Gaussian densities, with the same covariance matrix Σ\boldsymbol{\Sigma} in each class, this leads to linear discriminant analysis. By altering the forms for fk(x)f_{k}(x) , we get different classifiers.

  • With Gaussians but different Σk\boldsymbol{\Sigma}_{k} in each class, we get quadratic discriminant analysis.
  • With fk(x)=j=1pfjk(xj)f_{k}(x)=\prod_{j=1}^{p} f_{j k}\left(x_{j}\right) (conditional independence model) in each class we get naïve Bayes. For Gaussian this means the Σk\boldsymbol{\Sigma}_{k} are diagonal.
  • Many other forms, by proposing specific density models for fk(x)f_{k}(x) , including nonparametric approaches.

Quadratic Discriminant Analysis

stat3612-ch3-8.png

δk(x)=12(xμk)TΣk1(xμk)+logπk\delta_{k}(x)=-\frac{1}{2}\left(x-\mu_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1}\left(x-\mu_{k}\right)+\log \pi_{k}

Because the Σk\boldsymbol{\Sigma}_{k} are different, the quadratic terms matter.

Naïve Bayes

  • Assumes features are independent in each class.

    Useful when pp is large, and so multivariate methods like QDA and even LDA break down.

    • Gaussian naïve Bayes assumes each Σk\boldsymbol{\Sigma}_{k} is diagonal:

      δk(x)log[πkj=1pfkj(xj)]=12j=1p(xjμkj)2σkj2+logπk\delta_{k}(x) \propto \log \left[\pi_{k} \prod_{j=1}^{p} f_{k j}\left(x_{j}\right)\right]=-\frac{1}{2} \sum_{j=1}^{p} \frac{\left(x_{j}-\mu_{k j}\right)^{2}}{\sigma_{k j}^{2}}+\log \pi_{k}

    • can use for mixed feature vectors (qualitative and quantitative). If XjX_{j} is qualitative, replace fkj(xj)f_{k j}\left(x_{j}\right) with probability mass function (histogram) over discrete categories.

  • Despite strong assumptions, naive Bayes often produces good classification results.

Logistic Regression versus LDA

  • For a two-class problem, one can show that for LDA

    log(p1(x)1p1(x))=log(p1(x)p2(x))=c0+c1x1++cpxp\log \left(\frac{p_{1}(x)}{1-p_{1}(x)}\right)=\log \left(\frac{p_{1}(x)}{p_{2}(x)}\right)=c_{0}+c_{1} x_{1}+\ldots+c_{p} x_{p}

    So it has the same form as logistic regression.

    The difference is in how the parameters are estimated.

    • Logistic regression uses the conditional likelihood based on Pr(YX)\operatorname{Pr}(Y \mid X) (known as discriminative learning).
    • LDA uses the full likelihood based on Pr(X,Y)\operatorname{Pr}(X, Y) (known as generative learning).
    • Despite these differences, in practice the results are often very similar.
  • Footnote: logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.

Summary

  • Logistic regression is very popular for classification, especially when K=2K=2.
  • LDA is useful when nn is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when K>2K>2.
  • Naïve Bayes is useful when pp is very large.

— Jul 15, 2022

Creative Commons License
§3 Classification by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.