Qualitative variables take values in an unordered set C.
Given a feature vector X and a qualitative response Y taking values in the set C , the classification task is to build a function C(X) that takes as input the feature vector X and predicts its value for Y ; i.e. C(X)∈C.
Often we are more interested in estimating the probabilities that X belongs to each category in C.
Linear Regression
Suppose for the Default classification task that we code
Y={01 if No if Yes.
Can we simply perform a linear regression of Y on X and classify as Yes if Y^>0.5 ?
In this case of a binary outcome, linear regression does a good job as a classifier, and is equivalent to linear discriminant analysis which we discuss later.
Since in the population E(Y∣X=x)=Pr(Y=1∣X=x) , we might think that regression is perfect for this task.
However, linear regression might produce probabilities less than zero or bigger than one. Logistic regression is more appropriate.
Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.
Y=⎩⎪⎨⎪⎧123 if stroke if drug overdose; if epileptic seizure.
This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure.
Linear regression is not appropriate here.
Multiclass Logistic Regression or Discriminant Analysis are more appropriate.
Logistic Regression
Let’s write p(X)=Pr(Y=1∣X) for short and consider using balance to predict default. Logistic regression uses the form
p(X)=1+eβ0+β1Xeβ0+β1X
(e≈2.71828 is a mathematical constant [Euler’s number.])
It is easy to see that no matter what values β0,β1 or X take, p(X) will have values between 0 and 1.
A bit of rcarrangcment gives
log(1−p(X)p(X))=β0+β1X
This monotone transformation is called the ∗∗log odds** or logit transformation of p(X).
Logistic regression ensures that our estimate for p(X) lies
between 0 and 1.
Maximum Likelihood
We use maximum likelihood to estimate the parameters.
ℓ(β0,β)=i:yi=1∏p(xi)i:yi=0∏(1−p(xi))
This likelihood gives the probability of the observed zeros and ones in the data. We pick β0 and β1 to maximize the likelihood of the observed data.
Most statistical packages can fit linear logistic regression models by maximum likelihood. In R we use the glm function.
Confounding
Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students.
But for each level of balance, studentsdefault less than non-students.
Multiple logistic regression can tease this out.
Case-control sampling and logistic regression
In South African data, there are 160 cases, 302 controls π~=0.35 are cases. Yet the prevalence of MI in this region is π=0.05.
With case-control samples, we can estimate the regression parameters βj accurately (if our model is correct); the constant term β0 is incorrect.
We can correct the estimated intercept by a simple transformation
β^0∗=β^0+log1−ππ−log1−π~π~
Often cases are rare and we take them all; up to five times that number of controls is sufficient.
Diminishing returns in unbalanced binary data
Sampling more controls than cases reduces the variance of the parameter estimates. But after a ratio of about 5 to 1 the variance re- duction flattens out.
Logistic regression with more than two classes
So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form
Here there is a linear function for each class.
(Some cancellation is possible, and only K−1 linear functions are needed as in 2-class logistic regression.)
Multiclass logistic regression is also referred to as multinomial regression.
Discriminant Analysis
Here the approach is to model the distribution of X in each of the classes separately, and then use Bayes theorem to flip things around and obtain Pr(Y∣X).
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis.
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions.
Bayes theorem for classification
Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem:
Pr(Y=k∣X=x)=Pr(X=x)Pr(X=x∣Y=k)⋅Pr(Y=k)
One writes this slightly differently for discriminant analysis:
Pr(Y=k∣X=x)=∑l=1Kπlfl(x)πkfk(x)
where
fk(x)=Pr(X=x∣Y=k) is the density for X in class k. Here we will use normal densities for these, separately in each class.
πk=Pr(Y=k) is the marginal or prior probability for class k.
Classify to the highest density
We classify a new point according to which density is highest.
When the priors are different, we take them into account as well, and compare πkfk(x). On the right, we favor the pink class - the decision boundary has shifted to the left.
Why discriminant analysis?
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.
Linear Discriminant Analysis when p=1
The Gaussian density has the form
fk(x)=2πσk1e−21(σkx−μk)2
Here μk is the mean, and σk2 the variance (in class k ). We will assume that all the σk=σ are the same.
Plugging this into Bayes formula, we get a rather complex expression for pk(x)=Pr(Y=k∣X=x) :
Happily, there are simplifications and cancellations.
Discriminant functions
To classify at the value X=x , we need to see which of the pk(x) is largest. Taking logs, and discarding terms that do not depend on k , we see that this is equivalent to assigning x to the class with the largest discriminant score:
δk(x)=x⋅σ2μk−2σ2μk2+log(πk)
Note that δk(x) is a linear function of x.
If there are K=2 classes and π1=π2=0.5 , then one can see that the decision boundary is at
x=2μ1+μ2
Typically we don’t know these parameters; we just have the training data. In that case we simply estimate the parameters and plug them into the rule.
where σ^k2=nk−11∑i:yi=k(xi−μ^k)2 is the usual formula for the estimated variance in the k th class.
Linear Discriminant Analysis when p>1
Density:
f(x)=(2π)p/2∣Σ∣1/21e−21(x−μ)TΣ−1(x−μ)
Discriminant function:
δk(x)=xTΣ−1μk−21μkTΣ−1μk+logπk
Despite its complex form, δk(x)=ck0+ck1x1+ck2x2+…+ckpxp is a linear function.
Illustration: p=2 and K=3 classes
Here π1=π2=π3=1/3.
The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers.
From δk(x) to probabilities
Once we have estimates δ^k(x) , we can turn these into estimates for class probabilities:
Pr(Y=k∣X=x)=∑l=1Keδ^l(x)eδ^k(x)
So classifying to the largest δ^k(x) amounts to classifying to the class for which Pr(Y=k∣X=x) is largest.
When K=2 , we classify to class 2 if Pr(Y=2∣X=x)≥0.5 else to class 1.
Types of errors
False positive rate: The fraction of negative examples that are classified as positive.
False negative rate: The fraction of positive examples that are classified as negative.
We produced this table by classifying to class Yes if
Pr(Default=Yes∣Balance,Student)≥0.5
We can change the two error rates by changing the threshold from 0.5 to some other value in [0,1] :
Pr(Default=Yes∣Balance,Student)≥threshold
and vary threshold.
Varying the threshold
In order to reduce the false negative rate, we may want to reduce the threshold to 0.1 or less.
The ROC plot displays both simultaneously.
Sometimes we use the AUC or area under the curve to summarize the overall performance. Higher AUC is good.
Other forms of Discriminant Analysis
Pr(Y=k∣X=x)=∑l=1Kπlfl(x)πkfk(x)
When fk(x) are Gaussian densities, with the same covariance matrix Σ in each class, this leads to linear discriminant analysis. By altering the forms for fk(x) , we get different classifiers.
With Gaussians but different Σk in each class, we get quadratic discriminant analysis.
With fk(x)=∏j=1pfjk(xj) (conditional independence model) in each class we get naïve Bayes. For Gaussian this means the Σk are diagonal.
Many other forms, by proposing specific density models for fk(x) , including nonparametric approaches.
Quadratic Discriminant Analysis
δk(x)=−21(x−μk)TΣk−1(x−μk)+logπk
Because the Σk are different, the quadratic terms matter.
Naïve Bayes
Assumes features are independent in each class.
Useful when p is large, and so multivariate methods like QDA and even LDA break down.
Gaussian naïve Bayes assumes each Σk is diagonal:
can use for mixed feature vectors (qualitative and quantitative). If Xj is qualitative, replace fkj(xj) with probability mass function (histogram) over discrete categories.
Despite strong assumptions, naive Bayes often produces good classification results.
Logistic Regression versus LDA
For a two-class problem, one can show that for LDA