PCA produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated.
Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization.
The first principal component of a set of features is the normalized linear combination of the features
that has the largest variance. By normalized, we mean that .
We refer to the elements as the loadings of the first principal component; together, the loadings make up the principal component loading vector, .
We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.
Suppose we have a data set . Since we are only interested in variance, we assume that each of the variables in has been centered to have mean zero (that is, the column means of are zero).
We then look for the linear combination of the sample feature values of the form
for that has largest sample variance, subject to the constraint that .
Since each of the has mean zero, then so does (for any values of ). Hence the sample variance of the can be written as .
Plugging in the first principal component loading vector solves the optimization problem
This problem can be solved via a singular-value decomposition of the matrix , a standard technique in linear algebra.
We refer to as the first principal component, with realized values .
The second principal component is the linear combination of that has maximal variance among all linear combinations that are uncorrelated with .
The second principal component scores take the form
where is the second principal component loading vector, with elements .
It turns out that constraining to be uncorrelated with is equivalent to constraining the direction to be orthogonal (perpendicular) to the direction And so on.
The principal component directions are the ordered sequence of right singular vectors of the matrix , and the variances of the components are times the squares of the singular values. There are at most principal components.
To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one.
The total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as
and the variance explained by the th principal component is
It can be shown that , with .
Therefore, the PVE of the th principal component is given by the positive quantity between 0 and 1
If we use principal components as a summary of our data, how many components are sufficient?
Let denote sets containing the indices of the observations in each cluster. These sets satisfy two properties:
For instance, if the th observation is in the th cluster, then .
The idea behind -means clustering is that a good clustering is one for which the within-cluster variation is as small as possible.
The within-cluster variation for cluster is a measure of the amount by which the observations within a cluster differ from each other.
Hence we want to solve the problem
In words, this formula says that we want to partition the observations into clusters such that the total within-cluster variation, summed over all clusters, is as small as possible.
Typically we use Euclidean distance
where denotes the number of observations in the th cluster.
The optimization problem that defines -means clustering,
This algorithm is guaranteed to decrease the value of the objective at each step. Note that
where is the mean for feature in cluster .
however it is not guaranteed to give the global minimum.
The approach in words:
Complete
Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
Single
Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities.
Average
Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
Centroid
Dissimilarity between the centroid for cluster A (a mean vector of length ) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.
— Jul 15, 2022
Made with ❤ at Earth.