🌑

Explore / Study / Statistics / Statistical Machine Learning 2.7k words | 16 minutes

§6 Unsupervised Learning

  1. The Goals of Unsupervised Learning
    1. Another advantage
  • Principal Components Analysis
    1. Computation of Principal Components
    2. Geometry of PCA
    3. Further principal components
    4. PCA find the hyperplane closest to the observations
    5. Scaling of the variables matters
    6. Proportion Variance Explained
    7. How many principal components should we use?
  • Clustering
    1. PCA vs Clustering
    2. Two clustering methods
    3. KKK-means clustering
      1. How to define within-cluster variation?
      2. KKK-Means Clustering Algorithm
      3. Properties of the Algorithm
    4. Hierarchical Clustering
      1. Hierarchical Clustering Algorithm
      2. Types of Linkage
      3. Choice of Dissimilarity Measure
    5. Practical issues
  • Conclusions
  • The Goals of Unsupervised Learning

    • The goal is to discover interesting things about the measurements: is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations?
    • We discuss two methods:
      • principal components analysis, a tool used for data visualization or data pre-processing before supervised techniques are applied, and
      • clustering, a broad class of methods for discovering unknown subgroups in data.

    Another advantage

    • It is often easier to obtain unlabeled data - from a lab instrument or a computer - than labeled data, which can require human intervention.

    Principal Components Analysis

    • PCA produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated.

    • Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization.

    • The first principal component of a set of features X1,X2,,XpX_{1}, X_{2}, \ldots, X_{p} is the normalized linear combination of the features

      Z1=ϕ11X1+ϕ21X2++ϕp1XpZ_{1}=\phi_{11} X_{1}+\phi_{21} X_{2}+\ldots+\phi_{p 1} X_{p}

      that has the largest variance. By normalized, we mean that j=1pϕj12=1\sum_{j=1}^{p} \phi_{j 1}^{2}=1.

    • We refer to the elements ϕ11,,ϕp1\phi_{11}, \ldots, \phi_{p 1} as the loadings of the first principal component; together, the loadings make up the principal component loading vector, ϕ1=(ϕ11ϕ21ϕp1)T\phi_{1}=\left(\begin{array}{llll}\phi_{11} & \phi_{21} & \ldots & \phi_{p 1}\end{array}\right)^{T}.

    • We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.

    Computation of Principal Components

    • Suppose we have a n×pn \times p data set X\mathbf{X}. Since we are only interested in variance, we assume that each of the variables in X\mathbf{X} has been centered to have mean zero (that is, the column means of X\mathbf{X} are zero).

    • We then look for the linear combination of the sample feature values of the form

      zi1=ϕ11xi1+ϕ21xi2++ϕp1xipz_{i 1}=\phi_{11} x_{i 1}+\phi_{21} x_{i 2}+\ldots+\phi_{p 1} x_{i p}

      for i=1,,ni=1, \ldots, n that has largest sample variance, subject to the constraint that j=1pϕj12=1\sum_{j=1}^{p} \phi_{j 1}^{2}=1.

    • Since each of the xijx_{i j} has mean zero, then so does zi1z_{i 1} (for any values of ϕj1\phi_{j 1} ). Hence the sample variance of the zi1z_{i 1} can be written as 1ni=1nzi12\frac{1}{n} \sum_{i=1}^{n} z_{i 1}^{2}.

    • Plugging in the first principal component loading vector solves the optimization problem

      maximizeϕ11,,ϕp11ni=1n(j=1pϕj1xij)2 subject to j=1pϕj12=1\underset{\phi_{11}, \ldots, \phi_{p 1}}{\operatorname{maximize}} \frac{1}{n} \sum_{i=1}^{n}\left(\sum_{j=1}^{p} \phi_{j 1} x_{i j}\right)^{2} \text{ subject to } \sum_{j=1}^{p} \phi_{j 1}^{2}=1

    • This problem can be solved via a singular-value decomposition of the matrix X\mathbf{X} , a standard technique in linear algebra.

    • We refer to Z1Z_{1} as the first principal component, with realized values z11,,zn1z_{11}, \ldots, z_{n 1}.

    Geometry of PCA

    • The loading vector ϕ1\phi_{1} with elements ϕ11,ϕ21,,ϕp1\phi_{11}, \phi_{21}, \ldots, \phi_{p 1} defines a direction in feature space along which the data vary the most.
    • If we project the nn data points x1,,xnx_{1}, \ldots, x_{n} onto this direction, the projected values are the principal component scores z11,,zn1z_{11}, \ldots, z_{n 1} themselves.

    Further principal components

    • The second principal component is the linear combination of X1,,XpX_{1}, \ldots, X_{p} that has maximal variance among all linear combinations that are uncorrelated with Z1Z_{1}.

    • The second principal component scores z12,z22,,zn2z_{12}, z_{22}, \ldots, z_{n 2} take the form

      zi2=ϕ12xi1+ϕ22xi2++ϕp2xipz_{i 2}=\phi_{12} x_{i 1}+\phi_{22} x_{i 2}+\ldots+\phi_{p 2} x_{i p}

      where ϕ2\phi_{2} is the second principal component loading vector, with elements ϕ12,ϕ22,,ϕp2\phi_{12}, \phi_{22}, \ldots, \phi_{p 2}.

    • It turns out that constraining Z2Z_{2} to be uncorrelated with Z1Z_{1} is equivalent to constraining the direction ϕ2\phi_{2} to be orthogonal (perpendicular) to the direction ϕ1.\phi_{1}. And so on.

    • The principal component directions ϕ1,ϕ2,ϕ3,\phi_{1}, \phi_{2}, \phi_{3}, \ldots are the ordered sequence of right singular vectors of the matrix X\mathbf{X} , and the variances of the components are 1n\frac{1}{n} times the squares of the singular values. There are at most min(n1,p)\min (n-1, p) principal components.

    PCA find the hyperplane closest to the observations

    • The first principal component loading vector has a very special property: it defines the line in pp-dimensional space that is closest to the nn observations (using average squared Euclidean distance as a measure of closeness)
    • The notion of principal components as the dimensions that are closest to the nn observations extends beyond just the first principal component.
    • For instance, the first two principal components of a data set span the plane that is closest to the nn observations, in terms of average squared Euclidean distance.

    Scaling of the variables matters

    • If the variables are in different units, scaling each to have standard deviation equal to one is recommended.
    • If they are in the same units, you might or might not scale the variables.

    Proportion Variance Explained

    • To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one.

    • The total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as

      j=1pVar(Xj)=j=1p1ni=1nxij2\sum_{j=1}^{p} \operatorname{Var}\left(X_{j}\right)=\sum_{j=1}^{p} \frac{1}{n} \sum_{i=1}^{n} x_{i j}^{2}

      and the variance explained by the mm th principal component is

      Var(Zm)=1ni=1nzim2\operatorname{Var}\left(Z_{m}\right)=\frac{1}{n} \sum_{i=1}^{n} z_{i m}^{2}

    • It can be shown that j=1pVar(Xj)=m=1MVar(Zm)\sum_{j=1}^{p} \operatorname{Var}\left(X_{j}\right)=\sum_{m=1}^{M} \operatorname{Var}\left(Z_{m}\right) , with M=min(n1,p)M=\min (n-1, p).

    • Therefore, the PVE of the mm th principal component is given by the positive quantity between 0 and 1

      i=1nzim2j=1pi=1nxij2\frac{\sum_{i=1}^{n} z_{i m}^{2}}{\sum_{j=1}^{p} \sum_{i=1}^{n} x_{i j}^{2}}

    How many principal components should we use?

    If we use principal components as a summary of our data, how many components are sufficient?

    • No simple answer to this question, as cross-validation is not available for this purpose.
      • Why not?
      • When could we use cross-validation to select the number of components?
    • the “scree plot” on the previous slide can be used as a guide: we look for an “elbow”

    Clustering

    • Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set.
    • We seek a partition of the data into distinct groups so that the observations within each group are quite similar to each other,
    • It make this concrete, we must define what it means for two or more observations to be similar or different.
    • Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied.

    PCA vs Clustering

    • PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance.
    • Clustering looks for homogeneous subgroups among the observations.

    Two clustering methods

    • In K**K -means clustering**, we seek to partition the observations into a pre-specified number of clusters,
    • In hierarchical clustering, we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram, that allows us to view at once the clusterings obtained for each possible number of clusters, from 11 to nn.

    KK-means clustering

    • Let C1,,CKC_{1}, \ldots, C_{K} denote sets containing the indices of the observations in each cluster. These sets satisfy two properties:

      1. C1C2CK={1,,n}.C_{1} \cup C_{2} \cup \ldots \cup C_{K}=\{1, \ldots, n\}. In other words, each observation belongs to at least one of the KK clusters.
      2. CkCk=C_{k} \cap C_{k^{\prime}}=\emptyset for all kkk \neq k^{\prime}. In other words, the clusters are non-overlapping: no observation belongs to more than one cluster.

      For instance, if the ii th observation is in the kk th cluster, then iCki \in C_{k}.

    • The idea behind KK -means clustering is that a good clustering is one for which the within-cluster variation is as small as possible.

    • The within-cluster variation for cluster CkC_{k} is a measure WCV(Ck)\mathrm{WCV}\left(C_{k}\right) of the amount by which the observations within a cluster differ from each other.

    • Hence we want to solve the problem

      minimizeC1,,CK{k=1KWCV(Ck)}\operatorname{minimize}_{C_{1}, \ldots, C_{K}}\left\{\sum_{k=1}^{K} \mathrm{WCV}\left(C_{k}\right)\right\}

    • In words, this formula says that we want to partition the observations into KK clusters such that the total within-cluster variation, summed over all KK clusters, is as small as possible.

    How to define within-cluster variation?

    • Typically we use Euclidean distance

      WCV(Ck)=1Cki,iCkj=1p(xijxij)2\mathrm{WCV}\left(C_{k}\right)=\frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}

      where Ck\left|C_{k}\right| denotes the number of observations in the kk th cluster.

    • The optimization problem that defines KK -means clustering,

      minimizeC1,,CK{k=1K1Cki,iCkj=1p(xijxij)2}\underset{C_{1}, \ldots, C_{K}}{\operatorname{minimize}}\left\{\sum_{k=1}^{K} \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}\right\}

    KK-Means Clustering Algorithm

    1. Randomly assign a number, from 11 to KK , to each of the observations. These serve as initial cluster assignments for the observations.
    2. Iterate until the cluster assignments stop changing:
      1. For each of the KK clusters, compute the cluster centroid. The kk th cluster centroid is the vector of the pp feature means for the observations in the kk th cluster.
      2. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

    Properties of the Algorithm

    • This algorithm is guaranteed to decrease the value of the objective at each step. Note that

      1Cki,iCkj=1p(xijxij)2=2iCkj=1p(xijxˉkj)2\frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}=2 \sum_{i \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-\bar{x}_{k j}\right)^{2}

      where xˉkj=1CkiCkxij*\bar{x}_{k j}=\frac{1}{\left|C_{k}\right|} \sum_{i \in C_{k}} x_{i j}* is the mean for feature jj in cluster CkC_{k}.

    • however it is not guaranteed to give the global minimum.

    Hierarchical Clustering

    • KK -means clustering requires us to pre-specify the number of clusters KK. This can be a disadvantage (later we discuss strategies for choosing KK )
    • Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of KK.
    • In this section, we describe bottom-up or agglomerative clustering. This is the most common type of hierarchical clustering, and refers to the fact that a dendrogram is built starting from the leaves and combining clusters up to the trunk.

    Hierarchical Clustering Algorithm

    The approach in words:

    • Start with each point in its own cluster.
    • Identify the closest two clusters and merge them,
    • Repeat.
    • Ends when all points are in a single cluster.

    stat3612-ch6-1.png

    Types of Linkage

    • Complete

      Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.

    • Single

      Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities.

    • Average

      Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.

    • Centroid

      Dissimilarity between the centroid for cluster A (a mean vector of length pp) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.

    Choice of Dissimilarity Measure

    • So far have used Euclidean distance.
    • An alternative is correlation-based distance which considers two observations to be similar if their features are highly correlated.
    • This is an unusual use of correlation, which is normally computed between variables; here it is computed between the observation profiles for each pair of observations.

    Practical issues

    • Should the observations or features first be standardized in some way? For instance, maybe the variables should be centered to have mean zero and scaled to have standard deviation one.
    • In the case of hierarchical clustering,
      • What dissimilarity measure should be used?
      • What type of linkage should be used?
    • How many clusters to choose? (in both KK -means or hierarchical clustering). Difficult problem. No agreed-upon method.

    Conclusions

    • Unsupervised learning is important for understanding the variation and grouping structure of a set of unlabeled data, and can be a useful pre-processor for supervised learning
    • It is intrinsically more difficult than supervised learning because there is no gold standard (like an outcome variable) and no single objective (like test set accuracy)
      • It is an active field of research, with many recently developed tools such as self-organizing maps, independent components analysis and spectral clustering.

    — Jul 15, 2022

    Related: ,

    Creative Commons License
    §6 Unsupervised Learning by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.