I’m currently in the process of writing the background chapter on Machine Learning for my dissertation. In the context of doing that, I took a closer look at a widely used feature extraction technique called “Principle Component Analysis” (PCA). It can be described either on the intuitive level (“it finds orthogonal directions of greatest variance in the data”) or on the mathematical level (“it computes an eigenvalue decomposition of the data matrix”). What I found most difficult to understand was how these two descriptions are related to each other. This is essentially what I want to share in today’s blog post.
Dimensionality Reduction in General
But first some context about machine learning and dimensionality reduction. In machine learning, a data set is usually given by a matrix X with n rows and m columns. Each row corresponds to one example and each column corresponds to one feature (i.e., a measurable property of an example, e.g., the width of a leaf, the relative sugar content of a yogurt, or the maximum speed of a car). If we deal with a classification task, each example is furthermore annotated with its target class (e.g., maple leaf, fruit yogurt, or sports car). The task of a machine learning algorithm is now to find a rule associating the feature value of a given example with its target class. This classification rule is typically not known a priori – otherwise we could just write down the rule and be done with it.
If the machine learning algorithm is supposed to find a useful rule, one needs to ensure that it has access to all relevant features. For example, the maximum speed might be an informative feature when deciding whether a given car is a sports car or not, while the number of radio channels of its radio is probably not an informative feature. If one is not sure about the classification rule, one thus may be tempted to just list every possible feature one can think of and let the machine learning algorithm figure out which ones are relevant.
Unfortunately, this is where the curse of dimensionality comes into play: If we have too many features and only a limited number of examples, the machine learning algorithm may have trouble learning anything useful. With m features that can take t different values each, the number of possible combinations of feature values is tm. If we assume that the machine learning algorithm needs to see the most important combinations at least once, we need a lot of examples. In fact, the number of required examples grows exponentially with the number of features – which is bad, since we typically cannot collect examples on such a large scale.
In order circumvent this problem, we can apply dimensionality reduction techniques which drastically reduce the number of features. In feature selection, one aims at selecting a small subset of the given features without changing them, while in feature extraction, one defines a small set of new features based on the existing ones. PCA is one of the most important feature extraction algorithms which defines new features as a linear combination of existing features. Dimensionality reduction is not only helpful for improving the performance of a machine learning algorithm, but it can also help to better understand the data set.
Finding the Direction with Largest Variance
I’ll try to keep the math as simple as possible here, following roughly the explanation by Bro and Smilde [1]. So let’s start by looking for a new feature T (with upper case indicating that we’re talking about a random variable) with a vector t of feature values for each example (with lower case indicating that we’re talking about actual values). We assume that this novel feature is a linear combination of the existing features F1, …, Fm whose values f1, …, fm correspond to the columns of our data matrix X. We can write this as t = w1 · f1 + … + wm · fm = Xw with a given weight vector w. Since the scaling of t does not matter, we simply require w to be of length 1.
We want to find the feature T with the largest possible variance. The variance Var(T) of a random variable T is defined as the expected squared difference of T from its expected value: Var(T) = E[(E[T] – T)²]
One important assumption of PCA is that the features have been normalized, i.e., that their expected value is zero and that their variance equals one. The former requirement tells us that also the expected value of the new feature T is zero (since it is just a weighted sum of the existing features). Since E(T) = 0, we get that Var(T) = E[(0-T)²] = E(T²).
Since E[T²] can be approximated by taking the average of all squared values of t over the data set, we can maximize Var(T) by maximizing t’ t (where ‘ is used to denote the transposition operator). Since t = Xw, we need to maximize w’X’Xw by choosing an appropriate w.
Computing an Eigenvalue Decomposition
But how can we find this optimal weight vector w? This is where the eigenvalue decomposition enters the stage. In general, the eigenvalue decomposition of a given n x n matrix A is given by the formula A = Q Λ Q’, where Λ is an n x n diagonal matrix of eigenvalues and Q is an n x n matrix whose columns qi correspond to the eigenvectors associated with the eigenvalues λi.
But what does that mean? Well, the formula for the eigenvalue decomposition can be reformulated as AQ = ΛQ (which uses the fact that Q’Q = I, i.e., that the eigenvectors are of unit length and orthogonal to each other). This means that for each eigenvector qi it does not matter whether we multiply it with its eigenvalue λi or with the whole matrix A. In other words, multiplying qi with A is equivalent to scaling qi with a factor of λi.
So how can this help us in maximizing w’X’Xw? Well, the vector w which maximizes w’X’Xw is then simply the eigenvector qi of X’X which is associated to the largest eigenvalue λi. In this case, we get that w’X’Xw = w’λi w = λi (since w is of unit length). Luckily, computing the eigenvalue decomposition is a standard problem in linear algebra for which efficient algorithms are readily available.
Thus, by using the eigenvalue decomposition of X’X, we can find a novel feature corresponding to the direction of largest variance in the data. Moreover, if we look for the direction with the second largest variance in the data, we can simply choose the eigenvector qj of X’X which is associated to the second largest eigenvalue λj. A nice bonus is that the eigenvectors qi and qj are guaranteed to be orthogonal to each other. This also means that the resulting features T1 and T2 are uncorrelated with each other, which is another nice property of PCA.
Geometric Interpretation
In order to supplement the mathematical idea with some visualization, let’s take a look at a geometric interpretation of the PCA.
Figure 1 shows an example for a two-dimensional feature space along with some data points and the two weight vectors identified by a PCA. As we can see, the two weight vectors are orthogonal to each other with the first weight vector capturing the direction of largest variance in the data. The feature value of a given data point for one of the novel features can be computed by projecting this data point onto the respective weight vector. Moreover, since the weight vectors are orthogonal to each other, the transformation from the original features to the novel features can be interpreted as a rotation of the feature space such that the axes align with the directions of greatest variance.
If we have m features, a PCA can return up to m novel features since X’X is of size m x m and has therefore m eigenvectors. This is however not helpful if we want to reduce the dimensionality of our feature space. Since the novel features are however sorted by the amount of variance they capture, one can simply pick the first k < m novel features in order to define the novel feature space.
PCA and Conceptual Spaces
How can we relate the PCA to conceptual spaces? In his book, Gärdenfors [2, Section 6.5] has made the following argumentation concerning the origin of the quality dimensions based on perceptual input (emphasis in original):
The prime problem is that the information received by the receptors is too rich and too unstructured. What is needed is some way of transforming and organizing the input into a mode that can be handled on the conceptual or symbolic level. This basically involves finding a more economic form of representation: going from the subconceptual to the conceptual level usually involves a reduction of the number of dimensions that are represented […].
Thus, dimensionality reduction is needed to go from raw perceptual input (e.g., the pixels of an image) to the quality dimensions of a conceptual space. Since PCA is one of the standard dimensionality reduction tools in machine learning, it seems to be a good fit on first sight. However, it has two major drawbacks: Firstly, the extracted directions of greatest variance do not necessarily have a clear meaning. This is in conflict with the assumption that the quality dimensions of a conceptual space are interpretable. Secondly, the transformation used by a PCA is linear in nature and might thus not be powerful enough for going from raw perceptual input to quality dimensions in semantic similarity spaces. For instance, computing a linear combination of pixel values may not be sufficient for capturing global shape structure.
Therefore, more complex dimensionality reduction techniques are needed for grounding conceptual spaces in perceptual data. This is essentially the approach I follow with my research involving artificial neural networks. Nevertheless, PCA is a very important feature extraction method and it certainly doesn’t hurt knowing a little bit about its underlying mechanisms.
References
[1] Bro, R. & Smilde, A. K. “Principal component analysis” Anal. Methods, The Royal Society of Chemistry, 2014, 6, 2812-2831.
[2] Gärdenfors, P. “Conceptual Spaces: The Geometry of Thought” MIT Press, 2000