Welcome to yet another exciting series of blogs. In this series, we are going to talk about dimensionality reduction techniques. Our thinking throughout this series should be oriented towards optimization. Every technique discussed here shall be keeping that in mind. However, this section will also involve some bit of math, but just to be inclusive, we shall keep it simple and straight, discussing only the concepts and how to perceive them.
Well, just to recapitulate from the last article, dimension is nothing but the number of input variables or features for a dataset. Suppose you build an employee dataset and consider as many as 15 features of the employee to incorporate in it then the employee could be said to have 15 dimensions. And also as we have discussed, processing a large number of dimensions is not always favorable concerning space and time. This can dramatically impact the performance of machine learning algorithms, generally referred to as the "Curse of Dimensionality". Large is comparative and it depends on the data, computational resources you have, the deployment environment configurations of the model, and many more. Thus, for large datasets, we saw dimensionality reduction, i.e., reduction in the number of features or variables that were necessary for its optimal usage.
Now, there is no free lunch, at least in computing! Dimensionality reduction doesn't come without any tradeoffs! There is a good chance that you will miss out on some of the information from the data. And there is no absolute number of it, it depends on the data and its implementation. So if you are fine with the tradeoff between computational complexity and loss of information from the data, then sure dimensionality reduction can help you out in some situations.
In this tutorial, we are going to start with one such technique known as Principal Component Analysis. We shall look at the definition of it, the steps involved in it, and what is the benefit of it? So without any further ado, let's get the ball rolling...
Principal Component Analysis (PCA) is one such technique wherein we but first build our own set of dimensions from the original dataset, do some analysis on it, and then translate our dataset to those new dimensions without losing out on much information on the data. Or at least that choice depends on you. It enables us to create and use a reduced set of variables, which are called
principal factors. A reduced set is much easier to analyze and interpret. To study a data set that results in the estimation of roughly 500 parameters may be difficult, but if we could reduce these to 5 it would certainly make our day.
There are many tutorials out there on the internet as this is one of the most common techniques used for reducing complexity. I won't suggest reading all of them as they either seem to get to involved in terms of complexity or get too banal as you get along. But, this is a good-to-read article.
The principal components of a collection of points in a real p-space are a sequence of direction vectors, where the vector is the direction of a line that best fits the data while being orthogonal to the first vectors. Here, a best-fitting line is defined as one that minimizes the average squared distance from the points to the line. These directions constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
Here is an outline of the important steps involved in implementing PCA:
- Computing Covariance Matrix
- Computing the Eigen Vectors and Eigen Values of the covariance matrix to identify the principal components
- Extracting the Feature Vector
- Recasting the data along the principal component axes
Let us have a brief of each of these steps to have a clearer understanding of this topic without getting mathematical.
When you have a large data set, there are many columns and a large number of records. Each variable fluctuates in a particular range. For example, the speed of the car, cannot be negative and cannot cross a particular value either. It has to be within a particular limit eg., 0 - 200 Km/hr. But the temperate of the engine fluctuates in another range, eg., 0-50 °C. Similarly, there could be many other variables. In order to judge them with respect to each other, we must bring them down to scale. For example, subtract each number by the mean of that feature in the data and divide it by the standard deviation (the dispersion of a dataset relative to its mean).
Once the standardization is done, all the variables will be transformed to the same scale.
Have you ever had this experience of someone asking your birth year and then immediately being asked your age? I guess so. Does both these question provide you equally significant things about me or is one redundant in context? This is what the Covariance matrix aims to eliminate within the data. It tries to figure out how closely the values are related to each other and establishes a relation between them. It tries to figure out the redundancy in the data. For a p dimensional data, you will have a
p*p covariance matrix. Also, the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.
Computing the Eigen Vectors and Eigen Values of the covariance matrix to identify the principal components
I know this has to get all mathematical but, we will try to get around it by just understanding the concept and not working out the math behind it.
The first thing to understand here is that we decompose the matrix. Why do we do it you ask? We do it to break it into its constituent elements. Constituent elements of the matrix tell us more about what's significant in it. There are many approaches to do that but perhaps the most opted method is that of Eigen decomposition of the Covariance matrix which results in having Eigenvector and Eigenvalues.
Now that we have constituent vectors along with their transformational value with us all the mathematical operations can be performed on it so as to squeeze maximum information out of the given data as a linear combination to form principal components. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second, and so on.
Organizing information in principal components this way will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.
Look at this image, it is prevalent when you look up the PCA on the internet.
As there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set. For example, let’s assume that the scatter plot of our data set is as shown below, can we guess the first principal component? Yes, it’s approximately the line that matches the purple marks because it goes through the origin and it’s the line in which the projection of the points (red dots) is the most spread out. Or mathematically speaking, it’s the line that maximizes the variance (the average of the squared distances from the projected points (red dots) to the origin). And we keep doing so until we covered all the principal components in that data.
Now that we have all the Principal Vectors arranged in descending order, thanks to eigenvalues. We can select as many features as we like to reduce the complexity in the data. In this step, what we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector.
Till now, we were performing our analysis on the data, did not actually perform any kind of transformations on the data. We now use the feature vector that we have deduced and transform it along the principal components. This can be done by multiplying the transpose of the original data set by the transpose of the feature vector. Here is a more mathematical version of it:
This is how the data is transformed and the complexity in the data is reduced using Principal Component Analysis.
I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.
Until next time...