In Machine Learning classification problems, there are often too many factors on the basis of which the final classification is done. These factors are basically variables called features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant.
Few algorithms do not perform well when you have huge amounts of data like KNN, Decision trees etc.. So by reducing them it will help the algorithms to perform well. And we cannot visualize data more than 3D so by reducing data to 2D or 3D will allow us to plot and observe patterns more clearly.
This technique also removes multicollinearity by removing redundant features. Let's see we have N variables in the dataset where we reduce it to K variables (K<<N)
There are two components of dimensionality reduction:
Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a smaller subset which can be used to model the problem. It usually involves three ways:
- Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e. a space with lesser no. of dimensions.
Advantages of Dimensionality Reduction
- It helps in data compression, and hence reduced storage space.
- It reduces computation time.
- Helps to remove noise so that we can improve the performance of models
- It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
- It may lead to some amount of data loss.
- PCA tends to find linear correlations between variables, which is sometimes undesirable.
- PCA fails in cases where mean and covariance are not enough to define datasets.
Different Dimensionality Reduction Techniques
Linear Dimensionality Reduction Methods
- Factor Analysis
Non Linear Dimensionality Reduction Methods
- Spectral Embedding
Other techniques like
- Auto encoders
- Missing value ratio
- Low variance filter