My Journey with Multimodal Data Preprocessing and Truncated SVD
In one of our projects, we had a dataset containing over 1500 features to create a machine learning model. By the
multimodality, I mean there were a combination of
text features in it.
To handle this dataset, I employed a standard strategy of preprocessing and the current features transformed to more features. A crucial aspect of analyzing these additional features was determining a method to identify the most important ones.
Of course, before modeling, we analyze data to keep the more informative samples and features. But in this project, we still deal with curse of dimensionality.
For example, among these features, there were numerous categorical variables for which I utilized
OneHotEncoding for them to convert to the numeric values. This picture shows it in simple, but if want to know more about it you can visit this link.
Furthermore, there are some text features in this dataset. When we tried to use these kind of features, the
Tfidf-Vectorizer came in use! This technique tries to identify the more important tokens in a text by counting their frequencies in the documents. This picture may show the idea behind in a one shot, but if you want to known more you can again visit this link.
In our machine learning pipeline, consists of
modeling. After the featurization step, we faced with an enormous sparse dat matrix. In a sparse matrix, there are lots of cells with zero and just few cells containing non-zero values. Using this kind of data matrix can cause to computational overhead and slow down the modeling process.
The first idea was to use the well-known PCA algorithm as a dimensionality reduction technique. When I attempted to apply the PCA algorithm, I encountered an error indicating that the algorithm could not be used with a sparse matrix. But why?
Consequently, I started exploring about the Truncated SVD as an alternative method.
In the next section I tried to sum up all the things I learned from this technique in comparison to the PCA.
Truncated SVD (Singular Value Decomposition) and PCA (Principal Component Analysis) are both linear algebra techniques that can be used to reduce the dimensionality of high-dimensional data, while retaining the most important information.
As I mentioned before, I was dealing with a large dataset that after featurization step it was still large enough to push me to know about the alternative way to deal with!
The main differences between Truncated SVD and PCA which I found out about are:
PCA aims to find the directions (principal components) that explain the
maximum amount of variance in the data, while Truncated SVD aims to
factorize a matrix into two lower rank matrices.
PCA is typically applied to a
covariance matrix, while Truncated SVD can be applied
directly to a data matrix without computing the covariance matrix.
PCA provides the
principal components, which are linear combinations of the original variables, while Truncated SVD provides the
singular vectors, which are also linear combinations of the original variables.
In PCA, the
number of principal components to keep is typically chosen based on the
percentage of variance explained or by setting a fixed number of components. In Truncated SVD, the
number of singular vectors to keep is typically chosen based on the
rank of the matrix or a fixed number of components.
Truncated SVD is typically
faster than PCA for
large datasets, as it only computes a subset of the singular vectors and values.
As I first described, our dataset was This was very important in our case. Because we use a pay-as-you-go Azure Compute to run the experiments. It was crucial to save the computation time.
Truncated SVD and
PCA are useful techniques for reducing the dimensionality of high-dimensional data.
The choice of which technique to use depends on the specific requirements of the problem at hand. In our case, the large sparse data matrix, need to choose the
In my next post, I will show a simple code to use this technique!