*How to choose and plot the most important features from an overwhelming pool of features, using Truncated SVD!?*

## What was the problem?

As I explained in my previous post, in one of our projects at MelkRadar, I had to tackle with a large dataset including more than 1500 features! It was really overwhelming identify the most important features especially after the features transformation.

This problem was the beginning of my journey through the feature selection and feature extraction techniques. I found out about two well-known techniques: Truncated SVD and PCA. After all, I understood that Truncated SVD was a better solution to handle our problem with a large sparse dataset.

## Why is the Truncated SVD more informative than PCA in my problem?

### Truncated SVD and PCA:

Truncated SVD (Singular Value Decomposition) and PCA (Principal Component Analysis) are both linear algebra technique. They are dimensionality reduction techniques used to reduce the dimensionality of high-dimensional datasets. They both aim to find a lower-dimensional representation of the data while indicating the most important information.

### Truncated SVD vs. PCA:

The main difference between truncated SVD and PCA lies in how they handle data. Truncated SVD is specifically designed for sparse matrices and can handle missing values, while PCA works on dense matrices and requires complete data.

Truncated SVD is often preferred for text data, while PCA is commonly used for numerical data.

Truncated SVD is typically faster than PCA for large datasets, as it only computes a subset of the singular vectors and values.

## How to does Truncated SVD identifies the most important features?

Truncated SVD does `not directly`

identify the most frequent features in a dataset. Its primary goal is to reduce the dimensionality of the data while retaining the most important information. However, it is possible to indirectly identify the most frequent features by `examining the singular vectors`

obtained from the truncated SVD.

In truncated SVD, the singular vectors are the linear combinations of the original features that explain the most variance in the data. Therefore, the features that have the highest coefficients in the singular vectors can be considered the most important features in the dataset.

## Now let's do some coding...

To identify the most frequent features using truncated SVD, one could perform the following steps:

### Step 1:

To keep it simple, I generate a random data matrix to simulate a dataset that I had in my real project.

```
from sklearn.decomposition import TruncatedSVD
import numpy as np
# Generate a random data matrix X of size (m x n)
m, n = 1000, 1500
X = np.random.randn(m, n)
```

### Step 2:

Now fit X to the truncated SVD to obtain the singular vectors. Then, compute the low-rank approximation.

```
k = 10 # number of singular vectors to keep
U, S, Vt = np.linalg.svd(X)
X_approx = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :] # fit data to the model
```

`U`

: is the left singular value. It is a m*m matrix. The columns of this matrix are the corresponding eigenvectors of the singular values in the X matrix.

`S`

: is the singular values matrix. It is a m*n matrix. The diagonal elements of S are the singular values of X.

`Vt`

: is the right singular vectors matrix. It is n*n matrix. The columns of this matrix are the corresponding eigenvectors of the singular values in the X matrix.

Note that **U** captures the `relationships among the columns`

of X, while **Vt** captures the `relationships among the rows`

of X.

### Step 3:

To evaluate the model, we have to calculate the relative approximation error like this:

```
approx_error = np.linalg.norm(X - X_approx) / np.linalg.norm(X)
print(f'Relative approximation error: {approx_error:.2f}')
```

### Step 4:

Select the `k`

most important features. For each singular vector, identify the features with the highest coefficients.

```
Vk = Vt[:k, :]
```

### Step 5:

Compute the feature importance scores as the sum of absolute values of the coefficients in the first k singular vectors. Count the frequency of each feature across all singular vectors.

```
feature_importance = np.abs(Vk).sum(axis=0)
```

### Step 6:

Sort the feature importance scores in descending order.

Rank the features by frequency to identify the most frequent features in the dataset.

```
sorted_idx = np.argsort(feature_importance)[::-1]
# Save the names of the top 10 most frequent features in a list
top_features = [feature_names[i] for i in sorted_idx[:10]]
```

### Step 7:

Create a bar plot of the top 10 most frequent features:

```
plt.barh(range(10), feature_importance[sorted_idx[:10]])
plt.yticks(range(10), top_features)
plt.xlabel('Importance Score')
plt.title('Most Frequent Features in Truncated SVD')
plt.show()
```

## Note that...

This approach assumes that the most frequent features are also the most important features in the dataset. This may not always be the case, as important features may have lower frequencies if they are correlated with other features that are more frequent.

Additionally, the choice of the truncation parameter in truncated SVD can affect the results, so it is important to choose an appropriate truncation level based on the problem at hand.

## Top comments (0)