There are three broad categories of methods for Feature Selection.

- Filter Methods
- Embedded Methods
- Wrapper Methods

## Filter Methods.

These methods are based on the **intrinsic** (natural) properties of the features itself. We don't use any classifier or model at this point.

### Univariate Statistics

If the variance for the feature is large such that data points are very spread out it tells that these features data points are very useful for distinguishing b/w different training examples. It will be easy to come up with boundaries to distinguish different data points if there is variance. The larger the variance the more better it is so we can simply remove the features with low variance. This is also known as **UNIVARIATE Statistic**. As only a single feature is involve. Another fancy term we often use is **Information Gain** that how much a feature contribute in distinguishing different data points.

#### 1. Using Simply Threshold

The advantage of calculating variance is that it's really fast. And, the major disadvantage is that it doesn't take into account the relationship among features.

```
# Seperating fetures from Target Variable
features = dataset.loc[:, dataset.columns != 'Label'].astype('float64')
labels = dataset['Label']
```

```
from sklearn.feature_selection import VarianceThreshold
# We want to see column here 95% of the value in a feature/column is same
# Selecting features with less than 5% variance
var_constant = 0.05
var_thr = VarianceThreshold(threshold = var_constant)
var_thr.fit(features)
variance_stat = var_thr.get_support()
print(f"The {len(variance_stat[variance_stat==False])} have low variance than {var_constant*100}% out of {len(features.columns)} features.")
print(f'Following are the features with low variance')
print(features.columns[np.invert(variance_stat)])
```

### Bivariate Statistics

If we are involving more than one feature for the computation than it's nothing but Bivariate Statistics.

#### 1. Pairwise Correlation

When two features are very correlated then we know that one feature is redundant and probably we can remove that from the dataset without losing too much information.

```
feature_corr_matrix = features.corr()
plt.figure(figsize=(100,100)) # (width, height)
sns.heatmap(feature_corr_matrix, annot=True,cmap="RdYlGn")
```

#### 2. Correlation with target Variable

If we have a feature which are highly correlated with target variable than they are good features to use. Especially in the case of Linear Regression.

```
def corelationHeatMap(col_name):
corr_matrix = dataset.corr()
corr_matrix_cols = corr_matrix.columns
plt.figure(figsize=(20,60)) # (width, height)
index = [i for (i,each) in enumerate(corr_matrix.columns) if each == col_name][0]
sns.heatmap(corr_matrix.iloc[:, [index]], annot=True,cmap="RdYlGn")
```

Pass in the name of your Target variable to the above function.

```
corelationHeatMap("Label")
```

#### 3. Using Anova

We know that Standard Deviation tell us about how spread out the data is in other words how much our data points deviates from mean on average. Whereas, Variance is nothing but the square of the Standard Deviation it also help us to understand the correlation b/w variables.

Anova also help us to find the correlation b/w the variables.

When we use Anova we end up with the value known as F Ratio or F Statistic. This tells us how confidentially we can say there is a correlation b/w the variables. There is a Null Hypothesis which says there is no correlation b/w Variable and Alternate Hypothesis saying there is correlation b/w Variable. Just like p-value is less than significant level here similarly if F ratio is less than significant level than we will reject the Null Hypothesis and accept the Alternate Hypothesis.

When we can use Anova Test if your features are Numerical and Target Variable is Categorical or Numerical then we can use the Anova Test.

How Anova Selection for Feature Selection?

The thing is the F Ratio is calculate for each feature with the Target variable. We select the features with the highest F Ratio/Score as they are the most important.

## Embedded Methods

These methods actually involves the model. The model is used with the goal of optimising by selecting the best features. For example: **Decision Tree** each time we split the node we compare all the different features to select the feature with maximum information gain. So, our goal is to find the features which maximize the information gain when we use it for splitting. We can say that decision tree is actually selecting features while growing the tree. We will be selecting the features that result in most information gain. Usually the features which are used more higher up are the most important one as they have the maximum information gain in the decision tree.

This is just one of the many examples. We will look each of them in great details in the upcoming blogs.

## Wrapper Methods

They are based on our main objectives. For instance if we are interested in improving our prediction performance or time of prediction or may be training time. So, we may have different best features depending upon our objective.

What we do is basically we fit our models based on different subsets of features. And, see what are the values/performance of the model using different subsets as per our main objective. This help us selecting the best features.

The Wrapper Method is really expensive as it takes a lot of time to compute the result for each subset of the features as compare to the Univariate Statistics such as variance. So it's quite computationally expensive but it's also very good as it is directly dealing with the intended results.

## Discussion (0)