DEV Community


Posted on • Updated on

Embedded Methods for Feature Selection

L1 Regularized Logistic Regression

Let's have a brief overview of Regularization.

Regularization help us with the problem of overfitting model on the training dataset. Instead of just decreasing the LOSS function we also penalize the model complexity.

There are different form of Regularizatoin.

  1. L1 Regularization.
  2. L2 Regularization.

We will be discussing them in detail in the future blogs.

Using L1 (LASSO) Regularization

This L1 regularization can also be use as one of the method of feature selection.

Image description

We will discuss the loss function minimising part in some other blog. For this discussion, the interesting part is the N1 Norm it plays the main role in feature selection. This N1 Norm is the measure of how big the weights are. Here, m is the number of features in the dataset whereas |W| is the absolute sum of all the weights. You can think of Lambda as the scaling factor it's a hyper parameter which we have to tune when we use it in practice. This N1 Norm is the penalty against the LOSS function greater the complexity greater will be the penalty and vice versa.

Our final LOSS function is

Image description

Our goal is to minimize the overall LOSS function but the N1 NORM added a large positive number. So, if we want to minimize the loss we also need to minimize this term which can only be possible if we use least weights (less complex model). So, our goal is to find weights which are not only good for predictions but also the smallest possible weights to make overall loss function less.

If we have large Lambda term, the trade of b/w minimizing the solution term and the global loss function lies where one of the weights is zero or usually more than one with is zero.

How can we use it for feature selection?

As we know the greater the weight the more important/value-able is the feature. So, we can remove the features with zero weight or least weights or select the features with the most weights.

Most of this info is derived from this video.

Using Decision Trees & Random Forest

In Logistic Regression, we are using all of the feature unless we are using L1 regularization which zero out some of the features. However, in Decision Tree the features selection is done implicitly. It is done such that a feature is selected which reduces the entropy the most. The goal is get the Entropy to 0. There can other criteria for selection of feature such as Gini or any other impurity in General. Also known as Information Gain.

Decision tree perform feature selection implicitly.

Discussion (0)