DEV Community

Cover image for Hands-on with Feature Selection Techniques: Embedded Methods
Younes Charfaoui
Younes Charfaoui

Posted on

Hands-on with Feature Selection Techniques: Embedded Methods

This article is part 4 of a series centered on hands-on approaches to feature selection techniques. If you’ve missed any of the other posts, I’d recommend checking them out:

Hands-on with Feature Selection Techniques: An Introduction.
Hands-on with Feature Selection Techniques: Filter Methods.
Hands-on with Feature Selection Techniques: Wrapper Methods.
Hands-on with Feature Selection Techniques: Embedded Methods.
Hands-on with Feature Selection Techniques: Hybrid Methods.
Hands-on with Feature Selection Techniques: More Advanced Methods.

Welcome back! In part 4 of our series, we’ll provide an overview of embedded methods for feature selection.

We learned from the previous article a method that integrates a machine learning algorithm into the feature selection process.
Those wrapper methods provide a good way to ensure that the selected features are the best for a specific machine learning model.

We concluded that using these methods will provide better results in terms of performance, but they’ll also cost us a lot of computation time/resources.

But what if we could include the feature selection process in ML model training itself? That could lead us to even better features for that model, in a shorter amount of time. This is where embedded methods come into play.

Embedded Methods: Definition

Embedded methods complete the feature selection process within the construction of the machine learning algorithm itself. In other words, they perform feature selection during the model training, which is why we call them embedded methods.

A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification/regression at the same time.

Embedded Methods: Advantages

The embedded method solves both issues we encountered with the filter and wrapper methods by combining their advantages. Here’s how:

  • They take into consideration the interaction of features like wrapper methods do.
  • They are faster like filter methods.
  • They are more accurate than filter methods.
  • They find the feature subset for the algorithm being trained.
  • They are much less prone to overfitting.

For a complete guide of Feature Selection & Feature Engineering in one book, you can check this link.

Embedded Methods: Process

Any and all embedded methods work as follows:

  • First, these methods train a machine learning model.
  • They then derive feature importance from this model, which is a measure of how much is feature important when making a prediction.
  • Finally, they remove non-important features using the derived feature importance.

The Methods

In this article, we’ll explore a few specific methods that use embedded feature selection: regularization and tree-based methods.

Using Regularization

Regularization in machine learning adds a penalty to the different parameters of a model to reduce its freedom. This penalty is applied to the coefficient that multiplies each of the features in the linear model, and is done to avoid overfitting, make the model robust to noise, and to improve its generalization.

There are three main types of regularization for linear models:

  • lasso regression or L1 regularization
  • ridge regression or L2 regularization
  • elastic nets or L1/L2 regularization

Let’s look at each in a bit more detail:

  • L1 regularization has shrinks some of the coefficients to zero, therefore indicating that a certain predictor or certain features will be multiplied by zero to estimate the target. Thus, it won’t be added to the final prediction of the target—this means that these features can be removed because they aren’t contributing to the final prediction.

  • L2 regularization, on the other hand, doesn’t set the coefficient to zero, but only approaching zero—that’s why we use only L1 in feature selection.

  • L1/L2 regularization is a combination of the L1 and L2. It incorporates their penalties, and therefore we can end up with features with zero as a coefficient—similar to L1.

Here is a code snippet to work with:

# Lasso for Regression tasks, and Logistic Regression for Classification tasks.
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel

# using logistic regression with penalty l1.
selection = SelectFromModel(LogisticRegression(C=1, penalty='l1')), y_train)

# see the selected features.
selected_features = x_train.columns[(selection.get_support())]

# see the deleted features.
removed_features = x_train.columns[(selection.estimator_.coef_ == 0).ravel().tolist()]
Enter fullscreen mode Exit fullscreen mode

Tree-based Feature Importance

Tree-based algorithms and models (i.e. random forest) are well-established algorithms that not only offer good predictive performance but can also provide us with what we call feature importance as a way to select features.

Feature importance

Feature importance tells us which variables are more important in making accurate predictions on the target variable/class. In other words, it identifies which features are the most used by the machine learning algorithm in order to predict the target.

Random forests provide us with feature importance using straightforward methods — mean decrease impurity and mean decrease accuracy.

How it works

A random forest is no more than a group of decision trees. Each of them is established over a random extraction of samples and features from the dataset, so an individual tree isn't able to see all the features or access all the observations.

Furthermore, every node in a decision tree is a condition on one feature—these nodes are designed to split the dataset into two different sets. Similar observation values will be in the same set, and different ones will be in the other.

Thus, the importance of each feature is derived by how “pure” each of the sets is.

The measure based on which optimal condition is chosen is known as an impurity. For classification, it’s typically either the Gini impurity or information gain/entropy; and for regression trees, it’s the variance.

Thus, when training a tree, feature importance is calculated as the decrease in node impurity weighted in a tree. The higher the value, the more important the feature.

Here’s a code sample that abstracts what we said in just a couple of lines:

from sklearn.ensemble import RandomForestClassifier

# create the random forest with your hyperparameters.
model = RandomForestClassifier(n_estimators=340)

# fit the model to start training., y_train)

# get the importance of the resulting features.
importances = model.feature_importances_

# create a data frame for visualization.
final_df = pd.DataFrame({"Features": x_train.columns, "Importances":importances})

# sort in ascending order to better visualization.
final_df = final_df.sort_values('Importances')

# plot the feature importances in bars. 
Enter fullscreen mode Exit fullscreen mode

You can use any other tree-based algorithm the same way we did here. One of the best tree model types is gradient boosting algorithms (like XGBoost, CatBoost, and many more) since they provide accurate feature importance.


Using embedded methods can be very a straightforward approach for selecting good features for machine learning models, especially if you’re going to use the same model used for the feature selection process itself.

To see the full example of this article check out this GitHub repository.

More methods are on the way. We’ll continue this journey in the next part of this series, which will explore hybrid methods for selecting features.

Discussion (0)