A Beginner's Guide to Feature Selection in Machine Learning: Techniques and Tips

#machinelearning #100daysofcode #womenintech #datascience

Introduction to Feature Selection.

As a data scientist, one of the crucial steps in the machine learning process is selecting the most relevant and important features to train our models on. This is because using too many irrelevant or redundant features can lead to overfitting and poor model performance. In this blog, we will explore various techniques and approaches for selecting the best features for your machine learning models

What is Feature selection?

Feature selection also known as variable selection or attribute selection is the process of identifying and selecting a subset of relevant features from a larger set of features for use in a machine learning model. This is often done to improve the model's performance and reduce the complexity of the model.

Need for feature selection.

Reduce over-fitting
overfitting is a situation where your model is too complex for your data. To improve performance of a machine learning model, it is important to correctly select features to make it simpler thus reducing the risk of overfitting.
Improve accuracy of a model
Reducing the input variable into your model by selecting the most critical variables and eliminating irrelevant features increases the prediction power of algorithms.
Simplify models for easy interpretation
Discarding the variables with less significant values enables easy interpretation of variables.
Increase training speed-
Reducing data features reduces algorithm complexity and algorithm trains faster.

Feature selection techniques

There are various feature selection techniques, including:

Univariate Feature selection
Multivariate Feature selection

Univariate Feature Selection

Description
– It involves manually visiting every feature and checking its
importance against a target.
–It is useful when dealing with few features.
Techniques
– use of personal judgement {Not Recommended if
you are not an expert}
– Computing Variance (or standard deviation)
– Computing Correlation (i.e. Pearson) {Most Recommended}

Implementing Univariate Feature selection using Python



#Import libraries
import pandas as pd
import numpy as np

create a dummy data set and load it into a Pandas DataFrame



# Creating dummy data set with the following attributes: 
# size of house (sq. m), number of bedrooms, number of parking slots, montly rent (usd)
data = {'Size': [90, 97, 82, 39, 120],
        'Bedrooms': [2, 2, 3, 1, 4],
        'Parking': [2, 2, 3, 1, 3],
        'Rent': [90, 100, 80, 40, 120]}

# Create a DataFrame
df = pd.DataFrame(data)
df

Output

select a target variable and feature matrix



# Choosing target variable and feature matrix
X = df.drop("Rent",1)   # Feature Matrix
y = df["Rent"]          # Target Variable
X

Output

Set a variance threshold



# Target Variable: Rent
# Feature Matrix: Size and Bedrooms
# Threshold Variance Value: 5
# comparisons
var1 = df[["Rent", "Size"]].var(axis=1)
var2 = df[["Rent", "Bedrooms"]].var(axis=1)
var3 = df[["Rent", "Parking"]].var(axis=1)
var4 = df[["Rent", "Rent"]].var(axis=1)
print(var1)
print("-----")
print(var2)
print("-----")
print(var3)
print("-----")
var2
var3
var1.mean()
var1.mode()
var2.mode()
var3.mode()

In the example above, we use a dummy data set that has the attributes: {Size, Bedrooms, Parking, Rent}. {Rent} is chosen to the target variable and {Size, Bedrooms, Parking} form the feature matrix.
Each feature in the feature matrix is then compared against the target variable to check its importance.
Variance is computed for each comparison. The minimum variance threshold used in this example is 5; that is, if the variance of the comparison is less than 5 then the affected samples have almost the same values. This implies that these samples will not generate any meaningful information when applied on a predictive model.
In this example, we see that feature {Size} when compared to the target variable {Rent} yields variance that are less than 5 for all the samples. This feature can be ignored because it will not bring any predictive power to the model.

Multivariate Feature selection.

It is the most powerful method when carrying out machine learning and data analytics tasks. Multivariate feature selection involves identifying and selecting the most important features in a dataset, based on their ability to explain the variance in the target variable. This is useful when dealing with large datasets, as it helps to reduce the complexity of the data and improve the efficiency of the modeling process.
They are different approaches to multivariate feature selection, and how they can be applied in various machine learning contexts;

Filter methods

These methods use statistical measures to evaluate the relevance of each feature to the target variable and select the most relevant ones. Examples include chi-squared test, information gain, and correlation coefficient. These methods can be applied in any machine learning context where there is a target variable to predict.

Wrapper methods
These methods use a machine learning model as a "wrapper" to evaluate the importance of each feature. The model is trained using a subset of features and the performance is evaluated. The subset of features is then modified and the process is repeated until the best performing subset is found. Examples include recursive feature elimination, backward selection and forward selection. These methods are more computationally expensive and are best suited for smaller datasets.

Embedded methods
These methods use the structure of the machine learning model itself to identify the most important features. For example, in a decision tree model, the features that are most important for making decisions are considered the most important. Examples include lasso regression, Elastic Net regularization, ridge regularization, Random forest, Gradient boosting Machin and decision trees. These methods can be applied in any machine learning context where the model has a built-in feature selection method.

Linear models
In this method, the relationship between each feature and the dependent variable is evaluated using a linear model. The features that have the highest coefficients in the model are selected.

Example.

We will implement the use of multivariate feature selection using California Housing dataset as follows;
import the necessary libraries and load in the dataset:



import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression

Load in the California Housing dataset.
data = pd.read_csv("housing.csv")



# split the data into features (X) and target (y)
X = data.drop(["median_house_value"], axis=1)
y = data["median_house_value"]

Split the data into training and test sets, and scale the features using StandardScaler:



# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# scale the features using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Use the SelectKBest function to select the top 5 features based on their F-values from the f_regression function:



# select the top 5 features using SelectKBest
selector = SelectKBest(f_regression, k=5)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
# print the selected features
print(X.columns[selector.get_support()])

Finally, we can fit a Linear Regression model using the selected features and evaluate its performance:



  
  
  fit a Linear Regression model using the selected features


model = LinearRegression()

model.fit(X_train_selected, y_train)

  
  
  evaluate the model on the test set


print("Test set R2 score: {:.2f}".format(model.score(X_test_selected, y_test))

Conclusion

In conclusion, feature selection is an important step in the machine learning process as it helps to improve the performance of a model and reduce the complexity of the data. By carefully selecting the most relevant and informative features, we can build more accurate and efficient models that are better able to make predictions on new data. While there are various methods available for feature selection, it is important to consider the specific needs of your dataset and choose the method that best fits your goals. Ultimately, the right feature selection technique can greatly enhance the performance of your machine learning model and lead to more successful results.