Standardizing the Data Using StandardScaler in ML

#machinelearning #dataprocessing #python #programming

Ensuring consistency in the numerical input data is crucial to enhancing the performance of machine learning algorithms. To achieve this uniformity, it is necessary to adjust the data to a standardized range.

Standardization and Normalization are both widely used techniques for adjusting data before feeding it into machine learning models.

In this article, you will learn how to utilize the StandardScaler class to scale the input data.

What is Standardization?

Before diving into the fundamentals of the StandardScaler class, you need to understand the standardization of the data.

Standardization is a data preparation method that involves adjusting the input (features) by first centering them (subtracting the mean from each data point) and then dividing them by the standard deviation, resulting in the data having a mean of 0 and a standard deviation of 1.

The formula for standardization can be written like the following:

standardized_val = ( input_value - mean ) / standard_deviation

Assume you have a mean value of 10.4 and a standard deviation value of 4. To standardize the value of 15.9, put the given values into the equation as follows:

standardized_val = ( 15.9 - 10.4 ) / 3
standardized_val = ( 5.5 ) / 4
standardized_val = 1.37

The StandardScaler stands out as a widely used tool for implementing data standardization.

What is StandardScaler?

The StandardScaler class provided by Scikit Learn applies the standardization on the input (features) variable, making sure they have a mean of approximately 0 and a standard deviation of approximately 1.

It adjusts the data to have a standardized distribution, making it suitable for modeling and ensuring that no single feature disproportionately influences the algorithm due to differences in scale.

Why Bother Using it?

Well, so far you've already understood the idea of using StandardScaler in machine learning but just to highlight, here are the primary reasons why you should use StandardScaler:

For the betterment of the performance of the machine learning models
Maintains the consistency of data points
Useful when working with machine learning algorithms that can be negatively influenced by differences in the scale of the features of the data.

How to Use StandardScaler?

First, you should bring in the StandardScaler class from the sklearn.preprocessing module. After that, create an instance of the StandardScaler class by using StandardScaler(). Following that, apply the fit_transform method to the input data by fitting it to the created instance.

# Imported required libs
import numpy as np
from sklearn.preprocessing import StandardScaler

# Creating a 2D array
arr = np.asarray([[12, 0.007],
                 [45, 1.5],
                 [75, 2.005],
                 [7, 0.8],
                 [15, 0.045]])

print("Original Array: \n", arr)

# Instance of StandardScaler class
scaler = StandardScaler()

# Fitting and then transforming the input data
arr_scaled = scaler.fit_transform(arr)
print("Scaled Array: \n", arr_scaled)

An instance of the StandardScaler class is created and stored in the variable scaler. This instance will be used to standardize the data.

The fit_transform method of the StandardScaler object (scaler) is called with the original data arr as the input.

The fit_transform method will compute the mean and deviation for each data point in the input data arr and then apply the standardization to the input data.

Here's the original array and the standardized version of the original array.

Original Array: 
 [[1.200e+01 7.000e-03]
 [4.500e+01 1.500e+00]
 [7.500e+01 2.005e+00]
 [7.000e+00 8.000e-01]
 [1.500e+01 4.500e-02]]
Scaled Array: 
 [[-0.72905466 -1.09507083]
 [ 0.55066894  0.79634605]
 [ 1.71405403  1.43610862]
 [-0.92295217 -0.09045356]
 [-0.61271615 -1.04693028]]

Does Standardization Affect the Accuracy of the Model?

In this section, you'll see how the model's performance is affected after applying standardization to features of the dataset.

Let's see how the model will perform on the raw dataset without standardizing the feature variables.

# Evaluate KNN on the breast cancer dataset
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from numpy import mean

# load dataset
df = datasets.load_breast_cancer()
X = df.data
y = df.target

# Instantiating the model
model = KNeighborsClassifier()

# Evaluating the model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=10, n_jobs=-1)

# Model's average score
print(f'Accuracy: {mean(scores):.2f}')

The breast cancer dataset is loaded from the sklearn.datasets and then the features (df.data) and target (df.target) are stored inside the X and y variables.

The K-nearest neighbors classifier (KNN) model is instantiated using the KNeighborsClassifier class and stored inside the model variable.

The cross_val_score function is used to evaluate the KNN model's performance. It passes the model (KNeighborsClassifier()), features (X), target (y), and specifies that accuracy (scoring='accuracy') should be used as the evaluation metric.

This will evaluate the accuracy scores by dividing the dataset equally into 10 parts (cv=10) which means the dataset will be trained and tested 10 times. Here, n_jobs=-1 means using all the available CPU cores for faster cross-validation.

Finally, the average of the accuracy scores (mean(scores)) is printed.

Accuracy: 0.93

Without standardizing the dataset's feature variables, the average accuracy score is 93%.

Using StandardScaler for Applying Standardization

# Evaluate KNN on the breast cancer dataset
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from numpy import mean

# loading dataset and configuring features and target variables
df = datasets.load_breast_cancer()
X = df.data
y = df.target

# Standardizing features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Instantiating model
model = KNeighborsClassifier()

# Evaluating the model
scores = cross_val_score(model, X_scaled, y, scoring='accuracy', cv=10, n_jobs=-1)

# Model's average score
print(f'Accuracy: {mean(scores):.2f}')

The dataset's features undergo scaling with the StandardScaler(), and the resulting scaled dataset is stored in the X_scaled variable.

Next, this scaled dataset is used as input for the cross_val_score function to compute and subsequently display the accuracy.

Accuracy: 0.97

It is noticeable that the accuracy score has significantly increased to 97% when compared to the previous accuracy score of 93%.

The application of StandardScaler(), which standardized the data's features, has notably improved the model's performance.

Conclusion

StandardScaler is used to standardize the input data in a way that ensures that the data points have a balanced scale, which is crucial for machine learning algorithms, especially those that are sensitive to differences in feature scales.

Standardization transforms the data such that the mean of each feature becomes zero (centered at zero), and the standard deviation becomes one.

Let's recall what you've learned: