## Feature Scaling

The process of making all the features or independent variables (Variable other than target variable) on almost the same scale so that each feature is equally important.

### Example:

This is a dataset that contains an independent variable (Purchased) and 3 dependent variables (Country, Age, and Salary). We can easily notice that the variables are not on the same scale because the range of Age is from 27 to 50, while the range of Salary going from 48 K to 83 K. The range of Salary is much wider than the range of Age. This will cause some issues in our models since a lot of machine learning models such as k-means clustering and nearest neighbor classification are based on the Euclidean Distance.

## Methods for Feature Scaling

There are different method of feature scaling.

- Standardization (Z-score Normalization)
- Max-Min Normalization (Min-Max Scaling)
- Standard Deviation Method
- Range Method

### 1. Standardization

Standardization means you're transforming your data so that fits within specific scale/range, like 0-100 or 0-1. The features are rescaled such that it's mean and standard deviation are 0 and 1, respectively.

The data distribution with mean and standard deviation 0 and 1 respectively indicates **Standard Normal Distribution**. This is also know as **Z-Score Normalization**.

Well, the idea is **simple**. Variables that are measured at different scales do not contribute equally to the model fitting & model learned function and might end up creating a bias. Thus, to deal with this potential problem feature-wise standardized (μ=0, σ=1) is usually used prior to model fitting.

Standardization comes into picture when features of input data set have large differences between their ranges, or simply when they are measured in different measurement units (e.g., Pounds, Meters, Miles … etc).

These differences in the ranges of initial features causes trouble to many machine learning models. For example, for the models that are based on distance computation, if one of the features has a broad range of values, the distance will be governed by this particular feature.

To illustrate this with an example : say we have a 2-dimensional data set with two features, Height in Meters and Weight in Pounds, that range respectively from [1 to 2] Meters and [10 to 200] Pounds. No matter what distance based model you perform on this data set, the Weight feature will dominate over the Height feature and will have more contribution to the distance computation, just because it has bigger values compared to the Height. So, to prevent this problem, transforming features to comparable scales using standardization is the solution.

The following formula is used to perform Standardization for each value of the feature.

#### Python Implementation

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
```

### 2. Max-Min Normalization

It is also known as Min-Max Scaling. Also in this blog, it is also being called simply **Scaling**

However, in most of the places I came across it is simply known as **Normalization**.

I Know this is confusing. Lol! But this is how I understand this.

It is defined as

"Technique in which values are shifted and rescaled so that they end up ranging between 0 and 1."

Here,s the formula

Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.

#### Python Implementation

```
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)
```

## 3. Robust Scaling

Use the `RobustScaler`

that will just scale the features but in this case using **statistics that are robust to outliers**. This scaler removes the **median** and **scales** the data according to the **quantile** **range** (defaults to **IQR**: Interquartile Range). *The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).*

Scaling using median and quantiles consists of subtracting the median to all the observations and then dividing by the interquartile difference. It Scales features using statistics that are robust to outliers.

The interquartile difference is the difference between the 75th and 25th quantile:

```
IQR = 75th quantile — 25th quantile
```

The equation to calculate scaled values:

```
X_scaled = (X — X.median) / IQR
```

#### Python Implementation

```
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
data_scaled = scaler.fit_transform(data)
```

## What are the main question when it comes to feature scaling?

The thing is we need to answer two question mainly.

- Does we need to do the feature scaling?
- If yes, then which method of feature scaling we need to use Standardization, Normalization, etc.

## When should we use feature scaling?

1- Gradient Descent Based Algorithms.

- Linear Regression
- Logistic Regression
- Neural Networks
- etc.

2- Distance Based Algorithms

- KNN
- K-means
- SVM

## When to perform standardization?

As seen above, for distance based models, standardization is performed to prevent features with wider ranges from dominating the distance metric. But the reason we standardize data is not the same for all machine learning models, and differs from one model to another.

So before which ML models and methods you have to standardize your data and why ?

### 1- BEFORE PCA:

In Principal Component Analysis, features with high variances/wide ranges, get more weight than those with low variance, and consequently, they end up illegitimately dominating the First Principal Components (Components with maximum variance). I used the word “Illegitimately” here, because the reason these features have high variances compared to the other ones is just because they were measured in different scales.

Standardization can prevent this, by giving same wheightage to all features.

### 2- BEFORE CLUSTERING:

Clustering models are distance based algorithms, in order to measure similarities between observations and form clusters they use a distance metric. So, features with high ranges will have a bigger influence on the clustering. Therefore, standardization is required before building a clustering model.

### 3- BEFORE KNN:

k-nearest neighbors is a distance based classifier that classifies new observations based on similarity measures (e.g., distance metrics) with labeled observations of the training set. Standardization makes all variables to contribute equally to the similarity measures .

### 4- BEFORE SVM

Support Vector Machine tries to maximize the distance between the separating plane and the support vectors. If one feature has very large values, it will dominate over other features when calculating the distance. So Standardization gives all features the same influence on the distance metric.

### 5- BEFORE MEASURING VARIABLE IMPORTANCE IN REGRESSION MODELS

You can measure variable importance in regression analysis, by fitting a regression model using the **standardized** independent variables and comparing the absolute value of their standardized coefficients. But, if the independent variables are not standardized, comparing their coefficients becomes meaningless.

This one is also known as

Feature importancemeasuring.

### 6- BEFORE LASSO AND RIDGE REGRESSION

LASSO and Ridge regressions place a penalty on the magnitude of the coefficients associated to each variable. And the scale of variables will affect how much penalty will be applied on their coefficients. Because coefficients of variables with large variance are small and thus less penalized. Therefore, standardization is required before fitting both regressions.

## When standardization is not needed?

### LOGISTIC REGRESSION AND TREE BASED MODELS

Logistic Regression and Tree based algorithms such as Decision Tree, Random forest and gradient boosting, are not sensitive to the magnitude of variables. So standardization is not needed before fitting this kind of models.

## When to do Normalization?

- Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbours and Neural Networks.
- However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using.
- There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized, and standardized data and compare the performance for the best results.

## Difference b/w normalization and standardization?

- Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbours and Neural Networks.
- Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.
- However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using.
- There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized, and standardized data and compare the performance for the best results.
- It is a good practice to fit the scaler on the training data and then uses it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.

## Visualizing unscaled, normalized and standardized data?

### After Normalization

### After Standardization

### How outliers are deal in Standardization VS. Normalization?

For **Standardized data** outliers exist as just they exist for the **Original data**. In contrast to standardization, in **Normalized data** the cost of having this bounded range is that we will end up with smaller standard deviations, which can suppress the effect of outliers. However, **Normalization** is still sensitive to outlier but a little less than **Standardization**.

## Points worth noting

- We can see that the
**Normalized data**have different means. As, the**MEAN**changes so does the Standard Deviation. However, the**Standardized data**have the same**MEAN**. -
**Normalized data**have the fixed range i.e. between 0 and 1. However, the range for**Standardized data**vary. - For
**Standardized data**outliers exist as just they exist for the**Original data**. In contrast to standardization, in**Normalized data**the cost of having this bounded range is that we will end up with smaller standard deviations, which can suppress the effect of outliers. However,**Normalization**is still sensitive to outlier but a little less than**Standardization**.

## Discussion (0)