How to scale attributes with normalization and standardization

#machinelearning #datascience #python

Attributes in different scales are common in Machine Learning projects. For example, a medical record dataset can include in the columns weight, height, and blood pressure. These attributes have different units of measure and vary in different intervals, making their comparison difficult.

In these cases, we can apply a process called scaling to make this comparison easier. In this process, we change the original data but keeping the relative distance between the data points, so that we preserve the attribute distribution.

Normalization

In normalization, we scale an attribute by making all data points fit in the interval between 0.0 and 1.0. We express the normalization process using the formula:

X^{norm} = \frac{X - X_{min}}{x_{max} - x_{min}}

where:

$X^{norm}$ : is the new scaled attribute
$X_{min}$ : is a column vector where all elements are equal to $x_{min}$
$x_{min}$ : is the minimum value of the $X$ attribute
$x_{max}$ : is the maximum value of the $X$ attribute

For each attribute value, we subtract its minimum and then divide the result by the difference of its maximum and its minimum. The mean value and standard deviation will be scaled as well, but the transforming will keep the data distribution.

Standardization

In standardization, the attribute is transformed to have a mean equals 0 (zero), and the standard deviation equals 1 (one). The following formula is applied:

X^{std} = \frac{X - \mu}{\sigma}

Where:

$X^{std}$: is the new scaled attribute
$X$: is a column vector representing our attribute
$\mu$: is a column vector where all elements are the mean value of the attribute
$\sigma$: is the standard deviation of the attribute

In standardization, there are no lower and upper limits for the new data values. But all of them are now expressed as unitarian distances from the mean.

Example

In the project Scaling attributes with normalization and standardization, we use vectorized operations to apply normalization and standardization to scale the attributes of the House Data Pricing dataset.

For example, we used the following code to apply normalization to the LotFrontage attribute:

df_norm['LotFrontage'] = (df_float['LotFrontage'] - df_float['LotFrontage'].min()) / (df_float['LotFrontage'].max() - df_float['LotFrontage'].min())

We also applied standardization to the same attributes and saved the result in a separated DataFrame:

df_std['LotFrontage'] = (df_float['LotFrontage'] - df_float['LotFrontage'].mean()) / df_float['LotFrontage'].std()

After scaling the attributes, we create linear regression models for each dataset and compared the results.

Conclusion

After comparing the score of the models, we concluded that the scaling of attributes by themselves does not improve the linear regression models. Check the complete example in the link below: