Attributes in different scales are common in Machine Learning projects. For example, a medical record dataset can include in the columns weight, height, and blood pressure. These attributes have different units of measure and vary in different intervals, making their comparison difficult.
In these cases, we can apply a process called scaling to make this comparison easier. In this process, we change the original data but keeping the relative distance between the data points, so that we preserve the attribute distribution.
Normalization
In normalization, we scale an attribute by making all data points fit in the interval between 0.0 and 1.0. We express the normalization process using the formula:
where:
For each attribute value, we subtract its minimum and then divide the result by the difference of its maximum and its minimum. The mean value and standard deviation will be scaled as well, but the transforming will keep the data distribution.
Standardization
In standardization, the attribute is transformed to have a mean equals 0 (zero), and the standard deviation equals 1 (one). The following formula is applied:
Where:
- $X^{std}$: is the new scaled attribute
- $X$: is a column vector representing our attribute
- $\mu$: is a column vector where all elements are the mean value of the attribute
- $\sigma$: is the standard deviation of the attribute
In standardization, there are no lower and upper limits for the new data values. But all of them are now expressed as unitarian distances from the mean.
Example
In the project Scaling attributes with normalization and standardization, we use vectorized operations to apply normalization and standardization to scale the attributes of the House Data Pricing dataset.
For example, we used the following code to apply normalization to the LotFrontage attribute:
df_norm['LotFrontage'] = (df_float['LotFrontage'] - df_float['LotFrontage'].min()) / (df_float['LotFrontage'].max() - df_float['LotFrontage'].min())
We also applied standardization to the same attributes and saved the result in a separated DataFrame:
df_std['LotFrontage'] = (df_float['LotFrontage'] - df_float['LotFrontage'].mean()) / df_float['LotFrontage'].std()
After scaling the attributes, we create linear regression models for each dataset and compared the results.
Conclusion
After comparing the score of the models, we concluded that the scaling of attributes by themselves does not improve the linear regression models. Check the complete example in the link below:
Top comments (5)
You should have done this
$$ X^{std} = \frac{X - \mu}{\sigma} $$
for a one line equation.
and
$ X^{std} = \frac{X - \mu}{\sigma} $
for an inline equationProbably dev.to doesn't support proper markdown and latex syntax
Nice Work. You can use other kind of transformation in your data as well.
if x is your dataset
x = log(x)
x=sqrt(x)
That’s curious
Thanks @lukaszahradnik !