Jinhoon Chung

Posted on May 13, 2022

Magic of Transformation in Linear Regression

#datascience #machinelearning #regression #beginners

Intro

Recently I completed a project from data science boot camp and learned how the transformation of numerical information is helpful in finding a regression model. On this page, I like to focus on how the transformation of numerical data can give better validation to the model instead of focusing on how the model looks and can be used to predict a value.

I will go over the data source and data structure briefly and give a quick explanation of what methods are used in this post. I will also include some python code and mathematical formulas to aid understanding.

Data

I am using data from the project I had completed, but the data is from a real-life and contains information on house sales data in King County, Washington. The information includes house prices and multiple house features. Below is the list of variables used in the analysis.

Dependent Variable

House Price

Independent Variables

Numerical

Living space in squared-feet
Lot size in squared-feet
Year built
The number of floors*

* A separate explanation of why I defined it as numerical is at the end of the post.

Categorical

Binaries

Waterfront
View presence
Renovation condition
Basement presence

Multi-categorical

Maintenance condition
House grade

Methods

This section is just to give you an idea of what I have done for the results. If these look familiar to you or don't interest you, then you can just skip to the result section. Checking the results before reading this section might give you an idea more easily of why I am posting this.

Assumptions

There are several ways to validate the model, and I like to go over the four major assumptions to validate the model. The assumptions are linearity, normality, homoscedasticity, and multicollinearity. I chose this method because it can be explained visually, and visualization is more helpful in explaining the concept than just lots of words and numbers.

1. Linearity

It is important to check the linearity assumption in the linear regression analysis. As polynomial transformation has not been applied in this analysis, the expected house price (dependent variable) will be compared to the raw value of the house price.

Below is the python code I used. The purpose of sharing the code is to give some idea of how the graph is created.

# split whole data to training and test data
from sklearn.model_selection import train_test_split

X = independent_variables
y = house_price_column

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# find the model fit
from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(X_train, y_train)

# Calculate predicted price using test data
y_pred = model.predict(X_test)

# Graphing part
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

perfect_line = np.arange(y_test.min(), y_test.max())
ax.plot(perfect_line, linestyle="--", color="orange", label="Perfect Fit")
ax.scatter(y_test, y_pred, alpha=0.5)
ax.set_xlabel("Actual Price")
ax.set_ylabel("Predicted Price")
ax.legend();

2. Normality

The normality assumption is related to the normality of model residuals. This is checked using a QQ plot.

import scipy.stats as stats
residuals = y_test - y_pred
sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);

3. Homoscedasticity

The assumption of homoscedasticity checks the dependent variables against the dependent variables and sees if values are dispersed without any pattern. This assumption is also related to the residuals.

fig, ax = plt.subplots()

residuals = y_test - y_pred

ax.scatter(y_pred, residuals, alpha=0.5)
ax.plot(y_pred, [0 for i in range(len(X_test))])
ax.set_xlabel("Predicted Value")
ax.set_ylabel("Actual - Predicted Value");

4. Multicollinearity

The assumption of multicollinearity checks dependency between independent variables. It is best to have independent variables independent of one another as much as possible.

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
pd.Series(vif, index=X_train.columns, name="Variance Inflation Factor")

Transformations of numerical variables

Log transformation

This part is simple. All values in the numerical columns are natural-logged.

Normalization

The below formula shows a value in a numerical variable is subtracted by the mean of the variable, and then the subtracted value is divided by the standard deviation of the variable.

Results

Here is the fun part. You can just relax and see how graphs and scores change.

The Raw Data - no transformation

1. Linearity

I see several outliers. Some linearity is observed only on the left side.

2. Normality

Only 1/3 of the dots are on the red line.

3. Homoscedasticity

A clear pattern is observed.

4. Multicollinearity

Only scores below 5 are accepted. About half of the scores are not acceptable.

sqft_living            8483.406359
sqft_lot                  1.200729
floors                   14.106084
waterfront                1.085728
view                      1.344477
yr_built                 72.300842
is_renovated              1.157421
has_basement              2.175980
condition_Fair            1.038721
condition_Good            1.668386
condition_Very Good       1.295097
grade_11 Excellent        1.530655
grade_6 Low Average       5.129509
grade_7 Average          14.142031
grade_8 Good              8.261598
grade_9 Better            3.446987
interaction            8460.117213

Log Transformation

1. Linearity

It shows much better linearity. The linearity of the dots have a slightly lower slope than the perfect line.

2. Normality

There is a small kurtosis, but the majority of the dots are on the red line.

3. Homoscedasticity

This looks better, too. A slight pattern is observed on the right side.

4. Multicollinearity

Several scores are still too high.

sqft_living            471370.972327
sqft_lot                  155.772190
floors                      4.052275
yr_built                  922.928871
waterfront                  1.086052
view                        1.337069
is_renovated                1.146855
has_basement                2.438983
condition_Fair              1.042784
condition_Good              1.668688
condition_Very Good         1.283740
grade_11 Excellent          1.468962
grade_6 Low Average         5.221791
grade_7 Average            12.895007
grade_8 Good                7.519577
grade_9 Better              3.355969
interaction            469074.416388

Log Transformation and Normalization

1. Linearity

The slope is slightly better and closer to the perfect line.

2. Normality

I don't see much difference from the previous graph.

3. Homoscedasticity

I don't see much difference from the previous graph.

4. Multicollinearity

All of the scores are now acceptable. This is a huge difference!

sqft_living            3.001670
sqft_lot               1.552016
floors                 2.046914
yr_built               1.758294
waterfront             1.086293
view                   1.313341
is_renovated           1.148279
has_basement           2.441147
condition_Fair         1.042169
condition_Good         1.647135
condition_Very Good    1.281906
grade_11 Excellent     1.278034
grade_6 Low Average    1.939542
grade_7 Average        2.077564
grade_8 Good           1.609822
grade_9 Better         1.440610
interaction            1.374175

Conclusion

Transformations helped to keep (i.e. not reject) the four assumptions. Visualizations seem clear enough to guide you to study what was improving.

Extra

The number of floors

This is kind of out of the major topic in this post, but this decision can be crucial to the overall regression analysis. Let me begin with the value counts of the floor information.

1.0    10673
2.0     8235
1.5     1910
3.0      611
2.5      161
3.5        7

The left column shows the data has a range of the floor counts from 1 through 3.5. The model might make more sense to treat this variable as a categorical variable. However, what if one likes to predict a house price that has 4 floors? This question or problem can be solved if this information is treated as numerical. I think this is a matter of the goal of the analysis.

Intro

Data

Dependent Variable

Independent Variables

Numerical

Categorical

Binaries

Multi-categorical

Methods

Assumptions

1. Linearity

2. Normality

3. Homoscedasticity

4. Multicollinearity

Transformations of numerical variables

Log transformation

Normalization

Results

The Raw Data - no transformation

1. Linearity

2. Normality

3. Homoscedasticity

4. Multicollinearity

Log Transformation

1. Linearity

2. Normality

3. Homoscedasticity

4. Multicollinearity

Log Transformation and Normalization

1. Linearity

2. Normality

3. Homoscedasticity

4. Multicollinearity

Conclusion

Extra

The number of floors

Read next

Who should be your first data hire and when should you hire them?

ECCV 2024 Redux: Fast and Photo-realistic Novel View Synthesis from Sparse Images

Mitigating False Positives in AML Machine Learning Models

Interview Questions on AWS Networking: VPC, Subnets, and Security Groups