Jinhoon Chung

Posted on

# Magic of Transformation in Linear Regression

## Intro

Recently I completed a project from data science boot camp and learned how the transformation of numerical information is helpful in finding a regression model. On this page, I like to focus on how the transformation of numerical data can give better validation to the model instead of focusing on how the model looks and can be used to predict a value.

I will go over the data source and data structure briefly and give a quick explanation of what methods are used in this post. I will also include some python code and mathematical formulas to aid understanding.

## Data

I am using data from the project I had completed, but the data is from a real-life and contains information on house sales data in King County, Washington. The information includes house prices and multiple house features. Below is the list of variables used in the analysis.

House Price

### Independent Variables

#### Numerical

Living space in squared-feet
Lot size in squared-feet
Year built
The number of floors*

* A separate explanation of why I defined it as numerical is at the end of the post.

#### Categorical

##### Binaries

Waterfront
View presence
Renovation condition
Basement presence

##### Multi-categorical

Maintenance condition

## Methods

This section is just to give you an idea of what I have done for the results. If these look familiar to you or don't interest you, then you can just skip to the result section. Checking the results before reading this section might give you an idea more easily of why I am posting this.

### Assumptions

There are several ways to validate the model, and I like to go over the four major assumptions to validate the model. The assumptions are linearity, normality, homoscedasticity, and multicollinearity. I chose this method because it can be explained visually, and visualization is more helpful in explaining the concept than just lots of words and numbers.

#### 1. Linearity

It is important to check the linearity assumption in the linear regression analysis. As polynomial transformation has not been applied in this analysis, the expected house price (dependent variable) will be compared to the raw value of the house price.

Below is the python code I used. The purpose of sharing the code is to give some idea of how the graph is created.

``````# split whole data to training and test data
from sklearn.model_selection import train_test_split

X = independent_variables
y = house_price_column

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# find the model fit
from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(X_train, y_train)

# Calculate predicted price using test data
y_pred = model.predict(X_test)

# Graphing part
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

perfect_line = np.arange(y_test.min(), y_test.max())
ax.plot(perfect_line, linestyle="--", color="orange", label="Perfect Fit")
ax.scatter(y_test, y_pred, alpha=0.5)
ax.set_xlabel("Actual Price")
ax.set_ylabel("Predicted Price")
ax.legend();
``````

#### 2. Normality

The normality assumption is related to the normality of model residuals. This is checked using a QQ plot.

``````import scipy.stats as stats
residuals = y_test - y_pred
sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);
``````

#### 3. Homoscedasticity

The assumption of homoscedasticity checks the dependent variables against the dependent variables and sees if values are dispersed without any pattern. This assumption is also related to the residuals.

``````fig, ax = plt.subplots()

residuals = y_test - y_pred

ax.scatter(y_pred, residuals, alpha=0.5)
ax.plot(y_pred, [0 for i in range(len(X_test))])
ax.set_xlabel("Predicted Value")
ax.set_ylabel("Actual - Predicted Value");
``````

#### 4. Multicollinearity

The assumption of multicollinearity checks dependency between independent variables. It is best to have independent variables independent of one another as much as possible.

``````from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
pd.Series(vif, index=X_train.columns, name="Variance Inflation Factor")
``````

### Transformations of numerical variables

#### Log transformation

This part is simple. All values in the numerical columns are natural-logged.

#### Normalization

The below formula shows a value in a numerical variable is subtracted by the mean of the variable, and then the subtracted value is divided by the standard deviation of the variable.

## Results

Here is the fun part. You can just relax and see how graphs and scores change.

### The Raw Data - no transformation

#### 1. Linearity

I see several outliers. Some linearity is observed only on the left side.

#### 2. Normality

Only 1/3 of the dots are on the red line.

#### 3. Homoscedasticity

A clear pattern is observed.

#### 4. Multicollinearity

Only scores below 5 are accepted. About half of the scores are not acceptable.

``````sqft_living            8483.406359
sqft_lot                  1.200729
floors                   14.106084
waterfront                1.085728
view                      1.344477
yr_built                 72.300842
is_renovated              1.157421
has_basement              2.175980
condition_Fair            1.038721
condition_Good            1.668386
condition_Very Good       1.295097
interaction            8460.117213
``````

### Log Transformation

#### 1. Linearity

It shows much better linearity. The linearity of the dots have a slightly lower slope than the perfect line.

#### 2. Normality

There is a small kurtosis, but the majority of the dots are on the red line.

#### 3. Homoscedasticity

This looks better, too. A slight pattern is observed on the right side.

#### 4. Multicollinearity

Several scores are still too high.

``````sqft_living            471370.972327
sqft_lot                  155.772190
floors                      4.052275
yr_built                  922.928871
waterfront                  1.086052
view                        1.337069
is_renovated                1.146855
has_basement                2.438983
condition_Fair              1.042784
condition_Good              1.668688
condition_Very Good         1.283740
interaction            469074.416388
``````

### Log Transformation and Normalization

#### 1. Linearity

The slope is slightly better and closer to the perfect line.

#### 2. Normality

I don't see much difference from the previous graph.

#### 3. Homoscedasticity

I don't see much difference from the previous graph.

#### 4. Multicollinearity

All of the scores are now acceptable. This is a huge difference!

``````sqft_living            3.001670
sqft_lot               1.552016
floors                 2.046914
yr_built               1.758294
waterfront             1.086293
view                   1.313341
is_renovated           1.148279
has_basement           2.441147
condition_Fair         1.042169
condition_Good         1.647135
condition_Very Good    1.281906
interaction            1.374175
``````

## Conclusion

Transformations helped to keep (i.e. not reject) the four assumptions. Visualizations seem clear enough to guide you to study what was improving.

## Extra

### The number of floors

This is kind of out of the major topic in this post, but this decision can be crucial to the overall regression analysis. Let me begin with the value counts of the floor information.

``````1.0    10673
2.0     8235
1.5     1910
3.0      611
2.5      161
3.5        7
``````

The left column shows the data has a range of the floor counts from 1 through 3.5. The model might make more sense to treat this variable as a categorical variable. However, what if one likes to predict a house price that has 4 floors? This question or problem can be solved if this information is treated as numerical. I think this is a matter of the goal of the analysis.