Intro
Recently I completed a project from data science boot camp and learned how the transformation of numerical information is helpful in finding a regression model. On this page, I like to focus on how the transformation of numerical data can give better validation to the model instead of focusing on how the model looks and can be used to predict a value.
I will go over the data source and data structure briefly and give a quick explanation of what methods are used in this post. I will also include some python code and mathematical formulas to aid understanding.
Data
I am using data from the project I had completed, but the data is from a real-life and contains information on house sales data in King County, Washington. The information includes house prices and multiple house features. Below is the list of variables used in the analysis.
Dependent Variable
House Price
Independent Variables
Numerical
Living space in squared-feet
Lot size in squared-feet
Year built
The number of floors*
* A separate explanation of why I defined it as numerical is at the end of the post.
Categorical
Binaries
Waterfront
View presence
Renovation condition
Basement presence
Multi-categorical
Maintenance condition
House grade
Methods
This section is just to give you an idea of what I have done for the results. If these look familiar to you or don't interest you, then you can just skip to the result section. Checking the results before reading this section might give you an idea more easily of why I am posting this.
Assumptions
There are several ways to validate the model, and I like to go over the four major assumptions to validate the model. The assumptions are linearity, normality, homoscedasticity, and multicollinearity. I chose this method because it can be explained visually, and visualization is more helpful in explaining the concept than just lots of words and numbers.
1. Linearity
It is important to check the linearity assumption in the linear regression analysis. As polynomial transformation has not been applied in this analysis, the expected house price (dependent variable) will be compared to the raw value of the house price.
Below is the python code I used. The purpose of sharing the code is to give some idea of how the graph is created.
# split whole data to training and test data
from sklearn.model_selection import train_test_split
X = independent_variables
y = house_price_column
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# find the model fit
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
# Calculate predicted price using test data
y_pred = model.predict(X_test)
# Graphing part
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
perfect_line = np.arange(y_test.min(), y_test.max())
ax.plot(perfect_line, linestyle="--", color="orange", label="Perfect Fit")
ax.scatter(y_test, y_pred, alpha=0.5)
ax.set_xlabel("Actual Price")
ax.set_ylabel("Predicted Price")
ax.legend();
2. Normality
The normality assumption is related to the normality of model residuals. This is checked using a QQ plot.
import scipy.stats as stats
residuals = y_test - y_pred
sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);
3. Homoscedasticity
The assumption of homoscedasticity checks the dependent variables against the dependent variables and sees if values are dispersed without any pattern. This assumption is also related to the residuals.
fig, ax = plt.subplots()
residuals = y_test - y_pred
ax.scatter(y_pred, residuals, alpha=0.5)
ax.plot(y_pred, [0 for i in range(len(X_test))])
ax.set_xlabel("Predicted Value")
ax.set_ylabel("Actual - Predicted Value");
4. Multicollinearity
The assumption of multicollinearity checks dependency between independent variables. It is best to have independent variables independent of one another as much as possible.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
pd.Series(vif, index=X_train.columns, name="Variance Inflation Factor")
Transformations of numerical variables
Log transformation
This part is simple. All values in the numerical columns are natural-logged.
Normalization
The below formula shows a value in a numerical variable is subtracted by the mean of the variable, and then the subtracted value is divided by the standard deviation of the variable.
Results
Here is the fun part. You can just relax and see how graphs and scores change.
The Raw Data - no transformation
1. Linearity
I see several outliers. Some linearity is observed only on the left side.
2. Normality
Only 1/3 of the dots are on the red line.
3. Homoscedasticity
4. Multicollinearity
Only scores below 5 are accepted. About half of the scores are not acceptable.
sqft_living 8483.406359
sqft_lot 1.200729
floors 14.106084
waterfront 1.085728
view 1.344477
yr_built 72.300842
is_renovated 1.157421
has_basement 2.175980
condition_Fair 1.038721
condition_Good 1.668386
condition_Very Good 1.295097
grade_11 Excellent 1.530655
grade_6 Low Average 5.129509
grade_7 Average 14.142031
grade_8 Good 8.261598
grade_9 Better 3.446987
interaction 8460.117213
Log Transformation
1. Linearity
It shows much better linearity. The linearity of the dots have a slightly lower slope than the perfect line.
2. Normality
There is a small kurtosis, but the majority of the dots are on the red line.
3. Homoscedasticity
This looks better, too. A slight pattern is observed on the right side.
4. Multicollinearity
Several scores are still too high.
sqft_living 471370.972327
sqft_lot 155.772190
floors 4.052275
yr_built 922.928871
waterfront 1.086052
view 1.337069
is_renovated 1.146855
has_basement 2.438983
condition_Fair 1.042784
condition_Good 1.668688
condition_Very Good 1.283740
grade_11 Excellent 1.468962
grade_6 Low Average 5.221791
grade_7 Average 12.895007
grade_8 Good 7.519577
grade_9 Better 3.355969
interaction 469074.416388
Log Transformation and Normalization
1. Linearity
The slope is slightly better and closer to the perfect line.
2. Normality
I don't see much difference from the previous graph.
3. Homoscedasticity
I don't see much difference from the previous graph.
4. Multicollinearity
All of the scores are now acceptable. This is a huge difference!
sqft_living 3.001670
sqft_lot 1.552016
floors 2.046914
yr_built 1.758294
waterfront 1.086293
view 1.313341
is_renovated 1.148279
has_basement 2.441147
condition_Fair 1.042169
condition_Good 1.647135
condition_Very Good 1.281906
grade_11 Excellent 1.278034
grade_6 Low Average 1.939542
grade_7 Average 2.077564
grade_8 Good 1.609822
grade_9 Better 1.440610
interaction 1.374175
Conclusion
Transformations helped to keep (i.e. not reject) the four assumptions. Visualizations seem clear enough to guide you to study what was improving.
Extra
The number of floors
This is kind of out of the major topic in this post, but this decision can be crucial to the overall regression analysis. Let me begin with the value counts of the floor information.
1.0 10673
2.0 8235
1.5 1910
3.0 611
2.5 161
3.5 7
The left column shows the data has a range of the floor counts from 1 through 3.5. The model might make more sense to treat this variable as a categorical variable. However, what if one likes to predict a house price that has 4 floors? This question or problem can be solved if this information is treated as numerical. I think this is a matter of the goal of the analysis.
Top comments (0)