DEV Community

daud99
daud99

Posted on • Edited on

Regression Modeling

Make sure to read this blog before reading this one.

What is Regression model and which kind of problem it is used for?

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:

  • What is the value of a house in California?
  • What is the probability that a user will click on this ad?

They can't be use to predict discrete values. For example:

  • Is a given email message spam or not spam?
  • Is this an image of a dog, a cat, or a hamster?

Such problems are known as Classification problem. We will see which Machine Learning algorithms we can use for classification problem in future blogs.

The dataset used in this blog can be found here.

Linear Regression Model

In Linear Regression Model, we have only one Dependent Variable (Target Variable) and one Independent Variable (Feature).

Step-1

Loading Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
Enter fullscreen mode Exit fullscreen mode

We are Scikit-Learn library to model our model. By default, this model uses the Ordinary Least Square method.

Step-2

Loading Dataset.

df = pd.read_csv('homeprices.csv')
Enter fullscreen mode Exit fullscreen mode

Step-3

We now that in order for linear regression to work. We need to know whether the data. We are fitting our model is linear or not. Let's draw a scatter plot to find it's Linear or not.

plt.xlabel('Area')
plt.ylabel('Price')
plt.scatter(df['area'],df['price'],marker='+', color='red')
Enter fullscreen mode Exit fullscreen mode

Image description

We can see that all points plot indicate that it's almost linear data and can be fit by a stright line.

Step-4

Modeling our Linear Regression model.

linear_reg_model = linear_model.LinearRegression()
linear_reg_model.fit(df[['area']],df['price'])
Enter fullscreen mode Exit fullscreen mode

You may wonder why are we using df[['area']] instead of df['area']. Thing is df['area'] returns Pandas.Series while this function takes DataFrame as an argument. So, that's why we used df[['area']] because it return the DataFrame instead of Series.

Step-5

Testing our model on a single value

linear_reg_model.predict([[3300]])
Enter fullscreen mode Exit fullscreen mode

Output

array([628715.75342466])

Our model is doing the prediction as intended.

Step-6

Checking the regression coefficient and Y-intercept calculated.

print(linear_reg_model.coef_)
print(linear_reg_model.intercept_)
Enter fullscreen mode Exit fullscreen mode

Output

[135.78767123]
180616.43835616432

For the equation y=mx+b. The linear_reg_model.coef_ indicates the m and linear_reg_model.intercept_ indicates the b in the equation respectively.

Step-7

Doing prediction on testing data

test_data = pd.read_csv('areas.csv')
preds = linear_reg_model.predict(test_data)
test_data['price'] = preds
plt.scatter(test_data['area'],test_data['price'],color='red')
plt.plot(test_data['area'],preds,color='green')
Enter fullscreen mode Exit fullscreen mode

Here, plot methods draw a straight line while the scatter method draw the point on the plot. We are trying to get an idea. How well our regression lines fit the actual point of the data.

Image description

Here, we don't have the actual price values. For the sake of visualization. We are treating the predicted values both as actual and predicted value. That's why we have all the point exactly on the line which is not the case otherwise.

Multivariate Regression Model

When we have more than one independent variables. We just can't fit a Linear Model as it can't be fit by a Straight line. When we have more than one independent variable (features) the dimensinality of our problem increase. So, we use plane or hyperplane to fit our dataset.

Step-1

Import libraries and Reading data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
USAhousing = pd.read_csv('USA_Housing.csv')
Enter fullscreen mode Exit fullscreen mode

Step-2

We want to see whether our data is Linear. Does there exist any collineraity or noise. What it's distribution?

sns.pairplot(USAhousing)
Enter fullscreen mode Exit fullscreen mode

Remember! It may take quite a time for larger dataset. Just be patient.

Image description

We can clearly see that other than Price none of the features show collinearity. There isn't noise here as well.

Let's now see the distribution of Price dependent variable.

sns.distplot(USAhousing['Price'])
Enter fullscreen mode Exit fullscreen mode

Image description

We can clearly see that it's normally distributed just like we wanted. So, we can use Regression model here.

sns.heatmap(USAhousing.corr())
Enter fullscreen mode Exit fullscreen mode

Image description

Another way of seeing the correlation between all the variable. We can see that no variable other than the target variable have any correlation with any other variable which confirms the absence of the Collinearity. The diagonal show the strong correlation because every variable have 100% correlation with itself.

Step-3

Seperating Features and Target Variable.

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
Enter fullscreen mode Exit fullscreen mode

Step-4

Train Test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
Enter fullscreen mode Exit fullscreen mode

0.4 indicates that 40% will be test data while 100-40=60% will be the Training data.

Step-5

Creating and training the model.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
Enter fullscreen mode Exit fullscreen mode

Step-6

Model Evaluation

# print the intercept
print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
print(coeff_df)
Enter fullscreen mode Exit fullscreen mode

Image description

This is how we will interpret the data.

  • Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an *increase of $21.52 *.
  • Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an *increase of $164883.28 *.
  • Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated with an *increase of $122368.67 *.
  • Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated with an *increase of $2233.80 *.
  • Holding all other features fixed, a 1 unit increase in Area Population is associated with an *increase of $15.15 *.

Step-7

Predication from our model.

predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
Enter fullscreen mode Exit fullscreen mode

Image description

sns.distplot((y_test-predictions),bins=50);
Enter fullscreen mode Exit fullscreen mode

Image description

If your residual have a NORMAL distribution it's a good indication that your choice for choosing the Regression model was correct. Otherwise, you may want to go back and see perhaps there is another model which is better for this problem.

Step-8

Regression Evation Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

Image description

Mean Squared Error (MSE) is the mean of the squared errors:

Image description

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

Image description

Comparing these metrics:

  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world. Applying square, as we know larger values have larger squares also square will make the value positive.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
Enter fullscreen mode Exit fullscreen mode

Output

MAE: 82288.2225191
MSE: 10460958907.2
RMSE: 102278.829223

The smaller the errors (LOSS) better is the model performance.

Top comments (0)