daud99

Posted on Apr 29, 2022 • Edited on Jan 28, 2024

Regression Modeling

#machinelearning #datascience #statistics #regression

Make sure to read this blog before reading this one.

What is Regression model and which kind of problem it is used for?

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:

What is the value of a house in California?
What is the probability that a user will click on this ad?

They can't be use to predict discrete values. For example:

Is a given email message spam or not spam?
Is this an image of a dog, a cat, or a hamster?

Such problems are known as Classification problem. We will see which Machine Learning algorithms we can use for classification problem in future blogs.

The dataset used in this blog can be found here.

Linear Regression Model

In Linear Regression Model, we have only one Dependent Variable (Target Variable) and one Independent Variable (Feature).

Step-1

Loading Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model

We are Scikit-Learn library to model our model. By default, this model uses the Ordinary Least Square method.

Step-2

Loading Dataset.

df = pd.read_csv('homeprices.csv')

Step-3

We now that in order for linear regression to work. We need to know whether the data. We are fitting our model is linear or not. Let's draw a scatter plot to find it's Linear or not.

plt.xlabel('Area')
plt.ylabel('Price')
plt.scatter(df['area'],df['price'],marker='+', color='red')

We can see that all points plot indicate that it's almost linear data and can be fit by a stright line.

Step-4

Modeling our Linear Regression model.

linear_reg_model = linear_model.LinearRegression()
linear_reg_model.fit(df[['area']],df['price'])

You may wonder why are we using df[['area']] instead of df['area']. Thing is df['area'] returns Pandas.Series while this function takes DataFrame as an argument. So, that's why we used df[['area']] because it return the DataFrame instead of Series.

Step-5

Testing our model on a single value

linear_reg_model.predict([[3300]])

Output

array([628715.75342466])

Our model is doing the prediction as intended.

Step-6

Checking the regression coefficient and Y-intercept calculated.

print(linear_reg_model.coef_)
print(linear_reg_model.intercept_)

Output

[135.78767123]
180616.43835616432

For the equation y=mx+b. The linear_reg_model.coef_ indicates the m and linear_reg_model.intercept_ indicates the b in the equation respectively.

Step-7

Doing prediction on testing data

test_data = pd.read_csv('areas.csv')
preds = linear_reg_model.predict(test_data)
test_data['price'] = preds
plt.scatter(test_data['area'],test_data['price'],color='red')
plt.plot(test_data['area'],preds,color='green')

Here, plot methods draw a straight line while the scatter method draw the point on the plot. We are trying to get an idea. How well our regression lines fit the actual point of the data.

Here, we don't have the actual price values. For the sake of visualization. We are treating the predicted values both as actual and predicted value. That's why we have all the point exactly on the line which is not the case otherwise.

Multivariate Regression Model

When we have more than one independent variables. We just can't fit a Linear Model as it can't be fit by a Straight line. When we have more than one independent variable (features) the dimensinality of our problem increase. So, we use plane or hyperplane to fit our dataset.

Step-1

Import libraries and Reading data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
USAhousing = pd.read_csv('USA_Housing.csv')

Step-2

We want to see whether our data is Linear. Does there exist any collineraity or noise. What it's distribution?

sns.pairplot(USAhousing)

Remember! It may take quite a time for larger dataset. Just be patient.

We can clearly see that other than Price none of the features show collinearity. There isn't noise here as well.

Let's now see the distribution of Price dependent variable.

sns.distplot(USAhousing['Price'])

We can clearly see that it's normally distributed just like we wanted. So, we can use Regression model here.

sns.heatmap(USAhousing.corr())

Another way of seeing the correlation between all the variable. We can see that no variable other than the target variable have any correlation with any other variable which confirms the absence of the Collinearity. The diagonal show the strong correlation because every variable have 100% correlation with itself.

Step-3

Seperating Features and Target Variable.

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']

Step-4

Train Test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

0.4 indicates that 40% will be test data while 100-40=60% will be the Training data.

Step-5

Creating and training the model.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)

Step-6

Model Evaluation

# print the intercept
print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
print(coeff_df)

This is how we will interpret the data.

Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an *increase of $21.52 *.
Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an *increase of $164883.28 *.
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated with an *increase of $122368.67 *.
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated with an *increase of $2233.80 *.
Holding all other features fixed, a 1 unit increase in Area Population is associated with an *increase of $15.15 *.

Step-7

Predication from our model.

predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)

sns.distplot((y_test-predictions),bins=50);

If your residual have a NORMAL distribution it's a good indication that your choice for choosing the Regression model was correct. Otherwise, you may want to go back and see perhaps there is another model which is better for this problem.

Step-8

Regression Evation Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

Mean Squared Error (MSE) is the mean of the squared errors:

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

Comparing these metrics:

MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world. Applying square, as we know larger values have larger squares also square will make the value positive.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Output

MAE: 82288.2225191
MSE: 10460958907.2
RMSE: 102278.829223

The smaller the errors (LOSS) better is the model performance.

DEV Community

Regression Modeling

What is Regression model and which kind of problem it is used for?

Linear Regression Model

Step-1

Step-2

Step-3

Step-4

Step-5

Step-6

Step-7

Multivariate Regression Model

Step-1

Step-2

Step-3

Step-4

Step-5

Step-6

Step-7

Step-8

Top comments (0)

Read next

Data-Centric Visual AI Linkedin Learning course!

Mastering Data Cleaning: Your Guide to a Cleaner, Reliable Dataset 🚀

ECCV 2024 - Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Large Memory Models: AI That Updates Knowledge Without Full Retraining