Nitin Kendre

Posted on May 31, 2023

Mastering Multiple Linear Regression: A Step-by-Step Implementation Guide with Python Code Examples

#python #machinelearning #datascience #productivity

Introduction :

Multiple Linear Regression is a statistical model used to find relationship between dependent variable and multiple independent variables. This model helps us to find how different variables contribute to outcome or predictions. In this article we will see how to implement it using python language from data preparation to model evaluation.

1. Understanding Multiple Linear Regression :

In simple linear regression only one independent and dependent variables are there. So Multiple Linear Regression extends this capacity of simple linear regression. Means there can many number of independent variables in Multiple Linear Regression.

General Equation for Multiple Linear Regression is as follow -

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ɛ

where,

y is the dependent variable
β₀ is the intercept (means value of y when value of x is zero)
β₁, β₂, ..., βₚ are the coefficients.
x₁, x₂, ..., xₚ are the independent variables.
ɛ represents error terms.

2. Data Preparation :

It is the fundamental step in any Machine Learning Model. Because before feeding to model data should be clean, without any missing values, and all values should be in numeric.

Below are some code examples -

## Importing Libraries
import numpy as np
import pandas as pd

## Loading Data
m_data = pd.read_csv("50_Startups.csv")

## Checking for missing values
m_data.isnull().sum()

## Creating dependent and independent variables.
x = m_data.iloc[:, :-1].values
y = m_data.iloc[:, -1].values

Encoding Categorical Variables :

It is necessary to encode categorical values in the form of numbers. Because model don't accepts categorical values like string, characters etc. In this article we will be using one hot encoding.

Refer link for more information on one hot encoder.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[3])],remainder='passthrough')
x = np.array(ct.fit_transform(x))

Splitting Dataset into training and test set :

By splitting the dataset into training and test set we can train our model using training set and evaluate our model using test set. So this is also an important step if you don't have testing dataset seperately.

from sklearn.model_selectiom import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=21)

For more data preparation tools refer this Link.

3. Model Training and Evaluation :

Now in this step we will train our multiple linear regression model using training set and evaluate it using test set.

Model Training -

from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(x_train,y_train)

Predicting -

y_pred = mlr.predict(y_test)

Model Evaluation -

we will use mean squared error (MSE) for evaluating our model.

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test,y_pred)

To learn more on Mean Squared Error (MSE) refer this Link

4. Interpreting Coefficients :

The coefficients in multiple linear regression equation determines the relationship between each independent and dependent variable. If value of coefficient is positive then it shows positive relationship between two variables, while negative value shows negative relationship.

Below is the python code which will generate a data frame which will contains columns name and coefficient value of that column from data.

coefficients = pd.DataFrame({"Variable":m_data.columns,"Coefficient":mlr.coef_})
print(coefficients)

5. Model Evaluation using plots :

We can use various diagnostic plots to evaluate the performance of model or diagnose any issue.

Below is the python code for plot -

import matplotlib.pytplot as plt

# residuals vs Predicted values
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='-')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values")
plt.show()

There are different diagnostic plans for models. You can refer below links for more information.

Conclusion :

We have learned implementation of Multiple Linear Regression using python. Also with that we learned data preparation, model evaluation etc.

So by following these steps and using code examples provided, you can easily implement Multiple Linear Regression in your own projects.

References :

sklearn: Linear Regression Documentation - Link.
sklearn: Mean Squared Error Documentation - Link
Seaborn Documentation for Various Diagnostic plots - Link.
Matplotlib Documentation for Visualization - Link.

DEV Community