DEV Community

Nitin Kendre
Nitin Kendre

Posted on

Mastering Multiple Linear Regression: A Step-by-Step Implementation Guide with Python Code Examples

Introduction :

Multiple Linear Regression is a statistical model used to find relationship between dependent variable and multiple independent variables. This model helps us to find how different variables contribute to outcome or predictions. In this article we will see how to implement it using python language from data preparation to model evaluation.

1. Understanding Multiple Linear Regression :

In simple linear regression only one independent and dependent variables are there. So Multiple Linear Regression extends this capacity of simple linear regression. Means there can many number of independent variables in Multiple Linear Regression.

General Equation for Multiple Linear Regression is as follow -

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ɛ
Enter fullscreen mode Exit fullscreen mode

where,

  • y is the dependent variable
  • β₀ is the intercept (means value of y when value of x is zero)
  • β₁, β₂, ..., βₚ are the coefficients.
  • x₁, x₂, ..., xₚ are the independent variables.
  • ɛ represents error terms.

2. Data Preparation :

It is the fundamental step in any Machine Learning Model. Because before feeding to model data should be clean, without any missing values, and all values should be in numeric.

Below are some code examples -

## Importing Libraries
import numpy as np
import pandas as pd
Enter fullscreen mode Exit fullscreen mode
## Loading Data
m_data = pd.read_csv("50_Startups.csv")
Enter fullscreen mode Exit fullscreen mode
## Checking for missing values
m_data.isnull().sum()
Enter fullscreen mode Exit fullscreen mode
## Creating dependent and independent variables.
x = m_data.iloc[:, :-1].values
y = m_data.iloc[:, -1].values
Enter fullscreen mode Exit fullscreen mode

Encoding Categorical Variables :

It is necessary to encode categorical values in the form of numbers. Because model don't accepts categorical values like string, characters etc. In this article we will be using one hot encoding.

Refer link for more information on one hot encoder.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[3])],remainder='passthrough')
x = np.array(ct.fit_transform(x))
Enter fullscreen mode Exit fullscreen mode

Splitting Dataset into training and test set :

By splitting the dataset into training and test set we can train our model using training set and evaluate our model using test set. So this is also an important step if you don't have testing dataset seperately.

from sklearn.model_selectiom import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=21)
Enter fullscreen mode Exit fullscreen mode

For more data preparation tools refer this Link.

3. Model Training and Evaluation :

Now in this step we will train our multiple linear regression model using training set and evaluate it using test set.

Model Training -

from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(x_train,y_train)
Enter fullscreen mode Exit fullscreen mode

Predicting -

y_pred = mlr.predict(y_test)
Enter fullscreen mode Exit fullscreen mode

Model Evaluation -

we will use mean squared error (MSE) for evaluating our model.

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test,y_pred)
Enter fullscreen mode Exit fullscreen mode

To learn more on Mean Squared Error (MSE) refer this Link

4. Interpreting Coefficients :

The coefficients in multiple linear regression equation determines the relationship between each independent and dependent variable. If value of coefficient is positive then it shows positive relationship between two variables, while negative value shows negative relationship.

Below is the python code which will generate a data frame which will contains columns name and coefficient value of that column from data.

coefficients = pd.DataFrame({"Variable":m_data.columns,"Coefficient":mlr.coef_})
print(coefficients)
Enter fullscreen mode Exit fullscreen mode

5. Model Evaluation using plots :

We can use various diagnostic plots to evaluate the performance of model or diagnose any issue.

Below is the python code for plot -

import matplotlib.pytplot as plt

# residuals vs Predicted values
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='-')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values")
plt.show()

Enter fullscreen mode Exit fullscreen mode

There are different diagnostic plans for models. You can refer below links for more information.

Conclusion :

We have learned implementation of Multiple Linear Regression using python. Also with that we learned data preparation, model evaluation etc.

So by following these steps and using code examples provided, you can easily implement Multiple Linear Regression in your own projects.

References :

  • sklearn: Linear Regression Documentation - Link.

  • sklearn: Mean Squared Error Documentation - Link

  • Seaborn Documentation for Various Diagnostic plots - Link.

  • Matplotlib Documentation for Visualization - Link.

Top comments (0)