DEV Community

Cover image for Multiple Linear Regression
Erick Badillo
Erick Badillo

Posted on

Multiple Linear Regression

Hello, this is my second post regarding Linear Regression. These posts aim to share the fundamentals of the programming and math concepts behind the basics of ML algorithms.

You can go post by post and follow the sequence, or you can read only what you are interested in. I hope you find interesting this amazing topic and get a good understanding of the basics behind ML algorithms.

I will visit Linear Regression, Classification, Clustering, and some concepts of feature engineering, regularization, overfitting, and underfitting. Therefore, I hope you enjoy reading and please let me know your thoughts.

In the previous post , I wrote about how Linear Regression is implemented from scratch using:

  • Cost function: Mean Squared Error (MSE)
  • Optimization function: Gradient descent
  • Python implementation from scratch using loop statements
  • One feature to predict the label

In this post, I use:

  • Cost function: Mean Squared Error (MSE)
  • Optimization function: Gradient descent
  • Python implementation from scratch using Numpy
  • Multiple features to predict the label
  • R-squared metric to measure our model performance

Let's get started

Linear regression assumes a linear relationship between the label and features. This means that we believe the relationship can be expressed as a straight-line equation:

y=f(x)=wx+by = f(x) = wx + b

Considering we have more than one feature and more than one test example, we can generalize the above equation as a prediction hypothesis to compute the vector y^\hat{y} .
y^=wX+b\overrightarrow{\hat{y}} = \overrightarrow{w}\cdot X+b

Another way to write the above equation would be:
[y1^ym^]=[w1wn][x11x1nxm1xmn]+[b] \begin{bmatrix}\hat{y_1}\\\vdots\\\hat{y_m}\end{bmatrix} = \begin{bmatrix}w_1\\\vdots\\w_n\end{bmatrix}\cdot \begin{bmatrix}x_{11} & \cdots & x_{1n} \\\vdots & \ddots & \vdots \\x_{m1} & \cdots & x_{mn}\end{bmatrix} + \begin{bmatrix}\vdots\\b\\\vdots\end{bmatrix}

Now we have the hypothesis to predict the label from multiple features. Let's have fun writing some code and doing some math.

1. Data generation

We will use a pre-built library to create an ad-hoc dataset for our exercise. Using this library is very helpful because in this way we can skip the feature selection, and feature engineering processes.

Nevertheless, whenever you want to compute predictions with your own data take into consideration that you will need to run into the feature selection, and feature engineering processes to ensure your model will perform well with unknown data.

Libraries import

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
Enter fullscreen mode Exit fullscreen mode

Data generation

X, y = make_regression(n_samples = 200, n_features = 3,
random_state = 42, noise = 50)
y = y.reshape(200, 1)
Enter fullscreen mode Exit fullscreen mode

2. Data visualization

Features histogram

sns.histplot(X[:, 0], kde=True, label='Feature 1')
sns.histplot(X[:, 1], kde=True, label='Feature 2')
sns.histplot(X[:, 2], kde=True, label='Feature 3')
plt.xlabel('Feature Values')
Enter fullscreen mode Exit fullscreen mode

Features histogram

3. Training/Test sets creation

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

4. Model implementation

Cost function

J(w,b)=12mi=1m(fw,b(x(i))y(i))2 \begin{equation}J(w,b) = \frac{1}{2m}\sum_{i=1}^{m}(f_{w,b}(x^{(i)})-y_{(i)})^{2}\end{equation}

def compute_loss(X, y, w, b, m):
    y_hat =, w) + b
    loss = (1 / (2* m)) * np.sum((y_hat - y)**2)
    return loss
Enter fullscreen mode Exit fullscreen mode

The compute gradients function

δJ(w,b)δw=1mi=1m(fw,b(x(i))y(i))x(i) \frac{\delta J(w,b)}{\delta w} = \frac{1}{m}\sum_{i=1}^{m}(f_{w,b}(x^{(i)})-y^{(i)})x^{(i)}
δJ(w,b)δb=1mi=1m(fw,b(x(i))y(i)) \frac{\delta J(w,b)}{\delta b} = \frac{1}{m}\sum_{i=1}^{m}(f_{w,b}(x^{(i)})-y^{(i)})

def compute_gradients(X, y, w, b, m):
    y_hat =, w) + b
    error = y_hat - y
    dj_dw = (1 / m) * np.sum(error * X, axis = 0)
    dj_db = (1 / m) * np.sum(error)        
    return dj_dw, dj_db
Enter fullscreen mode Exit fullscreen mode

The optimization function (gradient descent)

w=wαδJ(w,b)δw w = w-\alpha\frac{\delta J(w,b)}{\delta w}
b=bαδJ(w,b)δb b = b-\alpha\frac{\delta J(w,b)}{\delta b}

def compute_optimization(learning_rate, w, b, dj_dw, dj_db):
    w_new = w - (learning_rate * dj_dw)
    b_new = b - (learning_rate * dj_db)
    return w_new, b_new
Enter fullscreen mode Exit fullscreen mode

The fit(training) function

def fit(epochs, learning_rate, X, y, w, b, debug):                
    m, n = X.shape

    losses = list() #tracking purposes

    for epoch in range(epochs):

        loss = compute_loss(X, y, w, b, m)        

        dj_dw, dj_db = compute_gradients(X, y, w, b, m)            
        w_new, b_new = compute_optimization(learning_rate, 
w, b, dj_dw, dj_db)

        w = w_new
        b = b_new

        # print
        if (debug and (epoch % 50) == 0):
            print(f"epoch: {epoch}, loss: {loss}")            

    return w, b, losses
Enter fullscreen mode Exit fullscreen mode

Training the model

epochs = 500
learning_rate = 0.01

m, n = X_train.shape

w = np.random.rand(n,1)
b = np.random.rand(1)

w, b, loss = fit(epochs, learning_rate, X_train, y_train, w, b, True)
w, b
Enter fullscreen mode Exit fullscreen mode

5. Prediction and model evaluation

Prediction function

def predict(X):
    return, w[0]) + b
Enter fullscreen mode Exit fullscreen mode

Prediction and model evaluation using R-squared

R-squared ( R2R^2 is a metric that ranges from 0 to 1 and is sometimes expressed as a percentage between 0% and 100%. It represents the proportion of variance in the dependent variable (target) that is explained by the regression model. In other words, it indicates how much variability in target values is captured by the model.

R-squared interpretation

R2=0,R^2 = 0, The model explains no variability in the data and is essentially useless for making predictions.

R2=1,R^2 = 1, The model explains all the variability in the data and makes perfect predictions. This is rare in practice and could be a sign of overfitting.

0<R2<1,0 < R^2 < 1, The better the model fits the data. A higher value indicates that the model can explain a greater proportion of the variability in the target values.

y_pred = predict(X_test)
score = r2_score(y_test, y_pred)
Enter fullscreen mode Exit fullscreen mode

Final comments

In this journey through multiple linear regression, we embarked on a hands-on exploration of the fundamental concepts, implementation from scratch using Python and NumPy, and performance evaluation using the Mean Squared Error (MSE) metric and the R-squared ( R2R^2 ) score. Achieving an R-squared value of 0.707 demonstrates that our model captures a substantial portion of the data's variability, indicating a meaningful relationship between our predictor variables and the target. This journey not only allowed us to gain insights into the inner workings of regression modeling but also showcased the power of building predictive models from the ground up. Armed with these skills, we're well-equipped to tackle more complex regression problems and fine-tune our models for even better performance.


[1] A. Géron, "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow," O'Reilly Media, Inc., Oct. 2022.

[2] I. Goodfellow, Y. Bengio, and A. Courville, "Deep Learning," The MIT Press.

Top comments (0)