DEV Community

Cover image for Linear Regression
praveenr
praveenr

Posted on • Edited on

Linear Regression

A journey of a thousand miles begins with a single step, my journey and I believe your M.L journey is going to start today and it is going to start with linear regression, so let's begin... :)

Two Takeaways

  1. Mathematical Explanation behind linear regression
  2. Python Implementation Using Scikit-Learn

What is Regression and why is this algorithm called Linear Regression

Regression is a statistical method that helps us find the relationship between independent and dependent variables.
To make it simple you have two columns "EXPERIENCE" and "KNOWLEDGE" and the relationship is how your knowledge increases with experience, this relationship is called regression.

Linear Regression

The linearity assumption in linear regression means the model is linear in parameters (i.e coefficients of variables) & may or may not be linear in variables.

Graph - Linear Regression
(The above is a perfectly linear graph)

Mathematical Understanding of Linear Regression

  1. Hypothesis Function
  2. Loss and cost function
  3. Optimisation

Hypothesis Function

Any machine learning algorithm is nothing but a hypothesis function, imagine this to be a function that you use to get output from a set of inputs, this hypothesis function is optimised to work for our dataset and this process of optimisation is what we call model-training.

Hypothesis Function - Single Variable

The above equation is the hypothesis function for linear regression, and yes you would have seen this somewhere else, this equation is similar to the equation of a straight line y=m*x + c.
The above equation is a single variable function i.e the data for this contains just 2 columns in total, the input column and output column.

x ==> Input/Independent Variable
y ==> Output/Dependant Variable
teta1, teta2 ==> Constants

We basically try to predict y by giving x as the input and assigning optimised values for teta1 and teta2 (NOTE: These are also called weights and in further text teta1, teta2 would be referred to as weights).

Multi-Variable Hypothesis Function

Hypothesis Function - Multi Variable
When we have many columns then the hypothesis function would look like this.

Loss and Cost Functions

Loss and cost functions are not exactly the same, there is a subtle difference between them.
Loss - Cost

When we check the deviation between the actual and predicted value of a single data point it is called loss and if I cumulatively find the deviation for the entire dataset it is called cost and the corresponding functions are called loss and cost functions.

Some common loss and const functions for linear regression are

Loss-Cost Regression

Mean Squared Error (MSE)

The cost function that we are going to see is Mean Squared Error(MSE) which could be used for optimising the hypothesis function of linear regression.

MSE

Basically, the difference between the predicted value and the actual value is found, and squared, this is done for each and every data point and collated.

Now we have predicted with random weights(teta1, teta2), now a process called training is used to optimise the weights.

Training the model

Training could be defined as a process that is used to optimise the hypothesis function by optimising the weights.
This optimisation process is achieved using gradient-descent.

Gradient descent
The above curve is called the error curve and it is the output of the mean squared error(MSE).
Gradient descent is the process where we update the weights of the hypothesis function either by small values or large values depending on the learning rate.
On reaching a particular threshold error value the gradient descent stops and we would have optimised weights to best suit our input dataset.

Best Fit
The red line is called the best-fit line and this represents the final output after the linear regression model has completed training.

Python Implementation - Linear Regression Using Scikit-Learn

Importing the required python libraries

# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
Enter fullscreen mode Exit fullscreen mode

We are going to use the diabetes to understand linear regression and this dataset comes along with the scikit-learn library.

# We are taking an inbuilt dataset present in sklearn 
diabetes = datasets.load_diabetes()
Enter fullscreen mode Exit fullscreen mode

Dataset description

Dataset desc

Columns present in the dataset

diabetes.feature_names
Enter fullscreen mode Exit fullscreen mode

Columns

Storing the training data and target data separately in two different variables

# Extracting the training data and target 
X = diabetes.data
Y = diabetes.target
Enter fullscreen mode Exit fullscreen mode
print(X.shape, Y.shape)
Enter fullscreen mode Exit fullscreen mode

Shape

Importing the linear regression model from scikit-learn library, the model is present as a class.

from sklearn.linear_model import LinearRegression
le = LinearRegression()
Enter fullscreen mode Exit fullscreen mode

Training process

le.fit(train_x,train_y)
Enter fullscreen mode Exit fullscreen mode

yes the whole training process is done in one single line of code, thanks to scikit-learn.

Making Predictions

y_pred = le.predict(test_x)
Enter fullscreen mode Exit fullscreen mode

the predictions on the test dataset are stored in y_pred.

Let's print the results by converting it into a pandas dataframe

result = pd.DataFrame({'Actual': test_y, 'Model Prediction' : y_pred})
print(result.head(20))
Enter fullscreen mode Exit fullscreen mode

Result Df

Visualisation

Let's take a small subset i.e 20 data points of our prediction and compare it with actual output using matplotlib library

sample_result = result.head(20)
sample_result.plot(y=["Actual", "Model Prediction"],
        kind="line", figsize=(10, 7))
Enter fullscreen mode Exit fullscreen mode

Output

Variance Score

print('Variance score: {}'.format(le.score(test_x, test_y)))
Enter fullscreen mode Exit fullscreen mode

Variance Score

That's all about about linear regression, thank you for your patience :))

Top comments (0)