praveenr

Posted on Dec 8, 2022 • Edited on Dec 13, 2022

Linear Regression

#machinelearning #computerscience #python #datascience

A journey of a thousand miles begins with a single step, my journey and I believe your M.L journey is going to start today and it is going to start with linear regression, so let's begin... :)

Two Takeaways

Mathematical Explanation behind linear regression
Python Implementation Using Scikit-Learn

What is Regression and why is this algorithm called Linear Regression

Regression is a statistical method that helps us find the relationship between independent and dependent variables.
To make it simple you have two columns "EXPERIENCE" and "KNOWLEDGE" and the relationship is how your knowledge increases with experience, this relationship is called regression.

Linear Regression

The linearity assumption in linear regression means the model is linear in parameters (i.e coefficients of variables) & may or may not be linear in variables.

(The above is a perfectly linear graph)

Mathematical Understanding of Linear Regression

Hypothesis Function
Loss and cost function
Optimisation

Hypothesis Function

Any machine learning algorithm is nothing but a hypothesis function, imagine this to be a function that you use to get output from a set of inputs, this hypothesis function is optimised to work for our dataset and this process of optimisation is what we call model-training.

The above equation is the hypothesis function for linear regression, and yes you would have seen this somewhere else, this equation is similar to the equation of a straight line y=m*x + c.
The above equation is a single variable function i.e the data for this contains just 2 columns in total, the input column and output column.

x ==> Input/Independent Variable
y ==> Output/Dependant Variable
teta1, teta2 ==> Constants

We basically try to predict y by giving x as the input and assigning optimised values for teta1 and teta2 (NOTE: These are also called weights and in further text teta1, teta2 would be referred to as weights).

Multi-Variable Hypothesis Function

When we have many columns then the hypothesis function would look like this.

Loss and Cost Functions

Loss and cost functions are not exactly the same, there is a subtle difference between them.

When we check the deviation between the actual and predicted value of a single data point it is called loss and if I cumulatively find the deviation for the entire dataset it is called cost and the corresponding functions are called loss and cost functions.

Some common loss and const functions for linear regression are

Mean Squared Error (MSE)

The cost function that we are going to see is Mean Squared Error(MSE) which could be used for optimising the hypothesis function of linear regression.

Basically, the difference between the predicted value and the actual value is found, and squared, this is done for each and every data point and collated.

Now we have predicted with random weights(teta1, teta2), now a process called training is used to optimise the weights.

Training the model

Training could be defined as a process that is used to optimise the hypothesis function by optimising the weights.
This optimisation process is achieved using gradient-descent.

The above curve is called the error curve and it is the output of the mean squared error(MSE).
Gradient descent is the process where we update the weights of the hypothesis function either by small values or large values depending on the learning rate.
On reaching a particular threshold error value the gradient descent stops and we would have optimised weights to best suit our input dataset.

The red line is called the best-fit line and this represents the final output after the linear regression model has completed training.

Python Implementation - Linear Regression Using Scikit-Learn

Importing the required python libraries

# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

We are going to use the diabetes to understand linear regression and this dataset comes along with the scikit-learn library.

# We are taking an inbuilt dataset present in sklearn 
diabetes = datasets.load_diabetes()

Dataset description

Columns present in the dataset

diabetes.feature_names

Storing the training data and target data separately in two different variables

# Extracting the training data and target 
X = diabetes.data
Y = diabetes.target

print(X.shape, Y.shape)

Importing the linear regression model from scikit-learn library, the model is present as a class.

from sklearn.linear_model import LinearRegression
le = LinearRegression()

Training process

le.fit(train_x,train_y)

yes the whole training process is done in one single line of code, thanks to scikit-learn.

Making Predictions

y_pred = le.predict(test_x)

the predictions on the test dataset are stored in y_pred.

Let's print the results by converting it into a pandas dataframe

result = pd.DataFrame({'Actual': test_y, 'Model Prediction' : y_pred})
print(result.head(20))

Visualisation

Let's take a small subset i.e 20 data points of our prediction and compare it with actual output using matplotlib library

sample_result = result.head(20)
sample_result.plot(y=["Actual", "Model Prediction"],
        kind="line", figsize=(10, 7))

Variance Score

print('Variance score: {}'.format(le.score(test_x, test_y)))

That's all about about linear regression, thank you for your patience :))

DEV Community

Linear Regression

Two Takeaways

What is Regression and why is this algorithm called Linear Regression

Linear Regression

Mathematical Understanding of Linear Regression

Hypothesis Function

Multi-Variable Hypothesis Function

Loss and Cost Functions

Mean Squared Error (MSE)

Training the model

Python Implementation - Linear Regression Using Scikit-Learn

Dataset description

Training process

Making Predictions

Visualisation

Variance Score

Top comments (0)

Read next

Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)

IBM InfoSphere vs. STIBO STEP: Which MDM Wins?

Day 1: Mastering the Basics of Python

Machine Learning for Software Engineers: A Comprehensive Theoretical Foundation