DEV Community

Cover image for Simple Linear Regression Explained with Example
seboo1995
seboo1995

Posted on

Simple Linear Regression Explained with Example

Linear regression is a very simple approach in supervised learning and used in predicting quantitative variable. This model can seem dull compared to the other more advanced mathematically heavy model, however it is very powerful and still widely used.
Moreover, Linear Regression is the best starting point to understand the more complex ML algorithms.
In order to understand better this simple but very effective model I will use an example and for that we will need data. I will use data related to cars and their price.
Example model will be engine horsepower and the price, or using only the car's horsepower how useful it is to predict the price.

You might ask yourself, when to use Linear Regression and which questions could be answered, therefore I will set up question that could be answered with Linear Regression.

  • Is there a relationship between manufacturing year and price?

  • How strong is the relationship between the number of cylinders in the engine and the price or are faster cars more expensive?

  • Which feature contributes more to changes in the price?

  • How accurately can we estimate the effect of each feature on the price?

  • Is the relationship linear?

Simple Linear Regression

Simple Linear Regression is used to predict a quantitative response Y on the basis of a single predictor X. It assumes that there is approximately a linear relationship between X and Y.
Mathematically

Y ≈ β0+ β_1 X.

or

price = β0 + β1 * horsepower

β0 and β1 are called the coefficients or parameters

Estimating the coefficients

In practice β0 and β1 are unknown. Therefore, we can use the data we have to estimate the coefficients

So we have 2 features price and horsepower and we want to model price = β0 + β1 * horsepower

Because we have only 2 features, we can map this to a 2-D space and visualize it.

Image description

So in less abstract way, we must fit a line (basically, if we want to visualize a 2 dimension equation it will always be a line) through the data. Now the main question arises: how to fit a line. In order to find the optimal line we use a method called Gradient Descent. We need somehow a way to get the best fit line with less error. Now let's try to explain Gradient Descent on a very high-level (if you are a visual learner, then I suggest the this video made by Josh Stamer )

First we need to initialize the 2βs as random number, this would give a random line.

Then we try to get the distance between the line and each of the points, these distances are called errors.

After that, we need to sum the errors, however there could be negative errors and positive errors, and if we sum them they will cancel each other. In order to overcome this issue, we need to square each error and sum them up, and this is called Residual Sum of Squares (RSS). Now in mathematical terms we can get the error simply by using the formula (predicted_price - real price) 2
The predicted price would be if we use the current coefficients.
Now the error is simply a function of 2 unknowns( β0 and β1) and we want to minimize this function, or minimize the error by choosing the optimal βs.
Without going into calculus (trust me you don't want to go there :) ) we can make some derivatives which will tell us the direction for which we can change the βs.
After some iterations we can find the optimal line. This is only to know what is happening inside the hood of Linear Regression, all these steps are already written by professionals(in sklearn) and thanks to them we do not have to worry about that)

After making sure we have the coefficients β0 and β1, we need to know that they are only estimates. Now you must be wondering why this guy wrote that in bold? Well, it is very important that we are making a estimate of the real model because we do not have all the data from all cars (or in statistics its called population). If we had data from all cars in the world, then we can say that those are not estimates they are the real thing.
Therefore, we need to find a way to see if those coefficients are 'good' enough.

Assessing the model

In order to make a comparison we will have to somehow represent the real model, so we can write something like this:

Y = f(X) + e

e is the mean-zero random error and if the function f is linear then

Y = β0 + β1*X + e

  • β0 is the expected result of Y when X = 0, or in our case how much is the car if it had 0 horsepower (in some cases the β0 makes sense, but in some cases (such as ours) it doesn't)

  • β1 is the expected result of the price with one unit increase of horsepower on average.

The reason we do not know the real coefficients is because we do not have the population data.
Lets have an example:
Suppose you want to know the mean height of all people of Earth, but this is impossible (or maybe way too expensive to get this data), so we only have sample of 100 people. If we measure the height of those 100 people and get the average, then we can assume that the sample mean should be close to the population mean ( except if you get the heights of 100 basketball players :), then the result would be a bit skewed to the right )

The same thing happens with the LR coefficients, we do not know them, but if the sampling is good, then we can assume the coefficients are close to the real coefficients.

OK, we established that we can not know the real average, but how close are we to the real thing?

In general we can answer this question by computing the standard error of the estimate

Var(μ̂) = SE(μ̂)2 = σ2 / n

where sigma is the standard deviation of each of the data points of Y. Roughly speaking, the SE tells us the average amount that this estimate differs from actual value.

Now if you look closely, the SE is getting smaller if the number of observations is getting larger.

Image description

Now these are the formulas to calculate the SE of the model, but in order to be really valid we have to assume that each error is not correlated and has constant variance.

If SE( β1) is smaller when xi are more spread, intuitively we have more leverage.
In general sigma 2 is now known, but can be estimated from the data.

The main reason to have the SE are to compute the confidence intervals . Most common confidence interval range is 95%. If you have the estimate for the coefficient and multiply the 2*SE
Image description

All these coefficients and standard errors are computed by a Python, so we dont have to compute it manually.
These CI tells us where is the real coefficient based on our estimate with 95% confidence

SE can also be used to perform hypothesis tests on the coefficients.

The most common and useful hypothesis test is whether there is a relationship between X and Y.
Image description
Image description

Or in more maths sense it would be to test if the coefficient could be 0 or not.
Image description

Why is this important for us? If we prove that the coefficient is 0, then the formula is reduced to Y = b0 + e and the feature is not associated with Y. In other words, we can say that X does not contribute to Y, so there is no relationship. To test this hypothesis, we need to see if b1 is far away from zero, however how far is far enough?(Because β1 are only estimates it could be that the estimate is 0.6 but in reality it could be from -0.2 to 1.2,so what is the reality, X is decreasing Y or increasing Y?)
This depends on the accuracy of β1, or depends on the SE(β1). In practice we compute a t-statistic given by
Image description

which measures the number of standard deviations that b1 is away from zero. If the t is between -2 and 2 then its 2 standard deviations.
But working with probabilities is more efficient, so we can compute the probability of observing any value equal to |t| or larger assuming β1 = 0, and this is what is called the p-value.

Enough theory, lets see the model where the feature is engine horsepower and the target is the price. This model was done with the python package statsmodels

Image description

const is the β0 and in our case is what would be price of the car if it had 0 horsepower. In this case it would be negative 60,000 USD, so if you wish yo buy a 0 horsepower car they will get you the car and give additional 60 000 USD, this does not make sense, so we will discard this coefficient.

engine_hp would be β1 and the coefficient is 401. So we can say that with one unit increase of horsepower, the price of the car is increased by 400 USD on average. This means the faster the car the more expensive it is.

The SE for the engine_hp is relatively small only 5.12, which tells us that 400 USD is closer to the real coefficient.

The t-statistic is bigger than 2 which tells us that zero is far away from the real coefficient and the p-value tells us the probability that the real value could be 0, in this case its 0.000 ( its not actually 0, but it has more decimal places).

The last 2 columns tells us the range between the 25th and 75th Confidence interval.

It could be interpreted as: we can be confident with 95%, that the real value for the coefficient is between 391 and 411.

Assessing the Accuracy of the model

The quality of the linear regression fit is assessed with RSE(residual standard error) and R2 statistic

Residual standard error

The residual standard error would be the estimate of the std of the error terms.
Image description
But this RSE will be in units same as the target feature and we would not know whether that value is good or not.
In the model above, the RSE is 53017 USD.
Therefore we will use another statistic to assess the model and its called the R2.

R2 statistic

The R2 statistic actually tells us the proportion of variance explained by the model and its always between 0 and 1
Image description

In our case the R2 is 0.43 or 43% of the variance is explained with this model.

Moreover, R2 statistic is a measure of the linear relationship between X and Y, or the correlation coefficient squared and R2 are the same. This is only for simple linear regression, if more variables are in the model this approach is not correct.

Top comments (0)