To measure the 'goodness of fit' of the line, when we do the linear regression analysis, Coefficient of Determination (R squared) could be calculated. R squared can measure how well our model explains the correlation. Here we measure percentage of variance explained by the linear model vs baseline model(in this case it is simply mean value of the target).
We can visualize it on the simple example. If we have some target, e.g. number of sales of the item during 5 days and we fitted a line. Now we want to check how good is our fit, basically how well we perform compare to naive prediction: calculating mean value of the sales.
from sklearn.metrics import r2_score import numpy as np import matplotlib.pyplot as plt # given target y_true = [5, 10, 11, 16, 19] # base line y_mean = [np.mean(y_true) for i in range(len(y_true))] # fit a line from sklearn import linear_model X = [1, 2, 3, 4, 5] X = np.asarray(X).reshape(-1, 1) Y = y_true model = linear_model.LinearRegression() model.fit(X, Y) print(model.intercept_) print(model.coef_)
model_intercept = 2
model_coef = 3.4
# regression line y_pred = [3.4*i+2 for i in range(1,(len(y_true))+1)]
# calculate R squared using formula var_mean = sum([(y_true[i]-y_mean[i])**2 for i in range(len(y_true))]) var_pred = sum([(y_true[i]-y_pred[i])**2 for i in range(len(y_true))]) r2 = (var_mean-var_pred)/(var_mean) print(r2)
# calculate R squared using scikit learn r2_score(y_true, y_pred)
# calculate using pearson correlation correlation = np.corrcoef(y_true, y_pred)[-1] correlation**2
In this example regression line explained 97% better than just predicting mean value.