Assumptions of Linear Regression
Due to the parametric nature of regression, it is restrictive in nature. Hence due to this, it fails to deliver good results with data sets which doesn't fulfil its assumptions.
Following are the assumptions:
- Linear Relationship
- No correlation of error terms
- Constant Variance of error terms
- No correlation among independent variables
- Error normally distributed.
Now let's understand what this terms actually mean....
Linear or Additive Model: If you fit a model to a non-linear, non-additive dataset, the regression algorithm would fail to capture the trend mathematically, thus result in an inefficient model. Also, this will lead to an erroneous prediction on an unseen dataset.
No correlation of error terms: The presence of correlation in error terms drastically reduces model's accuracy. This is usually occurs in time series models where the next instant is dependent on previous instant. If the error terms are correlated , the estimated standard errors tend to underestimate the true standard error.
- This is also known as autocorrelation.
Constant Variance of error terms: This phenomenon exists when the independent variables are found to be moderately or highly correlated. In the model with correlated variables, it becomes difficult to find out which variable is actually contributing to predict the response variable.
- It also leads to increase in standard error.
- This is also called as Multicollinearity.
No Correlation among independent variables: The presence of non-constant variance in the error terms results in heteroskedasticity.
- Generally , non-constant variance arises in presence of outliers or extreme leverage values. Look like , these values get too much weight, thereby disproportionality influences the model's performance. When this phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrows.
- This is also known as heteroskedasticity.
Normal Distribution: If error terms are non-normally distributed , confidence intervals may become too wide or narrow.
- Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares.
- Presence of non-normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.
These are the assumptions you need to get a better linear model.