DEV Community

Cover image for Multicollinearity and how to solve it
Jordan Gamba
Jordan Gamba

Posted on

Multicollinearity and how to solve it

**

Multicollinearity

**
-It refers to a situation in multiple regression analysis where two or more independent variables are highly correlated, making it difficult to isolate the individual effects of each variable on the dependent variable. This correlation among independent variables can cause issues in the estimation of regression coefficients and can affect the overall interpretability and reliability of the regression model.

In Linear Regression. The objective is to reduce the residual(error)

Image description
Multicollinearity occurs when two or more independent variables begin to explain each other.Due to multicollinearity,the model begins to vary more and as a result the model becomes unstable.

Dealing with Multicollinearity:

  1. Remove one or more highly correlated variables from the model by using the Variance Inflation Factor.
  2. Combine correlated variables or use composite variables.
  3. Use regularization techniques (e.g., Ridge regression or Lasso regression) that can handle multicollinearity to some extent.
  4. Principal Component Analysis (PCA) can be applied to transform variables and create uncorrelated components.

We will use the Variance Inflation Factor.

Variance Inflation Factor-It measures how much the variance of an estimated regression coefficient increases when predictors are correlated.
We use the R-square or coefficient of determination to evaluate the performance of a Regression model.

Let's say we have a Linear Regression equation of:

Image description

We therefore can place one of the independent variables as our target variable and we use the other independent variables to predict it. As shown in the example below

Image description

We use the following formula to calculate the VIF
Image description

The variable with VIF higher than our threshold will be eliminated.

NB: VIF-based feature elimination is done recursively i.e one at a time until we reach a stage where no variable has VIF > threshold ( 5 or 10)

Conclusion
It's important to detect and address multicollinearity to ensure the reliability of regression results and the meaningful interpretation of coefficients. Ignoring multicollinearity can lead to misleading conclusions and hinder the usefulness of the regression model.

I would highly recommend visit this link below 👇 to learn about multicollinearity nad VIF
https://www.youtube.com/watch?v=8YC73paDntY

Top comments (0)