Validating Linear Regression Assumptions: A Comprehensive Approach to Multivariate Normality

#python #ai #machinelearning #datascience

Multiple linear regression analysis is predicated on several fundamental assumptions that ensure the validity and reliability of its results. Understanding and verifying these assumptions is crucial for accurate model interpretation and prediction.

Multivariate Normality: The analysis assumes that the residuals (the differences between observed and predicted values) are normally distributed. This assumption can be assessed by examining histograms or Q-Q plots of the residuals, or through statistical tests such as the Kolmogorov-Smirnov test.

I'll guide you through my process of constructing a linear model using my selected dataset. Following the model construction, I focused on validating the assumptions of linear regression. There are five assumptions, but I'll delve into one particular aspect: the Normal Distribution of Errors. This assumption suggests that the errors (or residuals) generated from a statistical model or measurement process should adhere to a normal (Gaussian) distribution.

Lets build the model!!!!
First I will import all the libraries need and read into my cleaned dataset.

Quick description of the library I imported
pandas as pd for data manipulation, statsmodels.api as sm for statistical modeling, matplotlib.pyplot as plt for data visualization, and scipy.stats as stats for statistical functions and tests.

Fitting The Model - An OLS Model

After defining the dependent and independent variables and adding an intercept term, the model was then instantiated and fitted. Then, the data was split for evaluation, leading to predictions being made then calculated the residuals of a model and assign them to the variable "residuals". Residuals represent the differences between the observed values and the values predicted by the model.

After getting the residuals, which is what I to work on I did visual inspection

I used matplotlib.pyplot as plt to generate histogram of the residuals, which are the differences between the observed values and the values predicted by the model. By plotting the distribution of residuals, we can visually assess whether they follow a normal (Gaussian) distribution, which is a key aspect of multivariate normality in linear regression analysis. The histogram is divided into bins to display the frequency of different ranges of residuals. The x-axis represents the residuals, the y-axis represents the frequency of occurrence, and the title provides an overview of the plot. The grid lines assist in visual interpretation.

After which I did, Quantile-Quantile (Q-Q) Plot

Recall our earlier discussion regarding the utilization of statistical tests, such as the Kolmogorov-Smirnov test, for evaluating data distribution.

Interpreting Results

Histogram and Q-Q Plot Observations:
The histogram graph displays a slight left skew, indicating a deviation from a perfectly symmetrical distribution. Similarly, the Q-Q plot deviates from a perfect straight line, suggesting departures from normality in the distribution of residuals.

Statistical Test Results:
Shapiro-Wilk Test:
The p-value (0.0677) from the Shapiro-Wilk test is marginally above the conventional significance level of 0.05, implying that we fail to reject the null hypothesis of normality. However, this result should be interpreted cautiously, considering its proximity to the significance threshold.
Kolmogorov-Smirnov Test:
The very low p-value (9.11e-07) from the Kolmogorov-Smirnov test indicates a significant departure from normality.
Anderson-Darling Test:
The test statistic (0.6919) falls below the critical value at the 5% significance level, suggesting no significant departure from normality according to the Anderson-Darling test.

Understanding the Outcome

The combination of visual inspection and statistical tests suggests that while the distribution of residuals exhibits some departure from normality, the evidence is somewhat mixed. The Shapiro-Wilk test, although inconclusive, hints at a potential normal distribution, whereas the Kolmogorov-Smirnov test strongly suggests otherwise. The Anderson-Darling test falls in between, indicating no significant departure from normality at the 5% significance level. However, given the slight skew observed in the histogram and deviations in the Q-Q plot, caution is warranted in interpreting the results.

Report Findings

Findings:
The analysis suggests that the assumption of normality in the residuals may not hold perfectly. This could imply that the regression model might not fully capture the underlying data distribution.
Limitations:
It's important to acknowledge several limitations in this analysis. Firstly, the interpretation of normality tests can be influenced by sample size, and the dataset under consideration may have unique characteristics not fully captured by standard statistical tests. Additionally, while visual inspection is informative, it is subjective and may vary depending on individual interpretation. Lastly, the choice of significance level and the assumption of independence of observations are inherent assumptions in the conducted tests.

Thank you for reviewing this validation process. If you have any questions or would like to discuss further, please feel free to do so in the comments.

DEV Community

Validating Linear Regression Assumptions: A Comprehensive Approach to Multivariate Normality

Top comments (0)

Read next

Active Directory (AD) vs Azure Active Directory (AAD)

Survival prediction for Titanic passengers using logistic regression.

Artificial Intelligence and Machine Learning in Web Development

Karma IDS: An Intrusion Detection System using eBPF and LSTM