loading...
Cover image for Covariance, Correlation, and Collinearity

Covariance, Correlation, and Collinearity

xsabzal profile image Abzal Seitkaziyev Updated on ・2 min read

Variance and Covariance

When we measure the spread of the distribution of some random variable X, we calculate variance and standard deviation, as:

Alt Text

The variance between X and Y is called covariance. To find covariance of X and Y, we use the same approach as above:

Alt Text

So, the covariance of X and Y could be negative or positive. Because covariance is not normalized, it only describes a trend between two variables.

Correlation and Collinearity

To measure the strength of the trend, we need to normalize the covariance. So, covariance normalized by the standard deviations of X and Y is a correlation coefficient (or Pearson's correlation coefficient), which is defined below:

Alt Text

Thus, correlation coefficient values are between -1 and +1.
To classify the strength of the correlation, the following ranges are commonly used:

Alt Text

Positive and negative signs indicate the trend of the correlation.

When two variables are strongly correlated with each other, they are collinear. If there are strong correlations with multiple variables, it is multicollinearity. Depending on the goal of the analysis, one can consider dropping strongly correlated features. To work with collinear features, we also can use variance inflation factors(VIF) and Principal Component Analysis (PCA).

Application

Here I will use a London bike sharing dataset to play with covariance and correlation.

# https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset?select=london_merged.csv
df = pd.read_csv('data/london_merged.csv')
data = df.iloc[:,2:].copy()
data.head()
Enter fullscreen mode Exit fullscreen mode

Alt Text

1) Let's check covariance of features:

# covariance
data.cov()
Enter fullscreen mode Exit fullscreen mode

Alt Text

2) Correlation of features:

# correlation
data.corr()
Enter fullscreen mode Exit fullscreen mode

Alt Text

# correlation
abs(data.corr()) > 0.70
Enter fullscreen mode Exit fullscreen mode

Alt Text

Not surprisingly, we can see that temperatures t1 and t2 are strongly and positively correlated.

3) When applying PCA, we can see the number of principal components vs. Explained Variance:

Alt Text

Discussion

pic
Editor guide