DEV Community

Cover image for Correlation Matrix tutorial (Using Pandas)
Kaira Kelvin.
Kaira Kelvin.

Posted on • Edited on

Correlation Matrix tutorial (Using Pandas)

Correlation - Measures to what extent different variables are interdependent ie the statistical relationship between two variables.

In the field of data science and ML, a correlation matrix aids in understanding relationships between variables. The correlation matrix represents how different variables interact with each other.
Each data point in the dataset is an observation,and the features are the properties or attributes of those observations.
Correlation is a statistical indicator that quantifies the degree to which two variables change in relation to each other.ie
The measure of the relationship between two variables.

It indicates the strength and direction of the linear relationship between two variables.
The correlation coefficient is denoted by "r" and it ranges from -1 to 1.

Types of correlations

Negative Correlation.

  • If r =-1, it means that there is a perfect negative correlation (as one variable increases, the other tends to decrease).

Image description

Zero Correlation.

  • If r = 0, it means that there is no correlation between the two variables. (Values close to 0 indicate a weak or no correlation between the variables.) Two variables don't seem to be linked in anyway to independent variables.

Image description

Postive Correlation.

If r = 1, it means that there is a perfect positive correlation.
(as one variable increases, the other tends to increase as well).

Image description

Non-linear Correlation (known as curvilinear correlation)

There is a non-linear correlation when there is a relationship between variables but the relationship is not linear (straight).

Image description

Popular methods used to find the correlation coefficients

1.Pearson’s product-moment correlation coefficient
It is a measure of the linear relationship between two variables that have been measured on interval or ratio scales.
Pearson product-moment correlation coefficient attempts to draw a line of best fit through the data of two variables.
Below is the formula :
r = n(∑xy) – (∑x)(∑y) / √[n∑x²-(∑x)²][n∑y²-(∑y)²]
where :
n is the number of data points
∑xy is the sum of the product of corresponding values of x
and y
∑x is the sum of all the values of x
∑y is the sum of all the values of y
∑x^2 is the sum of the squares of all values of x
∑y^2 is the sum of the squares of all the of y.

Below is a link where u can get more on PPMCC,
https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php

Salient points about PPMCC

Image description

Image description

2.Spearman’s rank correlation coefficient.
Spearman’s rank correlation coefficient named after Charles Spearman and denoted by the Greek letter ρ(rho) is used to measure monotonic function correlation between two variables.
A monotonic function is one that either never increases or never decreases as its independent variable changes.
Spearman's correlation coefficient technique is applied when your data does not meet the requirements for Pearson's coefficient, for example when the data is skewed or non-linear.
Look at the diagram below to understand,

Image description

Correlation Matrix.

A tabular data representing the correlations between pairs of variables in a given data. Within this matrix, each cell signifies the correlation between two specific variables.
It is also an important pre-processing step in Machine Learning pipelines to compute and analyze the correlation matrix where dimensionality reduction is desired on a high dimension data.

Interpreting the correlation matrix.

  • Strong correlations, indicated by values close to 1 or -1, suggest a robust connection, while weak correlations, near 0, imply a less pronounced association. They are identifying these degrees of correlation aids in understanding the intensity of interactions within the dataset, facilitating targeted analysis and decision-making.

  • Positive correlations (values > 0) signify that as one variable increases, the other tends to increase as well. Conversely, negative correlations (values < 0) imply an inverse relationship—when one variable increases, the other tends to decrease. Investigating these directional associations provides insights into how variables influence each other, crucial for formulating informed hypotheses and predictions.

Facilitating Analysis and Decision-Making:
By identifying the degrees of correlation, analysts can gauge the intensity of interactions between variables within the dataset.
Understanding the strength of correlations aids in targeted analysis, allowing analysts to focus on relationships that have a more substantial impact on the dataset.

How to create correlation matrix in python?

A correlation matrix has been created using the following two libraries:

1.NumPy Library.

2.Pandas Library.

Here will be working on creating a correlation matrix using Pandas.
1.Creating correlation matrix using Pandas library.

Pandas is a library with built-in functionalities using which user can analyze and interpret the relationships between variables.
In order to create a correlation matrix, we used corr() method on data frames.

How to visualize correlation matrix in Python?

There are two popular libraries for data visualization, Matplotlib and seaborn.

let's visualize using Seaborn.

import seaborn as sns 
import matplotlib.pyplot as plt
df_small=df.iloc[:,:6] #taking all rows but only 6 columns
correlation_mat= df_small.corr()
sns.heatmap(correlation_mat,annot=True)
plt.show()
Enter fullscreen mode Exit fullscreen mode

OUTPUT.

Image description
Pandas df corr() method is used to compute the matrix. By default it computes the Pearson's correlation coefficient.
The parameter 'annot=True' displays the values of the correlation coefficient in each cell.

Interpreting the correlation matrix

Image description

Ten points when working with the correlation matrices.

1.Each cell in the grid represents the value of the correlation coefficient between two variables.

2.The value at position (a, b) represents the correlation coefficient between features at row a and column b. This will be equal to the value at position (b, a).

3.It is a square matrix – each row represents a variable, and all the columns represent the same variables as rows, hence the number of rows = number of columns.

4.It is a symmetric matrix – this makes sense because the correlation between a,b will be the same as that between b, a.

5.All diagonal elements are 1. Since diagonal elements represent the correlation of each variable with itself, it will always be equal to 1.

6.A large positive value (near to 1.0) indicates a strong positive correlation, i.e., if the value of one of the variables increases, the value of the other variable increases as well.

7.A large negative value (near to -1.0) indicates a strong negative correlation, i.e., the value of one variable decreases with the other’s increasing and vice-versa.

8.A value near to 0 (both positive or negative) indicates the absence of any correlation between the two variables, and hence those variables are independent of each other.

9.Each cell in the above matrix is also represented by shades of a color.Here darker shades of the color indicate smaller values while brighter shades correspond to larger values (near to 1).
This scale is given with the help of a color-bar on the right side of the plot.

10.The axes ticks denote the feature each of them represents.

Finally u can export the correlation matrix to an image, using the method plt.savfig() method.
Select features pairs having a particular range of values of correlation coefficient.

  1. Choose pairs with negative correlation from the sorted pairs .
negative_pairs=sorted_pairs(sorted_pairs <0)
print(negative_pairs)
Enter fullscreen mode Exit fullscreen mode
  1. Selecting strong correlation pairs(magnitude greater than 0.5)
strong_pairs =sorted_pairs[abs(sorted_pairs)>0.5]
print(strong_pairs)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)