Many times we hear that "the correlation is not the causation" or "correlation does not imply causation". But what do they mean? Let's understand this phrase using some statistics and examples because most of the time these terms are mostly misunderstood and often used interchangeably.
Correlation is a statistical technique that tells us how strong a pair of variables are linearly related. If we measure this relationship, we will get a number between -1 to 1. The sign of the number denotes how one will change if the other variable changes. If we get a number 0, that means there is no correlation between those two variables.
Example: Correlation between ice cream sales and sunglass sold.
The sales of ice cream increases as the sales of sunglass increases.
Causation adds a cause to the relationship. If two variables are correlated then it says any change in one variable will cause change in the other variable. That means these two variables are in cause and effect relationship. One event will help to occur the other event.
Example: When the wind blows faster then a windmill produces more power.
Here, the faster wind is the cause to rotate the windmill faster and the effect of this event is more power.
So, now we understand what correlation and causation are, let's understand why "Correlation does not imply causation!" with the famous example.
This hypothesis says that as the sales of ice cream rise and fall, the number of homicides also rise and fall. But does ice cream consumption causing the death of people?
The answer is "No". If two things are correlated that doesn't mean that one will cause another.
When two independent things are tied together, these can be either be bound by causality or correlation. But in most cases, it is just a coincidence. Correlation is something we think of when we cannot see under the hood. The less information we have, the more we are forced to see correlations. Similarly, if we have more information then we will be able to see the actual causal relationships.
Generally, we collect some samples to test our hypothesis. What if our sample is not large enough and a little too biased? What if there is a hidden factor that we did not record in our sample dataset? These things can influence the causation and correlation of two variables. Here in this example, weather is the hidden factor.
Let's consider a sunny summer day. It is a lovely day to go outside and people are enjoying the beach. As the day is hot, people are buying ice creams also people are exposed to accidents. The weather is causing a rise in the ice cream sale and homicides.
So, there is no causal relationship between ice cream sales and the homicide rate. Sunny weather is the hidden factor that is bringing both factors together. Also, ice cream same and the homicide rate both have a causal relationship with the weather.
When we see a correlation is present between two variables, do not jump to a conclusion too early. Always take time to explore if any underlying influencing factor is present or not. If such a factor is present then verify if they are correlated or not and then conclude. You can consider the below questions before concluding.
- Do we have enough samples?
- Did we consider the seasonality?
Hope this post will clear your doubts!
Thanks for reading!!