DEV Community

Kiran U Kamath
Kiran U Kamath

Posted on • Originally published at blog.learnwithdata.me

Variables and Correlation

What is a variable??

A variable is something that varies, as opposed to a constant.

For example, The current temperature is a value for the variable temperature 27, 30, 19, 29. As opposed to freezing temperature, which is constant, it's always the same.

Anything that varies about a thing, event or person can be a variable.
Variables are distributed in some way. One of the ways they're most frequently distributed is the normal distribution pattern.

In this kind of distribution, the mean is the most common value, and that's in the middle. As you get further and further away, from the mean, cases become rarer and rarer.

Consider example of distribution of cooking ability.
Think of somebody who's not such a good cook, maybe your friend. And then, think of somebody who's a great cook, maybe your grandma. And then think of somebody who's at the mean is the average. And there's lots of average people compared to good old grandma's cooking and your poor friend's cooking. Cases become fewer and fewer as you get further from the mean.

An important point about the normal distribution, is that it can be described in terms of standard deviations from the mean. That's almost like the average deviation but not quite.

The average American male is a little less that 5'10". And the average deviation is a little less than 3", and so is the standard deviation.

  • Standard deviation fact number one is that 68% of all cases, ex: male heights, are within -1 standard deviation and +1 standard deviation. So slightly more than two-thirds of all American males are between 5'7" and 6'1".
  • Standard deviation fact number two is that 84% of all cases are between, The bottom of the distribution to 1 standard deviation, there you find 84% of cases. So 84% of all males are less than 6'1", and about 16% of American males are taller than 6'1".
  • Standard deviation fact number three is that 96% of all cases lie between -2 standard deviations and +2 standard deviations. So 96% of all American males are taller than 5'4" and shorter than 6'4".
  • Standard deviation fact number four is that you can convert standard deviations to percentiles. The mean is always at the 50th percentile. 1 standard deviation is always at the 84th percentile. And 83% of cases are below the 84th percentile. And 16% of cases are above, the 84th percentile.

For example, imagine you've designed a new way of teaching algebra. Kids taught by the old method get 72 on the exam and kids taught by the new method get 78 on the exam. Is that a big deal or not?

It completely depends on the standard deviation. So the mean is 72, if the standard deviation is 6, that's 78. That's a big gain, because that takes the average kid from the 50th percentile to about the 84th percentile, which is no joke.

On the other hand, assume that the standard deviation is 18. If so, it's not such a big deal. Because the gain is only one-third of a standard deviation, which is the equivalent of going from the 50th percentile to just the 64th percentile, which is not such a big deal. And you might want to take into consideration whether there are added costs if that's all the gain you're getting.

What is correlation??

Correlation measures the association between variables.

To give some examples, let's consider example of cooking ability which we saw earlier. And we're going to relate that to age. And we already know two points here.

Your friend is young and is not a very good cook and Grandma is old and she's a great cook. And if you were to collect additional data, you would find a tendency for people who are not such great cooks. are going to be youngish and people who are better than the Average are going to be relatively older. And that gives us the correlation.

Correlations range between minus one and plus one.

A minus one correlation means that there is a perfect correlation such that the higher you go on the x variable, the lower you go on the y variable.

At the other end, correlation of +1.0 indicates that the higher you go on the variable x, the higher you go on the variable y.

A correlation of -1 is equal to a correlation of +1, they are just in different directions

There are two basic ways of looking at correlations.

Rank order correlation.

A rank-order correlation is a correlation between two variables whose values are ranks.

When variables are measured at least on ordinal scales, units of observation (e.g., individuals, nations, organizations, values) can be ranked.

A ranking is an ordering of units of observations with respect to an attribute of interest.

For example, nations can be ranked with respect to their quality of life, their freedom, etc. A rank is the position of a unit of observation (e.g., nation) in the ranking. Units of observation with higher ranks show the attribute of interest to a higher degree.

If one is interested in the association between two rankings (e.g., quality of life and freedom of nations), rank-order correlations can be calculated.

Correlations are the way we assess reliability of measures.

There's two different ways to define reliability, one is it's the degree to which a measure of a particular variable gives the same value across occasions. Or the degree to which a measure correlates with itself.

So as an example, you can have the correlation between measures of height taken on different occasions and you would expect that correlation.

Correlations also are the way that we measure validity of measures.
Validity is the degree to which a variable measures.
There are two very important points about the relationship between validity and reliability.

  • The first point is that there can be no validity if there is no reliability.

    If your measure gives you a different score every time and they're more or less random, so that you get a high score one measurement and a low score on another. And your friend gets a high score on one and a low score on the other and there's no relationship at all, then you can't have any validity for that measure.

    There has to be some stability, some degree of getting the same answer twice before you can have any validity for that measure at all.

  • A second point is that reliability implies very little about validity. Now, if reliability is zero, there can't be any validity. But at the other extreme, reliability can be absolutely perfect, but there may be no validity.

Credit: Thanks to Richard E. Nisbett for Mindware: Critical Thinking for the Information Age course on Coursera.

Top comments (0)