Artik Blue

Posted on Apr 8, 2020

Measures of spread in data

#datascience #beginners

If we have some data and we want to give a first sight on it, we can initially think about the properties we saw on the previous post, such as the mean, the median of the mode. But sometimes those are not enough and we want to look at the bigger picture of the data, we may want to see how that data is distributed.

For example, let us think about a university class. We are the professor and we have a bunch of students, we want to evaluate how the whole class is doing so we start by extracting the measures of center such as the mean. We discover that the mean of our class is something like 5.8 (we are using a 0-10 scale), but our class may be very large and with that, we don't know if we have a bunch of students that are not even passing and some very smart ones who are getting straight 10s or maybe we have a whole lot of average students who get marks close to 6 and 7s. Of course we can use other systems measures here to get more information such as the median or the mode, but if you think it well, as we have decimal results the mode will return something weird (or perhaps we can group the marks into categories like (zero, fail, barely pass, good...) but that might be considered cheating if you think it well.

One thing we could do in this case is to start by plotting an histogram of the data of our students (to make it clearer I adjusted all of the marks to integers instead of decimals):

That histogram can be generated with python using pyplot in the following way:

import matplotlib.pyplot as plt

x = [1]*1 + [2]*2 + [3]*3 + [4] *4+ [5]*5 + [6]*6 + [7]*5 + [8]*4 + [9]*3 + [10]*2
# take your tame and change the bins param to see what it does!
plt.hist(x, bins = len(x))
plt.show()

What we see in the graph is that most of the students get marks closer to 6 and few of them perform very good or very bad eventhough there is some tendency in the class that says that students tend to perform more good than bad.

In fact we can contrast that with our already known measures of central tendency such as the median or the mode!

import numpy as np

x = [1]*1 + [2]*2 + [3]*3 + [4] *4+ [5]*5 + [6]*6 + [7]*5 + [8]*4 + [9]*3 + [10]*2

print("mean:")
print(np.mean(x))
>>> 5.8
print("median:")
print(np.median(x))
>>> 6

If we recall the histogram we just ploted, we now see that the median here is coincides with its center and its also easy to understand that the point associated with the highest bar is the mode and due to the fact that we have some more points closer to ten rather than points closer to zero the mean is closer to six instead of for example closer to four.

Another way to analyze the spread of our data is by the boxplot.

That can be generated with:

import matplotlib.pyplot as plt

x = [1]*1 + [2]*2 + [3]*3 + [4] *4+ [5]*5 + [6]*6 + [7]*5 + [8]*4 + [9]*3 + [10]*2

#plt.hist(x, bins = len(x))
plt.boxplot(x)
plt.show()

In this case, what the boxplot shows is that most of the data is located within the box, to be more specific the data that is inside the box is the data that falls within the interquartile range, the line marks the median.

But wait wait, interquartile range? what is a quartile?

For a better explaination, let's calculate them with the scipy stats package!

import matplotlib.pyplot as plt
from scipy.stats.mstats import mquantiles
x = [1]*1 + [2]*2 + [3]*3 + [4] *4+ [5]*5 + [6]*6 + [7]*5 + [8]*4 + [9]*3 + [10]*2

print(mquantiles(x))

>>> [4.  6.  7.8]

So 4 is the quartile 1 (Q1), 6 is the Q2 and 7.8 the Q3

That means that 25% of the students scored 4 or less, 50% of them did 6 or less and 75% did 7.8 or less (of course 100% of them scored 10 or less).

The inter-quartile range or IQR can be calculated by substracting the first quartile from the third, in this case is 3.8 and tells us how far apart the first and third quartile are, so it indicates how spread out the middle 50% of our set of data is.

Another interesting feature, but a bit less relevant in this particular case we can see here is the range, which is 10 herre as our data goes from 0 to 10.

Sometimes we may work with multiple datasets and want to perform multiple automatic decisions, so plotting a histogram or a boxplot may be ver tedious. On that cause three numbers such as the quartiles may be useful but tedious to deal with as well as they won't be a single number!

To deal with that situation we have a couple of measures that come very handy, those are the variance and the standard deviation and they well how far is the are features from the mean.

To better understand it, let's do it with python!

import numpy as np

x = [1]*1 + [2]*2 + [3]*3 + [4] *4+ [5]*5 + [6]*6 + [7]*5 + [8]*4 + [9]*3 + [10]*2

print("Variance:")
print(np.var(x))
>>> 5.26530612244898
print("Standard deviation:")
print(np.std(x))
>>> 2.294625486315573

So those measures tell us how far the average student is from the mean when it comes to his or her marks. The standard deviation is just the square root of the variance, thats because the variance is measured in squared units (we have it squared to avoid negative results).

As I want to get practical on this I encourage you to read more about that on your own here: https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214891-eng.htm

So you can learn the theory.

If your std is low that means that most of the data is grouped together near the mean and so the mean will be a strong statistic to understand the data, if otherwise the std is very high that would mean that our data is wide spread across axis.

Let us now look at the histogram one more time:

If we look closer to it... dont we see a pattern? it is like wave! Well yes, on that case it is clear that our data follows a pattern, one of the most important patterns on this case, the normal distribution (well, quasi normal distribution on this case, as it is not 100% symetric). Read more about it here! https://statisticsbyjim.com/basics/normal-distribution/

The normal distribution has many interesting properies, it is symetric, the mean the median and the mode are equal (we have a quasi normal distribution here), and one of my favorite properties is that in a normal distribution, 68% of the data fall within +/- 1 standard deviation from the mean.

So if we know that we are dealing with data that is following this distribution and we know both the mean and the std we can assume a lot of things very easy.

Other kinds of shapes exist as well out there such as left and right skewed distributions. For example the athletic shape of an individual during life it starts relatively well and decreases rapidly as the person ages. Another example of a skewed distribution could be the amount of income you have, when you are young it is really low but it may increase year after year as you grow older!

DEV Community

Measures of spread in data

Top comments (0)

Read next

WebAssembly + JavaScript: Building a Real-Time Image Processing Tool

Getting Started with Golang: A Beginner’s Guide

GraphQL: A Beginner's Guide

How My Old Laptop Taught Me More About Coding Than Any Course Ever Could