Descriptive stats + percentiles in numpy and scipy.stats

To get the measures of central tendency in a pandas df, we can use the built in functions to calculate mean, median, mode:

import pandas as pd
import numpy as np


# Load the data
df = pd.read_csv("data.csv")

df.mean()
df.median()
df.mode()

To measure dispersion, we can use built-in functions to calculate std. deviation, variance, interquartile range, and skewness.

A low std. deviation means the data tends to be closer bunched around the mean, and vice versa if the std. deviation is high. The iqr is the difference between the 75th and 25th percentile. To calculate this, scipy.stats is used. Skew refers to how symmetric a distribution is about its' mean. A perfectly symmetric distribution would have equivalent mean, median, and mode.

from scipy.stats import iqr

df.std()
iqr(df['column1'])
df.skew()

from scipy import stats

stats.percentileofscore([1, 2, 3, 4], 3)
>> 75.0

The result of the percentileofscore function is the percentage of values within a distribution that are equal to or below the target. In this case, [1, 2, 3] are <= to 3, so 3/4 are below.

numpy.percentile is actually not the inverse of stats.percentileofscore. numpy.percentile takes in a parameter q to return the q-th percentile in an array of elements. The function sorts the original array of elements, and computes the difference between the max and minimum element. Once that range is calculated, the percentile is computed by finding the nearest two neighbors q/100 away from the minimum. A list of input functions can be used to control the numerical method applied to interpolate the two nearest neighbors. The default method is linear interpolation, taking the average of the nearest two neighbors.

Example:

arr = [0,1,2,3,4,5,6,7,8,9,10]
print("50th percentile of arr : ",
       np.percentile(arr, 50))
print("25th percentile of arr : ",
       np.percentile(arr, 25))
print("75th percentile of arr : ",
       np.percentile(arr, 75))

>>> 50th percentile of arr :  5
>>> 25th percentile of arr :  2.5
>>> 75th percentile of arr :  7.5

Now, using scipy.stats, we can compute the percentile at which a particular value is within a distribution of values. In this example, we are trying to see the percentile score for cur within the non-null values in the column ep_30.

non_nan = features[~features['ep_30'].isnull()]['ep_30']
cur = features['ep_30'][-1]

print(f'''Cur is at the {round(stats.percentileofscore(non_nan, cur, kind='mean'), 2)}th percentile of the distribution.''')

>>> This is at the 7.27th percentile of the distribution.