DEV Community

Dr. Azad Rasul
Dr. Azad Rasul

Posted on • Updated on

15- Use Python for Statistics

In this tutorial, you learn how to perform in Python descriptive statistics (i.e min and max), central tendency (i.e. mean and median), variability (i.e. variance and standard deviation), summary, correlation (i.e. covariance and correlation coefficient), and statistical tests (i.e. t-test). We use statistical powerful libraries such as scipy.stats function in SciPy library, scikit-learn, statsmodels, and other libraries.

Import libraries:

import numpy as np
import pandas as pd
import statistics
import scipy.stats
from statsmodels.formula.api import ols
from scipy import stats
Enter fullscreen mode Exit fullscreen mode

Descriptive statistics

x = [11, 13, 8, 16, 23, 24, 30, 16, 9, 18] # create a list
x = np.array(x) # convert x to numpy array
Enter fullscreen mode Exit fullscreen mode

Min

print(x.min()) 
# 8
Enter fullscreen mode Exit fullscreen mode

Max

print(x.max())  
# 30
Enter fullscreen mode Exit fullscreen mode

Mean

print(x.mean())  
# 16.8
Enter fullscreen mode Exit fullscreen mode

Var

print(x.var())  
# 45.36
Enter fullscreen mode Exit fullscreen mode

Sum

print(x.sum())
# 168
Enter fullscreen mode Exit fullscreen mode

Median

print(np.median(x))
# 16
Enter fullscreen mode Exit fullscreen mode

Measures of central tendency:

To measure the central or middle values of datasets we can use: mean, weighted mean, geometric mean, harmonic mean, median, mode.

Mean

print(x.mean())
# 16.8
# or 
mean = sum(x) / len(x)
mean
# 16.8
Enter fullscreen mode Exit fullscreen mode

Harmonic mean

hmean = len(x) / sum(1 / item for item in x)
hmean
# 14.225214542387663
Enter fullscreen mode Exit fullscreen mode

Weighted mean

y = [1, 3, 5, 7, 9, 10, 12, 14, 15, 16]
wmean = sum(y[i] * x[i] for i in range(len(x))) / sum(y)
wmean
# 18.0
Enter fullscreen mode Exit fullscreen mode

Geometric mean

scipy.stats.gmean(x)
# 15.47120474800591
Enter fullscreen mode Exit fullscreen mode

Median

statistics.median(x)
# 16.0
Enter fullscreen mode Exit fullscreen mode

Mode

statistics.mode(x)
# 16
Enter fullscreen mode Exit fullscreen mode

Measures of variability

It calculates the spread of data points such as variance, standard deviation, skewness, percentiles, ranges.

Variance

statistics.variance(x)
# 50
Enter fullscreen mode Exit fullscreen mode

Standard deviation

statistics.stdev(x)
# 11.099549540409287
Enter fullscreen mode Exit fullscreen mode

Skewness

scipy.stats.skew(x, bias=False)
# 0.5846768190780606
Enter fullscreen mode Exit fullscreen mode

Percentiles

np.percentile(x, 5) # the 5th percentile
# 8.45
np.percentile(x, 95) # the 95th percentile
# 27.299999999999994
np.percentile(x, [25, 50, 75]) #25th, 50th and 75th percentile
# array([11.5 , 16.  , 21.75])
Enter fullscreen mode Exit fullscreen mode

Range (maximum - minimum)

np.ptp(x)
# 22
max(x)-min(x)
# 22
Enter fullscreen mode Exit fullscreen mode

Summary of descriptive statistics

scipy.stats.describe(x, bias=False) # correcting the skewness and kurtosis for statistical bias
# DescribeResult(nobs=10, minmax=(8, 30), mean=16.8, variance=50.400000000000006, skewness=0.5846768190780606, kurtosis=-0.42418400340735873)

x = pd.Series([22, 13, 14, 55, 10])
x.describe() # Using pandas function
"""
count     5.000000
mean     22.800000
std      18.539148
min      10.000000
25%      13.000000
50%      14.000000
75%      22.000000
max      55.000000
dtype: float64
"""
Enter fullscreen mode Exit fullscreen mode

Measures of correlation

(Tip: Correlation implies association, not causation.)

Correlation coefficient and covariance can measure the correlation between variables.

Covariance

# create x and y list
x = [11, 13, 8, 16, 23, 24, 30, 16, 9, 18]
y = [1, 3, 5, 7, 9, 10, 12, 14, 15, 16]

cov_matrix = np.cov(x, y)
cov_matrix
cov_xy = cov_matrix[0, 1] # or [1, 0]
cov_xy
# 12.266666666666667
Enter fullscreen mode Exit fullscreen mode

Correlation coefficient

Pearson correlation coefficient (r) is a measurement of the correlation between data.

r, p = scipy.stats.pearsonr(x, y) # using pearsonr stats function of SciPy library 
r
# 0.3348799480266361
corr_matrix = np.corrcoef(x, y) # using corrcoef from NumPy
corr_matrix
r = corr_matrix[0, 1]
r
# 0.33487994802663607

lg = scipy.stats.linregress(x, y) # extract r from linregress of SciPy library.
# LinregressResult(slope=0.24338624338624337, intercept=5.111111111111111, rvalue=0.33487994802663607, pvalue=0.3442215013372624, stderr=0.24212131035281)
r = lg.rvalue
r
# 0.33487994802663607
Enter fullscreen mode Exit fullscreen mode

We can calculate it from the formula:

std_x=statistics.stdev(x)
std_y=statistics.stdev(y)
cov_xy = 12.26666666666
r = cov_xy / (std_x * std_y)
r
# 0.33487994802645404
Enter fullscreen mode Exit fullscreen mode

Studentโ€™s t-test

stats.ttest_1samp(x, 0)   # one sample
# Ttest_1sampResult(statistic=7.483314773547883, pvalue=3.75801166724644e-05)
stats.ttest_ind(x, y)   # two sample
#Ttest_indResult(statistic=2.738454755684762, pvalue=0.01350066550703028)
Enter fullscreen mode Exit fullscreen mode

Ordinary least squares (OLS)

df = pd.DataFrame({'x': x, 'y': y}) # create data frame

model = ols("x ~ y + 1", df).fit()
print(model.summary())

"""
 OLS Regression Results                            
=========================================================================
Dep. Variable:                      x   R-squared:                       0.112
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     1.010
Date:                Sun, 04 Jul 2021   Prob (F-statistic):              0.344
Time:                        06:07:07   Log-Likelihood:                -32.668
No. Observations:                  10   AIC:                             69.34
Df Residuals:                       8   BIC:                             69.94
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=========================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     12.5609      4.777      2.630      0.030       1.546      23.576
y              0.4608      0.458      1.005      0.344      -0.596       1.518
=========================================================================
Omnibus:                        0.271   Durbin-Watson:                   1.243
Prob(Omnibus):                  0.873   Jarque-Bera (JB):                0.313
Skew:                           0.290   Prob(JB):                        0.855
Kurtosis:                       2.357   Cond. No.                         22.3
=========================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Enter fullscreen mode Exit fullscreen mode

Multiple Regression (multiple factors)

x = [11, 13, 8, 16, 23, 24, 30, 16, 9, 18]
y = [1, 3, 5, 7, 9, 10, 12, 14, 15, 16]
z = [2, 4, 6, 8, 10, 11, 13, 15, 17, 19]
df = pd.DataFrame({'x': x, 'y': y, 'z':z})
model = ols('x ~ y + z', df).fit()
print(model.summary())

"""
 OLS Regression Results                           
=========================================================================Dep. Variable:                      x   R-squared:                       0.291
Model:                            OLS   Adj. R-squared:                  0.089
Method:                 Least Squares   F-statistic:                     1.439
Date:                Sun, 04 Jul 2021   Prob (F-statistic):              0.300
Time:                        06:23:49   Log-Likelihood:                -31.541
No. Observations:                  10   AIC:                             69.08
Df Residuals:                       7   BIC:                             69.99
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=========================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     15.7079      5.139      3.057      0.018       3.556      27.859
y              6.5891      4.627      1.424      0.197      -4.351      17.530
z             -5.6693      4.261     -1.331      0.225     -15.745       4.406
=========================================================================
Omnibus:                        0.842   Durbin-Watson:                   1.785
Prob(Omnibus):                  0.656   Jarque-Bera (JB):                0.626
Skew:                          -0.206   Prob(JB):                        0.731
Kurtosis:                       1.845   Cond. No.                         50.9
"""
Enter fullscreen mode Exit fullscreen mode

Analysis of variance (ANOVA)

x = [11, 13, 8, 16, 23, 24, 30, 16, 9, 18]
y = [1, 3, 5, 7, 9, 10, 12, 14, 15, 16]
b = [2, 4, 6, 8, 10, 11, 13, 15, 17, 19]
df = pandas.DataFrame({'x': x, 'y': y, 'b':b})

fvalue, pvalue = stats.f_oneway(df['x'], df['y'], df['b'])
print(fvalue, pvalue)
# 4.572409552116431 0.019488978036786876
Enter fullscreen mode Exit fullscreen mode

Plotting a univariate regression

import seaborn as sns
sns.lmplot(y='y', x='x', data=df)
Enter fullscreen mode Exit fullscreen mode

image.png

If you like the content, please SUBSCRIBE to my channel for the future content.

To get full video tutorial and certificate, please, enroll in the course through this link: https://www.udemy.com/course/python-for-researchers/?referralCode=886CCF5C552567F1C4E7

Top comments (0)