DEV Community

Cover image for Statistical Modeling with Python: How-to & Top Libraries
Aaron Harris for Kite

Posted on • Edited on • Originally published at kite.com

Statistical Modeling with Python: How-to & Top Libraries

Statistical Modeling with Python: How-to & Top Libraries

This post covers some of the essential statistical modeling frameworks and methods for Python, which can help us do statistical modeling and probabilistic computation.

  • Introduction: Why Python for data science
  • Why these frameworks are necessary
  • Start with NumPy
  • Matplotlib and Seaborn for visualization
  • Using Seaborn and Matplotlib
  • SciPy for inferential statistics
  • Statsmodels for advanced modeling
  • Scikit-learn for statistical learning
  • Conclusion

Why these frameworks are necessary

While Python is most popular for data wrangling, visualization, general machine learning, deep learning and associated linear algebra (tensor and matrix operations), and web integration, its statistical modeling abilities are far less advertised. A large percentage of data scientists still use other special statistical languages such as R, MATLAB, or SAS over Python for their modeling and analysis.

While each of these alternatives offer their own unique blend of features and power for statistical analyses, it’s useful for an up-and-coming data scientist to know more about various Python frameworks and methods that can be used for routine operations of descriptive and inferential statistics.

The biggest motivation for learning about these frameworks is that statistical inference and probabilistic modeling represent the bread and butter of a data scientists’ daily work. However, only by using such Python-based tools can a powerful end-to-end data science pipeline (a complete flow extending from data acquisition to final business decision generation) be built using a single programming language.

If using different statistical languages for various tasks, you may face some problems. For example:

  • Conducting any web scraping and database access using SQL commands and Python libraries such as BeautifulSoup and SQLalchemy
  • Cleaning up and preparing your data tables using Pandas, but then switching to R or SPSS for performing statistical tests and computing confidence intervals
  • Using ggplot2 for creating visualization, and then using a standalone LaTeX editor to type up the final analytics report
  • Switching between multiple programmatic frameworks makes the process cumbersome and error-prone.

What if you could do statistical modeling, analysis, and visualization all inside a core Python platform? Let’s see what frameworks and methods exist for accomplishing such tasks.

Start with NumPy

NumPy is the de-facto standard for numerical computation in Python, used as the base for building more advanced libraries for data science and machine learning applications such as TensorFlow or Scikit-learn. For numeric processing, NumPy is much faster than native Python code due to the vectorized implementation of its methods and the fact that many of its core routines are written in C (based on the CPython framework).

Although the majority of NumPy related discussions are focused on its linear algebra routines, it offers a decent set of statistical modeling functions for performing basic descriptive statistics and generating random variables based on various discrete and continuous distributions.

For example, let’s create a NumPy array from a simple Python list and compute basic descriptive statistics like mean, median, standard deviation, quantiles, etc.

The code for this article may be found at Kite’s Github repository.

import numpy as np

# Define a python list
a_list = [2, 4, -1, 5.5, 3.5, -2, 5, 4, 6.5, 7.5]

# Convert the list into numpy array
an_array = np.array(a_list)

# Compute and print various statistics
print('Mean:', an_array.mean())
print('Median:', np.median(an_array))
print('Range (Max - min):', np.ptp(an_array))
print('Standard deviation:', an_array.std())
print('80th percentile:', np.percentile(an_array, 80))
print('0.2-quantile:', np.quantile(an_array, 0.2))

The results are as follows:

Mean: 3.5
Median: 4.0
Range (Max - min): 9.5
Standard deviation: 2.9068883707497264
80th percentile: 5.699999999999999
0.2-quantile: 1.4000000000000001

To read more about Numpy, Matplotlib, Seaborn, and Statsmodels, check out the full article by Tirtha Sarkar.

Tirtha Sarkar is a semiconductor technologist, data science author, and author of pydbgen, MLR, and doepy packages. He holds a Ph.D. in Electrical Engineering and M.S. in Data Analytics.

Top comments (0)