In this post I will list some very valuable libraries for people who intend to use Python for data science, Machine Learning, and Statistics. These libraries are very extensive and are developed by a big number of experts around the world and together, the libraries, make Python a very powerful tool for data analysis.
I really recommend that you install, and use, Anaconda the scientific Python distribution. This will give you loads of Python libraries installed. In the example below we will use one very handy library; Pandas.
The convention is to load pandas as pd and then we can use the methods and classes very easily. For instance, we can write pd.read_csv(‘datafile.csv’) to load a CSV file to a dataframe object.
In this section I will list some of the most essential Python libraries when it comes to data science.
NumPy is an extensive library for data storage and calculations. This library contains data structures, algorithms, and other things that are used to handle numerical data in Python. Furthermore, NumPy includes methods for arrays (lists) that are more efficient than Python's built-in methods. This makes NumPy faster than Python's standard methods. NumPy also contains features that can be used to load data to Python, as well as export data from Python.
If you are migrating from MATLAB, for instance, you will like NumPy (see here)
Pandas is the most powerful library for data manipulation. Pandas contains a wide range of data import and export functions, as well as for indexing and manipulating data. This library is inevitable for those who use the Python for data science. Pandas also includes sophisticated methods for data structures. The most used data structure in pandas is dataframe (series of columns and rows) and series (a 1-dimensional array).
Pandas are extremely effective for reshaping, merging, splitting, aggregating, and selecting (subsetting) data. In fact, the absolute majority of the code in a data science project usually consists of data wrangling, which are the steps required to prepare data so that analyses can be performed. Having a coherent library for all data wrangling is, of course, advantageous.
Unlike the statistical programming environment konwn as R, there is no built-in variant of dataframes in Python. Dataframes are central to basically all data analysis. A dataframe is a table of columns and rows. Here's a very nice Pandas Dataframe tutorial I wrote, aimed at the beginner.
Matplotlib is used to visualize data. Although matplotlib is quite easy to use and you have a lot of control over your plots I would recommend using Seaborn.
Seaborn is Python package for data visualization that is based on matplotlib. This package gives us a high-level interface for drawing beautiful and informative visualizations. It's possible to draw bar plots, histograms, scatter plots, and many other nice plots.
Here's an example using NumPy to generate some data and plotting it using Seaborn:
import numpy as np import seaborn as sns # Generate some normally distributed data dat = np.random.normal(0.0, 0.2, 1000) # Create a histogram using seaborn sns.distplot(dat)
SciPy includes features for advanced calculations.
scikit-learn is a huge library of data analysis features. In scikit-learning there are classification models (e.g., Support Vector Machines, random forest), regression analysis (linear regression, ridge regression, lasso), cluster analysis (e.g, k-means clustering), data reduction methods (e.g., Principal Component Analysis,, feature selection), model tuning and selection (with features like grid search, cross validation, etc), pre-processing of data among many other things.