Data scientists come across a multitude of problems that present themselves in various different fields. We have to deal with design issues from generating plots to either try and see a relationship for ourselves, or present them to a third party. We have to perform data cleaning, organising, interpolation and analysis, sometimes even engineering. We work with APIs to get acquire needed information. And amongst many other things, we use statistics to analyse and interpret our data. Although those are all things that could be done in a long and windy way, there are people that create tools and libraries to facilitate the work that has to be done. Today I’ll talk about one in particular: SciPy.
SciPy, pronounced Sigh Pie, is an open source Python library that has a collection of mathematical algorithms, designed to make our lives easier. Basically, instead of writing big and complicated scientific formulas, SciPy got you covered. The library, as of now, contains fifteen sub packages that can be imported independently and have different utilities. They are the following:
- cluster: Clustering algorithms are useful in information theory, target detection, communications, compression, and other areas
- constants: offers a number of mathematical and physics constants and transformation for units of measurement.
- fft: Fast Fourier Transforms, Discrete Sin and Cosine Transforms, Fast Hankel Transform, along with helper functions and backend control
- integrate: Integration and ordinary differential equation solvers
- interpolate: Sub-package for objects used in interpolation
- io: modules to read and write on different types of files
- linalg: Linear algebra modules
- ndimage: functions for multidimensional image processing
- odr: Orthogonal distance regression
- optimize: functions for minimizing objective functions
- signal: Signal processing functions
- sparse: 2-D sparse matrix package for numeric data
- spatial: Spatial algorithms and data structures
- special: This package offers you a substantial amount of mathematical functions. Available ones include: airy, bessel, beta, hypergeometric, mathieu and kelvin
- stats: Statistical functions to work with frequency statistics, correlation functions and statistical tests, masked statistics and many others.
The best way to use these sub-packages is to import them separately, for example:
>>> from scipy import stats
Version 0.1 was first written back in 2001, with version 1.0.0 only being released in 2017. Now they’re at version 1.7.3, and make periodic updates. The code is written by scientists, for scientists, giving us a set of easy to use tools. One of it's creators, Travis Oliphant, was also the creator of Numpy, which merged Numeric and Numarray data. With a growing number of extension modules, and the rising necessity for a more complete environment for scientific and technical computing, in 2001 Travis joined efforts with Eric Jones and Pearu Peterson to create version 0.1. With this, SciPy runs on top of the numeric array data structure provided by NumPy. From there, the project only grew. Version 1.0.0 had a total of 121 contributors. Currently it is distributed under the BSD license and has it's development supported by an open community of supporters. Their GitHub repository has information on how to help contribute to SciPy and what are their plans for the future. The applications that the library amassed are significantly varied, from being used in high school education to power field changing research, like the 2017 Physics Nobel Prize winners “for decisive contributions to the LIGO detector and the observation of gravitational waves”. Part of the can be seen here and this is a GitHub repository where you can find a Jupyter Notebook and see some of the code in action.
NumPy - Though SciPy is built on top of Numpy and possesses all of it's features, Numpy can be a better choice when dealing only with basic array concepts. Python is a powerful and flexible language, but it might not be the fastest in some cases. NumPy is written in C, which makes it's execution faster.
MatLab - This is a different programming language altogether, that instead of being object-oriented like Python, it is Array oriented. This makes it an easy and productive environment for scientists performing mathematical and technical computing, and prime for matrix manipulation. It is not though, a language made for programming, which makes it very clunky when dealing with problems that demand more flexibility.
TenserFlow - Another open source library for numerical computing, that delves into Machine Learning and Artificial Intelligence. TenserFlow is really fast, since it's core is written in a combination of C++, Python and CUDA. A trade-off here is that TenserFlow is considered harder to use and to Debug.
These were just a few of dozens of libraries, tools or languages that have some of the same capabilities of SciPy. It's hard to make an overall comparison between them because they all are designed to do different things, that sometimes overlap . In here for instance, you can find a comparison amongst several numerical-analysis softwares (including SciPy), whereas here you can find comparison of different statistical packages (which also include SciPy). Mostly all of them have something they can do either better, faster, or have better compatibility with a certain program, but they also all have a downside when comparing to using Python and SciPy. You have to decide for yourself what is better for the projects you want to do.
One of the major advantages of using Python or it's tools and libraries, is that Python is it is consistently amongst the top most common languages used. It's simple syntax and versatility makes it a common entry point for beginners. That makes the user base grow more and more every year. Since Python is open source, like many as it's packages (SciPy included), there's a constantly increasing number of programmers working in it's improvement. That makes for more and better tools, more usage and, let's not forget, better documentation. In SciPy documentation page you can find extensive information on how to use it's tools, separated by sub-packages.
With Python being the programming language with most topic creations in Stack Overflow, there's a lot of content being created also by the major public. There are tutorials made for it by schools, individual people, and paid websites. I'll list a few free ones here:
One of my favourites though would have to be Real Python website. Just regarding SciPy, they have in depth information with very descriptive and didactic posts on:
What I'm trying to say is, if you're trying to learn SciPy, you're probably not gonna run out of resources. And if you do, just remember, you can always ask for .help().