Most scientists start programming in a procedural style. I certainly did. Procedural programming comes natural to scientists, because it reads like a precise protocol for an experiment. Do this. Then do that.
output = function(inputs)
Strange, because if you think about it, everything in data analysis is a function. Data cleaning maps from messy data to tidy data. A statistical estimator maps from a sample to a real number. A visualization maps from data to a colorful bitmap. For data analysis, we almost exclusively write code that does not require user interaction and would be well suited to the functional paradigm.
The conventional definition of functional programming is “no side effects.” You only compute output from inputs. You cannot rely on any other information, and you cannot pass on any other information. This very tight discipline is super useful for science, as it easier to argue about correctness. For example, the ordinary least squares estimator of multivariate regressions,
is a mathematical function which you can characterize using pencil and paper. The Julia equivalent,
function OLS(X, Y) return inv(X' * X) * X' * Y end
works independently of what you have done somewhere else in the code. (By the way,
X\Y is a better way to write this in Julia.)
Moreover, it is easier to automate computations as a chain of functions. If
f(X,Y) is the estimator of multivariate coefficients and
g(b,X) is a prediction rule, then
g(f(X,Y),X) is your fitted machine learning model. Relying on pure functions makes the data science process more reproducible.
You can chain small tools in a Unix-like shell via the pipe operator. The tool reads from STDIN and writes to STDOUT and (hopefully) does not touch anything else in between. As a data scientist, you can focus on implementing the function correctly, instead of worrying how you get the data and who does what with it. This is why I am a big fan of “data science from the command line.”
An even better example is
%>% piping in R. (Julia has a similar pipe operator.) As I understand from my R colleagues, most idiomatic code now uses this syntax.
x %>% log() %>% diff() %>% exp() %>% round(1)
At some level, even scripting languages such as Stata do-files can be thought of as a chain of functions. A strict limitation of Stata is that you can only carry out computations on a single dataframe at a time. This limitation has huge benefits, though. You can write functional code that maps from one state of your dataframe to the next. For example,
generate y = log(x) replace y = 0 if x < 0
is a chain of two functions. Easy to read, easy to debug. It does the same as the Pandas code
df['y'] = math.log(df['x']) df['y'][df['x'] < 0] = 0
Er, what? This reads more complicated because of a vastly wider state we have to control. What log function do we want to use? Which dataframe are we selecting over? Which dataframe are we changing?
Notebooks and other REPL are not and Joel Spolsky hates them with a passion. When you move up and down between cells, saving all kinds of variables in your workspace, you confuse yourself about what is an input to your current computation. I sometimes play around in ipython notebooks, but I always feel guilty.
Jenny Bryan from RStudio and tidyverse also has something to say about side effects.
- Implement pipe operator in Python. I know it’s hard, but can we just have tidyverse for Python?
- Write purely functional Stata code. Separate out input/output and even model estimation, graphing from pure data manipulation code.
- Explore data science libraries for real functional languages. I know, SQL is functional, but it reads very complicated.
- More generally, keep an eye out for side effects. Do I need this global parameter? Do I need to write this to disk? Aim to write as pure functions as possible.