I like to think that you can remain productive over 40. Make is 43 this year and it is still my tool of choice to automate my data cleaning or data analysis. It is versatile and beautifully simple. (At first.) Yet, in a recent survey, we found that less than 5 percent of data savvy economists use Make regularly.
Most build systems are meant to, well, build things. Compile code in Java, C, and the like. Make is supposed to do that, too, and most tutorials and StackOverflow questions will feature examples about how to build C code.
But at its very basic, Make is indeed beautifully simple. I create a text file called
Makefile in my folder with the following content.
clean_data.csv: raw_data.csv data_cleaner.py python data_cleaner.py
Then I say
make in the shell and Make creates
In other words, I need to specify
target: source recipe
and Make will run the recipe for me.
This information is something I want to note for my documentation anyway. What does my script need and what does it produce? I might as well put it in a Makefile.
This way, I can link up a chain of data work,
visualization.pdf: clean_data.csv visualize.py python visualize.py clean_data.csv: raw_data.csv data_cleaner.py python data_cleaner.py
When I enter
make in the shell, I get my
visualization.pdf recreated right from raw data.
Order matters here. Typing
makewithout any arguments recreates the first target found in the file called
Makefile. I can also type
make clean_data.csvif I want to recreate a specific target.
Suppose I don't like the color in my graph and decide to edit
visualize.py. But data cleaning takes a lot of time! If
clean_data.csv is already up to date (relative to the time stamps of
data_cleaner.py), Make will skip that step and only redo the visualization recipe.
You don't have to rerun everything. Lazy is good. One more reason why you want to write modular code.
As soon as you feel the power of your first few simple Makefiles, you will crave for more. Can I do this? Can I do that? The answer is yes, you can, but it will take a lot of searching on StackOverflow.
One feature I use regularly is automatic variables. If I don't want to hard code file names into my neat Python script (you'll see shortly why), I can pass the names of target and source as variables.
clean_data.csv: raw_data.csv data_cleaner.py python data_cleaner.py < $< > $@
raw_data.csv (the variable
$< refers to the first source file) to the STDIN of
data_cleaner.py and saves the output on STDOUT to
clean_data.csv (the variable
$@ denotes the target).
Why these symbols? Don't ask me. They don't look pretty but they get the job done.
I can also use functions like
clean_data.csv: input/complicated-path/raw_data.csv data_cleaner.py python data_cleaner.py $(basename $(notdir $@))
and many more.
And now for the best part. Make can execute my jobs in parallel. On a nicely equipped AWS server, I gladly launch
make -j60 to do the tasks on 60 threads. Make serves as a job scheduler. Because it knows what depends on what, I will not run into a race condition.
- Knock, knock.
- Race condition.
- Who's there?
Parallel execution doesn't help if I have a linear chain of recipe as above. But if I can split my dependency graph in parallel branches, they will be executed in the correct order.
So suppose my data is split into two (or many more). The following code would allow for parallel execution of the data cleaning recipe.
visualization.pdf: merged_data.csv visualize.py python visualize.py merged_data.csv: clean_data_1.csv clean_data_2.csv merge_data.py python merge_data.py clean_data_%.csv: raw_data_%.csv data_cleaner.py python data_cleaner.py < $< > $@
I have used the pattern matching character
% to match both
Invoking make with the option
make -j2 will start two processes to clean the data. When both finished, the merge data recipe runs, then the visualization. (These will be single threaded.)
I regularly use parallel execution to do Monte Carlo simulations or draw bootstrap samples. Even if I have 500 parallel tasks and only 40 processors,
make -j40 will patiently grind away at those tasks. And if I kill my jobs to let someone run Matlab for the weekend (why would they do that?), I can simply restart on Monday with only 460 tasks to go.
- Simple real-world Makefile with variables and for loops.
- Not-so simple Makefile with variables, for loops, functions and pattern matching.
Those who still don't like Make?
$< $@ them.
Originally posted on