First, let's try to make clear what I mean with optimizing software and which kind of optimum we can aim to reach and what is the general process.
The high level process is more or less the following:
- Decide what you want to optimize for
- Pick a target metric
- Pick a set of common use-cases
- Figure out which are the parts of the codebase that are impacting the selected metric the most.
- Setup benchmarks to measure how your changes impact the decided metric
- Change the code and go to back to
3.until you are happy.
First thing you should decide what you want to optimize for.
- Usually it is common to optimize for speed: making the same operation produce the same result, but taking less time.
- Sometimes you care more about how many resources the operation requires and you want to keep it to a minimum.
Depending on your optimization strategies you can achieve more speed while reducing the resource usage or you may have to consider trade-offs between the two or between a kind of resource or another (e.g. cpu and memory).
To keep the discussion simpler we'll focus on single execution time and memory usage.
Once you decided what you want to optimize for you need to select a set of use-cases that are well representative, they will be profiled to figure out what you want to speed up first and used to measure the improvements through benchmarks.
There are many parallels with the debugging activity:
- The benchmarks are no different from integration and unit-tests, and usually test harnesses could double as benchmark harnesses or the other way round (e.g.
- The resource profilers are quite similar to fault detectors as
NOTE: valgrind offers a large toolchain of resource profilers, fault detectors and even more.
You have the same
code-coverage problems that you have when you are crafting regression and conformance tests.
You need to come up with something that:
- is representing the common usage model well
- is reproducible
- executes in an acceptable time since you want to run it often, potentially as part of your CI, definitely as part of your release workflow.
- is easy to instrument so you can extract different metrics with minimal manual intervention.
Once you selected your common use-cases, can profile them first, figure out which are the hot paths that would give you the largest gain and craft benchmarks for both the overall execution and for the routines you want to optimize for.
To profile your binary there are a large number of tools that come handy.
Some may give you information over multiple metrics (e.g. dtrace or bpftrace) or a specific one (e.g memory-profiler or uftrace), they may work on unchanged binaries, require to have them compiled with specific settings (e.g. debug information or mcount markers).
You want enough information to come up with about 5 candidates for optimization.
Depending on the tool you may simply pick the top of a list, or inspect a flamegraph or other interactive representations.
Some tools provide the information in multiple formats and really good tools provide means to compare profiling traces of different runs.
The main metric you should care about is the cumulative amount of time spent on each function per run.
My favorite tools to get this information are:
- cargo flamegraph: producing pretty interactive graphs while using perf record or dtrace under the hood to get timing information.
- perf record + perf report: since its tabular representation is usually all you need to start, if you are on Linux.
- uftrace since it is faster than perf and has built-in exporters for multiple formats.
- cargo instruments since it is the most straightforward way to get profiling information on MacOS.
You want to consider at least two information: allocation count and peak memory usage.
In many cases the peak memory usage (and the maximum resident set) is one of the main limiting factors if you plan to have concurrent runs.
Keeping the allocation count low is important since you might end up fragmenting the memory and in general allocations in an hot path are also impacting the execution speed.
My favorite tools to get this information are:
- malt since its web UI is really straightforward. (Linux-only)
- memory-profiler since it is faster to collect information even if its web UI isn't as nice to navigate in my opinion. (Linux-only again)
- On MacOS cargo instruments would have you covered.
Once you have your candidates from the profiling phase, the best next step is to write a set of micro-benchmarks for each of them and have a whole-application benchmark ready as well: sometimes speeding up a single function might have unexpected side-effect, better to catch the pathological cases earlier.
|criterion||hyperfine||perf report||getrusage (gnu time)|
NOTE: anything that can measure the speed can repurposed to measure the throughput even if it does not explicitly does.
Half of the problem is extracting the benchmark data and the other half is store it properly so you can compare between revisions.
- critcmp makes easier to store and compare different criterion-based benchmark runs.
- hyperfine has a mean to export to multiple formats and some analysis scripts that may help you.
Once you have the benchmarks ready you can start optimizing the code.
This post is an introduction to the topic with just a list of tools I like to use and what I consider the best practice while dealing with this kind of tasks.
During the next posts I'll try to discuss what are the actual strategies to do the work and which crates may make your life much easier, depending on what you are trying to achieve.