At Linkurious, we build Linkurious Enterprise, a Web platform that leverages the power of graphs and graph visualizations to help companies and governments around the globe fight financial crime.
One of the main features of Linkurious Enterprise is a user-friendly graph visualization interface aimed at non-technical users.
Ogma has been designed to work with state-of-the-art algorithms to provide the best performance in the field of network visualizations, from a first-class WebGL rendering engine, to the adoption of WebWorkers to improve the interactivity of the library on long running tasks and finally with top-class graph layout algorithms implementations.
Since the first announcement, WebAssembly promised great performances – comparable to native ones – with very little effort from the developer himself other than developing the source code into a native performance language to get the best results on the Web.
After some time and many more announcements on the WebAssembly side, we decided to give it a try and run a thorough benchmark before jumping on the (performant) WASM bandwagon.
The perfect candidate for this kind of investigation are graph layouts: they are CPU intensive, crunching numbers over and over until a solution converges from it.
Our investigation focused first on finding a candidate to benchmark a typical graph layout algorithm, which can be easily ported to different languages using similar structures.
The choice landed on the n-body algorithm: this algorithm is often the baseline of many force-directed layout algorithms and the most expensive part in the layout pipeline. Solving this specific part of the pipeline would provide great value at the overall force-directed algorithms Ogma implements.
As Max De Marzi said on his blog last summer in 2019:
There are lies, damned lies, and benchmarks.
Building a fair benchmark is often not possible because it’s hard to reproduce real world scenarios: creating the right environment for a complex system to perform is always incredible hard because it’s easy to control external factors in a laboratory benchmarking, while in the real life many things concur to the final “perceived” performance.
In our case our benchmark will focus on a single, well defined, task: the n-body algorithm.
It is a clear and well-known defined algorithm used to benchmark languages by reputable organizations.
As any fair benchmark comparison, there are some rules we defined for the different languages:
- The code structure should be similar for the different implementations
- No multi-process, multi-thread concurrency allowed.
- No SIMD allowed
- Only stable versions of the compilers. No nightly, beta, alpha, pre-alpha versions allowed.
- Use only the latest versions of the compilers for each source language.
Once defined the rules, it is possible to move to the algorithm implementation. But first, it’s necessary to decide which other languages will be used for the benchmark:
WASM is a compiled language, even if declared as “human readable” assembly code, it’s not a (mentally) sane choice for us to write plain WASM code. Therefore we ran a survey for the benchmark and we picked the following candidates:
On each implementation, we kept the number of points at 1000 and ran the algorithm with different numbers of iterations. For each run, we measured how long it took to perform the computations.
The setup of the benchmark was the following:
- NodeJS v. 12.9.1
Chrome Version 79.0.3945.130 (Official Build) (64-bit)
clang version 10.0.0 - C language version
emcc 1.39.6 - Emscripten gcc/clang-like replacement + linker
AssemblyScript v. 0.9.0
Macbook Pro 2017 Retina
Intel Dual Core i5 2,3 GHz, 8GB DDR3 with 256GB SSD
Not top of the class machine for a benchmark, but we’re testing a WASM build which is going to be run in a browser context, which usually does not have access to all the cores and RAM memory anyway.
To put some spice on the benchmark, we produced several versions of each implementation: a version where each Point in the n-body system has a 64 bit numeric coordinate representation, and another version with a 32 bit representation.
Another note to consider is probably the “double” Rust implementation: originally in the benchmark a “raw” Rust “unsafe” implementation was written without using any particular toolchain for WASM. Later on, an additional “safe” Rust implementation was developed to leverage the “wasm-pack” toolchain, which promised easier JS integration and better memory management in WASM.
To crunch the numbers, 2 main environments have been tested: Node.js and a browser environment (Chrome).
Both benchmarks run in a “warm” scenario: the Garbage Collector has not been reset before each benchmark suite. From our experiments running the GC after each suite had no particular effects on the numbers.
The AssemblyScript source was used to build the following artifact:
- The JS baseline implementation
- The AssemblyScript WASM module
- The AssemblyScript asm.js module1
Crunching the numbers in Node.js shows the following scenario:
And then run the same suite in the browser:
The first thing we noted was how the AssemblyScript “asm.js” performs slower than other builds. This chart did not make it clear enough how well or bad other languages were doing compared to the JS implementation, so we created the following charts to clarify:
There is a distinction here between 32 and 64 bit, which may lead to the idea that JS numbers can have both representation: numbers in JS - our baseline - are always at 64 bits, but for the compilers to WASM it may make some difference.
In particular, it makes a huge difference for the AssemblyScript asm.js build at 32 bit. The 32 bits build has a big performance drop compared to the JS baseline, and compared to the 64 bit build.
It is hard to see how the other languages are performing compared to JS, as AssemblyScript is dominating the chart, therefore an extract of the charts was created without AssemblyScript:
The different numeric representation seems to affect other languages too, but with different outcomes: C becomes slower when using 32 bit (float) numbers compared to the 64 bits (double), while Rust becomes consistently faster with 32 bit (f32) numbers than with 64 bit (f64) alternative.
At this point one, question may come to mind: since all tested WASM builds are quite close to the JS implemented code, would it be possible that the native implementations are slower themselves and the WASM builds just mirror that?
Native versions of the implementations were always faster than its JS counterpart.
What has been observed is that the WASM builds perform slower than their native counterpart, from a 20% up to a 50% performance penalty - performed on a reduced benchmark version with 1000 iterations:
In the measurements above, the native measures are counting also the bootstrap time, while on the WASM measurement that time has been taken out.
This may sound like a win for Rust, but is actually a very small gain compared to the efforts required.
Learning new languages is always a good thing, but it should be for the right reason: performances are many times the “wrong” reason, as they are more affected by whole design decisions rather than compiler or micro-benchmark optimizations.
1: Interesting to note how the “AssemblyScript asm.js module” was not actually fully asm.js compliant. We tried to add the “use asm” comment on top of the module, but the browser refused the optimization. Later on, we discovered how the binaryen compiler we used does not actually target full asm.js compliance, but rather some sort of efficient JS version of WASM. ↑