DEV Community

Performance Comparison, Rust vs Crystal with Redis

Jamie Gaskins on June 26, 2020

You often hear about how fast languages like Rust and Go are. People port all kinds of things to Rust to make them faster. It's common to hear abou...
Collapse
 
marwes profile image
Markus Westerlind

Having done most of the optimization of the rust redis library I can say that this use case of sending an enormous pipeline of commands is not something I have optimized for. Most of the time I send a single or just a few commands and it is the roundtrip time that is most important which is of course dominated by IO. However, disregarding that the main overhead isn't in message encoding but rather in the book keeping needed to send concurrent commands on the same connection so optimizing out the remaining overhead of encoding hasn't been a priority.

Still it would be interesting to see how you achieve these impressive results (incase there is something to steal ;) ). But the post neither contains the crystal, redis library nor the benchmark setup itself :( .

Collapse
 
jgaskins profile image
Jamie Gaskins

Having done most of the optimization of the rust redis library I can say that this use case of sending an enormous pipeline of commands is not something I have optimized for.

Awesome, great to meet you! 🙂

Honestly, I only optimized the Crystal client for heap allocations — I tried to avoid them whenever feasible.

Still it would be interesting to see how you achieve these impressive results (incase there is something to steal ;) ). But the post neither contains the crystal, redis library nor the benchmark setup itself :( .

Because redis::pipe() doesn't take the connection as an argument, it looks like it's acting as a buffer and sending all the pipelined commands afterward with the query method. The convention in Crystal is instead to use I/O streams directly so we don't have to realloc a buffer every time we need to expand it. Instead, the stream has a static buffer. And then after the block is complete, I flush the buffer one more time before reading the results back off the socket.

I just published the code on GitHub so you can have a look. The pipeline implementation is here — you can see that it just wraps a connection and overrides run (which I believe is the equivalent of cmd in the Rust client).

The benchmark setup is in the code within the article. I tried using Cargo's bench command but it told me it was going to take 6800 seconds to complete all of its iterations, so I was like "uhh, nope" and just measured the time it took to run once instead. 😂 I also (elsewhere in the comments) looped over it to get more than a single sample on a warmed-up connection. It reduced the impact of latency even more on both clients since only the first run had to deal with TCP handshake and slow start.

If Redis pipelines in the Rust client aren't optimized, I'd be happy to try something that is. I really only used it because benchmarking anything with I/O is that latency even to localhost takes the vast majority of the time, so a benchmark has to run for several minutes to get a meaningful amount of CPU time to compare, especially since the UNIX time command only has 10-millisecond granularity at the CPU.

Does Rust have anything that measures CPU time internally using something like getrusage for fine-grained measurements?

Collapse
 
marwes profile image
Markus Westerlind

Because redis::pipe() doesn't take the connection as an argument, it looks like it's acting as a buffer and sending all the pipelined commands afterward with the query method. The convention in Crystal is instead to use I/O streams directly so we don't have to realloc a buffer every time we need to expand it. Instead, the stream has a static buffer. And then after the block is complete, I flush the buffer one more time before reading the results back off the socket.

I figured as much! redis-rs could do that as well, at least in the synchronous API. The async API can't however since it may receive concurrent requests and it must make sure that each request is written in its entirety without interleaving.

Since I only use the async implementation I have to accept the buffering in pipe or cmd (at least I haven't come up with a way to skip the allocations for the buffer) so changing the API for the synchronous implementation isn't on my radar.

I just published the code on GitHub so you can have a look. The pipeline implementation is here — you can see that it just wraps a connection and overrides run (which I believe is the equivalent of cmd in the Rust client).

Thanks! Another thing that helps crystal here is that since commands are written immediately the redis server will start processing the commands immediately which gives a much better end to end timing. The async implementation is capable of the same thing by simply issuing individual commands, however it naturally has more overhead as each request and response is passed through a channel (which allows requests to be done concurrently from multiple threads).

If Redis pipelines in the Rust client aren't optimized, I'd be happy to try something that is. I really only used it because benchmarking anything with I/O is that latency even to localhost takes the vast majority of the time, so a benchmark has to run for several minutes to get a meaningful amount of CPU time to compare, especially since the UNIX time command only has 10-millisecond granularity at the CPU.

For raw throughput the pipeline as uses is still the best way in redis-rs, it just isn't something I have optimized for since, as you say, IO is such a huge overhead (and more so for smaller pipelines).

Does Rust have anything that measures CPU time internally using something like getrusage for fine-grained measurements?

Not really, though you can of course call any C library (might be rust bindings already I guess). I usually don't look at CPU time, just use github.com/bheisler/criterion.rs to get good timings for comparison and perf + github.com/KDAB/hotspot/releases to track down where that CPU time goes to.

Thread Thread
 
jgaskins profile image
Jamie Gaskins

The async API can't however since it may receive concurrent requests and it must make sure that each request is written in its entirety without interleaving.

Ah, okay. I'd been seeing a bunch of stuff on Twitter about how Rust has been favoring async I/O and I saw what looks like some Python-style aio in redis-rs. So all this makes a whole lot more sense to me now. Thank you for clarifying some of this stuff!

I'm gonna keep checking some other libraries in Rust and Go so I can get a better picture of the performance landscape among the 3 languages. This was just one chapter of that story and I really appreciate you being a part of it. 🙂

Collapse
 
terkwood profile image
Felix Terkhorn

Looking forward to seeing the Crystal client open sourced. Please post again when you get that far!

The peer review can be a big help to the community -- thanks for taking the time to write this up!

Collapse
 
jgaskins profile image
Jamie Gaskins
Collapse
 
terkwood profile image
Felix Terkhorn

That's awesome, thank you!

Collapse
 
cads profile image
Carlos Donderis

Doesn’t comparing a heavily optimized Redis client for Crystal with an average one in Rust defeated the purpose of the benchmark?
I love Crystal, but I’m not sure how accurate this test might be

Collapse
 
rafaelfess profile image
Rafael Silveira

I agree!

Collapse
 
jgaskins profile image
Jamie Gaskins

The idea that the Rust client, with commits from 63 people and 1500 GitHub stars, has not had any optimizations applied seems a bit presumptuous.

Collapse
 
jgaskins profile image
Jamie Gaskins

If it helps ease your mind, I ran the benchmark code against the other Crystal Redis client I linked in the article, which is not optimized for heap allocations the way mine is. The only differences in the benchmark code are s/::Connection// and s/pipeline/pipelined/. The code is otherwise identical. Here is the result:

$ time bin/bench_redis
00:00:00.354746110
bin/bench_redis  0.17s user 0.03s system 53% cpu 0.372 total
Enter fullscreen mode Exit fullscreen mode

Only about 8% slower overall and 42% slower at the CPU (200ms vs 140ms) for the "average" Crystal Redis client. If Rust and Crystal were actually closer in performance, this is the sort of difference in performance I expected. I actually expected Rust to be within ±30%, but I was off by an entire order of magnitude.

Collapse
 
opensas profile image
opensas

I'm really surprised at this, I honestly thought that rust was hard to beat at a performance level, and that a code like crystal, which definetely looks like scripting, couldn't beat rust. I hope you continue with these articles. I think Crystal is a hidden gem, it's not getting (yet) the difussion it deserves.
BTW, it's great to see you interacting with the people that created the rust library and exchanging widom, open source rocks!!!

Collapse
 
pabuisson profile image
Pierre-Adrien Buisson

If Crystal code looks like scripting (heavily inspired by Ruby), it's still a compiled language, so it makes sense that its performances are waaaaaay better than what you can expect from scripting languages :)

Collapse
 
tamas profile image
Tamás Szelei

Great comparison. What happens if you enable LTO in rust?

Collapse
 
jgaskins profile image
Jamie Gaskins

I'm not sure what that is. Is that the same as the --release flag?

Collapse
 
aidiakapi profile image
Aidiakapi

Add:

[profile.release]
lto = true

In Cargo.toml to enable it. It allows more optimizations between crates at the cost of longer compile time. Though it's unlikely to give a 2x improvement.

Thread Thread
 
tamas profile image
Tamás Szelei

Sorry, just saw your answer and after I practically typed the same. I agree that it's unlikely to give a 2x speedup.

Collapse
 
tamas profile image
Tamás Szelei

LTO stands for link-time optimization, which is a great feature of LLVM (thus, rustc). You can enable it in your Cargo.toml:

[profile.release]
lto = true

The above will make --release builds use "fat" LTO, meaning all dependencies and the project itself is link-time optimized (you could set it to "thin" which means LTO is only applied to the current crate).

Another option to go even further is PGO, but that is a bit more involved and I haven't tried it with rust. Here is some documentation if you are interested: doc.rust-lang.org/rustc/profile-gu...

Combining both can go pretty far in optimizing performance.

Collapse
 
hnicolas profile image
Nicolas Hervé

Maybe a long running process will show a decrease in Crystal performance due to the garbage collector.

Collapse
 
asterite profile image
Ary Borenszweig

The GC is already running in the benchmark here. My guess is that Crystal actually allocates less memory than Rust for some reason (maybe the Rust client isn't well optimized).

Collapse
 
jgaskins profile image
Jamie Gaskins

I wrapped the Redis pipeline code to run it 10x in the same process, so it runs 4 million commands in 10 pipelines, but it didn't make any significant changes to how long it took for either app. The results are below, but to summarize: it makes the CPU-time ratio 3.13 CPU seconds for Rust vs 1 CPU second for Crystal, tipping the scales even more in Crystal's favor.

$ time target/release/examples/redis_app
553
543
534
538
540
529
524
532
528
536
target/release/examples/redis_app  2.82s user 0.31s system 49% cpu 6.279 total
Enter fullscreen mode Exit fullscreen mode
$ time bin/bench_redis
00:00:00.324824666
00:00:00.316971548
00:00:00.316958306
00:00:00.316820546
00:00:00.316534462
00:00:00.318077670
00:00:00.322508686
00:00:00.322692871
00:00:00.321157067
00:00:00.315688903
bin/bench_redis  0.94s user 0.06s system 30% cpu 3.298 total
Enter fullscreen mode Exit fullscreen mode