Jamie Gaskins

Posted on Jun 26, 2020 • Edited on Jun 27, 2020

Performance Comparison, Rust vs Crystal with Redis

#performance #rust #crystal

You often hear about how fast languages like Rust and Go are. People port all kinds of things to Rust to make them faster. It's common to hear about a company porting a Ruby microservice to Go or writing native extensions for a dynamic language in Rust for extra performance.

Crystal also compiles your apps into blazing-fast native code, so today I decided to try comparing Rust and Crystal side-by-side in talking to a Redis database.

The Benchmark

I wanted something realistic, and most benchmarks I could find were things like Mandelbrot and digits of π. They're CPU-intensive, absolutely, but they're nothing like the workload a typical web app has.

The benchmark I went with was to connect to a Redis database and run a bunch of pipelined commands. Pipelining means we're sending all of the commands before reading any of them. Because we're not waiting for the result after sending each command, this drastically reduces the impact that latency has on the benchmark. For example, instead of this sequence:

Send command
Read result
Send command
Read result
Send command
Read result

What we do instead is this:

Send command
Send command
Send command
Read result
Read result
Read result

This way we pay the latency cost once between the last send and the first read instead of 3 times.

For our benchmark, we're going to run a mix of common Redis operations:

Set a key
Get a key that exists
Get a key that does not exist
Increment the value for a key

We do each of these 100k times. The more work we do in this pipeline, the less effect latency has and the more effective the benchmark is. The reason we run a mix of commands isn't so much about what Redis does with them (we're not benchmarking Redis), but what Redis returns for them. The SET and GET commands in Redis return strings, which require heap allocations. INCR returns an integer, which is usually allocated on the stack (no malloc / free needed) and doesn't necessarily require a heap allocation (though the implementation might parse the integer from an intermediate string, which could involve an allocation).

First we'll look at the code in each language, then the results.

Rust

We're using the redis-rs Rust crate for this app. We construct a Redis pipeline with redis::pipe(), fill it with data, and then send that data to the connection.

use redis::{self};
use std::time::{Instant};

fn main() {
    const ITERATIONS: usize = 100_000;
    let client = redis::Client::open("redis://127.0.0.1:6379").unwrap();
    let mut con = client.get_connection().unwrap();

    let start = Instant::now();
    let mut pipe = redis::pipe();

    pipe.del("foo").ignore();
    for _i in 0..ITERATIONS { pipe.set("foo", "bar").ignore(); }
    for _i in 0..ITERATIONS { pipe.get("foo").ignore(); }

    pipe.del("foo").ignore();
    for _i in 0..ITERATIONS { pipe.incr("foo", 1).ignore(); }

    pipe.del("foo").ignore();
    for _i in 0..ITERATIONS { pipe.get("foo").ignore(); }

    let () = pipe.query(&mut con).unwrap();

    println!("{}", start.elapsed().as_millis());
}

Crystal

require "../src/redis"

redis = Redis::Connection.new

start = Time.monotonic

iterations = 100_000
redis.pipeline do |redis|
  redis.del "foo"
  iterations.times { redis.set "foo", "bar" }
  iterations.times { redis.get "foo" }

  redis.del "foo"
  iterations.times { redis.incr "foo" }

  redis.del "foo"
  iterations.times { redis.get "foo" }
end

pp Time.monotonic - start

Note that this isn't the more common Crystal Redis shard. This is a Redis client I wrote that is significantly tuned to reduce heap allocations and remain light while supporting as much of Redis as I needed. ~~I will be publishing it on GitHub soon.~~ You can find the code on GitHub.

The Results

$ cargo run --release --example redis_app
    Finished release [optimized] target(s) in 0.30s
     Running `target/release/examples/redis_app`
568

It took our Rust app 568 milliseconds to connect to Redis, send 400k commands, and receive all their results.

$ crystal run --release bench/bench_redis.cr
00:00:00.328368151

Our Crystal app took just 328 milliseconds to run the same commands. That means the Rust app took 73% more time to perform the exact same work as the Crystal app.

The Caveat

The hard part about benchmarking anything that connects to a server is that the server may actually be your bottleneck. With databases especially, it's easy to get stuck waiting on I/O. In our example apps, the Redis server was indeed capping out at 100% CPU but neither app was, which is why we stop at 400k commands — going beyond that wasn't actually providing any useful information.

So how can we find just the time our app spent in the CPU and ignore all the time we spent waiting on the server? Turns out the UNIX time command tells us exactly this. Instead of cargo run and crystal run, we'll compile our programs and run them directly through time:

$ cargo build --release --example redis_app
    Finished release [optimized] target(s) in 0.26s
$ time target/release/examples/redis_app
563
target/release/examples/redis_app  0.28s user 0.04s system 48% cpu 0.656 total

Our Rust app used the CPU for 320ms (280ms in userland and 40ms in system calls).

$ crystal build --release bench/bench_redis.cr -o bin/bench_redis
$ time bin/bench_redis
00:00:00.327064055
bin/bench_redis  0.12s user 0.02s system 41% cpu 0.341 total

Our Crystal app used the CPU for 140ms (120ms in userland and 20ms in system calls). That means our Crystal app was 2.29x as fast on the CPU!

Also, it was interesting seeing both of these programs were waiting on Redis for over half of their runtime! As someone that has worked mostly in Ruby for 16 years, being able to saturate a Redis server with a single client is hilarious to me.

The End

The purpose of this post was not to say that Rust is slow. Rust is very fast. The idea was to see if Rust was really the performance trailblazer we all thought it was and it turns out Crystal has just as good, if not way better, performance for cases like this.

One thing that strikes me is that you never hear people talk about using Rust and Go for how nice they are to read and write the way you hear people talk about Ruby. It's always about the performance. But somehow we don't hear people talking as much about Crystal for the same reasons. I wonder if it's because it resembles Ruby that people don't take it seriously. Rust and Go have curly braces everywhere, so they're fast, right? 😄

Anyway, if you use Ruby or Python for their expressiveness and Rust or Go for their performance, it might be worth writing a part of your app in Crystal to get both.

Top comments (21)

Markus Westerlind • Jun 27 '20

Having done most of the optimization of the rust redis library I can say that this use case of sending an enormous pipeline of commands is not something I have optimized for. Most of the time I send a single or just a few commands and it is the roundtrip time that is most important which is of course dominated by IO. However, disregarding that the main overhead isn't in message encoding but rather in the book keeping needed to send concurrent commands on the same connection so optimizing out the remaining overhead of encoding hasn't been a priority.

Still it would be interesting to see how you achieve these impressive results (incase there is something to steal ;) ). But the post neither contains the crystal, redis library nor the benchmark setup itself :( .

Jamie Gaskins • Jun 27 '20

Having done most of the optimization of the rust redis library I can say that this use case of sending an enormous pipeline of commands is not something I have optimized for.

Awesome, great to meet you! 🙂

Honestly, I only optimized the Crystal client for heap allocations — I tried to avoid them whenever feasible.

Still it would be interesting to see how you achieve these impressive results (incase there is something to steal ;) ). But the post neither contains the crystal, redis library nor the benchmark setup itself :( .

Because redis::pipe() doesn't take the connection as an argument, it looks like it's acting as a buffer and sending all the pipelined commands afterward with the query method. The convention in Crystal is instead to use I/O streams directly so we don't have to realloc a buffer every time we need to expand it. Instead, the stream has a static buffer. And then after the block is complete, I flush the buffer one more time before reading the results back off the socket.

I just published the code on GitHub so you can have a look. The pipeline implementation is here — you can see that it just wraps a connection and overrides run (which I believe is the equivalent of cmd in the Rust client).

The benchmark setup is in the code within the article. I tried using Cargo's bench command but it told me it was going to take 6800 seconds to complete all of its iterations, so I was like "uhh, nope" and just measured the time it took to run once instead. 😂 I also (elsewhere in the comments) looped over it to get more than a single sample on a warmed-up connection. It reduced the impact of latency even more on both clients since only the first run had to deal with TCP handshake and slow start.

If Redis pipelines in the Rust client aren't optimized, I'd be happy to try something that is. I really only used it because benchmarking anything with I/O is that latency even to localhost takes the vast majority of the time, so a benchmark has to run for several minutes to get a meaningful amount of CPU time to compare, especially since the UNIX time command only has 10-millisecond granularity at the CPU.

Does Rust have anything that measures CPU time internally using something like getrusage for fine-grained measurements?

Markus Westerlind • Jun 27 '20

Because redis::pipe() doesn't take the connection as an argument, it looks like it's acting as a buffer and sending all the pipelined commands afterward with the query method. The convention in Crystal is instead to use I/O streams directly so we don't have to realloc a buffer every time we need to expand it. Instead, the stream has a static buffer. And then after the block is complete, I flush the buffer one more time before reading the results back off the socket.

I figured as much! redis-rs could do that as well, at least in the synchronous API. The async API can't however since it may receive concurrent requests and it must make sure that each request is written in its entirety without interleaving.

Since I only use the async implementation I have to accept the buffering in pipe or cmd (at least I haven't come up with a way to skip the allocations for the buffer) so changing the API for the synchronous implementation isn't on my radar.

I just published the code on GitHub so you can have a look. The pipeline implementation is here — you can see that it just wraps a connection and overrides run (which I believe is the equivalent of cmd in the Rust client).

Thanks! Another thing that helps crystal here is that since commands are written immediately the redis server will start processing the commands immediately which gives a much better end to end timing. The async implementation is capable of the same thing by simply issuing individual commands, however it naturally has more overhead as each request and response is passed through a channel (which allows requests to be done concurrently from multiple threads).

If Redis pipelines in the Rust client aren't optimized, I'd be happy to try something that is. I really only used it because benchmarking anything with I/O is that latency even to localhost takes the vast majority of the time, so a benchmark has to run for several minutes to get a meaningful amount of CPU time to compare, especially since the UNIX time command only has 10-millisecond granularity at the CPU.

For raw throughput the pipeline as uses is still the best way in redis-rs, it just isn't something I have optimized for since, as you say, IO is such a huge overhead (and more so for smaller pipelines).

Does Rust have anything that measures CPU time internally using something like getrusage for fine-grained measurements?

Not really, though you can of course call any C library (might be rust bindings already I guess). I usually don't look at CPU time, just use github.com/bheisler/criterion.rs to get good timings for comparison and perf + github.com/KDAB/hotspot/releases to track down where that CPU time goes to.

Jamie Gaskins • Jun 28 '20

The async API can't however since it may receive concurrent requests and it must make sure that each request is written in its entirety without interleaving.

Ah, okay. I'd been seeing a bunch of stuff on Twitter about how Rust has been favoring async I/O and I saw what looks like some Python-style aio in redis-rs. So all this makes a whole lot more sense to me now. Thank you for clarifying some of this stuff!

I'm gonna keep checking some other libraries in Rust and Go so I can get a better picture of the performance landscape among the 3 languages. This was just one chapter of that story and I really appreciate you being a part of it. 🙂

Felix Terkhorn • Jun 27 '20

Looking forward to seeing the Crystal client open sourced. Please post again when you get that far!

The peer review can be a big help to the community -- thanks for taking the time to write this up!

Jamie Gaskins • Jun 27 '20

Just published! 🙂

Felix Terkhorn • Jun 27 '20

That's awesome, thank you!

Carlos Donderis • Jun 27 '20

Doesn’t comparing a heavily optimized Redis client for Crystal with an average one in Rust defeated the purpose of the benchmark?
I love Crystal, but I’m not sure how accurate this test might be

Rafael Silveira • Jun 27 '20

I agree!

Jamie Gaskins • Jun 27 '20

The idea that the Rust client, with commits from 63 people and 1500 GitHub stars, has not had any optimizations applied seems a bit presumptuous.

Jamie Gaskins • Jun 27 '20

If it helps ease your mind, I ran the benchmark code against the other Crystal Redis client I linked in the article, which is not optimized for heap allocations the way mine is. The only differences in the benchmark code are s/::Connection// and s/pipeline/pipelined/. The code is otherwise identical. Here is the result:

$ time bin/bench_redis
00:00:00.354746110
bin/bench_redis  0.17s user 0.03s system 53% cpu 0.372 total

Only about 8% slower overall and 42% slower at the CPU (200ms vs 140ms) for the "average" Crystal Redis client. If Rust and Crystal were actually closer in performance, this is the sort of difference in performance I expected. I actually expected Rust to be within ±30%, but I was off by an entire order of magnitude.

opensas • Jul 1 '20

I'm really surprised at this, I honestly thought that rust was hard to beat at a performance level, and that a code like crystal, which definetely looks like scripting, couldn't beat rust. I hope you continue with these articles. I think Crystal is a hidden gem, it's not getting (yet) the difussion it deserves.
BTW, it's great to see you interacting with the people that created the rust library and exchanging widom, open source rocks!!!

Pierre-Adrien Buisson • Mar 21 '22

If Crystal code looks like scripting (heavily inspired by Ruby), it's still a compiled language, so it makes sense that its performances are waaaaaay better than what you can expect from scripting languages :)

Tamás Szelei • Jun 27 '20

Great comparison. What happens if you enable LTO in rust?

Jamie Gaskins • Jun 27 '20

I'm not sure what that is. Is that the same as the --release flag?

Tamás Szelei • Jun 28 '20

LTO stands for link-time optimization, which is a great feature of LLVM (thus, rustc). You can enable it in your Cargo.toml:

[profile.release]
lto = true

The above will make --release builds use "fat" LTO, meaning all dependencies and the project itself is link-time optimized (you could set it to "thin" which means LTO is only applied to the current crate).

Another option to go even further is PGO, but that is a bit more involved and I haven't tried it with rust. Here is some documentation if you are interested: doc.rust-lang.org/rustc/profile-gu...

Combining both can go pretty far in optimizing performance.

Aidiakapi • Jun 28 '20

Add:

[profile.release]
lto = true

In Cargo.toml to enable it. It allows more optimizations between crates at the cost of longer compile time. Though it's unlikely to give a 2x improvement.

Tamás Szelei • Jun 28 '20

Sorry, just saw your answer and after I practically typed the same. I agree that it's unlikely to give a 2x speedup.

Nicolas Hervé • Jun 26 '20

Maybe a long running process will show a decrease in Crystal performance due to the garbage collector.

Jamie Gaskins • Jun 26 '20

I wrapped the Redis pipeline code to run it 10x in the same process, so it runs 4 million commands in 10 pipelines, but it didn't make any significant changes to how long it took for either app. The results are below, but to summarize: it makes the CPU-time ratio 3.13 CPU seconds for Rust vs 1 CPU second for Crystal, tipping the scales even more in Crystal's favor.

$ time target/release/examples/redis_app
553
543
534
538
540
529
524
532
528
536
target/release/examples/redis_app  2.82s user 0.31s system 49% cpu 6.279 total

$ time bin/bench_redis
00:00:00.324824666
00:00:00.316971548
00:00:00.316958306
00:00:00.316820546
00:00:00.316534462
00:00:00.318077670
00:00:00.322508686
00:00:00.322692871
00:00:00.321157067
00:00:00.315688903
bin/bench_redis  0.94s user 0.06s system 30% cpu 3.298 total

Ary Borenszweig • Jun 26 '20

The GC is already running in the benchmark here. My guess is that Crystal actually allocates less memory than Rust for some reason (maybe the Rust client isn't well optimized).

View full discussion (21 comments)