You often hear about how fast languages like Rust and Go are. People port all kinds of things to Rust to make them faster. It's common to hear about a company porting a Ruby microservice to Go or writing native extensions for a dynamic language in Rust for extra performance.
Crystal also compiles your apps into blazing-fast native code, so today I decided to try comparing Rust and Crystal side-by-side in talking to a Redis database.
The Benchmark
I wanted something realistic, and most benchmarks I could find were things like Mandelbrot and digits of π. They're CPU-intensive, absolutely, but they're nothing like the workload a typical web app has.
The benchmark I went with was to connect to a Redis database and run a bunch of pipelined commands. Pipelining means we're sending all of the commands before reading any of them. Because we're not waiting for the result after sending each command, this drastically reduces the impact that latency has on the benchmark. For example, instead of this sequence:
- Send command
- Read result
- Send command
- Read result
- Send command
- Read result
What we do instead is this:
- Send command
- Send command
- Send command
- Read result
- Read result
- Read result
This way we pay the latency cost once between the last send and the first read instead of 3 times.
For our benchmark, we're going to run a mix of common Redis operations:
- Set a key
- Get a key that exists
- Get a key that does not exist
- Increment the value for a key
We do each of these 100k times. The more work we do in this pipeline, the less effect latency has and the more effective the benchmark is. The reason we run a mix of commands isn't so much about what Redis does with them (we're not benchmarking Redis), but what Redis returns for them. The SET
and GET
commands in Redis return strings, which require heap allocations. INCR
returns an integer, which is usually allocated on the stack (no malloc
/ free
needed) and doesn't necessarily require a heap allocation (though the implementation might parse the integer from an intermediate string, which could involve an allocation).
First we'll look at the code in each language, then the results.
Rust
We're using the redis-rs
Rust crate for this app. We construct a Redis pipeline with redis::pipe()
, fill it with data, and then send that data to the connection.
use redis::{self};
use std::time::{Instant};
fn main() {
const ITERATIONS: usize = 100_000;
let client = redis::Client::open("redis://127.0.0.1:6379").unwrap();
let mut con = client.get_connection().unwrap();
let start = Instant::now();
let mut pipe = redis::pipe();
pipe.del("foo").ignore();
for _i in 0..ITERATIONS { pipe.set("foo", "bar").ignore(); }
for _i in 0..ITERATIONS { pipe.get("foo").ignore(); }
pipe.del("foo").ignore();
for _i in 0..ITERATIONS { pipe.incr("foo", 1).ignore(); }
pipe.del("foo").ignore();
for _i in 0..ITERATIONS { pipe.get("foo").ignore(); }
let () = pipe.query(&mut con).unwrap();
println!("{}", start.elapsed().as_millis());
}
Crystal
require "../src/redis"
redis = Redis::Connection.new
start = Time.monotonic
iterations = 100_000
redis.pipeline do |redis|
redis.del "foo"
iterations.times { redis.set "foo", "bar" }
iterations.times { redis.get "foo" }
redis.del "foo"
iterations.times { redis.incr "foo" }
redis.del "foo"
iterations.times { redis.get "foo" }
end
pp Time.monotonic - start
Note that this isn't the more common Crystal Redis shard. This is a Redis client I wrote that is significantly tuned to reduce heap allocations and remain light while supporting as much of Redis as I needed. I will be publishing it on GitHub soon. You can find the code on GitHub.
The Results
$ cargo run --release --example redis_app
Finished release [optimized] target(s) in 0.30s
Running `target/release/examples/redis_app`
568
It took our Rust app 568 milliseconds to connect to Redis, send 400k commands, and receive all their results.
$ crystal run --release bench/bench_redis.cr
00:00:00.328368151
Our Crystal app took just 328 milliseconds to run the same commands. That means the Rust app took 73% more time to perform the exact same work as the Crystal app.
The Caveat
The hard part about benchmarking anything that connects to a server is that the server may actually be your bottleneck. With databases especially, it's easy to get stuck waiting on I/O. In our example apps, the Redis server was indeed capping out at 100% CPU but neither app was, which is why we stop at 400k commands — going beyond that wasn't actually providing any useful information.
So how can we find just the time our app spent in the CPU and ignore all the time we spent waiting on the server? Turns out the UNIX time
command tells us exactly this. Instead of cargo run
and crystal run
, we'll compile our programs and run them directly through time
:
$ cargo build --release --example redis_app
Finished release [optimized] target(s) in 0.26s
$ time target/release/examples/redis_app
563
target/release/examples/redis_app 0.28s user 0.04s system 48% cpu 0.656 total
Our Rust app used the CPU for 320ms (280ms in userland and 40ms in system calls).
$ crystal build --release bench/bench_redis.cr -o bin/bench_redis
$ time bin/bench_redis
00:00:00.327064055
bin/bench_redis 0.12s user 0.02s system 41% cpu 0.341 total
Our Crystal app used the CPU for 140ms (120ms in userland and 20ms in system calls). That means our Crystal app was 2.29x as fast on the CPU!
Also, it was interesting seeing both of these programs were waiting on Redis for over half of their runtime! As someone that has worked mostly in Ruby for 16 years, being able to saturate a Redis server with a single client is hilarious to me.
The End
The purpose of this post was not to say that Rust is slow. Rust is very fast. The idea was to see if Rust was really the performance trailblazer we all thought it was and it turns out Crystal has just as good, if not way better, performance for cases like this.
One thing that strikes me is that you never hear people talk about using Rust and Go for how nice they are to read and write the way you hear people talk about Ruby. It's always about the performance. But somehow we don't hear people talking as much about Crystal for the same reasons. I wonder if it's because it resembles Ruby that people don't take it seriously. Rust and Go have curly braces everywhere, so they're fast, right? 😄
Anyway, if you use Ruby or Python for their expressiveness and Rust or Go for their performance, it might be worth writing a part of your app in Crystal to get both.
Top comments (21)
Having done most of the optimization of the rust redis library I can say that this use case of sending an enormous pipeline of commands is not something I have optimized for. Most of the time I send a single or just a few commands and it is the roundtrip time that is most important which is of course dominated by IO. However, disregarding that the main overhead isn't in message encoding but rather in the book keeping needed to send concurrent commands on the same connection so optimizing out the remaining overhead of encoding hasn't been a priority.
Still it would be interesting to see how you achieve these impressive results (incase there is something to steal ;) ). But the post neither contains the crystal, redis library nor the benchmark setup itself :( .
Awesome, great to meet you! 🙂
Honestly, I only optimized the Crystal client for heap allocations — I tried to avoid them whenever feasible.
Because
redis::pipe()
doesn't take the connection as an argument, it looks like it's acting as a buffer and sending all the pipelined commands afterward with thequery
method. The convention in Crystal is instead to use I/O streams directly so we don't have to realloc a buffer every time we need to expand it. Instead, the stream has a static buffer. And then after the block is complete, I flush the buffer one more time before reading the results back off the socket.I just published the code on GitHub so you can have a look. The pipeline implementation is here — you can see that it just wraps a connection and overrides
run
(which I believe is the equivalent ofcmd
in the Rust client).The benchmark setup is in the code within the article. I tried using Cargo's
bench
command but it told me it was going to take 6800 seconds to complete all of its iterations, so I was like "uhh, nope" and just measured the time it took to run once instead. 😂 I also (elsewhere in the comments) looped over it to get more than a single sample on a warmed-up connection. It reduced the impact of latency even more on both clients since only the first run had to deal with TCP handshake and slow start.If Redis pipelines in the Rust client aren't optimized, I'd be happy to try something that is. I really only used it because benchmarking anything with I/O is that latency even to
localhost
takes the vast majority of the time, so a benchmark has to run for several minutes to get a meaningful amount of CPU time to compare, especially since the UNIXtime
command only has 10-millisecond granularity at the CPU.Does Rust have anything that measures CPU time internally using something like getrusage for fine-grained measurements?
I figured as much!
redis-rs
could do that as well, at least in the synchronous API. The async API can't however since it may receive concurrent requests and it must make sure that each request is written in its entirety without interleaving.Since I only use the async implementation I have to accept the buffering in
pipe
orcmd
(at least I haven't come up with a way to skip the allocations for the buffer) so changing the API for the synchronous implementation isn't on my radar.Thanks! Another thing that helps crystal here is that since commands are written immediately the redis server will start processing the commands immediately which gives a much better end to end timing. The async implementation is capable of the same thing by simply issuing individual commands, however it naturally has more overhead as each request and response is passed through a channel (which allows requests to be done concurrently from multiple threads).
For raw throughput the pipeline as uses is still the best way in redis-rs, it just isn't something I have optimized for since, as you say, IO is such a huge overhead (and more so for smaller pipelines).
Not really, though you can of course call any C library (might be rust bindings already I guess). I usually don't look at CPU time, just use github.com/bheisler/criterion.rs to get good timings for comparison and
perf
+ github.com/KDAB/hotspot/releases to track down where that CPU time goes to.Ah, okay. I'd been seeing a bunch of stuff on Twitter about how Rust has been favoring async I/O and I saw what looks like some Python-style
aio
inredis-rs
. So all this makes a whole lot more sense to me now. Thank you for clarifying some of this stuff!I'm gonna keep checking some other libraries in Rust and Go so I can get a better picture of the performance landscape among the 3 languages. This was just one chapter of that story and I really appreciate you being a part of it. 🙂
Looking forward to seeing the Crystal client open sourced. Please post again when you get that far!
The peer review can be a big help to the community -- thanks for taking the time to write this up!
Just published! 🙂
That's awesome, thank you!
Doesn’t comparing a heavily optimized Redis client for Crystal with an average one in Rust defeated the purpose of the benchmark?
I love Crystal, but I’m not sure how accurate this test might be
I agree!
The idea that the Rust client, with commits from 63 people and 1500 GitHub stars, has not had any optimizations applied seems a bit presumptuous.
If it helps ease your mind, I ran the benchmark code against the other Crystal Redis client I linked in the article, which is not optimized for heap allocations the way mine is. The only differences in the benchmark code are
s/::Connection//
ands/pipeline/pipelined/
. The code is otherwise identical. Here is the result:Only about 8% slower overall and 42% slower at the CPU (200ms vs 140ms) for the "average" Crystal Redis client. If Rust and Crystal were actually closer in performance, this is the sort of difference in performance I expected. I actually expected Rust to be within ±30%, but I was off by an entire order of magnitude.
I'm really surprised at this, I honestly thought that rust was hard to beat at a performance level, and that a code like crystal, which definetely looks like scripting, couldn't beat rust. I hope you continue with these articles. I think Crystal is a hidden gem, it's not getting (yet) the difussion it deserves.
BTW, it's great to see you interacting with the people that created the rust library and exchanging widom, open source rocks!!!
If Crystal code looks like scripting (heavily inspired by Ruby), it's still a compiled language, so it makes sense that its performances are waaaaaay better than what you can expect from scripting languages :)
Maybe a long running process will show a decrease in Crystal performance due to the garbage collector.
I wrapped the Redis pipeline code to run it 10x in the same process, so it runs 4 million commands in 10 pipelines, but it didn't make any significant changes to how long it took for either app. The results are below, but to summarize: it makes the CPU-time ratio 3.13 CPU seconds for Rust vs 1 CPU second for Crystal, tipping the scales even more in Crystal's favor.
The GC is already running in the benchmark here. My guess is that Crystal actually allocates less memory than Rust for some reason (maybe the Rust client isn't well optimized).
Great comparison. What happens if you enable LTO in rust?
I'm not sure what that is. Is that the same as the
--release
flag?Add:
In Cargo.toml to enable it. It allows more optimizations between crates at the cost of longer compile time. Though it's unlikely to give a 2x improvement.
Sorry, just saw your answer and after I practically typed the same. I agree that it's unlikely to give a 2x speedup.
LTO stands for link-time optimization, which is a great feature of LLVM (thus, rustc). You can enable it in your Cargo.toml:
The above will make
--release
builds use "fat" LTO, meaning all dependencies and the project itself is link-time optimized (you could set it to "thin" which means LTO is only applied to the current crate).Another option to go even further is PGO, but that is a bit more involved and I haven't tried it with rust. Here is some documentation if you are interested: doc.rust-lang.org/rustc/profile-gu...
Combining both can go pretty far in optimizing performance.