DEV Community

rhymes
rhymes

Posted on

An example of why performance matters (with Python and Rust)

Long intro

My first real post on dev.to, in September 2017, was the following:

I was trying to extract information from around 60 GB of CSV files corresponding to 139 million events. I started with Python to see how it behaved. The experiment was sparked by my frustration at Redshift and because I wanted to play with TrailDB, a library to query event series. My tests were non-scientifical but, after switching to Go (by copy and pasting code because I didn't really know the language back then), I was able to setup the DB with a speedup of 2.6 times than in Python and to query the data 2.54 times faster.

The topic of speed and performance is dear to probably everyone on this website,even if speed and performance can be relative to a context (see the concept of "fast enough"). You can see this topic permeate conversations on dev.to around the slowness of the web, memory occupation of browsers and desktop apps and other related topics. A couple of nice examples with long and interesting discussions attached:

by @quii

by @tux0r

Why you're here reading this

Nobody would argue against cost effective speed improvements, and this brings me to the gist of this post. An article titled Parsing logs 230x faster with Rust by André Arko (lead developer of Ruby's Bundler) caught my attention.

I've been aware of Rust's speed since... well that and its advantages around memory management is what everyone talks about when they talk about Rust :-D

I've since switched to two Rust based tools that I use everyday on the command line: bat instead of cat and especially ripgrep instead of grep and ack. The speed improvement is noticeable (thanks @dmfay for the tip) with the naked eye!

Back to the article. Arko wanted to query Bundler's treasure trove of 500 GB of logs per day to extract useful information about the community. Each log file contains millions of events in JSON (BTW: use structured logs if you can, JSON or key-value, you'll thank me later). Currently those files are sitting compressed in a S3 bucket for a few dollars per month.

Hosted logging solutions were too expensive so he tried to see if he could cook something up.

The first attempt was in Ruby and it took an insane 16 hours for a day's worth of data. Nope.

The second attempt was in Python using AWS Glue and the full power of Amazon's servers. He went down to 3 hours with an average of 36 minutes per each log file (out of 500) using 100 parallel workers for 1000 dollars per month. Nope.

The third attempt was in Rust. He initially went down to 3 minutes per file, then to 60 seconds per file. After fiddling with it more and receiving feedback from readers, he managed to parse a single file in 8 seconds (!!).

The fourth attempt was in Rust again and he used parallelization. It was 3.3x faster than the sequential attempt. That's how he got to the 230x multiplication factor in the title.

A few notes on the comparisons

If you read closely you'll notice the following:

  • the first attempt shouldn't probably be mentioned in the post because it collects less data than the others (and we don't know how much less)
  • the first attempt in Rust amounts to 8.33 hours if run sequentially, more than 30 times faster than the experiment with Python and Glue
  • the last "sequential" experiment in Rust amounts to a little more than 1 hour for the entire set of 500 GB which is a huge speedup

Deploy time

The last thing André Arko talks about is how he managed to deploy the Rust script so that it can work on the production logs stored on AWS. This part made me laugh:

I discovered rust-aws-lambda, a crate that lets your Rust program run on AWS Lambda by pretending to be a Go binary

Another wonder of distributing an app as a binary :D

On AWS Lambda the speedup he got was 78 times the initial Python example, not bad!

He did some calculations and it was safely in the free tier for AWS Lambda.

So he went from 1000$ a month to 0 a month, by rewriting a script with Rust.

I checked the repository of the script and people are already suggesting ways to make it even faster 😂

Stuff to think about if you made it this far

  • Performance can save you a lot of money
  • Knowing (or being willing to learn) more than one language is a good idea
  • Rust is definitely worth looking at for this kind of parsing
  • Sometimes better is better than good enough

Top comments (15)

Collapse
 
maxart2501 profile image
Massimo Artizzu

So he went from 1000$ a month to 0 a month, by rewriting a script with Rust.

This phrase was used when this article was tweeted. And I thought: "Wait, does it mean he was fired?!" 🤣

Collapse
 
guneyozsan profile image
Guney Ozsan

I just logged in to like this XD

Collapse
 
rhymes profile image
rhymes

hahaha like the whole concept of automating yourself out of your job :D

Collapse
 
alainvanhout profile image
Alain Van Hout

Excellent write-up :) It's nice to explorative programming leading to such cost-effective solutions.

I do have a small quibble with the title, which you already touch upon in your second paragraph: it should be 'performance can matter'. Because under the more general statement also fall things like pointlessly dense one-liners and writing scripts to save 10 seconds of typing per day.

Collapse
 
rhymes profile image
rhymes

Because under the more general statement also fall things like pointlessly dense one-liners and writing scripts to save 10 seconds of typing per day.

Ah ah true :-) Saving 10 seconds is probably a stretch too far in the performance category but I see what you mean.

I have to honest: I'm not totally against dense one liners if they actually increase performance (after careful benchmarking), provided that they are well documented and hopefully isolated from the rest of the code.

Collapse
 
alainvanhout profile image
Alain Van Hout

That's where 'fast enough' comes in: if it takes you a minute to read the documentation, another minute or two to mentally parse the code and 15 minutes of carefully checking your changes, and perhaps 2 days debugging to notice what is actually wrong, then what point is a 2 ms speedup for something that likely has no gain from that speedup? (Which is far too common an occurrence)

Or more concisely: too many people have wasted too many hours due someone who felt a need to be clever.

There are of course plenty of situations where performance is really needed. And there it's worth investing the time to do benchmarking and properly document the code. But most of the time, I'd be grateful if code were maintainable rather than clever.

Thread Thread
 
rhymes profile image
rhymes

I totally agree with you, only a comment:

That's where 'fast enough' comes in: if it takes you a minute to read the documentation, another minute or two to mentally parse the code and 15 minutes of carefully checking your changes, and perhaps 2 days debugging to notice what is actually wrong, then what point is a 2 ms speedup for something that likely has no gain from that speedup? (Which is far too common an occurrence)

Well yes, if the speed up of the oneliner is 2ms then no, it's definitely not worth it. The only upside in your case is that you might have gain a better knowledge of the system but that's due to the 2 days of debugging, not due to the oneliner :D

Thread Thread
 
alainvanhout profile image
Alain Van Hout

Indeed :-)

Collapse
 
cathodion profile image
Dustin King

This reminds me of a talk where, if I recall, a Python reporting process was sped up by first improving the algorithm, then compiling with Cython:

Collapse
 
zeerorg profile image
Rishabh Gupta

Maybe this experiment can benefit if compiled in cython 🤔

Collapse
 
rhymes profile image
rhymes

Maybe, it really depends on the code

Collapse
 
smuschel profile image
smuschel • Edited

One thing that would fit your 'stuff to think about...' list nicely is 'always choose the right tool for the job'. But then again I think everybody knows that one by now

Collapse
 
rhymes profile image
rhymes • Edited

One thing that would fit your 'stuff to think about...' list nicely is 'always choose the right tool for the job'

You're right

But then again I think everybody knows that one by now

I wouldn't bet on it, we're lazy animals of habit after all

Collapse
 
antonrich profile image
Anton

Rust's bat is sick though.

Collapse
 
rhymes profile image
rhymes

Yeah! Rust FTW I guess