Paul Biggar for Darklang

Posted on Jul 28, 2020 • Edited on Oct 6, 2020 • Originally published at blog.darklang.com

Evolving Dark's tracing system

#devjournal #architecture #tdd #cloud

One of the things that makes Dark truly unique is what we call "Trace-driven development". The best way to write a HTTP handler in Dark is to start by making a request to the non-existent handler, then:

using the 404s list to create the handler

using the actual trace value to see the output of your code as you type it.

We use the trace system a lot, and it's pretty great. It acts as a sort of omniscient debugger: you don't need to start it, you can go back in time easily, you don't need print statements. You can even see the control-flow of your application.

Like most things in Dark today, the trace system was built using the simplest, most obvious implementation possible. As we've grown quite considerably since then, we need to ensure that traces continue to scale well, which they currently do not.

This post is discoveries about what's not working, and ramblings about what the next gen should be.

Cleanup

Dark stores basically every request that is made to it. And it stores it in the database. While this data is important, it isn't the same level of importance as user data. Storing useful and volumous data in the same DB as a much lower volume of extremely precious data is not a great idea.

To avoid the DB blowing up in size (and price) we go through the DB and garbage collect it pretty much continuously. We keep the last 10 requests, and also keep any requests that were made in the last week.

We have struggled to make this not be incredibly buggy. The logic is tricky, and mostly written in SQL whose performance is iffy and which hides quite a few footguns. As a result, the requests to delete data are slow (this garbage collector provides the majority of the load on our database, interestingly) and also locks quite a bit (though I'm systematically working through this in a recent PR).

It can also be hard to identify what data to delete. When we started, we didn't know how we wanted traces to work, and so went with an implementation that stored a trace using the path of the URL requested. This worked well initially, especially as it allowed for easily transitioning a 404 (essentially, a trace with no owner) to a new handler, but had weird behaviour when you changed a handler's route (losing all its traces!). Alas, URLs also support wildcards, and so this meant that in order to find out whether a trace should be deleted, we basically had to recreate the entire routing business logic in the DB.

My thinking here is to associate the trace with the actual handler it hits. That way we're not recreating the business logic, but we'd need a separate 404 storage (although this is probably simpler in the long run). It also changes the behaviour when you "rename" a handler, which you sometimes do early in development; the new behaviour would be to keep the existing traces, which honestly is a much more user-friendly behaviour.

Storage

One of the problems is that we're storing the data in a DB. This sort of log data, which is mostly immutable, should be stored somewhere more appropriate, like S3 (we use Google Cloud, so Cloud Storage in our case). This was also a pattern from the early days of CircleCI - we started by saving build logs in the DB, before moving them to S3.

That would also allow us to send traces to the client without going through the server, which has operational problems of its own. This solves a big problem for customers with larger traces, which can time out when loading from our server. Since Dark is basically unusable without traces (you cant use autocomplete well without them, for instance), solving this is pretty important.

The other upside of this is that rather than running a GC process to clear up the DB (which doesn't even do a great job, as the DB will continue to hold onto the space), using something like S3 would allow us to have lifecycle policies to automatically clean up this data.

One of the problems here is that traces aren't quite immutable. You can -- by intention -- change the contents of a trace. While the initial input is immutable, you can re-run a handler using the same inputs, which currently overwrites the same trace (users have found this dumb, so losing this behaviour is probably an improvement).

You can also run a function you just wrote, adding it to the trace. This behaviour actually is good - it's a key part of Trace-driven development that you start with a partial trace based on your inputs, and then start to build it up as you write code.

My current thinking is to add the concept of a trace "patch". If you run something on top of the trace, we store the "patch" in the DB and resolve/combine the "base trace" and its patches in the client.

Expiration

The GC process isn't a great feature. While it would be much better if it didn't hit the DB at all, it would be even better if it didn't exist. Cloud Storage/S3 have expiration policies, which can automatically delete data without having to go through an expensive GC process.

One issue would be that we don't want the latest ten traces (or some number) to expire. I haven't fully thought this one through, but it seems doable.

You can sign up for Dark here, and check out our progress in these features in our contributor Slack or by watching our GitHub repo. Comment here or on Twitter.

Top comments (6)

Chase Granberry • Jul 28 '20

First, I love this concept. Second ... you should use BigQuery!!! I think I read somewhere you're already on GCP. BigQuery was built for this. Storage pricing is basically the same as S3 and in some cases automatically a lot less. Use standard SQL for querying. You can actually update records if you like. You can query the streaming buffer for free (probably mostly what your users will need). You could even build it so that people could provide their own GCP credentials and store traces for however long they like (in this case they'd get charged for storage and queries). Partitioned tables can have a TTL so old data is auto pruned. Plus, you get Google Data Studio for free. I built Logflare on top of BigQuery initially for all these reasons and I've had zero regrets so far. The downside is that queries are pretty much never sub-second but they are very rarely above 5 seconds. If you'd like to play with this exact setup check out Logflare.

Paul Biggar Darklang • Aug 3 '20

That's super interesting, thanks! I'll have to think about that as I start to work on it. Subsecond is important, but I could put a cache in between so that things won't feel all that slow. And querying capability would actually be super useful.

Thanks!

Raunak Ramakrishnan • Jul 29 '20 • Edited

Have you explored saving the request data in Apache Kafka? As you mentioned in the post, the request data is mostly immutable which seems like a good use-case for Kafka. Kafka allows creating topics with retention which can be a combination of size and time e.g maximum of 7 days or 10 GB whichever gets hit first. You can also use KSQL for querying the produced data and creating stuff like constantly updated materialized views on the latest requests.

On the flip side, it will mean at least having some JVM based software (kafka and zookeeper) in the stack. Also not sure if OCaml has a mature Kafka client library.

Paul Biggar Darklang • Aug 3 '20

That's a really interesting approach, seems like it would solve all of the problems I'm looking at. Thanks for the suggestion!

Richey Ryan • Jul 28 '20 • Edited

More of a question for my own sake but why wouldn't you store this kind of data in some no sql data store separate from your primary user data store?

It would allow for updates and you could still facilitate whatever eviction policy suited you.

Also would using a time-based eviction method help? So if you rename a handler and it ultimately becomes defunct you then evict it 24 hours (or whatever makes sense later) whereas an active handler that lives on will still accumulate logs. It may mean that a user has to build up some logs again if they left a handler over the weekend but ultimately it favours things that are in active use. Obviously very active debugging sessions might create an awful lot of logs so you might need to compensate with a last X approach too

Paul Biggar Darklang • Aug 3 '20

Yeah, when I said S3 I really meant "some no sql data store separate from the primary user data store". Some folks have suggested Kafka and BigQuery, which have some very nice properties around this.