We’ve all seen bugs in production. Sometimes we can reproduce them, sometimes we can’t.
When we can reproduce them, then we fix it quickly. The we do some regression tests to see if we broke anything else. And then we’re done.
When we can’t reproduce it, that’s a totally different story.
How do we know that we can’t reproduce it?
This sounds simple, doesn’t it? We tried to use the system, and the problem hasn’t occurred. It is possible that we just couldn’t reproduce it yet? Is it possible that we’re only one step away from reproducing it?
Let’s move from philosophical questions to pragmatic ones. Is it a waste of time tying to reproduce a NullPointerException? Is it a bigger waste of time blindly adding some code? How do we know it?
Usually we have a process about reproducing. This process is usually implicit, nobody writes it down. My process looks something like this:
- Read the bug report
- Quickly check the exception trace, hoping that it reveals an obvious problem
- Try to reproduce it based on what’s written on the bug report
- Analyze the logs and the related source code to figure out what could’ve happened
- Try to reproduce it again
- In case of multi-threaded programming, tweak the threads and locks to reproduce an odd situation
At every step I might go back to read the bug report and the logs. After a day or two I can either reproduce the bug, or I give up the search.
When I give it up, then I do either or both of the two things:
- Reproduce and fix the issue with a unit-test.
- Add more logging.
If we reproduce a problem with a unit test, then we’re likely to fix only a symptom. Say for instance, if we add another null-check, then we still won’t know why a certain method receives invalid data. Soon we’ll see other NullPointerException
-s.
When we add logging, then we’ll use lots of resources like CPU, Disk, Memory. Operators won’t allow it most of the time. They usually agree that it’s crucial to fix the bug. Yes, they want to help you figure out the root cause. No, sorry, they can’t let you add logging.
Luckily, there are a few approaches to get debug data from production effectively.
Logging fix 1: Sampling
One approach is sampling. We don’t have to log every user session. We can select every 1000th user session and log all of their data. Alternatively, we can use a random generator to select every 1000th user session and log that.
This is useful only if the bug occurs quite frequently. Just keep in mind, we need to sample user sessions, not log events. So we’ll get detailed information on what went wrong.
You might make the log sampling configurable. So if production operators finds that our logging still uses too many resources during peek time, they might turn it off for a couple of hours.
An alternative of log sampling is A/B testing. If our architecture allows, we might turn on debug logging on a fraction of our servers.
Logging fix 2: Logging a session in memory
Sometimes we can log debug data in memory. If an error occurs in that user session, then we can write it to a file. If not, then we simply release the memory we’ve allocated.
This is useful if the error occurs infrequently. In this case we have to do something to catch the bug as soon as it happens. However, production operation won’t necessarily be happy about it. It depends on the memory needs of our debug logging.
We can combine this kind of logging with log sampling. Say for instance, we log debug data for every 10th user session, and write it to file when an error occurs.
Logging fix 3: In-memory ring buffer, by Kofa.
István Kovács explained a third approach: an in-memory ring buffer. Every log goes into this in-memory ring buffer. This buffer has a configurable size. Also, it’s a ring buffer, so the most recent logs overwrite the oldest lines.
The logs that usually go to a file, they still go to a file. When an error occurs, the whole ring buffer gets written to a file.
This is really cool because the memory overhead is limited. If you set the size of the ring buffer right, then you’ll have all the debug data you need. On the other hand, you might spend some time to find the right size for this ring buffer.
What’s your approach?
Originally appeared at tamasrev.
Top comments (2)
Nice to know my process of trying to reproduce bugs is same, but for logs we use debug in production cause we don't have problems yet, but your aprouch sounds pretty good and we should follow it soon, thanks for sharing!
I think the problem with debugging production is that you often don't have the luxury of making any changes to the production code if you're only hunting for bugs. Like, it's one thing to push a new bugfix version live, but nobody wants to be pushing a new version of production code just to add a verbose log line or two, especially when they might be in the wrong place!
One option I've discovered for some languages (basically, it doesn't work on true compiled code, only interpreted or bitcode languages), is Rookout.com which kind of lets you inspect the code execution line-by-line. It's a bit like breakpoints except they don't stop the code.
I like methods like this because it's more similar to how I debug during development, by inspecting the code and the variables rather than relying on logs and stuff.