Last week's #devdiscuss discussion was on the topic of debugging. This is something I have a lot of opinions and advice on, so I replied with a quick list of my suggestions for better debugging. It's worth recording those in blog post form as well.
Note: Ben McCormick pointed me to a very similar post he wrote a few years back. It's excellent, and you should read it too: The Debugging Toolbox
Reproducing the problem is key. If you can't reproduce the issue, it's harder to figure out what's going on and iterate on fixes. More importantly, if you can't repro, you probably can't prove that a "fix" actually works.
Error messages and stack traces are important. @swyx quoted me in a Dev post last year on this topic: How to Google Your Errors . They usually give you a good starting point for where the problem is and why it's happening. Read the messages. Google them if needed.
Know your tools. Most debuggers work the same no matter what IDE / DevTools you're using. Understand breakpoints, watches, stepping. Use more advanced features like conditional breakpoints, or "watchpoints" to just log a message instead of stopping.
Sometimes an IDE debugger won't cut it, especially if it's a remote system or multi-threaded environment. Good ol' console/print debugging is still important. It helps to have lots of existing logging with different levels that you can enable via config files.
Go in with a hypothesis. Understand how the code should behave first. Look at the actual behavior. You can probably make educated guesses for where things are diverging. Focus on those areas. Don't randomly tweak. Change one thing at a time and compare results. Narrow down possible causes based on that.
Experience helps, but don't be afraid to dive in if you're new, because that's how you get experience :) Even if an issue looks like it's really complex, take some time to break it down into smaller pieces. Dave Ceddia and Gosha Arinich both wrote good advice on the topic of breaking down problems from the perspective of building a new project and learning new topics, but the same concept applies to debugging.
Persistence is good, and it takes time to build up mental context on an issue, but sometimes you gotta take a break. Plenty of times I've left work, gone home, relaxed, and figured out what the issue was overnight or in the morning.
On that note, I'll recap a couple of my best debugging stories.
A few years ago. I had a Python service that was doing a bunch of number crunching, so I used Cython to speed up the math-intensive portions of the code. Cython is a tool that can compile plain Python code to C so that it runs faster. If you add extra type declarations, it can generate more efficient C code, because it knows what each variable type is.
Cython has a few ways you can declare types. You can rename your
.py files to
.pyx and add types directly in the code; you can use decorators to declare the types; or you can add separate
.pxd files that declare the types. I set it up using the side-by-side file approach, which lets you leave your original Python files alone. For this to work, the
.pxd file with the types has to match the exact name of the original
.py file. So, if I had
FileA.py, I'd also have
I added type definitions for the Python functions and variables and referenced several C math functions from
<math.h>. This service primarily runs on Linux, but I do most of my development on Windows. I did the initial changes on my local Windows machine, tested the code, and it seemed to be working great.
Unfortunately, when I went to run this on Linux, it compiled fine, but started throwing some weird exceptions buried deep down in the code. I hadn't actually written the math portion of this service myself, so I wasn't familiar with the actual logic. I traced the problem pretty far down into the code, and finally concluded that a particular function was trying to call the
cos() trigonometry function, but the call was failing because
cos didn't seem to exist. This didn't make any sense, because I was explicitly importing the C version of
<math.h> if using Cython, and importing the normal Python function from the
math module otherwise.
Took me about 6 hours of tracing, investigating, and steadily growing confusion before I finally figured out what was going on: there was a single letter casing mismatch between a
.py file and its corresponding
.pxd file. One had a capital 'O', the other had a lower-case 'o'. This worked fine on Windows, because Windows file systems are case-insensitive (so
filea.txt is considered the same as
FileA.txt). But, Linux file systems are case-sensitive, so files can have the same name with different letter cases. Because of this, the Cython compiler never actually found the
.pxdfile to match the original
.py file, didn't get the right imports, and thus stuff eventually exploded. The fix was to change the capital 'O' to a lower-case 'o', and it worked.
Definitely a pretty high hours-of-work to number-of-characters-changed ratio :)
Recently, we got reports that one of our Python services appeared to be locking up entirely. This particular service acts as a reverse proxy, forwarding HTTP requests to the rest of our system, and also handling authentication requests.
The main symptom was that none of our app would load in the browser. Oddly, there were no obvious errors in the logs, but that was because the logs for the service stopped entirely at about the time the symptoms occurred.
I was stumped by the issue for a while, but I had a couple guesses. My hypothesis was that one of the external calls this service was making was never getting a response, and thus blocking the event loop from continuing.
We cranked up the log levels for this service, restarted things, and waited for it to happen again. When it did, I took a look at the logs. Sure enough, the last thing the service was trying to do was make an external request to an LDAP server to query some user data.
I did some testing and confirmed that all of the external requests were happening on the main thread, same as the event loop (which I did by modifying the logging config to print out the name of the thread where the log statement occurred). I moved all those external requests to happen in background threads instead, and also fixed the timeout settings on the calls so that they don't block indefinitely.
The changes resolved the issue. We later found out that the issues did actually correspond to times when the LDAP server had been experiencing 100% CPU usage, thus confirming the external cause.
The moral of the story is that knowing how the underlying technologies behave allowed me to make an educated guess for the likely cause of the problem.
Debugging is a combination of gathering information, problem solving, and knowing how to use the tools you have available. Articles and courses can teach you a lot of things, but debugging is something that I feel really requires practice and experience.
So, the next time you see your application spewing errors into the logs, or something just "doesn't work", DON'T PANIC! You can figure this out :)