Mark Erikson

Posted on Jan 19, 2019 • Originally published at blog.isquaredsoftware.com

Debugging Tips and Tales

#debugging

Originally posted on my blog at https://blog.isquaredsoftware.com

Last week's #devdiscuss discussion was on the topic of debugging. This is something I have a lot of opinions and advice on, so I replied with a quick list of my suggestions for better debugging. It's worth recording those in blog post form as well.

Note: Ben McCormick pointed me to a very similar post he wrote a few years back. It's excellent, and you should read it too: The Debugging Toolbox

Mark's Debugging Advice

1) Reproduce the Issue

Reproducing the problem is key. If you can't reproduce the issue, it's harder to figure out what's going on and iterate on fixes. More importantly, if you can't repro, you probably can't prove that a "fix" actually works.

2) Errors Provide Useful Information

Error messages and stack traces are important. @swyx quoted me in a Dev post last year on this topic: How to Google Your Errors . They usually give you a good starting point for where the problem is and why it's happening. Read the messages. Google them if needed.

3) Learn to Use Debuggers

Know your tools. Most debuggers work the same no matter what IDE / DevTools you're using. Understand breakpoints, watches, stepping. Use more advanced features like conditional breakpoints, or "watchpoints" to just log a message instead of stopping.

4) Print Logging is Important

Sometimes an IDE debugger won't cut it, especially if it's a remote system or multi-threaded environment. Good ol' console/print debugging is still important. It helps to have lots of existing logging with different levels that you can enable via config files.

5) Debug With a Plan

Go in with a hypothesis. Understand how the code should behave first. Look at the actual behavior. You can probably make educated guesses for where things are diverging. Focus on those areas. Don't randomly tweak. Change one thing at a time and compare results. Narrow down possible causes based on that.

6) Don't Be Afraid

Experience helps, but don't be afraid to dive in if you're new, because that's how you get experience :) Even if an issue looks like it's really complex, take some time to break it down into smaller pieces. Dave Ceddia and Gosha Arinich both wrote good advice on the topic of breaking down problems from the perspective of building a new project and learning new topics, but the same concept applies to debugging.

7) Know When To Keep Going, and When To Stop

Persistence is good, and it takes time to build up mental context on an issue, but sometimes you gotta take a break. Plenty of times I've left work, gone home, relaxed, and figured out what the issue was overnight or in the morning.

On that note, I'll recap a couple of my best debugging stories.

Tales of Debugging

The Case of the Wrong Case

A few years ago. I had a Python service that was doing a bunch of number crunching, so I used Cython to speed up the math-intensive portions of the code. Cython is a tool that can compile plain Python code to C so that it runs faster. If you add extra type declarations, it can generate more efficient C code, because it knows what each variable type is.

Cython has a few ways you can declare types. You can rename your .py files to .pyx and add types directly in the code; you can use decorators to declare the types; or you can add separate .pxd files that declare the types. I set it up using the side-by-side file approach, which lets you leave your original Python files alone. For this to work, the .pxd file with the types has to match the exact name of the original .py file. So, if I had FileA.py, I'd also have FileA.pxd.

I added type definitions for the Python functions and variables and referenced several C math functions from <math.h>. This service primarily runs on Linux, but I do most of my development on Windows. I did the initial changes on my local Windows machine, tested the code, and it seemed to be working great.

Unfortunately, when I went to run this on Linux, it compiled fine, but started throwing some weird exceptions buried deep down in the code. I hadn't actually written the math portion of this service myself, so I wasn't familiar with the actual logic. I traced the problem pretty far down into the code, and finally concluded that a particular function was trying to call the cos() trigonometry function, but the call was failing because cos didn't seem to exist. This didn't make any sense, because I was explicitly importing the C version of cos from <math.h> if using Cython, and importing the normal Python function from the math module otherwise.

Took me about 6 hours of tracing, investigating, and steadily growing confusion before I finally figured out what was going on: there was a single letter casing mismatch between a .py file and its corresponding .pxd file. One had a capital 'O', the other had a lower-case 'o'. This worked fine on Windows, because Windows file systems are case-insensitive (so filea.txt is considered the same as FileA.txt). But, Linux file systems are case-sensitive, so files can have the same name with different letter cases. Because of this, the Cython compiler never actually found the .pxdfile to match the original .py file, didn't get the right imports, and thus stuff eventually exploded. The fix was to change the capital 'O' to a lower-case 'o', and it worked.

Definitely a pretty high hours-of-work to number-of-characters-changed ratio :)

A Twisted Tale

Recently, we got reports that one of our Python services appeared to be locking up entirely. This particular service acts as a reverse proxy, forwarding HTTP requests to the rest of our system, and also handling authentication requests.

The main symptom was that none of our app would load in the browser. Oddly, there were no obvious errors in the logs, but that was because the logs for the service stopped entirely at about the time the symptoms occurred.

This Python service is built on the Twisted network framework. Twisted uses an "event loop", similar to how Javascript and Node.js work. It has a queue of events, and for each event, runs the associated code to completion. This is great if you have lots of IO that happens in the background. But, it also means that if a particular script takes a long time to run, it can block the entire event loop and keep any other code from running.

I was stumped by the issue for a while, but I had a couple guesses. My hypothesis was that one of the external calls this service was making was never getting a response, and thus blocking the event loop from continuing.

We cranked up the log levels for this service, restarted things, and waited for it to happen again. When it did, I took a look at the logs. Sure enough, the last thing the service was trying to do was make an external request to an LDAP server to query some user data.

I did some testing and confirmed that all of the external requests were happening on the main thread, same as the event loop (which I did by modifying the logging config to print out the name of the thread where the log statement occurred). I moved all those external requests to happen in background threads instead, and also fixed the timeout settings on the calls so that they don't block indefinitely.

The changes resolved the issue. We later found out that the issues did actually correspond to times when the LDAP server had been experiencing 100% CPU usage, thus confirming the external cause.

The moral of the story is that knowing how the underlying technologies behave allowed me to make an educated guess for the likely cause of the problem.

Final Thoughts

Debugging is a combination of gathering information, problem solving, and knowing how to use the tools you have available. Articles and courses can teach you a lot of things, but debugging is something that I feel really requires practice and experience.

So, the next time you see your application spewing errors into the logs, or something just "doesn't work", DON'T PANIC! You can figure this out :)

Top comments (5)

Erebos Manannán • Jan 20 '19

One of the most important things that is still missing from this imo is "know when and how to ask for help".

A lot of people think they're alone with their problem to debug and so it becomes a much bigger source of stress etc. than it needs to. However, quite a lot of people have the comfort of working in a team and being able to ask for help from team mates, and there's Stack Overflow, GitHub issues, IRC channels, and Slack for many projects, if you can first identify roughly what you're having an issue with.

If you've spent actual effort trying to figure out what's wrong, you have an idea of what is going on and what's wrong, but don't yet know how to solve it, is usually a good time to start asking for help. If you're at the stage where literally all you know is "it doesn't work" where you're not even sure what "it" or "doesn't work" mean, then it's too early.

Once you've figured out you need help, the next hurdle for people really is "how to ask good questions". Most people are really bad at this, because they just don't practice at all. They never ask for help, they never visit forums, or other discussion mediums to ask for help, nor ask for help from their colleagues.

Explaining what is wrong to another person takes effort, and you can't just throw them a disconnected snippet of what YOU have been looking at, or the last line of a traceback saying "this is broken". You should at least explain what you've been trying, what kind of issue you're having with it, how you've tried to debug it, and what kind of errors you're seeing.

E.g. a bad "explanation" of your problem (that I encounter depressingly often) is like:

I'm getting "TypeError: object of type 'NoneType' has no len()", what's wrong?

Whereas a better one would be more akin to:

I've been trying to run the database migration task for my local environment, but it's giving me errors. I checked and it seems the configuration is correct, and the database server is running, but for some reason I'm getting this really odd error that seems to have something to do with the database connection based on the traceback.

Here's the full traceback:

File "/src/tasks.py", line 15, in <module>
    from app import services, models
  File "/src/app/services.py", line 12, in <module>
    import common.utils as utils
  File "/lib/python/common/utils.py", line 264, in <module>
    db = get_db(_ARANGODB_DB)
  File "/lib/python/common/utils.py", line 225, in get_db
    _ARANGODB_PORT, CustomHTTPClient(_ARANGODB_PROTOCOL))
  File "/lib/python/common/arangodb_client.py", line 26, in __init__
    self._session.mount(protocol, adapter)
  File "/.venv/auth-ALEBwC_u/lib/python3.6/site-packages/requests/sessions.py", > line 744, in mount
    keys_to_move = [k for k in self.adapters if len(k) < len(prefix)]
  File "/.venv/auth-ALEBwC_u/lib/python3.6/site-packages/requests/sessions.py", > line 744, in <listcomp>
    keys_to_move = [k for k in self.adapters if len(k) < len(prefix)]
TypeError: object of type 'NoneType' has no len()```

Also you should understand netiquette if you are asking people on forums or Slack or similar. Use a kind tone, write full coherent sentences instead of using the Enter -key for punctuation, and try to make sure you're posting in the right place.

In case of Slack, you probably shouldn't be posting to #general, but instead most projects have #troubleshooting or #help -channels. In Stack Overflow try to keep your questions clear, to the point, and with appropriate tags. In case of both IRC and Slack you should realize that these are not a 24/7 paid support, it's people with other things to do in their lives and you might have to wait for a long while (often hours) and maybe even repeat your question a few times (or rephrase it) before you get a reply.

Mark Erikson • Jan 20 '19

Yep, good advice!

There's plenty of "how to ask good questions" articles out there, but unfortunately folks that are likely to ask "bad" questions are also unlikely to be searching for info on how to ask better questions :)

Mark Erikson • Jan 20 '19

Haha, nice :) Want to share it?

DEV Community