Debugging is one of the hardest things you have to do as a dev. It pushes you to understand your creations at a whole new level. Your friendly Ruby debugging tools like pry and byebug are wonderful when you are trying to track down a bug locally or in your tests. But what happens when the bug is in production? What if the bug only occurs when thousands of concurrent jobs are running? How do you handle debugging those scenarios?
Recently, my team and I encountered a problem with our Sidekiq workers that took us more than 4 days to solve. While we were debugging the issue we tried relentlessly to replicate it in our local environment but were unsuccessful. Here is the story about how the bug hunting unfolded and the strategies we used to help us debug the issue safely in production.
The 🐛
We use Sidekiq Enterprise to process all of our background jobs. One of the main features of Sidekiq Enterprise that we use is the rolling restart feature. This allows us to have long-running jobs on Sidekiq that we don't have to worry about killing when a deploy goes out and reloads the workers. The reload command issues a USR2 signal which tells the workers to shut down once they are done working on whatever job they have.
We have been using this functionality for almost a year now and never had any issues with it until last week. Last week we deployed a new job that had some problems of its own and raised a lot of errors. While the job was breaking and raising all of these errors we started seeing Sidekiq threads get stuck when we tried to reload them. In a terminal when we looked at the Sidekiq processes after a deploy we saw this:
The threads were not working on anything and yet they still were not dying. Given we deploy many times a day we couldn't afford to have all of these stuck threads hanging around.
We immediately tried to recreate the problem locally. For development, we use a vagrant VM on our laptops. They come equipped with the entire application, workers and all. We can even kick our vagrants into production mode which will run almost exactly like a production server. (I will explain the almost later 😉)
We first tried to recreate the problem in development mode but had no luck. I then decided to kick my vagrant into production mode to see if that would work. Still no dice. At this point, our only option was to debug in production. I can already feel the judgy stares coming through the interwebs at me.
Is debugging in production ideal, no. But sometimes that is what you have to do to figure out a problem. The key is to do it in a safe way that ensures your users are not going to be affected. Throughout this debugging story, I will share with you the tools and strategies I use when I am debugging in production that have proved to be incredibly valuable.
1. Limit the scope of where you are debugging
First and foremost, when you are debugging in production you want to limit the scope of where you are doing your testing and debugging.
In our case, we have multiple production environments, some of which are less used than others. The first thing we did was confirm we could replicate the issue on one of the smaller production environments.
If you only have one production environment, consider scoping your testing to a single server or even a single user. The more you scope down what or where you will be testing the less risk there will be of an interruption to your users.
With an environment to test in, we started to dig into the problem. Given we had never seen this behavior before we initially assumed that there was a problem with the new job. After running some process traces on the stuck processes and Googling the results we found that one potential issue could be that the new job code wasn't thread-safe. Sidekiq uses multiple threads to process jobs and if they aren't thread-safe that can cause threads to get stuck. We went through and made sure the new job was thread-safe. Then we checked to make sure all the external requests the new job was making weren't getting stuck somehow. We made sure the new job code was pristine but we were still seeing stuck threads when it errored.
With no smoking gun in the job, we decided to see if it was actually the new job code. This leads me to my second piece of advice.
2. Localize the problem by eliminating possible causes
There are a lot of things going on when we process jobs with Sidekiq. In order to figure out exactly what the problem was, we had to start eliminating possible problematic code. In a development environment or when you are working on a test, this can be as easy as commenting out the code. In production, you have to get a little more creative which leads me to my next piece of advice:
3. Don't be afraid to write and commit code for debugging purposes
To see if the new job code was at fault we created a dummy Sidekiq job called StandardErrorWorker that literally did nothing but raise an error.
class StandardErrorWorker
include Sidekiq::Worker
def perform
raise StandardError
end
end
If this simple job caused the same Sidekiq stuck threads then we could eliminate the possibility that the problem was with the new job code.
I know some of you are shocked right now. Commit temporary code, blasphemy! But sometimes you have to do what you have to do. I believe it is totally acceptable to write temporary code for debugging purposes. The key is to make sure you clean it up when you are done. You can ensure you do this by adding a TODO, using a tool like the todo_or_die gem, or by simply making a note or ticket to remind yourself to remove it when you are done. Think of it like camping ⛺️
Leave no traces behind!
With our new job deployed we fired up our test scenario on our less used production environment and low and behold the threads got stuck! This was a big turning point for us because it showed us that it wasn't our new job that was the problem, it was something else.
Before I move on I want to point something out. This investigation into the new job took a couple of days to do. I feel like people do these write-ups and jump from one scenario to the next in the blink of an eye and it makes it seem like figuring it out was a piece of cake. News flash, it doesn't happen like that in real life! In real life, you have to bang your head against a wall for at least a day to draw these kinds of conclusions.
It took us a couple of days of trying other things before we even considered that it might not be the new job code. The new job went out when this issue started so how could it not be related to its code?! Even as we were setting up the dummy worker we were all pretty sure it would not have a problem. In the end, we were very surprised at the result and that gives me my fourth piece of advice:
4. Test every scenario, no matter how unlikely
Even when there is a scenario that you highly doubt, test it anyway! Worst case it will confirm what you already know. Best case, it will tell you that you are completely wrong and put you on a new path. In our case, it was the later.
Ok, now moving on. Once we determined it wasn't the new job I started taking a really close look at Sidekiq. My initial thought was that we were failing so fast that maybe it was a timing issue. Maybe something was getting skipped or missed in the shutdown path. It was officially time to bust out the debug logging.
In our production code, I added this to our sidekiq.rb initializer file to turn on Sidekiq debug logging in the small production environment that we were testing on.
if ENV['env_name'] == "small_prod"
Sidekiq::Logging.logger.level = Logger::DEBUG
end
With logging turned on I kicked off our error job, issued some reload commands, and got some workers stuck. What I found from the logs was inconclusive. Sometimes the Sidekiq shutdown flow triggered properly but the thread still stuck. Other times it looked like the shutdown flow never even started. It was a dead-end and not really helpful.
At this point, I turned to Mike Perham, the founder of Sidekiq, for some help. This leads me to yet another piece of advice which applies whether you are testing in production or locally.
5. Never be afraid to ask for help from your coworkers or the community.
I will admit, I probably always wait a little too long to ask for help because I want to make sure I didn't miss anything obvious. He pointed me to this helpful stuck processes troubleshooting section in his wiki. Not going to lie, I felt a little like an idiot for having completely missed that during my previous Google escapades. But rather than dwell on it I dug into the wiki suggestions.
I hooked up GDB on our server and went to town on the stuck processes. Unfortunately, that was a dead end. Nothing of value came back from dumping the thread data, I still had no visibility into what the stuck threads were doing. When I get in these situations where I am stuck I always go back to the logs to try to find something that I missed and that is my next piece of advice.
6. Read, reread, and read the logs again.
When you are debugging locally or in a testing environment you can easily throw in breakpoints to inspect what is going on along the way. Since you can't put breakpoints in your production code you have to rely on logs to show you what is happening. I went back and combed through the logs and something struck me immediately, our Honeybadger errors.
Honeybadger is the service we use for reporting application errors. We have the Honeybadger gem included in our application to help us send those application errors to the Honeybadger service. I noticed in the logs that not only were we sending errors to Honeybadger successfully, but we also had many that were failing. Those that were failing were being throttled. I started to wonder if throttling errors within the Honeybadger code was making the processes stick.
Time to apply some more problem localization and try to eliminate another possible problem component. Testing whether Honeybadger was the issue was easy. I turned off Honeybadger reporting for the StandardErrorWorker and replicated the failure scenario again, only this time, everything worked flawlessly! I was able to reload Sidekiq 5 times while thousands of jobs were failing and retrying and none of the threads came close to sticking. I now knew the problem was with Honeybadger, but I had no idea why or what in Honeybadger was the issue.
I immediately jumped into the Honeybadger source code to check it out. While in there I could see the throttling at play and decided to try and turn it off to see if that would fix our issue. Often when I am in a situation like this, rather than try to fully understand what is happening I will first figure out a test to validate if the code I'm looking at is the problem. Once again, problem localization for the win!
If I validate that the code I am looking at is the issue then I will dig more into understanding why it works the way it does. To turn the throttling off I got hacky and monkey patched 2 of the Honeybadger methods I had found and that leads me to my next piece of advice:
7. Monkey patching is your friend when debugging
When you suspect a gem might be the problem don't be afraid to monkey patch it to test your theory. If it does turn out to be the problem then you can take some time to actually issue a PR against the gem itself.
First I patched the handle_response method to not add any throttling.
module Honeybadger
class Worker
def handle_response(msg, response)
debug { sprintf('worker response code=%s message=%s', response.code, response.message.to_s.dump) }
case response.code
when 429, 503
warn { sprintf('Error report failed: project is sending too many errors. id=%s code=%s throttle=1.25 interval=%s', msg.id, response.code, throttle_interval) }
# add_throttle(1.25)
warn { sprintf('monkey patch code working') }
when 402
...
end
end
end
end
Next, I patched the work method to remove the throttling there as well.
module Honeybadger
class Worker
def work(msg)
send_now(msg)
# sleep(throttle_interval)
rescue StandardError => e
error {
msg = "Error in worker thread class=%s message=%s\n\t%s"
sprintf(msg, e.class, e.message.dump, Array(e.backtrace).join("\n\t"))
}
sleep(1)
end
end
end
With patches in place, I ran the same tests I had before. I really, really thought this was going to be it....but it wasn't 😩. The workers were still getting stuck. I went back to the Honeybadger source code and continued to poke around. That is when I noticed the Queue object. That piqued my interest because a queue means lining items up to process and if you have a lot of items that could slow things down.
I noticed first that the queue had a max_size and wondered what that was. Turns out it was a configuration setting and it defaulted to 1000.
WHOA, that seems like it could be an issue when paired with the throttling. When looking at that configuration setting I noticed the send_data_at_exit
setting. I had never heard of that before so I read what that was. It defaulted to true and this was its description:
Finish sending enqueued exceptions before allowing a program to exit.
BINGO! Honeybadger would not allow a program to exit until it had sent all of its enqueued data. Time to test another theory! I set the send_data_at_exit
configuration variable on the production environment I was using to false. This meant when I told the threads to exit they would disregard anything built up in the Honeybadger queue and exit immediately. I kicked off the failure scenario, reloaded the workers, and ZERO threads got stuck!
My team who knew I had been fighting this for days all looked at me when I shouted and the first thing they said was, "Shit you solved it didn't you?" It was the best feeling in the world!
Looking back on the whole process I have a couple more thoughts and advice to share when you are debugging complicated issues such as this.
8. Don't forget about your gems and plugins as possible causes for problems.
Gems that work 99.99% of the time are easy to forget about. We have never had an issue with Honeybadger so it never occurred to us that it might be the problem. It is also easy to assume that you wrote the buggy code because people who write gems are perfect, right? Wrong! Gems and plugins have bugs too! It may not even be a bug, it may be that the gem is not optimized for your use case.
The Fix
I know some of you are dying to know how we solved the issue. We had two options for solving this stuck thread issue. We could decrease the size of the queue or we could set send_data_at_exit
to false. We decided to go with the second option. Our Sidekiq failures are also logged in a logging Elasticsearch cluster so even if the failures don't make it all the way to Honeybadger we are still going to have a record of them. Plus, when we experience a large number of failures they are usually all the same type so having some missing from Honeybadger is not likely going to be a problem when it comes to visibility.
With our code fixed and everything running smoothly in our application, it was finally time to sit back and relax, right?
Just because the code is fixed doesn't mean we are done. There are still two more things you should do after a production debugging adventure. The first is:
9. Make the error reproducible locally
The main reason for testing in production is because you can't recreate it locally. In an effort to avoid having to debug again in production, look for ways you can improve or update your local environment to better mimic production and make the kinds of errors you ran into reproducible.
Earlier I mentioned that we use vagrant VMs locally and that we can kick them into production mode. The production mode came about from another time we had a problem we were forced to debug in production. Since then it has been invaluable in helping us solve production issues. In this case, the reason it didn't work was that Honeybadger was not enabled. We have since changed that so our vagrants in production mode better mimic our actual production environment.
Last but not least:
10. Share your story!
If you go on a long drawn out debugging adventure that leads to some interesting insights, share it! You might end up saving someone else who runs into the same problem a whole lot of time.
At the very least, if your problem was related to a gem or plugin, reach out to the maintainer or open a Github issue to let them know. They will definitely want to know if there is a bug, and they probably also want to know if there is a way they could improve their docs. Following my debugging adventure, I chatted with Mike(founder of Sidekiq) and Josh(cofounder of Honeybadger) about possible ways they could improve their docs and wiki's to help others avoid my missteps. They were more than gracious and implemented some of my advice.
The next time you run into a production bug use some of this advice to help you debug it safely and successfully! Now go forth and debug!
Top comments (16)
thank you, Molly! I learnt valuable exp from your sharing! :D
Two thoughts...
1) the suspense in reading this almost killed me 😀
2) I’ve been but my similar issues with Sentry and am starting to think enabling real error reporting in development might be a good idea all the time.
Right?! Honeybadger has filters and ways to ignore tags and projects so its kinda like why not?!
In the past I’ve disabled production gems in development entirely, and ran into similar issues. That’s the main reason in “development mode” the honeybadger gem behaves as close to production as possible—it executes all code paths up to notifying the server (which it mocks). That was a super intentional decision; you don’t want code running in production that isn’t running in development/test. I also recommend testing exception reporting in your test suite for similar reasons. docs.honeybadger.io/lib/ruby/getti...
Unfortunately that doesn’t usually cover production states which result from heavy load or server interactions, so having a production-like dev environment for that could be a nice extra precaution. It’s still not guaranteed to find issues like this before they happen, but it could help with quicker debugging/troubleshooting.
Load testing in general is also a good idea, and may help surface some issues faster. I like the idea of your StandardErrorWorker, Molly. Putting your application into exceptional states (and other failed states) under load could be a useful practice to help spot issues across the entire monitoring stack. Maybe I’ll try adding that worker to our own Rails app. ;)
Great article! Engaging and well-written!!
Molly I think debugging on production sounds not right at all.
What we do is that commits should first pass to our staging env (that totally mimicks our production).
Then we try things there (automated and manually), and if everything seems right we hit the button to deploy on production too.
We also use that staging env for debugging like once we had a weird issue with Celery and AWS SQS, took us a bit of time but we never debugged on production.
Thanks for the thoughtful feedback!
We are hoping to get a staging environment up and running once we have containerized our application. Hopefully, that will cover some of the differences that still exist between our vagrants and production.
Wonderfully written! Thank you for sharing 🙏
My DevOps brain says why is there no staging environment which can mirror the production environment or just mirror the Sidekiq queue for debugging?
We have CodeServer (VSCode) which is packaged as CloudFormation template so we can launch and attach to an environment for live debugging in this case.
github.com/cdr/code-server
Another thought would be the nature of the tasks. I have a feeling that fargate could timeout a job just as lambda would timeout a function. I bet sidekiq could be powered by lambdas or fargate and in the case of timeout it would just requeue the job but having it isolated would mitigate unthread safe data from hanging the instance.
github.com/hoshinotsuyoshi/serverl...
Just some thoughts
Our vagrant setups, for the most part, cover all of the testing cases we need which is why we currently don't have a staging environment. We used to have one but it was so rarely used that it would get out of date and then on the off chance someone wanted to use it, it usually took them a day to get it back in working order. Because of this and the cost, we got rid of it a few years ago. We are currently working on containerization of our application and once that is done we hope to be able to spin up a staging environment with the push of a button.
Thanks for sharing those resources!
Debugging in Production is the mother of all problems. I won't suggest that at best use the logging level to log more information, but don't debug or connect it from your localhost.
thanks for sharing!!! great article!
Blimey, what a rollercoaster that was! The story is brilliant, and it's also one big subtle flex 💪 and I absolutely love it.
Uhh debugging on production title triggered my PTSD. I had quite the experience with production debugging 😅 nasty little thing.
😬Never fun, but sometimes it's necessary!