I wanted to write this post because the most common question I get when I issue a pull request for an optimization is, "How the hell did you find that?!" My answer 90% of the time is, "I read the logs."
I think reading logs is a lost art these days. We have so many awesome monitoring tools that allow us to keep tabs on our applications that it's easy to forget about a simple resource like logs. Monitoring tools are great for getting a big picture look at your application. However, when it comes to the nitty gritty about how your code is interacting with your datastores and third party services, nothing beats reading the logs.
The next time you are looking to make some code optimizations consider doing this. Go into your development or staging environment and set your logging to debug EVERYWHERE!
Set it for your framework.
pry(main)> Rails.logger.level = 0
Set it for your gems.
pry(main)> Search.connection.transport.logger = Logger.new(STDOUT)
Set it for every one of your related services.
$ redis-cli monitor > commands-redis-2018-10-01.txt
Once your logging configurations are set, let loose on your application. Load webpages, run background workers, even jump into a console and run some commands by hand. When you are done, look at the logs that are produced. Those logs are going to tell you A LOT about how your code is interacting with all of your datastores and third party services. I would bet that some of it, is not interacting how you would expect. I cannot stress enough how valuable something as simple as reading logs can be when it comes to making optimizations in an application.
Finding a piece of code to optimize sometimes is as simple as reading the logs. Other times, it can be a little more involved. Here is one example of how my team used logs to find and fix a piece of code.
The first indication that there was a problem was a Redis High Traffic Alert we got from our monitoring service.
Usually when a piece of code is underperforming in our application the first thing we see is some sort of alert like a high traffic alert, slow query alert, timeout alert, etc. After receiving this alert, the next thing we did was look at the Redis traffic graphs during the timeframe when the alert occurred.
We can see from this graph the problem piece of code is running for a few hours early in the morning. Now that we know the WHERE and WHEN, the next step is to figure out the WHAT. This is where service monitoring comes in handy. Use your monitoring tools to figure out what was running at the time. Using our monitoring service we were able to figure out that during that time window we were running a lot of one specific type of Resque worker.
Once we pinned down the traffic to that one worker, the final piece of the puzzle was figuring what that worker was doing to cause the load. This is where the reading the logs comes in!
Since we knew what worker boxes were causing the issue we jumped onto one of them and fired up redis-cli.
$ redis-cli monitor
We let that run for 30 seconds while the workers were doing work and this is a snippet of what we got.
# REDIS LOG FORMAT # timestamp [redis_db ip_address:PID] command argument 1544451182.095873 [15 18.104.22.168:15288] "exists" "worker-01-shard-1" 1544451182.095888 [15 22.214.171.124:63462] "exists" "worker-02-shard-3" 1544451182.095941 [15 126.96.36.199:55676] "exists" "worker-05-shard-3" 1544451182.095952 [15 188.8.131.52:55596] "exists" "worker-02-shard-3" 1544451182.095983 [15 184.108.40.206:55682] "exists" "worker-09-shard-2" 1544451182.096225 [0 220.127.116.11:19286] "llen" "batching:indexing-worker:medium_priority" 1544451182.096260 [15 18.104.22.168:55564] "exists" "worker-03-shard-3" 1544451182.096280 [15 22.214.171.124:18378] "exists" "worker-02-shard-3" 1544451182.096309 [15 126.96.36.199:55594] "exists" "worker-02-shard-2" 1544451182.096439 [15 188.8.131.52:14774] "exists" "worker-03-shard-2" 1544451182.096502 [15 184.108.40.206:12098] "exists" "worker-03-shard-3" 1544451182.096672 [15 220.127.116.11:32618] "exists" "worker-03-shard-3"
To our surprise, we found that we were making a ton of EXISTS requests from a specific set of worker boxes. We then traced the EXISTS requests back to a gem that we were using to throttle our Resque workers. We were pretty sure those requests were the ones overwhelming Redis. The last thing left to do was confirm it. We removed the throttling code, and sure enough, that eliminated the Redis traffic.
The code examples I show use Ruby, but honestly, the concept of reading logs can apply to any language. The next time you find yourself faced with slow code or a bug you can't find, consider reading your logs. Best case, they help you solve your problem. Worst case, you get a better understanding about how your code is working.