We have just spent the last 24 hours fixing a nasty bug in production.
But first, here's a tricky question for you: for how long do you think this code will run?
//I'm trying to be language agnostic here for (i=0; i<1000000; i++) thread.sleep(1);
It would seem that the answer is obvious - a million milliseconds, which is about 15 minutes.
The correct answer is - more than 4 hours.
Wait, what?! Bare with me ;-)
The nasty production bug story
We have a background worker on the backend - a huge
while loop that runs through an array of millions of elements and does all kinds of in-memory checks and manipulations on them.
But we don’t want the CPU to get stuck at 100% during this loop and choke the server, do we? We want the server to stay alive and kicking.
So what does the average Joe programmer do? That's right - Joe adds a short pause into the cycle and goes home.
if there's any game developers reading this - I can already hear you giggling and reaching out for popcorn
Here's the thing: "pauses" AKA
Sleep() in most operating systems are based on timers. The resolution of these timers is 12-15ms. You cannot pause for 1 millisecond - there will be at least 15.
So on a large array with a million elements, we get
15ms * 1000000 / 1000 / 60 / 60 = 4.16 - more than four hours.
And coming back to work in the morning our Joe-the-programmer sees what? That his loop is still running from yesterday. The job, that used to take 7 minutes (although it kept the CPU at 100%), now takes half a day to finish. In a "relaxed" mode though.
Everything is broken and customers start creating the "huh?!" tickets.
and the gamedevs reading this - laugh viciously
Because in their game development world this happens all the time, this is called a "busy loop" or a "tight loop". And you can't rely on timers.
But how do we throttle properly?
1. Use multimedia timers or timers from OpenGL/DirectX (overkill)
2. Throttle every N-th step, not every step (inelegant and stinks)
3. Dump "pauses" completely and use the magic
Thread.Yield instruction - which is a "polite" way to share resources and tell the operating system "hey, I'm still busy, but if you really need this, slow me down and let other threads do the work" (and this is the best way)
The 100% CPU load won't go away, but now it's not an issue - everything is fast and responsive.
Thread.Yield is available in many languages:
runtime.Gosched (I think...)
DoEvents (kidding! ...although not really)
time.sleep(0) (on Windows it's
time.sleep(0.0001) don't ask me why... because Python...)
(by the way, it's not just Python - quite a few system libraries are smart enough to translate
sleep(0) into a
yield, including .NET, Posix and WinApi)
etc. Google your favorite language.
The moral of the story, I guess, is that even the trivial things get super tricky at scale. Things that used to be nice and simple when we just launched our little SaaS now get complicated when you have thousands of companies using your stuff. But that's a nice problem to have, I guess.