Discussion on: Explain Required Downtime Like I'm Five

View post

So, you want to make a sandwich before the cartoons start.

You need to get the peanut butter and jelly and the bread, and find a knife. You thought you had it all ready, but the silverware drawer is empty and you have to wait until the dishwasher is done.

Or a slice of bread slips out of your hand and lands jelly-side down on the kitchen floor and you have to throw it out and start over.

Or your brother asks for one too, but wants the other bread, the creamy peanut butter and the crusts cut off.

Or, you expect all of those things could happen and you want to be sure you can get a clean knife and clean up after yourself and make a second or third well before Itchy and Scratchy start fighting.

The hours-long maintenance windows I see in my life go to 1) interactive network-backed mobile games, where you want the servers unavailable while the games are uploaded and accepted by two separate apps stores, or 2) big changes to large compute clusters where jobs can take weeks on fast, GPU-laden high-memory nodes, because science. With the latter, it's Puppet or CFengine changing several thousands of machines, and because downtime is so rare, you often get "while we're down, we'll change out the switches/cooling/power".

In either case, I'm sure it involves minimally parallel tasks. You can't spread the jelly and wash the knife at the same time, for example.