loading...

Graceful Shutdown Is A Lie

samthor profile image Sam Thorogood ・1 min read

If you're writing backend code, you should always assume that your service is going to crash. 💥

This was something I learned when I was designing reliable code that was performing (effectively) a transaction using Megastore. This service needed to be live, always: our failure cases were that it crashed, or we were upgrading the binary.

This is fairly understood for webservers, where e.g. a Node server can be shut down only by Ctrl-C, or a Go server can only "fail":

func main() {
  http.Handle("/foo", fooHandler)
  log.Fatal(http.ListenAndServe(":8080", nil))
}

But it's less clear what a crash means while performing some operation, like database or file operations. When can your server crash awkwardly?

  • when you're writing a file to disk over a previous file: for a time, neither file will exist

    • ... solve with writing a temporary file and replacing the previous one
  • updating an aggregate count (e.g., when you insert a row, add one to the 'total'): if the second operation fails

    • ... solve with database transactions
  • marking an email as being sent: what if the database fails?

    • ... sending a second email is probably fine ¯\_(ツ)_/¯

In all these cases, it pays to code defensively. This isn't a long post, but just something to think about: at what point could the power 🔌 be unplugged from your provider? What's the risk?

Thanks!

19 👋

Discussion

pic
Editor guide
Collapse
stefandorresteijn profile image
Stefan Dorresteijn

Writing your code assuming that it'll crash is a pretty decent standard either way. When I was writing Elixir, that was the standard and it created much fewer awkward situations, because we never just assumed something was done. It was always checked, or assumed to have gone wrong. Sure, we sometimes sent a notification too many when our notification service lagged out and ended up sending the notification anyway, but nothing was lost!

Collapse
johnfound profile image
johnfound

Well, we have had a huge problem with different software running in Windows on a computers that can be in some circumstances instantly powered off. In this cases, even the Windows was regularly broken, forcing restoring the whole computer from backup image.

The solution I use for such computers and software is very simple - Linux with journaling file system such as Ext4 and wide use of SQLite database, that is very, very robust against any crashes and power interruptions.

Now we have a systems that can be shutdown instantly by simple powering off and still works without any information loss, file or OS damages for years.