I would argue that the way to deal with fail over in a cloud environment like Azure is to not rely on it, in its entirety. Sure they have excellent SLAs, but there will be down time.
The way they I would deal with it would to be to have a redundant installation of your application's stack (from the app down to the database) on a number of servers in a data centre, have them auto replicate the live database on a regular interval, and put a load balancer up as the entry point.
The load balancer points at both the cloud and data centre versions of the stack, with the cloud version marked as the priority.
That way when the cloud version goes down (because it will) you can failover to the data centre version.
And you wouldn't even have to have top of the line servers. You could display a message to the users saying that they might expect a slight degradation in page speed because some services are taking longer to respond that usual.
The hard parts would be syncing everything up. Before the failover, you could have an app which keeps the data centre databases in sync with the cloud version, and after the failover you could have the cloud version auto sync with the data centre version.
You'd have to publish all code changes to more places, and you'd have to figure out the load balance, but it would keep your services running when the cloud goes down.
I like this and it’s similar to what I was thinking about when the outage happened.
I guess the thing I need to let go of is trying to replicate the existing stack like for like. So where we use functions in azure we may need to look at standard web apis with similar functionality.
Like you say the databases are a challenge too, but we could do some replication.
We're a place where coders share, stay up-to-date and grow their careers.
We strive for transparency and don't collect excess data.