So I am in something of a complicated relationship with Azure.
I like that (in general) it makes my life easier.
I like that hooking up continuous integration is so incredibly easy.
I like that managing deployment slots and setting up new ones is logical and can be done quickly (albeit with something of a deployment wait); and I like that you can configure instances that will scale up or down depending on the demands that are made on their resources.
I don't like how long everything seems to take to update/deploy/propagate.
I don't like that the UI seems to have been built by about 200 people in simultaneous development so that sometimes things happen automagically and sometimes you have to hit 8 different confirm buttons before it registers that yes, you really do want to do that.
I don't like trying to troubleshoot performance issues when there are so many different places for logs/analytics/insights.
And I don't like that occasionally their idea of an error message is an unhappy cloud.
Recently, I was trying to get to the bottom of some rather frustrating performance issues on our Azure cloud catalog.
The symptoms included:
- intermittent downtime
- slow app restarts
- laggy front-end performance
One NET Core app in particular was very sickly and would classically take 7-8 minutes to find it's wee feet again when restarted. Bafflingly, it was also one of our simplest, smallest, lowest traffic apps, so what gives?
Cue a montage (although in reality it was more an increasingly frustrating, ever-decreasing spiral) of trailing through spiky graph after spiky graph in Application Insights, downloading memory dumps, clicking hopefully through every log folder on blob storage and tentatively poking through various routes on the "Diagnose and Solve Problems" dashboard which wants to "chat" to you. Endearing.
I started using phrases like "possible thread starvation" when colleagues asked how I was getting on, and spent enough time reading about startup configuration in Net Core that I was able to troubleshoot app bootstrapping at 50 paces, and yet still felt no closer to a solution.
Although, that's not strictly true. I knew a little more about why things were happening...
- we have ~7 production sites sitting within one App Plan, and this plan scaled up and down on a schedule (7am and 10pm) as well as when resources were under pressure or released outwith this period
- when the plan scaled, the app service instances within it were either spun up or wound down for the sites and it was this period that made the poor wee Net Core app the most unhappy
- the Net Core app was the one which Pingdom kept pulling up for downtime issues, but actually all of the apps had a bit of a wobble during the restarts (they were just sitting under a different alert criteria, doh!)
With this information, I could at least narrow my conversation with Google from the abstract and teenage-angst flavoured "but why?" to a more concrete "managing azure app restarts" and "configuring multiple instances of net core apps". This was small but hopeful progress.
Further investigation and coding montages led me to a set of guidance that I will lay here for future reference, and for any who are also trying to nurse sickly Azure Web Apps back to health:
First, and the biggest win for me: the
AlwaysOn setting on the Application Settings tab. For those not familiar:
When Always On is enabled on a site, Windows Azure will automatically ping your Web Site regularly to ensure that the Web Site is always active and in a warm/running state. This is useful to ensure that a site is always responsive (and that the app domain or worker process has not paged out due to lack of external HTTP requests).
Extracted from Scott Guthrie's blog
Sound sensible eh? And it is - on production sites. But, and here is the small hole we'd dug for ourselves having been lent a shovel by Microsoft, it is not a slot specific option so - to avoid production sites idling by accident after a staging swap - we had the
AlwaysOn option always on. On every slot. On every environment. For every project.
That means that every time our 7 production sites scaled up, we'd get (e.g.) 2 instances of each, and both of these would get restarted and warmed up and then pinged to ensure they are
AlwaysOn. So far, so good. But then all of the staging and dev slots would be pinged and forced to start up and the sheer volume of I/O totally destroys the performance of, well, pretty much everything and gives the perceived downtime. Why does Azure Web App suffer so much with this? That's a different kettle of fish.
There's no nice way of managing this for us at the moment - if you handle slot swaps with a script I imagine you can toggle the
AlwaysOn option post-swappage. We've just had to add it as a manual check at the end of a deployment. It's not the end of the world, but it's certainly a little irritating nuance to be aware of!
Other, smaller, wins included: moving from IMemoryCache to IDistributedCache on the Net Core app (to minimise I/O storage writing, and to enable us to take future advantage of load balancing), and ensuring that HTTPS Only flag is set to
true so that the app initializer isn't bounced around anywhere silly on startup.