Strangely this challenge proved to be the most straightforward bit.
The customer imposed Azure as a cloud, so that was that.
Our requirements for this:
- shared (not dedicated) application infrastructure. To control costs, we don't want to dedicate infrastructure to customers. A full-pipeline production customer may have 5 base environments, each with some number of running instances. We don't want to automatically add a 32Gb VM if they need another instance and there may be some unused resources on an existing one. We also don't want to manually provision smaller ones or have a gazillion different VM pools.
- easy way for developers to cough up a new environment without micromanaging routes
- a customer will have the following environment levels: dev (automatically or manually deployed)/auto (for automated testing)/test (our acceptance and some manual testing)/accept (customer acceptance)/prod
- each given environment could be scaled up to a number of running instances, automatically or manually
On our side, we looked at traditional deployment pipelines that would take the code and script a delivery process on a VM which would be part of the Azure equivalent of AWS' autoscaling groups. That would mean, operationally, to maintain some routing lists on the external load balancer level.
However, this would mean that the load balancer would have to route a given domain or path rule to a given VM (or to all VMs in a group), so we would have to provision and configure a different local proxy if we wanted to have multiple environments on a VM.
For example, our loadbalancer would need to route, say
*.customer2.com etc. But where? We don't know on which VM a running instance may be. We could label them, but then when scaling happens, we need to make sure an instance only has the proper labels to service a given customer. Also, we don't have different load balancers per customer.
The existing system was sort-of configured like this, except that the local proxy was a single Apache instance that also handled the PHP interpretation. Multi-tenant done properly (with shared infrastructure) would mean dedicated webservers which could be restarted individually, with the common routing done at proxy level.
Too complicated to do manually ...
But fortunately most of us were versed in the art of containers and we managed to cook up a Dockerized development environment in a couple of days. It was a no-brainer then to decide to use Kubernetes in Azure.
The system went like this:
- Azure AKS with nginx-ingress and a couple of static IPs (both outgoing and incoming)
- configmaps would hold the per-customer configuration
- a build would create and push a container to a registry
- a daemon inside the AKS cluster itself would poll the registry and deploy new builds automatically to QA environments
- HPAs would enable some basic autoscaling based on memory/CPU usage but later we would add more interesting rules.
Changes done to the application:
- make it stateless (this was very time consuming): since containers are disposable, the application must not write files in local paths (or even shared paths, if multiple instances are expected to run when scaled up) which are needed later (for example: file uploads).
- logging to stdout: AKS collects stdout/stderr from containers, so the application should not write logs to files, but directly to output. Fortunately, there's
- use Azure for customer uploads: there's a thing called
Flysystemwhich provides a filesystem abstraction that allows seamless access between local filesystem (like copy from local
tmp) and various cloud storage systems.
- a developer would need to copy/adjust a deployment/configmap/service and ultimately ingress, usually by editing out the relevant labels
- we ended up scripting with
yq(CLI yaml find/replace tool) and later on packaging with helm
- much later the configmaps were encrypted with
sopsand Azure KM and kept in codebase.
Phew! This was by far the fastest bit. Two days to make enough changes to create local docker-compose system, two more for the initial setup in Azure .... but quite some time to make the application stateless. Uploads were a fairly quick thing to do, but for some time afterwards we would keep discovering unexpected places where the application relied on locally produced files. Of course, often used things were quickly discovered and fixed but more obscure features came back with a vengeance (then again, obscure features were always a pain since they never found a place in test suites).
Onwards, to glory!