DevOps is a challenging beast when you decide to move from a few EC2 instances and an RDS instance in AWS to a Kubernetes cluster even if it’s a managed service so it’s safe to say we’ve learned a few lessons in the process not leaving out this one. Towards the end of 2018, the Open Banking SaaS we’ve been developing for a while started to kick off and we managed to make a few sales to a couple of big banks, we knew the hacked together bash scripts and manually provisioned EC2 instances weren’t going to cut it so we thought we’d embrace Kubernetes. I would say I am reasonably proficient with docker and used PaaS such as CloudFoundry so in my naivety thought how hard could this Kubernetes stuff be. Very soon into the process, we realised how much we still had to learn and that it would be a steep learning curve.
Fast forward some time and we’ve been managing multiple clusters, automatically provisioning them with our CI/CD pipelines and things are going well. We do continuous deployment into our “master” cluster which is essentially the latest and greatest SaaS offering. We tear down and rebuild this environment nightly at 2 am to ensure we can reliably provision a cluster and no manual tweaks are causing the cluster to be stable. We notice that our functional tests fail every morning and it seems to resolve itself by the time we all start work so we start looking at the logs from the deployment, realising they’re giving nothing away. We add more logging and debugging. We start by using the common DNS tools, dig and nslookup.
$ nslookup service.bank.master.forgerock.financial Server: 127.0.0.53 Address: 127.0.0.53#53 Non-authoritative answer: *** Can't find service.bank.master.forgerock.financial: No answer $ dig service.bank.master.forgerock.financial ; <<>> DiG 9.11.3-1ubuntu1.9-Ubuntu <<>> service.bank.master.forgerock.financial ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44619 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 65494 ;; QUESTION SECTION: ;service.bank.master.forgerock.financial. IN A ;; Query time: 49 msec ;; SERVER: 127.0.0.53#53(127.0.0.53) ;; WHEN: Tue Oct 22 15:27:16 BST 2019 ;; MSG SIZE rcvd: 68
They show that the DNS doesn’t seem to be resolving for some of our domains A records. An A record is essentially the mapping between the domain name and the server it should resolve to. We’ve created a wildcard A record for our subdomain so it should be a catch-all right…?
This is where we bring in cert-manager. If you’re not familiar with cert-manager it’s a service which pairs nicely with the TLS certificate issuing service LetsEncrypt which is great when you want simple automated TLS certificates that automatically renew. For LetsEncrypt to issue a TLS certificate for your domain it must first trust that you own the domain. There are two ways cert-manager can do this:
- Host an HTTP endpoint with a string LetsEncrypt knows about
- Add a DNS TXT record with a string LetsEncrypt knows about
We do the latter, so we provide cert-manager a restricted service account so it can dynamically add a DNS TXT record and prove to LetsEncrypt we own the domain. The TXT record is a record which can have an arbitrary string in it.
Back to our problem. So we’re trying to diagnose why some of our domains resolve and others don’t when we came across an issue on the cert-manager GitHub repo suggesting something we quickly dismissed at first which described a situation where TXT records on subdomains that have a wildcard would not return the A record. This surely cannot be true or we would have heard of it before. We discuss the issue further when my colleague presses the idea further and we also come across a 2011 blog on the same topic. So, we go to work trying to manually reproduce the issue by adding a TXT record on a subdomain and poll it using dnschecker.org and slowly one DNS server at a time the DNS A record disappears. We remove the TXT record and the DNS A record comes back to life. We try adding an A record on the subdomain and the issue with the TXT record doesn’t manifest. The solution to this problem was explicitly having an A record per subdomain.