DEV Community

Adam
Adam

Posted on

Solving Cert-Manager and Azure Application Gateway Integration for AKS

Hey all! I recently went through a bit of a journey getting cert-manager to play nicely with Azure Application Gateway and AKS (Azure Kubernetes Service). It was a learning experience, to say the least, and I wanted to share what I discovered in hopes that it might help others facing the same challenges.

The Problem

I was setting up an AKS cluster and wanted to use cert-manager to automate Let's Encrypt SSL certificate issuance for my services. I'm using Azure Application Gateway Ingress Controller (AGIC) to manage ingress to my cluster.

Everything was going smoothly until I tried to obtain certificates using the HTTP-01 challenge. I kept running into 404 and 502 errors when the Let's Encrypt validation servers tried to access the challenge URL.

Symptoms

  • 404 Not Found when accessing http://mydomain.com/.well-known/acme-challenge/*.
  • 502 Bad Gateway errors from the Application Gateway.
  • The health probes in Application Gateway were failing, marking the backend pool as unhealthy.

The Culprit

After digging into logs and configurations, I realized that the Azure Application Gateway wasn't correctly routing the Let's Encrypt HTTP-01 challenge requests to the cert-manager's solver pod. Here's what was happening:

  1. Cert-manager creates temporary pods and services to respond to the HTTP-01 challenge.
  2. The AGIC wasn't updating the Application Gateway configuration quickly enough to route traffic to these temporary resources.
  3. The default health probes in Application Gateway were failing because the solver pods return 404 for any path except the specific challenge URL.

The Solution

To fix this, I needed to:

  1. Create a dedicated service that consistently routes traffic to any cert-manager solver pod.
  2. Create a dedicated ingress that directs the challenge path to this service.
  3. Adjust the health probe settings in Application Gateway to consider 404 responses as healthy.

Step 1: Create the Service

I created a service named acme-challenge-service that selects any pod with the label acme.cert-manager.io/http01-solver=true, which cert-manager adds to its solver pods.

apiVersion: v1
kind: Service
metadata:
  name: acme-challenge-service
  namespace: hitc5  # Replace with your actual namespace
spec:
  selector:
    acme.cert-manager.io/http01-solver: "true"
  ports:
    - protocol: TCP
      port: 8089  # Port exposed by the Service
      targetPort: 8089  # Port the solver pods are listening on 

Step 2: Create the Dedicated Ingress

Next, I created an ingress resource specifically for the challenge path.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: acme-challenge-ingress
  namespace: hitc5
  annotations:
    appgw.ingress.kubernetes.io/backend-protocol: "http"
    appgw.ingress.kubernetes.io/request-timeout: "300"
    appgw.ingress.kubernetes.io/use-private-ip: "false"
    appgw.ingress.kubernetes.io/request-logging-enabled: "true"
    appgw.ingress.kubernetes.io/diagnostic-settings: "true"
    appgw.ingress.kubernetes.io/health-probe-status-codes: "200-499"
spec:
  ingressClassName: azure-application-gateway
  rules:
    - host: mc.10kvtech.co.uk  # Replace with your domain
      http:
        paths:
          - path: "/.well-known/acme-challenge/*"
            pathType: Prefix
            backend:
              service:
                name: acme-challenge-service
                port:
                  number: 8089` 

Key points:

  • The health-probe-status-codes annotation tells Application Gateway to consider 404 responses as healthy.
  • The ingress routes the challenge path to the acme-challenge-service.

Step 3: Update Application Gateway Health Probes

By adjusting the health probe settings through the ingress annotations, the Application Gateway now considers the backend healthy even if it receives a 404 status code. This is crucial because the cert-manager solver pods return 404 for any path other than the exact challenge URL.

Testing the Setup

After applying these configurations, I wanted to ensure everything was working correctly.

Checking the Certificate Status

To verify that the certificate was successfully issued, I used the following commands:

# Check certificate status and watch for changes
kubectl get certificate -n hitc5 -w
# Describe certificate for detailed status
kubectl describe certificate mc-10kvtech-tls -n hitc5` 

``
Sample Output:

Name: mc-10kvtech-tls
Namespace: hitc5
...
Status:
Conditions:
Type: Ready
Status: True
Reason: Issued
Message: Certificate issued successfully
...
``

When the Status shows True and the Reason is Issued, it means the certificate was successfully obtained and is ready for use.

Checking Ingress Resources

To see all the ingresses and their routing, especially to confirm that the ACME challenge ingress was created correctly:

Check all ingresses and their routing

kubectl get ingress -n hitc5 -o yaml

Check specifically for the ACME solver ingress

kubectl get ingress -n hitc5 -l acme.cert-manager.io/http01-solver=true -o yaml

Checking the ACME Challenge Status

To monitor the status of the ACME challenges:
kubectl get challenges -n hitc5

Checking Cert-Manager Logs for Challenge Failures

If there are issues, it's helpful to look at the cert-manager logs:

# Check cert-manager logs for challenge failures
kubectl logs -n cert-manager -l app=cert-manager

Testing the Challenge URL Before the Fix

Before applying the fix, accessing the challenge URL resulted in a 502 Bad Gateway error:

curl -v http://mc.10kvtech.co.uk/.well-known/acme-challenge/test

Output:

  • Trying :80...
  • Connected to mc.10kvtech.co.uk () port 80 (#0) > GET /.well-known/acme-challenge/test HTTP/1.1 > Host: mc.10kvtech.co.uk > User-Agent: curl/7.68.0 > Accept: / >
  • Mark bundle as not supporting multiuse < HTTP/1.1 502 Bad Gateway < Server: Microsoft-Azure-Application-Gateway/v2 < Date: Mon, 02 Dec 2024 15:49:23 GMT < Content-Type: text/html < Content-Length: 183 < Connection: keep-alive < 502 Bad Gateway

    502 Bad Gateway


    Microsoft-Azure-Application-Gateway/v2
  • Connection #0 to host mc.10kvtech.co.uk left intact

This happened because the Application Gateway marked the backend as unhealthy due to failing health probes.

Testing the Challenge URL After the Fix

After applying the fix and ensuring the backend was healthy, I tested the challenge URL with the actual challenge token:

curl -v http://mc.10kvtech.co.uk/.well-known/acme-challenge/<challenge-token>

Output:

  • Trying :80...
  • Connected to mc.10kvtech.co.uk () port 80 (#0) > GET /.well-known/acme-challenge/ HTTP/1.1 > Host: mc.10kvtech.co.uk > User-Agent: curl/7.68.0 > Accept: / >
  • Mark bundle as not supporting multiuse < HTTP/1.1 200 OK < Content-Type: text/plain < Date: Mon, 02 Dec 2024 16:00:00 GMT < Content-Length: 87 < .
  • Connection #0 to host mc.10kvtech.co.uk left intact

I received the expected challenge response, indicating that the request was correctly routed to the solver pod.

Verifying the Backend Health

In the Azure Portal, I checked the Application Gateway's backend health:

  • Backend Pool: pool-hitc5-acme-challenge-service-8089-bp-8089
  • Status: Healthy

This confirmed that the health probes were passing, and the Application Gateway considered the backend pool healthy.

Lessons Learned

  • Understanding Health Probes: Application Gateway health probes can be customized via annotations. Knowing how to adjust these settings is essential when dealing with services that don't respond with standard status codes.

  • Cert-Manager Solver Pods are Ephemeral: They come and go quickly, so creating a service that selects them based on labels ensures consistent routing.

  • AGIC Sync Intervals: The Application Gateway Ingress Controller might not sync changes instantly. Patience (or adjusting the sync interval) can help.

  • Detailed Logs are Your Friend: Checking logs from cert-manager, AGIC, and the Application Gateway was crucial in pinpointing where things were going wrong.

Conclusion

Integrating cert-manager with Azure Application Gateway on AKS isn't entirely "out-of-the-box," but with a bit of tweaking, it's entirely possible. I hope this write-up helps anyone else facing similar challenges.

If you're struggling with this setup, try creating dedicated services and ingresses for your challenge paths and adjust your health probes accordingly. Don't forget to check the status of your certificates and backend health to ensure everything is functioning as expected.

Feel free to reach out if you have questions or run into issues—I know how tricky it can be!

Happy coding!

Top comments (0)