cert-manager DNS01 Challenges: The Setup Nobody Gets Right the First Time

You Wanted Wildcard Certs. Now You Have DNS Problems.

HTTP01 challenges are easy. Point your ingress at the solver, Let's Encrypt hits the endpoint, done. Five minutes of YAML and you're shipping TLS. But then someone on your team decides you need a wildcard certificate. Maybe you're running multi-tenant infrastructure, maybe you have 40 subdomains and managing individual certs sounds miserable. Reasonable decision.

So you look at the cert-manager docs for DNS01 and think: okay, I just need a DNS provider, some credentials, and a ClusterIssuer. How hard can it be?

Very.

The Credentials Problem Nobody Warns You About

Every DNS01 solver needs API access to your DNS provider. That means credentials. And not just any credentials; credentials with write access to your DNS zones. In a Kubernetes cluster. Stored in a Secret. That cert-manager's service account needs to read.

If you're on Route53, you need an IAM user or role with route53:ChangeResourceRecordSets and route53:GetChange permissions. Sounds straightforward until you realize the permission boundary. You need access to the specific hosted zone ID, not just "all of Route53." Most people start with a wildcard policy because the docs show it that way, then spend months wondering why their security team keeps filing tickets.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "route53:GetChange",
      "Resource": "arn:aws:route53:::change/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "route53:ChangeResourceRecordSets",
        "route53:ListResourceRecordSets"
      ],
      "Resource": "arn:aws:route53:::hostedzone/Z2ABCDEF123456"
    }
  ]
}

On GCP it's even more fun. You create a service account, give it dns.admin on the specific zone, export a JSON key, stuff it in a Kubernetes Secret, and then reference that Secret in your ClusterIssuer. The JSON key never rotates automatically. You just have a static credential sitting in your cluster forever. Workload Identity Federation fixes this, but setting that up for cert-manager specifically? Budget an afternoon.

When IRSA and Workload Identity Actually Work

The "right" way to do credentials on AWS is IRSA (IAM Roles for Service Accounts). On GKE, it's Workload Identity. Both let you bind a Kubernetes service account to a cloud IAM role without static keys. cert-manager supports both. In theory.

In practice, IRSA requires your cluster's OIDC provider to be configured correctly, the trust relationship on the IAM role needs the exact service account namespace and name (it's cert-manager namespace, cert-manager service account by default, but if you installed with Helm and changed the name, good luck remembering that six months later). And if you're running cert-manager in a different namespace for some reason, none of the Stack Overflow answers will match your setup.

The annotation has to go on the ServiceAccount, not the Deployment:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: cert-manager
  namespace: cert-manager
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/cert-manager-dns01

Miss that annotation and cert-manager will silently fall back to the node's instance profile. Which probably doesn't have Route53 access. The error you get? Something vague about "failed to determine hosted zone" that sends you chasing DNS configuration issues for hours when the actual problem is IAM.

Propagation Delays Will Ruin Your Day

Here's something the quickstart guides gloss over completely. DNS propagation takes time. When cert-manager creates the _acme-challenge TXT record, that record needs to be visible to Let's Encrypt's validation servers before they check it. If your DNS provider takes 60 seconds to propagate and Let's Encrypt checks at second 30, the challenge fails.

cert-manager has a dns01 section in the solver config where you can set propagation check settings:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
    - dns01:
        route53:
          region: eu-west-1
          hostedZoneID: Z2ABCDEF123456
      selector:
        dnsZones:
        - "example.com"

But the default check nameserver is whatever the cluster's DNS resolver returns. In many clusters, that's CoreDNS, which caches aggressively. So cert-manager asks "is the TXT record there?" and CoreDNS says "nope" because it cached the negative response 45 seconds ago. Meanwhile the record exists just fine on the authoritative nameserver.

Fix it by telling cert-manager to check the authoritative nameservers directly. You can set this globally with the --dns01-recursive-nameservers flag on the cert-manager controller:

# In your Helm values
extraArgs:
  - --dns01-recursive-nameservers-only
  - --dns01-recursive-nameservers=8.8.8.8:53,1.1.1.1:53

That one flag. That's it. Would've saved me about four hours the first time around. It's not in the quickstart. It's buried in the CLI reference docs.

Cross-Account DNS Is Where Things Get Political

Enterprise setups love splitting DNS into a separate AWS account. Production workloads in Account A, Route53 zones in Account B. Makes sense from an organizational standpoint. Makes cert-manager configuration significantly more complex.

You need a role in Account B that trusts a role in Account A. cert-manager in Account A assumes the Account B role to create the TXT record. The ClusterIssuer needs the role field set to the cross-account ARN. And the trust policy on Account B's role needs to allow sts:AssumeRole from Account A's cert-manager role.

When this breaks (and it will, probably at 2 AM on a Friday), the error message says "access denied" with zero indication of which hop in the chain failed. Was it the IRSA binding? The cross-account trust? The zone permissions? You get to figure that out yourself.

CloudTrail helps, if you remember to check both accounts.

The Webhook Solver Escape Hatch

Not on AWS or GCP? Using Cloudflare, DigitalOcean, or some niche DNS provider? cert-manager doesn't ship with native solvers for all of them. You need a webhook solver, which is essentially a separate deployment that cert-manager calls via an internal webhook to manage DNS records.

The community maintains these. Quality varies wildly. Some are well-tested, documented, have Helm charts. Others are a single Go file that last saw a commit eight months ago. Before you pick one, check:

When was the last release?
Does it support the cert-manager API version you're running?
Is there a Helm chart, or are you writing Deployment manifests by hand?
Does it handle cleanup? Some webhook solvers create TXT records but never delete them.

That last point matters more than you'd think. Let's Encrypt has rate limits. If your zone fills up with stale _acme-challenge records from failed attempts, some DNS providers start throttling API calls. Cloudflare will return 429s if you're creating and deleting records too aggressively, which causes cert-manager to retry, which causes more API calls. Fun loop.

Testing Without Burning Your Rate Limits

Always, always, always start with Let's Encrypt staging. The staging server has much higher rate limits and issues certificates signed by a fake CA. They won't work in browsers, but they prove your DNS01 pipeline is functional end-to-end.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-staging-key
    solvers:
    - dns01:
        route53:
          region: eu-west-1

Switch to production only after staging works three times in a row. Not once. Three times. Because DNS01 failures are often intermittent, timing-dependent, and they love appearing only during the renewal window at 2 AM when nobody's watching kubectl get challenges.

And set up monitoring for certificate expiry. cert-manager exports Prometheus metrics out of the box. certmanager_certificate_ready_status and certmanager_certificate_expiration_timestamp_seconds are the two you care about. Alert on anything expiring within 14 days.

What I Wish Someone Had Told Me

DNS01 in cert-manager works. It works well, even. But the gap between "follow the quickstart" and "run this in production" is enormous. Scoped IAM permissions, IRSA/Workload Identity bindings, propagation check configuration, cross-account trust chains, webhook solver maintenance. None of this is optional in a real environment, and almost none of it appears in the first three pages of search results.

Get the credentials right first. Set the recursive nameserver flags. Test on staging until you're bored. Then and only then, point it at production.

Your on-call rotation will thank you.