Debugging cert-manager: Why Your Certificates Are Stuck in a Loop

The certificate that never arrives

You deployed cert-manager. You wrote the Issuer manifest. You annotated your Ingress. And then... nothing. The certificate resource says False under READY, and it's been that way for 45 minutes.

This is one of those Kubernetes problems that looks simple on the surface but has about twelve different failure modes hiding underneath. I've spent more hours than I'd like to admit staring at cert-manager logs trying to figure out which one I hit this time.

Start with the obvious: kubectl describe

Before you go digging through controller logs, just describe the Certificate resource.

kubectl describe certificate my-app-tls -n production

Look at the Events section at the bottom. Nine times out of ten, there's a message there telling you exactly what went wrong. People skip this step constantly. They jump straight to the cert-manager pod logs, which are verbose and full of noise from every certificate in the cluster. The describe output is scoped to your specific cert.

If you see Issuing certificate as Secret does not exist followed by nothing else, the CertificateRequest probably failed. Check that next:

kubectl get certificaterequest -n production
kubectl describe certificaterequest my-app-tls-xxxxx -n production

The ClusterIssuer vs Issuer confusion

This one catches people constantly. You created a ClusterIssuer but your Certificate references kind: Issuer. Or vice versa. cert-manager won't tell you "hey, that issuer doesn't exist in this namespace." It just silently does nothing.

# Your Certificate says this:
spec:
  issuerRef:
    name: letsencrypt-prod
    kind: Issuer          # but you created a ClusterIssuer

# Fix: match what you actually deployed
spec:
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer   # this is probably what you want

Quick rule of thumb: if you want one issuer for the whole cluster (most setups), use ClusterIssuer. If you need per-namespace isolation for multi-tenant scenarios, use Issuer. But pick one and be consistent.

DNS01 challenges: the silent killer

HTTP01 validation is straightforward. cert-manager creates a temporary Ingress, Let's Encrypt hits it, done. DNS01 is where things get ugly.

The most common failure? IAM permissions. Your cert-manager service account needs permission to create TXT records in your DNS zone. On AWS, that means a policy allowing route53:ChangeResourceRecordSets and route53:GetChange on the specific hosted zone. On Google Cloud DNS, it's dns.changes.create and dns.resourceRecordSets.*.

But here's what the docs won't emphasize enough: the GetChange permission on Route53. Without it, cert-manager can create the TXT record but can't verify that the change propagated. So it just... waits. Forever. The challenge sits in a pending state and eventually times out after 20 minutes.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "route53:GetChange",
        "route53:ChangeResourceRecordSets",
        "route53:ListResourceRecordSets"
      ],
      "Resource": [
        "arn:aws:route53:::hostedzone/Z04XXXXXXXXXX",
        "arn:aws:route53:::change/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": "route53:ListHostedZonesByName",
      "Resource": "*"
    }
  ]
}

Notice the separate resource for change/*. Miss that and you'll be debugging for hours.

Rate limits will ruin your weekend

Let's Encrypt has rate limits. 50 certificates per registered domain per week. Sounds generous until you're iterating on a broken config and accidentally requesting the same cert 50 times on a Friday afternoon.

Always, always test against the staging endpoint first.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: ops@yourcompany.com
    privateKeySecretRef:
      name: letsencrypt-staging-key
    solvers:
    - http01:
        ingress:
          class: nginx

Staging certs aren't trusted by browsers, but that's the point. You're testing the issuance flow, not the cert itself. Once staging works, swap the server URL to production and you're done.

If you already hit rate limits, there's no way to reset them. You wait. The duplicate certificate limit is 5 per week, which is even more restrictive. Check your current rate limit status at crt.sh by searching for your domain.

The ingress-class annotation trap

cert-manager needs to know which ingress controller handles the HTTP01 challenge. If you have multiple ingress controllers (say, nginx for public traffic and traefik for internal), and you don't specify the class, cert-manager picks one at random. Or worse, it creates a challenge ingress that no controller picks up.

solvers:
- http01:
    ingress:
      class: nginx          # match your actual ingress controller
      # OR for newer setups:
      ingressClassName: nginx

Traefik users have an extra gotcha. Traefik's IngressRoute CRD doesn't work with cert-manager's HTTP01 solver out of the box. You either need to use the standard Ingress resource for cert-manager challenges or switch to DNS01 validation entirely. Most Traefik shops I've worked with end up on DNS01 because it's less fragile in that ecosystem.

Renewals that silently break

Everything worked fine three months ago. The cert was issued, traffic is encrypted, life is good. Then the 60-day renewal window hits and suddenly things fail.

Common causes: someone rotated the cloud credentials that cert-manager uses. The DNS zone was migrated to a different provider. A network policy was added that blocks egress from the cert-manager namespace. The ingress controller was upgraded and the class name changed.

Set up monitoring. Seriously. A Prometheus alert on certmanager_certificate_expiration_timestamp_seconds with a 21-day threshold gives you three weeks to fix whatever broke. That's enough time even if you're on vacation when the alert fires.

# PrometheusRule for cert-manager
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cert-manager-alerts
spec:
  groups:
  - name: cert-manager
    rules:
    - alert: CertificateExpiringSoon
      expr: |
        certmanager_certificate_expiration_timestamp_seconds
        - time() < 21 * 24 * 3600
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Certificate {{ $labels.name }} expires in less than 21 days"

The order resource: your debugging friend

When a Certificate creates a CertificateRequest, cert-manager creates an Order (for ACME issuers), which creates Challenges. This chain is where the actual work happens, and it's where most failures surface.

# Follow the chain
kubectl get order -n production
kubectl describe order my-app-tls-xxxxx-yyyyy -n production

# Then check challenges
kubectl get challenge -n production
kubectl describe challenge my-app-tls-xxxxx-yyyyy-zzzzz -n production

The Challenge resource will tell you if the ACME server rejected the validation, if DNS propagation failed, or if the HTTP01 endpoint returned the wrong content. This is the most specific error you'll get, and it's buried three levels deep in the resource hierarchy. Most guides don't mention this.

Version mismatches after upgrades

cert-manager CRDs are versioned. If you upgrade cert-manager but forget to upgrade the CRDs (common with Helm when you don't use --set installCRDs=true), you get bizarre behavior. Resources appear to save correctly but the controller ignores them because the stored version doesn't match what it expects.

# Always upgrade CRDs before or with the controller
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.crds.yaml

# Then upgrade the deployment
helm upgrade cert-manager jetstack/cert-manager --namespace cert-manager --version v1.14.0

And run cmctl check api after upgrades. It validates that the API server can talk to the cert-manager webhook, which is another common failure point. The webhook handles validation and conversion of cert-manager resources. If it's down, you can't create or modify any cert-manager objects.

The practical checklist

When a certificate isn't issuing, go through this in order:

Is the Certificate resource created? kubectl get cert -A
What does kubectl describe cert say in Events?
Is the CertificateRequest created and what's its status?
Is the Order created? What state is it in?
Are there Challenge resources? What do they report?
Can cert-manager reach the ACME server? Check egress network policies.
Are cloud credentials (for DNS01) still valid?
Did you hit rate limits? Check crt.sh.

Following that chain from Certificate down to Challenge catches 95% of issues. The remaining 5% are weird edge cases involving custom webhooks or broken cluster DNS, and at that point you're reading controller logs anyway.