ACME Clients Break in Ways You Won't Expect

Renewal worked for 14 months. Then it didn't.

You set up certbot, tested it, watched the first renewal succeed, and moved on with your life. Smart. Except the ACME protocol has about a dozen edge cases that only surface after months of quiet operation, and most of them produce errors that look nothing like what actually went wrong.

The problem isn't your cron job. The problem is what happens between your ACME client and the CA when conditions change, and they always change.

CAA records: the DNS field nobody remembers

Certification Authority Authorization records tell CAs whether they're allowed to issue certificates for your domain. Most people set them up once, if at all, and forget they exist. Then someone on your team switches from Let's Encrypt to a commercial CA, or the other way around, and renewals start failing with cryptic authorization errors.

The really fun part? CAA checking became mandatory for all CAs in September 2017, but the error messages still vary wildly between providers. Let's Encrypt gives you a reasonably clear "CAA record for domain prevents issuance" message. Some commercial CAs just say "authorization failed" and leave you to figure out why.

# check what your CAA records actually say
dig CAA example.com +short

# you probably want something like this
# 0 issue "letsencrypt.org"
# 0 issuewild "letsencrypt.org"

# forgot the issuewild entry? wildcard renewals
# will fail while regular ones keep working.
# ask me how I know.

And here's what catches people: CAA records are checked at issuance time, not at order creation time. Your ACME client can complete the entire challenge process successfully, get the authorization, and then fail at the very last step because someone added a restrictive CAA record between your previous issuance and this renewal.

Account key rotation is a thing, and you're probably not doing it

Every ACME client registers with the CA using an account key. Certbot stores it in /etc/letsencrypt/accounts/. Most people never think about this key again. But it's a private key sitting on your server, sometimes for years, and if it gets compromised, someone can issue certificates for your domains.

RFC 8555 supports account key rollover. You can rotate your ACME account key without losing your account or existing authorizations. In practice, almost nobody does this. Certbot has had a --update-registration flag for ages but actual key rollover support has been shaky across clients.

The bigger issue is what happens when you lose the account key entirely. Migrating servers, reimaging a box, container volumes getting wiped. Your certs are still valid but your ACME client can't renew them because the account is gone. You have to register a new account, and depending on your rate limit situation with Let's Encrypt, that might mean waiting.

Rate limits that bite at the worst possible moment

Let's Encrypt has a 50 certificates per registered domain per week limit. Sounds generous until you're managing a platform with customer subdomains, or until a failed deployment causes your automation to retry issuance in a loop.

I've seen a staging deployment gone wrong burn through an entire week's rate limit in about 40 minutes. Someone had pointed their staging certbot at the production ACME endpoint instead of the staging one. By the time anyone noticed, the counter was maxed and production renewals for that domain were blocked for 7 days.

# always, always use staging for testing
# production: https://acme-v02.api.letsencrypt.org/directory
# staging:    https://acme-staging-v02.api.letsencrypt.org/directory

# certbot makes this easy enough
certbot certonly --staging --dry-run -d example.com

# but custom ACME clients? you'd be surprised
# how many hardcode the production URL

There's also the duplicate certificate limit, which is 5 identical certificates per week. "Identical" means the exact same set of hostnames. So if your automation requests the same cert over and over because it doesn't properly detect that one already exists, you hit this wall fast.

The failed validation limit is nastier: 5 failures per account, per hostname, per hour. If your DNS propagation is slow and challenges keep failing, you can lock yourself out of validation entirely for that hour. And your retry logic, if it has any, just makes it worse.

The HTTP-01 challenge and reverse proxies

HTTP-01 challenges require the CA to reach http://yourdomain/.well-known/acme-challenge/${token} on port 80. Simple. Except when you have a reverse proxy, a CDN, a load balancer, or basically any modern infrastructure sitting in front of your server.

Nginx is the classic trap. Your certbot runs and places the challenge file, but nginx is configured to redirect all HTTP to HTTPS, or it's proxying to a different backend, or the location block for /.well-known/ doesn't exist. The challenge fails and the error says "connection refused" or "invalid response" which tells you almost nothing about what's actually wrong.

# nginx config that breaks HTTP-01 silently:
server {
    listen 80;
    server_name example.com;
    return 301 https://$host$request_uri;  # redirects EVERYTHING
}

# nginx config that works:
server {
    listen 80;
    server_name example.com;

    location /.well-known/acme-challenge/ {
        root /var/www/certbot;  # let certbot serve challenges
    }

    location / {
        return 301 https://$host$request_uri;
    }
}

Cloudflare users get a special version of this headache. If you have "Always Use HTTPS" turned on in Cloudflare, HTTP-01 challenges will never reach your origin. You need DNS-01 challenges instead, which means giving your ACME client API access to your DNS provider. That's a whole different set of credentials to manage and rotate.

DNS-01 propagation timing is basically gambling

DNS-01 challenges should be straightforward: create a TXT record, wait for it to propagate, tell the CA to check. But DNS propagation isn't instant and it isn't consistent. Your authoritative nameserver might update in 2 seconds. The CA's resolver might cache the old response for another 60.

Most ACME clients have a propagation wait time. Certbot's dns plugins default to 10 seconds for some providers. That's fine for Route 53 which is fast. It's not fine for registrars with slow DNS infrastructure where propagation regularly takes 60-120 seconds.

And TTL values on your existing TXT records matter. If you previously had an ACME challenge TXT record with a 3600 second TTL, some resolvers might cache that for the full hour even after you've updated it with a new challenge value. I've seen renewals fail consistently at a 90 day cadence because the TTL on the _acme-challenge record was set high during initial setup and nobody changed it.

What happens when your ACME client updates itself

Certbot updates through your package manager. acme.sh pulls from GitHub. lego updates through Go modules or binary releases. And sometimes these updates break things.

Certbot 2.0 dropped support for some older authentication plugins. acme.sh changed its default CA from Let's Encrypt to ZeroSSL at one point, which surprised a lot of people when their renewal suddenly went to a different provider. These aren't bugs; they're intentional changes that happen to break your automation because you weren't paying attention to your ACME client's release notes. And nobody reads release notes for their certificate renewal tool.

The fix is boring: pin your ACME client version in production. Update deliberately in staging first. Test the actual renewal flow, not just --dry-run, because dry-run skips some of the steps that break in practice.

Monitoring the renewal, not just the certificate

Most monitoring checks certificate expiry from the outside. That's good but it's reactive. By the time your monitor sees the cert is expiring in 7 days, you've already missed two or three failed renewal attempts and you're now debugging under time pressure.

Better approach: monitor the renewal process itself. Check that your ACME client's last run succeeded. Parse its logs. Alert on failures immediately, not when the cert is about to expire.

# quick and dirty: check certbot's log for recent failures
grep -c "Failed" /var/log/letsencrypt/letsencrypt.log

# slightly better: check the cert's actual renewal date
openssl x509 -enddate -noout -in /etc/letsencrypt/live/example.com/cert.pem
# notAfter=Jun 27 04:20:00 2026 GMT

# if that date isn't ~60 days from now, something is wrong
# because certbot renews at 30 days remaining

Some teams set up a dead man's switch: the renewal script pings a monitoring endpoint on success. If the ping doesn't arrive within the expected window, you get alerted. Healthchecks.io, Cronitor, or even a simple webhook to Slack. The point is to know about failures when they happen, not when the cert is 3 days from expiry and you're scrambling.

The actual takeaway

ACME-based renewal automation is excellent when it works. The protocol is solid, the tooling is mature, and for straightforward setups it just runs. But "straightforward" is doing a lot of heavy lifting in that sentence. The moment you add CDNs, multiple domains, custom infrastructure, or team members who touch DNS records without telling anyone, you're operating in territory where silent failures accumulate until they become loud ones.

Back up your account keys. Pin your client versions. Monitor the renewal process, not just the certificate. Test with staging before production, every single time. And read the error messages carefully, because ACME errors are often technically accurate but misleading about root cause.