Certbot is installed. The cron job is set. You're done, right?
Every tutorial ends the same way. Install certbot, run it once, add a cron entry, walk away. And it works. For months, sometimes years. Then one Tuesday morning your monitoring lights up because the cert expired and nobody noticed until customers started getting browser warnings.
The renewal itself wasn't the hard part. Knowing it stopped working was.
Why "set and forget" is a lie
Let's Encrypt certificates last 90 days. Certbot tries to renew at 60 days. That gives you a 30-day window where renewal can fail and you still have a valid cert. Sounds generous. But here's the thing: if your cron job fails silently on day 60, and again on day 67, and again on day 74, you won't know until day 90 when the cert actually expires. Three failures, zero alerts.
I've seen this play out at a mid-size SaaS company running 40+ domains. Their certbot setup had been working perfectly for 14 months. Then their web server config changed during a migration, the webroot validation path moved, and certbot started failing every single run. The cron job dutifully executed, certbot dutifully failed, and the logs sat there unread for 28 days.
The failure modes nobody warns you about
Certbot failing is actually the easy case. At least it exits non-zero and writes to a log. The real killers are subtler.
DNS propagation timing. If you use DNS-01 challenges with an API-based provider, your automation creates a TXT record and then asks Let's Encrypt to validate. But DNS propagation isn't instant. Some providers take 30 seconds, others take 5 minutes. If your script doesn't wait long enough, validation fails intermittently. Works fine on Monday, fails on Thursday because your DNS provider's API was a bit slow that day.
Rate limits you forgot about. Let's Encrypt has a 50 certificates per registered domain per week limit. That sounds like plenty until you realize subdomains count against the parent domain. A staging environment that renews aggressively can eat into production's quota. I've seen a CI pipeline that ran certbot on every deploy burn through the entire weekly limit by Wednesday.
Port 80 is blocked. HTTP-01 challenges need port 80. Your firewall rules changed six months after you set up certbot. Maybe a security audit locked things down. Maybe someone added a WAF that intercepts the .well-known/acme-challenge path. The challenge fails, certbot retries, same result.
Disk full. Seriously. Certbot stores backups of old certificates in /etc/letsencrypt/archive/. On a small VPS with a 20GB disk running for two years, those archives and logs pile up. Renewal fails because there's no space to write the new cert. The error message mentions disk space, buried in a log file nobody reads.
Testing renewal without waiting 60 days
Certbot has a --dry-run flag. Use it. But don't just run it once after setup; run it on a schedule.
# Add this SEPARATE cron entry
0 6 * * 1 certbot renew --dry-run 2>&1 | grep -q "simulated renewal" || curl -s -o /dev/null "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" -d '{"text":"Certificate dry-run failed on $(hostname)"}'
That runs every Monday morning. If the dry-run fails, you get a Slack message. You have six days to fix it before the next real renewal attempt. Not perfect, but it catches 80% of the silent failures.
Some teams go further and use the Let's Encrypt staging environment for weekly test issuances. Staging has much higher rate limits and issues real (untrusted) certificates. If staging issuance works, production issuance will almost certainly work too.
Beyond certbot: what mature setups look like
Once you're managing more than a handful of domains, raw certbot cron jobs stop being practical. Not because certbot is bad, but because the operational overhead of monitoring 30 separate cron jobs on 12 different servers gets ridiculous fast.
Centralized issuance. Tools like acme.sh or Caddy handle renewal internally and expose hooks for notifications. Caddy is particularly interesting because it manages certificates automatically with zero configuration. You point a domain at it and it just handles everything, including OCSP stapling. No cron job to break.
In Kubernetes, cert-manager is the standard answer. It watches Certificate resources, issues via ACME or internal CAs, stores the result in Secrets, and renews before expiry. The renewal loop is built into the controller, not bolted on via cron. When it fails, it surfaces as a Kubernetes event you can alert on:
kubectl get events --field-selector reason=Failed --namespace cert-manager --sort-by='.lastTimestamp'
# Or watch Certificate resources directly
kubectl get certificates -A -o wide
# STATUS column shows True/False for ready state
But even cert-manager isn't foolproof. I've seen clusters where the cert-manager pod kept crashing due to memory limits and nobody noticed for weeks because the existing certs were still valid. The certificates eventually expired, and suddenly five services went red at once.
The notification problem
Most renewal failures are silent. That's the core issue. Certbot logs to a file. Cron sends email to root. When was the last time anyone checked root's mail on a production server?
You need active notification, not passive logging. Options that actually work:
Deploy hooks with HTTP pings. Certbot supports --deploy-hook which runs after a successful renewal. Flip the logic: instead of alerting on failure, alert on the absence of success. Use a dead man's switch service like Healthchecks.io or CronGuard. If the ping doesn't arrive within the expected window, something broke.
# In your renewal config or cron
certbot renew --deploy-hook "curl -fsS https://hc-ping.io/YOUR-UUID-HERE"
# Healthchecks.io alerts you if the ping stops arriving
# Set the expected period to match your renewal schedule
External certificate monitoring. This is the belt-and-suspenders approach. Regardless of how you handle renewal internally, have an external service check your actual live certificates. If the cert has less than 14 days remaining, alert. If it's expired, page someone.
This catches everything: failed renewals, renewed certs that weren't deployed, load balancers serving stale certs from cache, CDN edge nodes with old certificates. The external check doesn't care about your internal process. It checks what the user actually sees.
The reload trap
Here's one that bites people more than you'd expect. Certbot renews the certificate. The new files are on disk. But the web server is still serving the old cert from memory.
Nginx doesn't automatically pick up new certificates. Neither does Apache without a reload. Certbot's --post-hook can handle this:
certbot renew --post-hook "systemctl reload nginx"
Simple. But if you're running behind a load balancer, you need to reload on every backend server. If you're using a reverse proxy that caches TLS sessions, stale sessions might persist even after reload. And if your setup involves HAProxy with a PEM bundle, you need to concatenate the cert and key into a single file before reloading, which means your post-hook is now a script with its own failure modes.
#!/bin/bash
# post-renewal hook for HAProxy
DOMAIN="example.com"
CERT_DIR="/etc/letsencrypt/live/${DOMAIN}"
DEST="/etc/haproxy/certs/${DOMAIN}.pem"
cat "${CERT_DIR}/fullchain.pem" "${CERT_DIR}/privkey.pem" > "${DEST}"
systemctl reload haproxy
# Verify the reload actually worked
sleep 2
if ! systemctl is-active --quiet haproxy; then
echo "HAProxy failed to reload after cert renewal" | mail -s "CERT RENEWAL: HAProxy down" ops@example.com
fi
That verification step at the end? Almost nobody adds it. And then HAProxy sits there stopped because the PEM file had a permissions issue, and the reload failed, and the post-hook exited without checking.
What a resilient setup actually requires
After watching renewal automation fail in creative ways across dozens of environments, the pattern that works comes down to three layers. Not one. Three.
Layer 1: Automated renewal. Certbot, acme.sh, cert-manager, Caddy, whatever. This is the easy part and everyone has it.
Layer 2: Internal verification. After renewal, verify the new cert is actually being served. A quick openssl s_client check against localhost confirms the reload worked and the cert is valid:
# Quick post-renewal verification
echo | openssl s_client -connect localhost:443 -servername example.com 2>/dev/null | openssl x509 -noout -dates | grep notAfter
Layer 3: External monitoring. Something completely outside your infrastructure checks the certificate as seen by users. This is your safety net. If layers 1 and 2 both fail, layer 3 catches it before your users do.
Skip any one of these layers and you're rolling the dice. Maybe you'll be fine for years. Maybe you'll get that 3 AM page next month.
Stop treating certs as a solved problem
Certificate renewal automation is one of those things that feels solved after the initial setup. And that's exactly why it breaks. Nobody monitors what they consider solved. Build the monitoring first, the automation second. Your future self, the one not getting paged at 3 AM, will appreciate it.