Your Certificate Monitoring Is Broken (You Just Don't Know It Yet)

The alert that nobody read

Somewhere around certificate number 47, the Slack channel went quiet. Not because the alerts stopped. They didn't. The team just stopped caring.

This happens more often than anyone admits. A company sets up certificate monitoring, configures it to fire warnings at 30 days, 14 days, 7 days, 3 days, 1 day before expiry. Textbook setup. And for three months it works great. Then the alert volume creeps up. Renewals that are already automated still trigger warnings. Internal certs for dev environments nobody uses anymore keep pinging. Wildcard certs that cover 40 subdomains generate 40 nearly identical notifications.

By month six, the monitoring channel is muted.

By month nine, something expires in production.

Why "more alerts" is the wrong answer

The instinct after an outage is always the same: add more monitoring. Lower the thresholds. Alert earlier. Alert more people. CC the VP of Engineering.

This makes things worse. Every single time.

Alert fatigue isn't a discipline problem. You can't fix it by telling people to "pay more attention." The human brain physically cannot maintain vigilance over a high-volume, low-signal notification stream. There's actual research on this from aviation and healthcare, two fields where alert fatigue literally kills people. The certificate monitoring world has the same dynamics at a smaller scale.

The real fix is fewer, better alerts. Which sounds obvious but requires you to actually think about what deserves a notification and what doesn't.

Tiered alerting that respects human attention

Not every certificate deserves the same monitoring treatment. A production load balancer cert and a developer's local mTLS cert are not the same thing, so stop alerting on them identically.

A structure that works in practice:

# tier-1: production, customer-facing
# → PagerDuty/OpsGenie, escalation after 15 min
# → alert at 14 days, 7 days, 3 days

# tier-2: internal services, staging
# → Slack channel (not muted), email digest
# → alert at 7 days, 3 days

# tier-3: dev environments, test certs
# → weekly digest email, nothing more
# → alert at 3 days only

# Example config (prometheus alertmanager style)
groups:
  - name: cert_expiry_tier1
    rules:
      - alert: CertExpiryProduction
        expr: cert_days_remaining{tier="production"} < 14
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Production cert expiring: {{ $labels.domain }}"

Three tiers. Three different notification paths. The people getting paged at 2 AM only get paged for things that actually matter at 2 AM.

The inventory problem nobody wants to solve

You can't monitor what you don't know about. And most organizations have no idea how many certificates they actually have.

Seriously. Ask your team right now. The number they give you will be wrong. Probably by a factor of 3x or more.

Certificates hide everywhere. Load balancers, CDN configs, internal microservices doing mTLS, IoT devices, that Jenkins server someone set up in 2019 and forgot about, email signing certs, code signing certs. Every team provisions their own, usually with no central tracking.

Before you can build monitoring that works, you need a discovery phase. Active scanning is the minimum:

# Scan your known networks for TLS endpoints
# nmap is ugly but effective
nmap -p 443,8443,8080,4443 --script ssl-cert   192.168.0.0/16 -oX cert_scan.xml

# Or use a purpose-built tool
# certspotter watches CT logs for your domains
certspotter --domain example.com --follow

# For cloud environments, query the APIs directly
aws acm list-certificates --query 'CertificateSummaryList[*].[DomainName,NotAfter]'   --output table

# GCP equivalent
gcloud certificate-manager certificates list   --format="table(name,san_dnsnames,expire_time)"

Run these scans weekly. Compare results. Certificates appear and disappear constantly, and the delta between scans tells you what your team is doing without telling you.

Deduplication is half the battle

One wildcard certificate can protect 50 services. If your monitoring checks each service endpoint independently, you get 50 alerts for one certificate. That's not useful information. That's noise.

Good monitoring deduplicates by certificate fingerprint, not by endpoint. You want to know "this certificate is expiring" once, with a list of everywhere it's deployed attached as context. Not 50 separate incidents that your on-call engineer has to mentally correlate at 3 AM.

// Group alerts by cert fingerprint, not hostname
interface CertAlert {
  fingerprint: string;       // SHA-256 of the cert
  commonName: string;
  expiresAt: Date;
  deployedOn: string[];      // all endpoints using this cert
  tier: 1 | 2 | 3;
  lastNotified: Date | null;
}

function shouldAlert(cert: CertAlert, now: Date): boolean {
  const daysLeft = daysBetween(now, cert.expiresAt);
  const thresholds = {
    1: [14, 7, 3, 1],
    2: [7, 3],
    3: [3],
  };

  const applicableThresholds = thresholds[cert.tier];

  // Only alert if we just crossed a threshold
  // Not every single day
  return applicableThresholds.some(t =>
    daysLeft <= t && daysLeft > t - 1 &&
    (!cert.lastNotified || daysBetween(cert.lastNotified, now) >= 1)
  );
}

The deployedOn array is the key piece. When someone gets an alert, they immediately know the blast radius. One cert expiring that's only on a staging box? Low urgency. Same cert also on three production load balancers? Different conversation entirely.

Automated renewals still need monitoring (but differently)

Teams using Let's Encrypt with certbot or cert-manager often think they're done. Renewal is automated, so monitoring is unnecessary. Right?

Wrong. Automation fails silently all the time.

DNS provider changes their API. Rate limits get hit during a burst of renewals. The HTTP-01 challenge fails because someone changed the ingress config. The ACME account key rotates and nobody updated the secret. cert-manager pods get evicted during a node drain and don't come back.

For automated renewals, you want to monitor the renewal process, not just the expiry date. Track when the last successful renewal happened and alert if it's overdue:

# If cert was issued more than 60 days ago AND
# expires in less than 20 days, renewal is probably stuck
- alert: RenewalStalled
  expr: |
    (time() - cert_issue_timestamp) > (60 * 86400)
    and
    cert_days_remaining < 20
  for: 6h
  labels:
    severity: warning
  annotations:
    summary: "Automated renewal may be stuck for {{ $labels.domain }}"
    description: "Certificate was issued {{ $value | humanizeDuration }} ago and expires soon"

The for: 6h clause matters. Renewal processes retry. Give them time. But if nothing has changed in 6 hours and the clock is ticking, someone needs to look at it.

The notification channel matters more than you think

Slack messages get lost. Email gets filtered. PagerDuty gets expensive.

What actually works is matching the notification channel to the urgency and the team's workflow. Some teams live in Slack. Others live in Jira. Some still use email heavily. Forcing everyone through one channel is a recipe for missed alerts.

A pattern that's worked well for mid-size teams: critical alerts go to PagerDuty with auto-escalation. Warnings go to a dedicated Slack channel that's required to have zero unresolved items at the end of each week. Informational stuff goes into a weekly digest email that someone reviews during a 15-minute "certificate hygiene" slot on Monday mornings.

That weekly review meeting sounds bureaucratic. But teams that do it catch problems before they become incidents. Teams that rely purely on real-time alerts eventually mute them all.

Metrics to actually track

Beyond "days until expiry," there are metrics that tell you whether your monitoring setup is healthy:

Alert-to-action ratio. How many alerts result in someone actually doing something? If it's below 20%, your monitoring is generating noise. Tune it.

Mean time from alert to resolution. Getting slower over time? That's alert fatigue setting in, even if nobody admits it.

Certificate inventory drift. How many new certs appeared since last scan that weren't provisioned through your standard process? High numbers mean shadow IT is creating unmonitored certificates.

Renewal success rate. What percentage of automated renewals succeed on first attempt? Anything below 95% means your automation has reliability problems.

Track these monthly. They're leading indicators. By the time a certificate actually expires in production, you've already failed at four or five earlier checkpoints.

Start with the cleanup

If you're reading this because your monitoring is already a mess, here's the uncomfortable first step: delete most of your alerts. Seriously. Turn them off. Start from zero.

Identify your tier-1 certificates. The ones that would wake up the CEO if they expired. Set up monitoring for those only. Get that working well, with low noise and fast response times. Then expand to tier-2. Then tier-3.

Building monitoring incrementally, starting from what matters most, beats trying to monitor everything at once and drowning in alerts. Your team has a finite amount of attention. Spend it on what counts.