Back to Blog
Security

Rotating mTLS Client Certificates Without Taking Down Your Services

Client certificate rotation in mTLS is where most teams panic. Overlapping validity windows, trust bundle management, and the rollout order that actually works.

CertGuard Team··7 min read

The rotation nobody planned for

Setting up mTLS is the easy part. Every tutorial walks you through generating a CA, issuing client certs, configuring your services, done. What none of them mention is what happens 90 days later when those client certs expire and you need to swap them out across 30 services without a single failed handshake.

Most teams discover this the hard way. A cert expires, service A can't talk to service B, alerts fire, someone regenerates certs manually, pushes them out, restarts everything. Outage window: somewhere between 20 minutes and 3 hours depending on how many services are involved and how panicked the on-call engineer is.

There's a better way. But it requires thinking about rotation from day one.

Why server cert rotation is simple and client cert rotation isn't

Server certificate rotation is relatively straightforward. You update the cert on the server, clients connect and validate against the CA, new cert is signed by the same CA, everything works. The client doesn't care that the specific certificate changed.

Client certificates flip this relationship. Now the server needs to trust the client's cert. And if you're doing it properly, each service has its own unique client certificate. So when you rotate, you're not updating one thing; you're updating potentially dozens of client identities while the servers that verify them need to accept both old and new certs during the transition.

If your server only trusts certs from one CA and you're issuing new certs from the same CA, you might think you're fine. Sometimes you are. But if you've pinned specific certificates (not just the CA), or if you're rotating the intermediate CA itself, or if different services have different trust stores that get updated at different times, well. That's where things get interesting.

The overlap window is everything

The core concept behind zero-downtime rotation is simple: the new certificate must be valid and trusted before the old certificate expires. You need an overlap window where both certificates work.

How long should this window be? Depends on how fast you can roll out changes across your infrastructure. If you can push a new cert and restart all services in 5 minutes, a 24-hour overlap is plenty. If you're dealing with mobile apps that need app store updates, you might need 30 days or more. For most server-to-server mTLS setups, 7 days works well.

# Generate new client cert with overlap
# Old cert expires: April 15
# New cert valid from: April 8 (7-day overlap)
# New cert expires: July 8

openssl x509 -req -in new-client.csr \
  -CA intermediate-ca.pem \
  -CAkey intermediate-ca-key.pem \
  -CAcreateserial \
  -out new-client.pem \
  -days 90 \
  -extfile client-ext.cnf

# Verify the dates look right before doing anything else
openssl x509 -in new-client.pem -noout -dates
# notBefore=Apr  8 00:00:00 2026 GMT
# notAfter=Jul  8 00:00:00 2026 GMT

The rollout order matters more than you think

Here's where most teams mess up. They generate the new client cert and immediately deploy it to the client service. The client starts presenting the new cert. But the server hasn't been updated to trust it yet. Handshake failure. Outage.

The correct order is:

Step 1: Update trust stores on all servers first. If you're using CA-based validation, this might mean adding a new CA or intermediate to the trust bundle. If you're pinning specific certs, add the new client cert's fingerprint to the allowed list. At this point, servers accept both old AND new client certs. Nothing breaks because clients are still presenting the old cert.

Step 2: Roll out new client certificates. Now update the clients. They start presenting the new cert. Servers already trust it. No interruption.

Step 3: Clean up old trust entries. After you've confirmed everything works with the new certs, remove the old cert fingerprints from server trust stores. Optional but good hygiene.

Sounds obvious written out like this. In practice, with 15 teams deploying independently and no central coordination, step 1 and step 2 happen simultaneously or in the wrong order about half the time.

Dual-cert loading: the trick that saves you

Some runtimes and proxies support loading multiple client certificates simultaneously. Envoy does this natively. Nginx can be configured for it with some effort. Your application code probably can too, depending on the TLS library.

The idea: load both old and new client certs into the client's TLS context. The TLS handshake will negotiate which one to use based on what the server requests. During the overlap window, either cert works. After the old one expires, only the new one gets used.

# Envoy config: dual client certificates
# Both certs are loaded, Envoy picks the valid one
clusters:
  - name: backend_service
    transport_socket:
      name: envoy.transport_sockets.tls
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
        common_tls_context:
          tls_certificates:
            # Current cert (expires April 15)
            - certificate_chain:
                filename: /certs/client-current.pem
              private_key:
                filename: /certs/client-current-key.pem
            # New cert (valid from April 8)
            - certificate_chain:
                filename: /certs/client-new.pem
              private_key:
                filename: /certs/client-new-key.pem

This is cleaner than swapping certs atomically because there's no moment where the wrong cert is loaded. Both are available throughout the transition.

Automating the whole thing

Manual rotation doesn't scale past about five services. You need automation, but the kind matters.

If you're on Kubernetes, cert-manager handles most of this. It issues certs, tracks expiration, and renews them automatically. The renewal happens before expiry (configurable, defaults to 2/3 through the cert's lifetime), giving you that overlap window for free. Pair it with trust-manager to distribute CA bundles across namespaces.

# cert-manager Certificate resource for mTLS client cert
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: service-a-client-cert
  namespace: production
spec:
  secretName: service-a-mtls
  duration: 2160h    # 90 days
  renewBefore: 720h  # renew 30 days before expiry
  isCA: false
  privateKey:
    algorithm: ECDSA
    size: 256
  usages:
    - client auth
  issuerRef:
    name: internal-ca-issuer
    kind: ClusterIssuer

Outside Kubernetes, you'll need to build something. A rotation service that watches cert expiration dates, generates new certs 7-14 days before expiry, pushes them to a secrets manager (Vault, AWS Secrets Manager, whatever), and triggers service reloads. Not glamorous work, but it prevents 3 AM pages.

The intermediate CA rotation problem

Nobody talks about this one enough.

Rotating leaf certificates is manageable. Rotating the intermediate CA that signs those leaf certificates is a whole different beast. Every service that validates client certs has the old intermediate in its trust store. You issue new client certs signed by the new intermediate. Now you need to update trust stores everywhere before any of those new certs get presented.

The window here is much larger. You might need the old and new intermediate to coexist for weeks or months. During that time, some services have client certs signed by the old intermediate, others have certs signed by the new one, and every server needs to trust both.

Cross-signing helps. Have the new intermediate CA cross-signed by the old root (or old intermediate). Clients presenting certs from the new intermediate include the cross-signed chain, and servers that only know about the old CA can still validate them through the cross-sign path.

If this sounds complicated, it is. It's also why many organizations skip intermediate rotation entirely and just keep using the same intermediate for years. Which works until that intermediate gets compromised, and then you're doing an emergency rotation with zero preparation. Pick your poison.

Monitoring rotation health

You can automate everything and still get burned if you're not watching the right metrics. Here's what to track:

Days until expiry per service. Not just "is anything expiring soon" but a breakdown per service, per cert type (client vs. server). Alert at 14 days, page at 3 days.

TLS handshake failures. A spike in handshake failures during a rotation window means something went wrong with the rollout order. You should see zero increase if the rotation is working correctly.

Certificate serial numbers in use. Log which cert serial each client presents. During rotation, you should see a gradual shift from old serial to new serial. If the old serial is still showing up after the rotation window closes, some service didn't pick up the new cert.

# Quick check: which cert serial is a service presenting?
echo | openssl s_client -connect service-a:8443 \
  -cert my-client.pem -key my-client-key.pem 2>/dev/null | \
  openssl x509 -noout -serial

# Compare against expected new serial
# serial=3A4B5C6D7E8F...

What actually works at scale

After watching teams struggle with this across dozens of organizations, the pattern that works is boring but effective. Short-lived certificates, 24 to 72 hours, with automated renewal running constantly. SPIFFE/SPIRE does this well. So does Istio's built-in certificate management if you're already running a service mesh.

With 24-hour certs, "rotation" isn't an event anymore. It's just normal operations. Every service gets a fresh cert every few hours. The overlap window is built in because the old cert is still valid when the new one arrives. Nobody pages anyone because there's nothing to page about.

The tradeoff is complexity in the issuance infrastructure. Your CA needs to handle a high volume of cert requests. SPIRE handles this fine. Rolling your own probably won't. And if the CA goes down, you have at most 24 hours before services start losing their ability to communicate. So you need HA for the CA too.

Long-lived certs with periodic rotation or short-lived certs with continuous renewal. Both work. Mixing them, some services on 90-day certs and others on 24-hour certs, creates confusion and is where incidents happen. Pick one strategy and apply it consistently.