mTLS in Production: Why Your "Zero Trust" Setup Probably Isn't

Everyone says they do mTLS. Most don't do it right.

Mutual TLS has become one of those checkbox items in security audits. "Do you use mTLS between services?" Yes. Check. Move on. But when you actually dig into how teams have implemented it, you find certificate validation disabled in half the services, wildcard client certs shared across 40 microservices, and nobody rotating anything because last time someone tried, three services went down.

Sound familiar? You're not alone.

Regular TLS vs. mTLS, quickly

Regular TLS is one-sided. The client verifies the server's certificate, the server doesn't care who the client is. Your browser does this every time you visit a website. mTLS adds the reverse: the server also demands a certificate from the client and validates it. Both sides prove their identity.

The concept is dead simple. Two certificates, two validations, done. But getting it to work reliably across a fleet of services that deploy independently, scale horizontally, and occasionally get rebuilt from scratch at 2 AM by a CI pipeline? That's where things get interesting.

The "just use a service mesh" trap

Istio, Linkerd, Consul Connect. They all promise transparent mTLS. And they deliver, mostly. The sidecar proxy handles certificate issuance, rotation, and the TLS handshake itself. Your application code never touches a certificate. Sounds perfect.

Until you need to debug why Service A can't talk to Service B at 3 AM and the only error you get is "connection reset by peer." The sidecar abstracts away the TLS layer so completely that when something breaks, you're flying blind. I've watched a team spend four hours on what turned out to be an expired root CA in the mesh's trust bundle, something that would have taken ten minutes to find with a direct openssl s_client call.

Service meshes are fine. Use them if they fit your architecture. But understand what's happening underneath, because when the abstraction leaks (and it will), you need to know how mTLS actually works.

Setting up mTLS without a mesh

Sometimes you just need two services to authenticate each other. No mesh, no orchestrator, just plain TLS with client certificates. Here's what that looks like in Node.js:

const tls = require('tls');
const fs = require('fs');

const server = tls.createServer({
  key: fs.readFileSync('/certs/server-key.pem'),
  cert: fs.readFileSync('/certs/server-cert.pem'),
  ca: fs.readFileSync('/certs/client-ca.pem'), // trust anchor for client certs
  requestCert: true,       // ask the client for a certificate
  rejectUnauthorized: true  // actually enforce it (you'd be surprised how many skip this)
}, (socket) => {
  const clientCert = socket.getPeerCertificate();
  console.log('Client CN:', clientCert.subject.CN);
  // maybe check the CN against an allowlist here
  socket.write('authenticated\n');
});

server.listen(8443);

And the client side:

const socket = tls.connect(8443, 'server.internal', {
  key: fs.readFileSync('/certs/client-key.pem'),
  cert: fs.readFileSync('/certs/client-cert.pem'),
  ca: fs.readFileSync('/certs/server-ca.pem'),
  // don't set rejectUnauthorized to false. just don't.
}, () => {
  if (!socket.authorized) {
    console.error('Server cert validation failed:', socket.authorizationError);
    return;
  }
  console.log('Connected with mTLS');
});

See that rejectUnauthorized: true on the server? It defaults to true in Node, but I've seen production configs where someone explicitly set it to false "for testing" and never changed it back. At that point you're requesting client certs but not actually validating them. Congratulations, you have the overhead of mTLS with none of the security.

Certificate management is where it actually falls apart

The TLS handshake is the easy part. Certificate lifecycle management is where mTLS deployments go to die.

Consider what you need to manage: every service needs its own key pair and certificate. Those certificates need to be signed by a CA that the other services trust. They need to be rotated before they expire. And when you rotate them, you need to do it without downtime, which means supporting overlapping validity periods and trusting both old and new CAs during the transition.

For services running in Kubernetes, cert-manager with a private CA issuer handles most of this:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: payment-service-mtls
spec:
  secretName: payment-service-tls
  duration: 720h     # 30 days
  renewBefore: 168h  # renew 7 days before expiry
  isCA: false
  privateKey:
    algorithm: ECDSA
    size: 256
  usages:
    - client auth
    - server auth
  dnsNames:
    - payment-service.prod.svc.cluster.local
  issuerRef:
    name: internal-ca
    kind: ClusterIssuer

That renewBefore: 168h gives you a week of overlap. Plenty of time. But only if your services actually pick up the new certificate without a restart. Lots of frameworks cache TLS contexts aggressively. Go's standard library, for instance, loads certificates once at startup by default. If you're not using something like tls.Config's GetCertificate callback to reload dynamically, your service will happily keep using the old cert until it expires and connections start failing.

The one-CA-to-rule-them-all mistake

Teams love simplicity. One internal CA signs everything. Server certs, client certs, all the services. Quick to set up, easy to understand. And a nightmare when that CA gets compromised or needs rotation.

Better approach: use intermediate CAs. Have a root CA that lives offline (seriously, offline, on a USB drive in a safe if you have to), and create intermediates for different purposes. One for server certificates, one for client certificates. Maybe separate ones per environment. When you need to rotate, you rotate an intermediate without touching the root.

# generate an intermediate CA for client certs
openssl req -new -key intermediate-key.pem   -out intermediate.csr   -subj "/CN=Internal Client CA/O=YourCompany/OU=Platform"

# sign it with the root (which should be on an airgapped machine, ideally)
openssl x509 -req -in intermediate.csr   -CA root-ca.pem -CAkey root-ca-key.pem   -CAcreateserial -days 1095   -extfile intermediate-ext.cnf   -out intermediate-ca.pem

Three years on that intermediate. Some people go shorter. Depends on your threat model and how much pain certificate rotation causes you. If rotation is automated and tested, go shorter. If it's manual and terrifying, fix that first.

Debugging mTLS when it breaks

mTLS failures are uniquely frustrating because the error messages are terrible. "certificate unknown." "bad certificate." That's it. No context, no hint about which side failed validation or why.

Your best friend:

# test from the client's perspective
openssl s_client -connect service.internal:8443   -cert client-cert.pem   -key client-key.pem   -CAfile server-ca.pem   -state -debug 2>&1 | head -50

Watch for the CertificateRequest message from the server. If you don't see one, the server isn't asking for client certs at all. If you see it but the handshake still fails, check that the client cert was signed by a CA the server trusts. Sounds obvious, but when you have three different CAs across staging and production, it's easy to mix them up.

Common gotchas, ranked by how often they burn people:

Wrong CA bundle. The server's ca option doesn't include the CA that signed the client cert. Or vice versa. Check both sides.
Certificate chain incomplete. If you're using intermediate CAs, the client needs to send its cert AND the intermediate. Not just the leaf cert.
Clock skew. Containers with wrong system time. Certificate says "not before: 2026-03-09" and the container thinks it's March 8th. Kubernetes pods inherit the host clock, usually, but not always.
Key usage extensions. The client cert needs the clientAuth extended key usage. If it only has serverAuth, the server will reject it during the handshake. OpenSSL won't tell you this clearly.
SNI mismatch. Less common with mTLS between internal services, but if you're terminating at a load balancer, the SNI value needs to match what the server expects.

Should you even use mTLS?

Hot take: for most internal service-to-service communication, mTLS is overkill. If your services run in a private VPC with proper network segmentation, the threat model that mTLS addresses (untrusted network, impersonation of services) may not be your biggest risk. Misconfigured S3 buckets and leaked API keys cause more breaches than man-in-the-middle attacks on internal networks.

But. Compliance frameworks want it. SOC 2 auditors love seeing it. And if you're in healthcare or fintech, you probably don't have a choice. Just go in with realistic expectations: mTLS adds operational complexity, and that complexity has a cost. Budget for it. Automate everything you can. Test certificate rotation regularly, not just when something breaks.

And whatever you do, don't set rejectUnauthorized to false in production. Please.