Debugging TLS Handshake Failures Without Losing Your Mind

That moment when curl just says "handshake failure"

You deploy a new service. Everything looks fine. Then someone on the team pings you: "Hey, the API is returning some SSL error." You check, and all you get is a cryptic one-liner about a handshake failure. No details. No hints. Just vibes.

I've spent more hours than I'd like to admit staring at TLS errors that tell you absolutely nothing useful. The good news? After a while, you start recognizing patterns. Most handshake failures come down to maybe five or six root causes, and once you know what to look for, you can diagnose them in minutes instead of hours.

First move: get the actual error

Browser error pages are useless for debugging. Chrome's ERR_SSL_PROTOCOL_ERROR could mean twenty different things. Don't even bother trying to diagnose from the browser alone.

OpenSSL is your friend here:

openssl s_client -connect yoursite.com:443 -servername yoursite.com

That -servername flag matters more than you think. Without it, you're not sending SNI, and if the server hosts multiple domains on one IP (which, let's be honest, most do), you'll get the wrong certificate back or no certificate at all. I've seen teams spend an entire afternoon debugging what turned out to be a missing SNI flag in their test command.

For more detail, add -debug or pipe through -msg to see the actual handshake messages:

openssl s_client -connect yoursite.com:443 -servername yoursite.com -msg 2>&1 | head -50

ERR_SSL_VERSION_OR_CIPHER_MISMATCH

This one shows up constantly, and the name is misleading because it's almost never about the SSL version. Nine times out of ten, it's a cipher suite mismatch.

What's happening: the client sends a list of cipher suites it supports, the server tries to pick one from that list, and if there's no overlap, the handshake dies. Simple as that.

Common scenarios where this blows up:

You just hardened your Nginx config to only allow TLS 1.3 ciphers, but your monitoring tool or load balancer health check still speaks TLS 1.2
An old Java 8 client that doesn't support ECDHE or AES-GCM trying to connect to a modern server
Someone copy-pasted a "secure" Nginx SSL config from a blog post without checking what their clients actually support

Quick check:

# See what the server accepts
nmap --script ssl-enum-ciphers -p 443 yoursite.com

# Or test a specific TLS version
openssl s_client -connect yoursite.com:443 -tls1_2
openssl s_client -connect yoursite.com:443 -tls1_3

The fix is usually adding back a cipher suite or two. Yeah, I know, security hardening guides tell you to strip everything down to three ciphers. But if your Java 8 clients can't connect, that hardening isn't helping anyone. Be pragmatic.

The certificate chain problem nobody checks

Here's one that drives me crazy. Your certificate is valid. Your private key matches. The domain is correct. But clients still reject it. Why?

Incomplete certificate chain.

Your server needs to send not just the leaf certificate but also the intermediate certificates. Browsers are forgiving about this; Chrome and Firefox will often fetch missing intermediates on their own. But curl won't. Java won't. Python's requests library won't. And that health check running in your Kubernetes pod definitely won't.

# Check the chain
openssl s_client -connect yoursite.com:443 -servername yoursite.com 2>/dev/null | grep -A2 "Certificate chain"

# Verify the full chain
openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt -untrusted intermediate.pem server.pem

I once spent three hours helping a team debug why their microservice-to-microservice calls were failing with "certificate verify failed." Everything worked in the browser. Everything worked with curl from a developer laptop. But the Go service kept rejecting it. Turned out the Nginx config had ssl_certificate pointing to just the leaf cert, not the full chain bundle. The browser was silently fetching the intermediate from the CA's AIA endpoint. The Go HTTP client, rightfully, wasn't.

Lesson: always test with something strict, not just a browser.

When the certificate doesn't match the hostname

Sounds obvious, right? But the hostname matching rules in X.509 are weirder than you'd expect.

A wildcard cert for *.example.com covers api.example.com but NOT api.staging.example.com. Wildcards only match one level. This trips people up all the time, especially when they start adding subdomains for environments.

Also, the common name (CN) field? Basically deprecated for hostname matching. Modern TLS libraries check the Subject Alternative Name (SAN) extension. If your cert has the right CN but no SAN entries, newer versions of Go, Python, and even OpenSSL will reject it.

# Check what names the cert actually covers
openssl s_client -connect yoursite.com:443 -servername yoursite.com 2>/dev/null | \
  openssl x509 -noout -text | grep -A1 "Subject Alternative Name"

And no, you can't put a wildcard in the SAN for a public CA-issued cert at multiple levels. Some internal CAs let you get away with *.*.example.com but public CAs won't sign that. If you need coverage for deeply nested subdomains, you need multiple SANs or multiple certs.

The sneaky proxy problem

This one's subtle. Everything works locally. Everything works in staging. Production? Handshake failure. What changed?

A TLS-terminating proxy or CDN that you forgot about. Cloudflare, AWS ALB, a corporate forward proxy, whatever. The client isn't talking to your server; it's talking to something in between, and that something has its own certificate, its own cipher preferences, and its own ideas about what TLS version to speak.

Debugging this:

# Check who's actually responding
openssl s_client -connect yoursite.com:443 -servername yoursite.com 2>/dev/null | \
  openssl x509 -noout -issuer -subject

# If the issuer is "Cloudflare Inc" or "Amazon" and you expected your own cert...
# you've found your problem

A client had this exact situation last year. They'd renewed their certificate, uploaded it to their origin server, tested it, everything green. But users were still seeing the old cert. Because Cloudflare was caching the origin certificate and serving its own edge cert. They needed to purge and re-pull on Cloudflare's side. Took them two days to figure out because they kept looking at the wrong server.

CERTIFICATE_VERIFY_FAILED in containers

If you run anything in Docker, you've hit this. The container's CA certificate store is either missing, outdated, or doesn't include the root CA you need.

Alpine-based images are the usual suspect:

# In your Dockerfile
RUN apk add --no-cache ca-certificates && update-ca-certificates

Distroless images are worse because you can't just apt-get your way out of it. You need to copy the CA bundle in during the build. And if you're using a private CA for internal services, you need to add that CA's root certificate to the trust store explicitly.

# For a custom CA in Alpine
COPY my-internal-ca.crt /usr/local/share/ca-certificates/
RUN update-ca-certificates

Python has its own special flavor of this problem. The requests library uses certifi for its CA bundle, which is completely separate from the system CA store. So even if your OS trusts the cert, Python might not. You can point it at the system store with:

export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

Or in code:

import requests
requests.get("https://internal-api.example.com", verify="/path/to/ca-bundle.crt")

Node.js has the same issue. By default, it uses a compiled-in CA bundle. For custom CAs:

export NODE_EXTRA_CA_CERTS=/path/to/my-ca.crt

The "it was working yesterday" checklist

When something that was working suddenly breaks, skip the deep debugging and check these first. In order of likelihood:

Certificate expired. Check with openssl x509 -noout -dates. Let's Encrypt certs last 90 days. If auto-renewal failed silently, this is your answer.
CA root or intermediate rotated. Let's Encrypt switched from DST Root CA X3 to ISRG Root X1 and it broke a shocking number of things. CAs do rotate, and old clients with stale trust stores get left behind.
Someone changed the server config. Check git blame on your Nginx/Apache config. Someone "optimizing" TLS settings is a classic.
DNS changed. If DNS now points to a different IP, you might be hitting a different server that doesn't have the right cert. Basic, but I've seen it.
Client updated. A new version of Chrome, curl, or your HTTP library tightened its requirements. Happened when Chrome started enforcing Certificate Transparency; certs without SCTs suddenly stopped working.

Stop guessing, start capturing

When nothing else works, packet capture. I know, it sounds like overkill. It's not. A 10 second tcpdump will tell you exactly where the handshake fails.

sudo tcpdump -i eth0 -w /tmp/tls-debug.pcap 'port 443' &
# reproduce the error
kill %1
# open in Wireshark, filter: tls.handshake

Look at the ClientHello, the ServerHello (if there is one), and where the connection drops. If the server sends back a fatal alert, the alert description code tells you exactly what went wrong. Alert 40 is handshake failure. Alert 42 is bad certificate. Alert 48 is unknown CA. These codes are defined in RFC 5246 and they're actually useful, unlike browser error pages.

TLS debugging isn't glamorous work. But getting good at it means you're the person on the team who can actually fix the 3 AM outage instead of just staring at it. And honestly, once you've seen enough of these failures, most of them become obvious within a couple minutes. The trick is building that pattern recognition, and not trusting browsers to tell you the whole story.