TLS 1.3 Session Resumption: Why Your CDN Cares More Than You

Session Tickets Are Dead. Long Live PSK.

TLS 1.2 had two ways to resume sessions: session IDs (server-stored state) and session tickets (encrypted blob the client holds). Both kinda sucked for different reasons. Session IDs don't scale because every edge server needs to share state. Session tickets leak forward secrecy because that encryption key has to stick around.

TLS 1.3 killed session IDs entirely and replaced tickets with PSK (pre-shared keys). Sounds similar, but the mechanics shifted in ways that matter if you're serving real traffic.

What Actually Changed

In TLS 1.2, a session ticket was just an encrypted session state blob. The server encrypted it, sent it to the client, and when the client came back with that ticket, the server decrypted it and skipped the full handshake. Saved one round trip. Great for mobile connections where latency kills you.

Problem? That encryption key (the ticket key) had to persist. If someone grabbed your ticket key and recorded your traffic, they could decrypt past sessions. Bye bye forward secrecy.

TLS 1.3's PSK mode works differently. After the initial handshake, the server can issue a "resumption master secret" derived from the session. When the client reconnects, it uses that PSK to establish a new session. Each resumption creates fresh keys. Forward secrecy survives.

But here's the catch: PSK resumption in TLS 1.3 doesn't actually eliminate round trips in the standard mode. You still do a full handshake, just with pre-shared auth instead of certificates. The latency win is marginal unless you enable 0-RTT, which opens a whole different can of worms (replay attacks, mostly).

Why CDNs Love This (And You Might Not)

Cloudflare and Fastly pushed hard for TLS 1.3 resumption improvements because at edge scale, every millisecond compounds. If you're serving 10 million requests per second, shaving 20ms off handshakes is massive. They also have the infrastructure to rotate resumption secrets safely across globally distributed PoPs.

For everyone else? The calculus is murkier.

I've seen internal services disable resumption entirely because the operational complexity outweighs the latency benefit. When your median connection lifespan is 30 seconds and you're in the same datacenter, resumption just adds state management headaches. Clients hold onto PSKs that expire. Servers rotate secrets and invalidate half the client cache. Debugging becomes a nightmare because every failed resumption looks like a random TLS alert.

// nginx doesn't make this obvious
    ssl_session_timeout 1h;  // how long PSKs are valid
    ssl_session_cache shared:SSL:50m;  // server-side cache
    ssl_session_tickets off;  // kill TLS 1.2 tickets entirely
    
    // in TLS 1.3 you get PSK resumption anyway
    // but controlling the timeout is critical

The Mobile Problem

Mobile clients love session resumption because cellular handshakes are slow. Radio state transitions, tower handoffs, spotty connectivity... every round trip hurts. So mobile OSes aggressively cache PSKs and try to resume whenever possible.

Except mobile clients also get their IP addresses reassigned constantly. Your server sees what looks like a new client from a different IP trying to use a cached PSK. Depending on your config, that might trigger rate limiting or fraud detection. I've seen payment APIs reject resumed sessions from different geolocations because it looked like credential stuffing.

You can't just whitelist resumed sessions either, because attackers absolutely will steal PSKs if they can. The security model assumes the PSK itself is secret, but mobile apps get reverse-engineered all the time. If your threat model includes nation-state attackers or competitive intelligence, PSK resumption is a target.

When 0-RTT Makes Sense (Rarely)

0-RTT lets the client send application data in the first flight, before the handshake finishes. Latency goes to nearly zero for resumed connections. It's also the single most dangerous TLS 1.3 feature in production.

The problem is replay. An attacker can capture a 0-RTT request and replay it. The server can't distinguish replays from legitimate retries because the handshake hasn't completed yet. For idempotent GETs, maybe acceptable. For anything that mutates state? Absolute disaster.

Most implementations disable 0-RTT by default. Cloudflare enables it for static assets only. AWS ALBs don't support it at all. If you're considering enabling it, make sure your application layer can handle duplicate requests:

// application-level replay protection
    // check a nonce or timestamp in the request
    if (request.header('X-Request-ID') in recentRequests) {
      return 429;  // rate limit replays
    }
    
    // only allow 0-RTT for safe methods
    if (request.method !== 'GET' && request.is0RTT) {
      return 400;  // reject early data for mutations
    }

Rotation Hell

Resumption secrets need to rotate. In TLS 1.2, ticket keys would typically rotate every 24 hours. In TLS 1.3, the recommendation is even more aggressive because PSKs can be used for 0-RTT.

But rotation breaks active sessions. A client with a PSK from yesterday can't resume if you've already rotated the secret. So you need overlapping validity windows. Keep the old secret around for some grace period while issuing new PSKs with the new secret.

This is fine in theory. In practice, it means your TLS library needs to manage multiple concurrent secrets, your monitoring needs to track which secret generation each failure correlates with, and your deployment process can't just roll new secrets instantly.

At a previous job, we had a runbook for resumption secret rotation that was longer than the runbook for rotating the actual TLS certificates. The certificates were automated. The resumption secrets? Manual because we kept breaking sessions during deploys.

What To Actually Do

For public-facing services with lots of mobile clients: Enable PSK resumption with conservative timeouts (30-60 minutes max). Disable 0-RTT unless you have a very specific use case and robust replay protection. Monitor resumption rates and failed attempts separately from other TLS errors.

For internal service meshes: Consider disabling resumption entirely if your latency budget allows it. The operational complexity usually isn't worth the 10-20ms you save. Focus on connection pooling and HTTP/2 multiplexing instead.

For edge/CDN scenarios: You're probably already doing this right because your vendor handles it. But verify that resumption secrets rotate across your global PoPs in a coordinated way. A botched rotation can tank cache hit rates globally.

And if you ever see TLS alerts that only happen for resumed sessions, check your server's session cache size first. Running out of cache space will silently break resumption and the errors are indistinguishable from network issues.

// quick check: is resumption actually helping?
    // compare handshake latency for fresh vs resumed
    
    openssl s_client -connect example.com:443 -tls1_3 -sess_out /tmp/session.pem
    # note the handshake time
    
    openssl s_client -connect example.com:443 -tls1_3 -sess_in /tmp/session.pem
    # if this isn't noticeably faster, resumption is broken or pointless

TLS 1.3 session resumption is better designed than TLS 1.2's approach. Doesn't mean it's automatically worth the complexity for your specific situation. Measure, don't assume.