Domain Validation When Your Infrastructure Fights Back

Your infrastructure wasn't built for certificate challenges

ACME domain validation assumes something that was true in 2015: your server has a stable IP, a filesystem you can write to, and it stays running long enough for a CA to call back. That assumption aged poorly.

Modern infrastructure actively sabotages the validation process. Lambda functions spin down between requests. Kubernetes pods get rescheduled mid-challenge. Edge workers run in isolates with no disk access. And yet, every single one of these environments needs TLS certificates.

I've watched teams spend weeks debugging certificate issuance in environments where the fundamental problem was architectural. The challenge mechanism doesn't fit. So you adapt. Or you lose sleep.

Serverless: where HTTP-01 goes to die

Try running an HTTP-01 challenge on AWS Lambda behind API Gateway. Go ahead. The ACME client needs to write a token file to /.well-known/acme-challenge/ and keep it accessible until the CA fetches it. But there's no persistent filesystem. Your function might not even be warm when the CA comes knocking, and API Gateway's routing might not pass through the challenge path at all without explicit configuration.

The workaround most people land on: stick the challenge token in DynamoDB or S3, then write a small handler that serves it. Something like this:

// Lambda handler for ACME HTTP-01 challenges
// store tokens in DynamoDB beforehand
exports.handler = async (event) => {
  const token = event.pathParameters.token;
  const item = await dynamo.get({
    TableName: 'acme-challenges',
    Key: { token }
  }).promise();

  if (!item.Item) {
    return { statusCode: 404, body: 'nope' };
  }

  return {
    statusCode: 200,
    headers: { 'content-type': 'text/plain' },
    body: item.Item.keyAuth
  };
};

It works. But now you're maintaining infrastructure specifically to get certificates for your infrastructure. That recursive dependency should tell you something: HTTP-01 is the wrong tool here. Switch to DNS-01 and save yourself the Lambda-DynamoDB-API-Gateway dance.

Containers and the ephemeral problem

Kubernetes introduced a specific flavor of pain. Your pod requests a certificate, starts the ACME challenge, and then Kubernetes reschedules it to a different node. The challenge token? Gone with the old pod. The CA comes to validate, hits a fresh container that knows nothing about any pending challenge. Validation fails. Cert-manager retries. Sometimes the retry works. Sometimes you hit Let's Encrypt rate limits first.

Cert-manager solves most of this by decoupling the challenge lifecycle from individual pods. It runs its own solver pods or manages DNS records through provider APIs. But teams running custom ACME implementations inside containers still get bitten.

The pattern that actually holds up: never let application containers handle their own certificate validation. Externalize it completely.

# cert-manager with DNS-01 via Route53
# this runs outside your app pods entirely
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
    - dns01:
        route53:
          region: eu-west-1
          # uses IRSA, no static credentials
          # your app pods never touch this

One team I worked with had each microservice managing its own certs. Thirty-seven services, each running certbot in a sidecar container. The cluster's DNS was getting hammered with TXT record updates, and they kept triggering Let's Encrypt's duplicate certificate limit. Consolidating to a single cert-manager instance with wildcard certs cut their certificate-related incidents to zero overnight.

Edge and CDN: someone else's computer, someone else's rules

Cloudflare Workers, Deno Deploy, Vercel Edge Functions. You don't control the server. You barely control the routing. How do you validate domain ownership when you can't serve arbitrary files or modify DNS programmatically?

Short answer: you usually don't. The platform does it for you. Vercel handles certificate provisioning automatically when you add a custom domain. Cloudflare issues certs through their own CA partnership. You point your DNS at them and certificates just appear.

The trap is thinking you still need to manage this yourself. I've seen developers deploy a separate VPS just to run certbot for a domain that's fully hosted on Cloudflare. Cloudflare was already issuing and rotating certificates for that domain. The VPS certbot was generating certs that nothing used.

But there's a real problem hiding here. When the platform handles your certs, you lose visibility. Certificate transparency logs become your only window into what's actually issued for your domains. And if the platform has an incident, like the Cloudflare certificate delays in January 2024, you find out when your users see browser warnings, not before.

IoT and devices with no DNS access

This is where it gets genuinely difficult. An embedded device in a factory needs TLS. It has network access but can't modify DNS records. It doesn't run a web server that a CA can reach from the public internet. HTTP-01 and DNS-01 are both off the table.

Three approaches that people actually use in production:

Device identity certificates from a private CA. Skip public CAs entirely. Run your own CA (HashiCorp Vault, step-ca, EJBCA) and issue certificates through an enrollment protocol like EST or SCEP. The device authenticates to your CA using a bootstrap credential, gets a cert, done. No domain validation needed because you're not validating public domain ownership; you're asserting device identity within your own PKI.

# Issue a device cert with step-ca
# the device runs this on first boot
step ca certificate "sensor-42.factory.internal" \
  device.crt device.key \
  --provisioner "device-bootstrap" \
  --provisioner-password-file /etc/step/password \
  --not-after 720h

ACME device attestation. RFC 9444 defines a mechanism where device identity gets tied into the ACME flow. Still early. Google and a few others are pushing it. Not widely deployed outside of managed device fleets yet.

Centralized issuance with distribution. A management server handles all the ACME challenges centrally, then pushes certificates to devices over a secure channel. Works well for fleets of hundreds of devices. Falls apart when devices are offline for extended periods and miss their renewal window.

The DNS-01 escape hatch (and its limits)

Most of the environments above push you toward DNS-01 validation. Makes sense. You don't need the target server to be reachable. You just need API access to your DNS provider. But DNS-01 has its own failure modes that get amplified at scale.

Propagation delays are the obvious one. You create a TXT record, the CA queries for it, gets a cached NXDOMAIN from a previous lookup, validation fails. Setting low TTLs on your ACME challenge records helps, but some resolvers ignore TTL hints below their minimum (looking at you, certain ISP resolvers that cache for 300 seconds no matter what).

Then there's the permissions problem. To do DNS-01 validation, your ACME client needs write access to your DNS zone. That's a powerful permission. A compromised ACME client can now modify arbitrary DNS records, not just challenge tokens. Scoping permissions down to only TXT records on _acme-challenge subdomains is possible with some providers but not all. AWS Route53 lets you do it with IAM policies. GoDaddy's API? Good luck.

CNAME delegation is the cleanest mitigation. Point _acme-challenge.yourdomain.com as a CNAME to a record in a dedicated validation zone. Your ACME client only needs write access to that zone, which contains nothing except challenge records. If it gets compromised, the blast radius is limited to certificate issuance, not your production DNS.

; Production zone - read only for ACME client
_acme-challenge.app.example.com. CNAME _acme-challenge.app.example.com.acme-validation.example.net.

; Validation zone - ACME client has write access here only
; nothing else lives in this zone
_acme-challenge.app.example.com.acme-validation.example.net. TXT "challenge-token-here"

What actually works: match the method to the environment

There's no universal answer. But after watching dozens of teams struggle with this, patterns emerge.

Serverless and managed platforms: let the platform handle it. Monitor with CT logs. Don't fight it.

Kubernetes: cert-manager with DNS-01. Wildcard certs where possible to reduce the number of issuances. IRSA or workload identity for cloud DNS access, never static keys in secrets.

IoT and constrained devices: private CA. Public domain validation doesn't make sense for internal device identity. Step-ca or Vault with short-lived certs and automated rotation.

Multi-cloud or hybrid: DNS-01 with CNAME delegation to a single validation zone. Keeps the ACME machinery centralized regardless of where workloads run.

The mistake most teams make is picking one validation method and applying it everywhere. Your Kubernetes cluster and your IoT fleet have completely different trust models. Treating them the same creates complexity where simplicity was possible. Match the method to the environment. Accept that you'll run multiple approaches in parallel. That's fine. That's reality.