Skip to content

Support identity-based mutual TLS #249

Description

@conradbzura

Description

Wool already provides transport encryption and certificate-authority-based peer authentication through worker-specific credentials (WorkerCredentials) and a mutual flag for mutual TLS. That model assumes a static, name-stable topology with manually provisioned, long-lived certificates: every worker is reachable at a known address, and that address is fixed for the lifetime of its certificate.

Modern service orchestrators violate every part of that assumption. On platforms such as Kubernetes and Amazon ECS/Fargate, a worker's network address is assigned dynamically when its task or pod starts, is not known ahead of time, and changes across restarts and rescheduling. Credentials are issued and rotated by the platform (e.g., via secret-store rotation, certificate-manager renewals, short-lived workload certificates, etc.) on a schedule independent of the worker process lifetime. And when a credential is misconfigured, operators need to tell that apart from an empty pool.

This issue keeps the existing certificate-authority trust model and worker-specific credentials exactly as they are, and closes the three gaps that block real deployments today:

  1. Verifying a discovered worker against a stable logical identity instead of against its dynamically assigned address.
  2. Adopting rotated credentials without restarting the worker or the pool.
  3. Surfacing authentication and handshake failures as a distinct, diagnosable condition.

In scope: the three behaviors above, expressed over the existing WorkerCredentials and mutual-TLS model and the existing discovery and dispatch paths.

Out of scope, deferred to dependent follow-up work: a first-class workload-identity model with SPIFFE / URI-based verification, which builds directly on the identity verification and credential rotation introduced here; and transport-agnostic token authentication, per-identity authorization of routines, authentication of the discovery plane itself, and secure-by-default guardrails, which build on that identity model in turn. Nothing in this issue should foreclose those directions.

Motivation

Integrating Wool's mutual TLS into a cloud-hosted application surfaced three blockers, each of which independently prevents enforced mutual TLS from being used in production on a dynamic-address platform.

A static certificate cannot match a dynamic address. When workers are discovered by address — the common case for orchestrator-assigned networking — the dispatching client verifies the worker's server certificate against the exact address it dialed. Because that address is not known until the task is already running, no certificate provisioned ahead of time can carry a matching identity, and the handshake fails verification. The only workarounds are to mint a fresh certificate per task (operationally heavy and costly) or to abandon enforced mutual TLS altogether.

Long-lived workers cannot follow credential rotation. Because credentials are read once and then fixed for the lifetime of the worker and pool, a long-running fleet cannot pick up rotated certificates, keys, or certificate-authority bundles without a full process restart. This pushes deployments toward long-lived certificates — itself a weaker posture — and makes short-lived, automatically rotated credentials effectively unusable.

Credential failures are invisible. When a client cannot complete the handshake with the workers it discovers (e.g., due to a wrong certificate authority, a plaintext client against encrypted workers, an address or identity mismatch, expired or incompatible certificates, etc.), the dispatcher today observes only a generic "no workers available" outcome after its retry budget is exhausted. A fleet-wide credential misconfiguration is therefore indistinguishable from an empty pool, which makes mutual-TLS problems extremely difficult to diagnose in production.

Expected outcome

Identity-based verification of discovered workers

  • A worker provisioned with a single static certificate that carries a stable logical identity rather than any particular network address can be discovered at a dynamically assigned address, and a credentialed client completes the mutual TLS handshake with it successfully, verifying the worker against that stable identity rather than against the dialed address.
  • The same worker certificate continues to validate after the worker is reassigned to a different address, with no change to the certificate and no per-task certificate issuance.
  • Enforced mutual TLS works end to end over address-based discovery: a complete dispatch succeeds against a worker reached at an arbitrary, previously unknown address.
  • A worker that presents a certificate not matching the expected identity, or not signed by the trusted certificate authority, is still rejected. Identity-based verification strengthens the existing guarantee rather than relaxing it; weakening or disabling server-identity verification is not an acceptable way to achieve this outcome.
  • The expected worker identity is configurable by the operator wherever credentials are configured, and a single configured identity applies across every worker in the pool regardless of the address at which each worker is reached.

Credential rotation without restart

  • After the underlying credential artifact, i.e., worker certificate, private key, or certificate-authority bundle, is rotated out of band, a long-running worker and a long-running pool begin using the new material for subsequent connections and handshakes without restarting the process.
  • In-flight dispatches that are in progress at the moment of rotation are not abruptly torn down by the rotation itself; new material is adopted at the natural boundary of new connections rather than by interrupting existing work.
  • A deployment can run entirely on short-lived, automatically renewed credentials without being forced to provision long-lived certificates to keep the fleet reachable.
  • When a rotated credential is itself invalid or incompatible, the resulting failures are reported through the diagnosable channel described below rather than being silently swallowed.

Diagnosable authentication and handshake failures

  • When a client cannot complete the secure handshake with the workers it discovers, it receives a distinct, typed signal identifying the failure as an authentication or handshake problem separate from, and not collapsed into, the "no workers available" condition that means discovery found nothing.
  • An operator can distinguish "no workers are present" from "workers are present, but every one of them refused, or was refused by, my credentials", both from the error surfaced to the dispatcher and from the pool's observability.
  • Common, concrete misconfigurations each produce a clear, distinguishable signal rather than a generic timeout: a plaintext client against encrypted workers, a client that trusts the wrong certificate authority, a worker identity that does not match what the client expects, and expired or otherwise rejected certificates.
  • Per-worker handshake-rejection information — counts and/or events — is observable through the pool, so an operator can see how many discovered workers refused the client's credentials, and why.

Compatibility and posture (cross-cutting)

  • Existing deployments are unaffected: workers and pools that use static-address mutual TLS, one-way TLS, or plaintext today continue to behave exactly as before, with no required configuration changes.
  • The secure-by-default posture is preserved — mutual TLS remains the default when credentials are supplied — and none of these changes weaken an existing guarantee or make an insecure configuration easier to reach by accident.
  • The new behaviors are opt-in and compose with the existing WorkerPool configuration modes (default, ephemeral, durable, and hybrid) and the existing discovery backends, without requiring any particular orchestrator.

Metadata

Metadata

Assignees

Labels

featureNew feature or capability

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions