Skip to content

Support identity-based mutual TLS — Closes #249#250

Draft
conradbzura wants to merge 38 commits into
wool-labs:mainfrom
conradbzura:249-identity-based-mtls
Draft

Support identity-based mutual TLS — Closes #249#250
conradbzura wants to merge 38 commits into
wool-labs:mainfrom
conradbzura:249-identity-based-mtls

Conversation

@conradbzura

@conradbzura conradbzura commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Close the three gaps that block Wool's mutual TLS on dynamic-address orchestrators (Kubernetes, ECS/Fargate), while keeping the existing certificate-authority trust model and WorkerCredentials exactly as they are. The change is additive and opt-in: a bare WorkerCredentials is coerced into an identity-free static provider everywhere, so static-address mTLS, one-way TLS, and plaintext deployments behave byte-for-byte as before, and no protobuf change is required.

The approach introduces a CredentialsProviderLike seam — the unit the runtime consults for current credential material — and threads it through the Worker Subsystem (client and server) and the Load Balancing layer:

  • Identity-based verification keys client channels on a content fingerprint and adds gRPC's ssl_target_name_override, so a worker reached at a dynamically assigned address is verified against a stable logical identity (its certificate SAN) rather than the dialed address. Full chain and SAN verification are preserved — verification is strengthened, not relaxed.
  • Rotation without restart re-reads file-backed material on change: the fingerprinted client pool yields fresh channels for rotated material while in-flight dispatches finish on their existing channel, and the worker server adopts new material per connection via grpc.dynamic_ssl_server_credentials.
  • Diagnosable failures classify a failed handshake as a typed HandshakeError (with a reason enum). A handshake-failing worker is skipped without eviction — so a worker mid-rotation self-heals on a later dispatch — and each rejection is emitted as a structured log carrying the classified reason; a drained dispatch raises the plain NoWorkersAvailable. A typed aggregate of the per-worker failures is deferred to a follow-up (see Review remediation).

One trade-off worth noting: rotation replaces cert/key/CA bytes but not the mutual-TLS mode, which grpc.dynamic_ssl_server_credentials fixes at construction. A second: the handshake classifier keys control flow on the gRPC status code plus the presence of any TLS/cert token in the error text — a deliberately broad gate that degrades a tokenless TLS failure to plain transient-skip rather than risking a false eviction. A handshake failure is skipped without eviction, so a worker mid-rotation self-heals on a later dispatch rather than being removed until re-discovery.

Closes #249

Proposed changes

Credential provider abstraction (worker/auth.py, Worker Subsystem)

Add CredentialsProviderLike (a resolve() -> CredentialsSnapshot protocol with a reloadable flag) and the fingerprinted CredentialsSnapshot, with one public implementation — FileCredentialsProvider (re-reads PEM files on change, validates before caching, returns the last-good snapshot on a transient or malformed read, and is lock-guarded for the off-loop server fetcher) — plus an internal _StaticCredentialsProvider that a bare WorkerCredentials coerces into. WorkerCredentials.provider_from_files(..., identity=, reload=) is the ergonomic entry point; _coerce_provider is the single seam that normalizes a bare WorkerCredentials into a provider. A blank identity normalizes to None (address-based path). CredentialsContext is widened to carry either form. The public surface is deliberately small: pass a CredentialsProviderLike (typically a FileCredentialsProvider, or your own custom provider), or a WorkerCredentials for the trivial static case.

Identity verification and fingerprint-keyed pool (worker/connection.py, Worker Subsystem)

Re-key the client channel pool on (target, credential fingerprint + identity, options) via _CredentialKey so unchanged material reuses a pooled channel and rotated material yields a fresh one. _channel_factory adds ssl_target_name_override only when an identity is configured; WorkerConnection holds a provider and resolves it per dispatch. Self-dispatch over the loopback UDS keeps its insecure key.

Diagnosable handshake failures (worker/connection.py + loadbalancer/, Worker Subsystem + Load Balancing)

Add HandshakeError(RpcError) with a typed Reason, classified structurally from the gRPC status code and error text (including the ssl_target_name_override hostname-verification mismatch → IDENTITY_MISMATCH and CA-verification failures → CERT_VERIFY). RoundRobinLoadBalancer skips a handshake-failing worker without evicting it (advancing to the next candidate and logging a structured warning that carries the classified reason), so a rotated worker recovers on a later dispatch. A drained dispatch raises the plain NoWorkersAvailable. HandshakeError.details is redacted to a fixed message rather than gRPC's verbose debug blob.

Provider wiring, client and server (worker/proxy.py, pool.py, local.py, process.py, base.py)

WorkerProxy, WorkerPool, LocalWorker, WorkerProcess, and the factory protocols accept either a WorkerCredentials or a provider. The proxy resolves its provider from the argument or the ambient context and forwards it to each connection; WorkerProcess serves rotating credentials via a per-connection fetcher for a reloadable provider and the unchanged static path otherwise; LocalWorker's stop RPC applies the identity override. The insecure self-dispatch UDS socket is confined to a per-worker 0700 directory under a short base ($XDG_RUNTIME_DIR or /tmp, within the AF_UNIX path limit).

Public API and docs (__init__.py, READMEs)

Export CredentialsProviderLike, CredentialsSnapshot, FileCredentialsProvider, and HandshakeError (the static wrapper stays internal as _StaticCredentialsProvider). Document identity verification, rotation, the diagnosable handshake errors, the self-dispatch socket, and the discovery-plane trust boundary in the top-level and worker READMEs.

Test cases

# Test Suite Given When Then Coverage Target
1 TestCredentialsSnapshot Credential material and an identity A snapshot is built Its fingerprint is stable for identical material and changes for any byte, identity, or mutual-flag difference Fingerprint determinism
2 TestStaticCredentialsProvider A static provider over fixed material resolve() is called repeatedly It returns the same snapshot, reports reloadable False, and survives a pickle roundtrip Back-compatible provider
3 TestFileCredentialsProvider A file provider over PEM files The files are unchanged, rotated, fail to re-read, or are replaced with malformed PEM It reuses, re-reads, or keeps the last-good snapshot respectively; an in-place same-size rewrite is still detected; concurrent resolves stay consistent Rotation source robustness
4 TestWorkerCredentials PEM paths with reload and identity (incl. blank) provider_from_files is called It returns a static or reloading provider; a blank identity normalizes to None Provider entry point
5 TestHandshakeError A status code, details, and reason A HandshakeError is constructed and pickled It is a non-transient RpcError exposing the reason, and survives a serialization roundtrip Error taxonomy + wire-safety
6 TestWorkerConnection A stub raising UNAUTHENTICATED or UNAVAILABLE with TLS evidence (incl. a verbatim hostname-verification failure) A task is dispatched It raises HandshakeError with the matching reason (IDENTITY_MISMATCH for the hostname case); a plain UNAVAILABLE stays transient; details are redacted Handshake classification + drift canary
7 TestWorkerConnection A provider with or without an identity A task is dispatched The secure channel carries ssl_target_name_override only when an identity is set Identity override
8 TestWorkerConnection A reloading provider, dispatched then rotated A second task is dispatched; a primed stream is iterated after rotation A new channel is built for the rotated fingerprint while an in-flight stream finishes on its original channel Fingerprint-keyed pool + in-flight survives rotation
9 TestWorkerConnection A secure connection self-dispatching over UDS A task is dispatched It routes over the insecure loopback and never builds a secure channel Loopback safety
10 TestWorkerProxy A provider supplied directly or via the credential context The proxy starts It admits only secure workers and forwards the provider to each connection Proxy provider wiring
11 TestRoundRobinLoadBalancer A pool whose workers all fail the handshake A task is dispatched It raises NoWorkersAvailable with the workers left in the pool (skipped, not evicted) Skip-without-evict on drain
12 TestRoundRobinLoadBalancer One handshake-failing worker and one healthy worker (and, separately, a surviving transient worker) A task is dispatched The failing worker is skipped (not evicted) and the dispatch succeeds on the healthy one; with a transient survivor the result is a plain NoWorkersAvailable Skip-without-evict + over-claim guard
13 TestRoundRobinLoadBalancer A worker that fails the handshake once then its connection recovers A second dispatch runs It dispatches successfully without re-discovery, and each rejection is logged Recoverability + rejection observability
14 TestWorkerProcess A reloading versus a static provider The worker server starts It builds dynamic server credentials for the former and the static path for the latter; the self-dispatch socket lives in a 0700 directory within the path limit Server-side rotation + UDS hardening
15 TestLocalWorker A worker whose provider carries an identity stop() is called The stop channel carries the identity override Stop-RPC identity
16 test_identity_mtls (integration) A real worker with an identity certificate at an ephemeral address A routine is dispatched with a matching identity; with an untrusted CA; with a mismatched identity; and after rotating the CA on disk The matching dispatch succeeds; the untrusted CA and the mismatched identity each drain to NoWorkersAvailable with the classified reason (CERT_VERIFY / IDENTITY_MISMATCH) logged; and a rotated CA is adopted without restarting the worker End-to-end identity, diagnosability, and rotation

Review remediation (post-review hardening)

A 20-agent application-security panel reviewed this PR (0 blocking; security core verified sound). The findings are remediated in the commits from fix(loadbalancer): recover handshake-failed workers … through test+docs(mtls): …:

  • Load balancer recoverability — handshake failures skip-without-evict (a rotated worker self-heals); a drained dispatch raises the plain NoWorkersAvailable, with each rejection logged.
  • Diagnostics — dropped the in-memory rejection ledger (unbounded, keyed by per-restart uid) in favour of per-rejection structured logs carrying the classified reason; redacted gRPC's debug blob from HandshakeError.details. The dedicated AllWorkersUnauthenticated signal was removed in favour of the plain NoWorkersAvailable (its name over-claimed, and it only covered the all-or-nothing case); a typed aggregate of the per-worker failures — likely an ExceptionGroup-based NoWorkersAvailable variant — is deferred to a follow-up issue.
  • Rotation robustness — validate-before-cache (a malformed PEM never overwrites good material), log on last-good fallback, detect in-place/same-size rewrites and inode swaps (st_ino + st_ctime_ns), and a threading.Lock around the off-loop resolve.
  • Contractsreloadable is now a protocol member; a blank identity normalizes to None.
  • UDS hardening — the insecure self-dispatch socket is confined to a per-worker 0700 directory under a short base directory (the canonical fix for the AF_UNIX path-length limit; man 7 unix), not a world-readable temp path.
  • Diagnosability end-to-end — an ssl_target_name_override mismatch now classifies as IDENTITY_MISMATCH (the real AioRpcError says "Hostname Verification Check failed", not the C-core log's "no match found for server name"); a drift canary pins the gRPC strings.
  • Docs — documented the discovery-plane trust boundary: WorkerMetadata (incl. the secure flag) is self-advertised over an unauthenticated, forgeable plane, so the security filter is a compatibility gate, not a trust boundary; trust rests on the mTLS handshake.

New e2e coverage: real rotation-without-restart, negative-identity rejection, wire-safety pickle round-trips, in-flight-survives-rotation, and the classifier drift canary. Full suite: 1368 passed, 98.10% coverage.

…tion

Introduce CredentialProviderLike — the seam through which the runtime
obtains current credential material — plus a fingerprinted
CredentialSnapshot and two implementations:

- StaticCredentialProvider wraps fixed WorkerCredentials (the
  back-compatible default; a bare WorkerCredentials is wrapped in one
  with no identity, preserving address-based verification exactly).
- FileCredentialProvider re-reads PEM files as they rotate, returning
  the last-good snapshot on a transient read failure so an in-flight
  rotation never tears the fleet down.

CredentialSnapshot.fingerprint is a content hash over the CA/key/cert
bytes, mutual flag, and expected identity — the stable key under which
client channels will be pooled (added in a later commit) so unchanged
material reuses a channel while rotated material yields a fresh one.

WorkerCredentials.provider_from_files(..., identity=, reload=) is the
ergonomic entry point. WorkerCredentials itself is untouched. No
behavior change yet — wiring follows.

Refs wool-labs#249
A failed secure handshake with a discovered worker (wrong CA, identity
mismatch, expired/rejected cert, plaintext-vs-encrypted) is today
indistinguishable from a transient hiccup: it surfaces as UNAVAILABLE,
the load balancer skips the worker forever, and the pool looks empty.

Introduce HandshakeError(RpcError) with a typed Reason enum and a
structural classifier wired into WorkerConnection.dispatch:

- UNAUTHENTICATED is always a handshake failure (peer rejected the
  client certificate).
- UNAVAILABLE is promoted to a HandshakeError only when the error text
  carries positive TLS evidence (broad ssl/tls/handshake/certificate
  tokens); a plain unreachable worker stays TransientRpcError, so there
  is no regression and no false eviction.

The reason (CERT_VERIFY, IDENTITY_MISMATCH, PEER_UNAUTHENTICATED,
PLAINTEXT_VS_ENCRYPTED, TLS_HANDSHAKE) is best-effort diagnostic
metadata; the load-bearing signal is the type. As a non-transient
RpcError it is evicted by the existing load-balancer arm — a later
commit records it so the pool can surface a distinct drained-on-auth
signal. Identity mismatch is distinguished from a CA rejection by the
gRPC error text alone, so the classifier needs no configured identity.

Refs wool-labs#249
Rework the client channel pool so it keys on credential *content*, not on
a gRPC credentials object's identity, and verify discovered workers
against a configurable logical identity instead of the dialed address.

- _CredentialKey makes the pool key (target, fingerprint+identity,
  options): unchanged material reuses the pooled channel; rotated
  material resolves to a new fingerprint and a fresh channel on the next
  dispatch, while in-flight dispatches finish on the old channel.
- WorkerConnection now holds a CredentialProviderLike and resolves it on
  every dispatch, so rotated material is adopted without restart.
- _channel_factory adds grpc.ssl_target_name_override when an identity is
  configured, verifying the server cert SAN against the identity rather
  than the address — full chain + SAN verification preserved.
- Self-dispatch over the loopback UDS keeps its insecure None-credential
  key even for a secure provider: a worker never does TLS against itself.

The proxy sentinel wraps its resolved WorkerCredentials in a
StaticCredentialProvider for now; the fuller provider/context wiring
follows. A bare WorkerCredentials with no identity wraps to an identity-
free static provider, so static-address mTLS, one-way TLS, and plaintext
are byte-for-byte unchanged.

Refs wool-labs#249
Widen WorkerProxy's credentials parameter to accept either a
WorkerCredentials or a CredentialProviderLike, and store the resolved
provider so the worker sentinel forwards it to each WorkerConnection
(which resolves it per dispatch for rotation).

- auth._coerce_provider normalizes a bare WorkerCredentials into an
  identity-free StaticCredentialProvider — the single seam pools,
  proxies, and worker processes use to accept either form.
- CredentialContext now carries either a WorkerCredentials or a provider,
  so a provider bound in the ambient context is resolved by the proxy
  exactly as credentials were before.
- The security filter keys on provider presence, unchanged in behavior:
  a credentialed proxy still admits only secure workers.

The reduce tuple still omits credentials (re-resolved from context on the
receiving side), and bare-WorkerCredentials callers are unaffected.

Refs wool-labs#249
…pool

When every discovered worker fails the secure handshake, the dispatcher
previously drained to a generic NoWorkersAvailable — indistinguishable
from a genuinely empty pool. Make that condition diagnosable:

- AllWorkersUnauthenticated(NoWorkersAvailable) is raised when the pool
  drains because workers refused (or were refused by) the client's
  credentials, carrying the per-worker HandshakeErrors. Subclassing
  NoWorkersAvailable keeps every existing 'except NoWorkersAvailable'
  caller working.
- RoundRobinLoadBalancer catches HandshakeError before the generic
  RpcError arm: it evicts the worker (a HandshakeError is an RpcError) and
  records it, then raises AllWorkersUnauthenticated instead of the bare
  NoWorkersAvailable when any rejection occurred.
- LoadBalancerContext keeps a per-worker HandshakeRejection ledger
  (counts + last reason), surfaced through WorkerProxy.rejections so an
  operator can see how many workers refused and why. The ledger is an
  optional capability — not added to the LoadBalancerContextLike protocol
  — so custom contexts that predate it still satisfy the protocol and the
  balancer skips recording via getattr.

Refs wool-labs#249
Wire the credential-provider seam through the worker server side so a
long-running worker adopts rotated material without a restart.

- WorkerProcess._server_credentials builds grpc.dynamic_ssl_server_
  credentials for a reloadable provider: its fetcher re-resolves the
  provider on each new connection, so rotated cert/key/CA bytes are
  adopted at the natural boundary of new connections while established
  RPCs continue. A static provider keeps the unchanged
  WorkerCredentials.server_credentials path, so the static-mTLS posture
  is byte-for-byte preserved. The mTLS mode is fixed from the initial
  material — rotation replaces bytes, not the handshake mode.
- Providers carry into the worker subprocess via multiprocessing, so the
  credential context now propagates a provider (resolved by nested
  WorkerProxies) rather than a bare WorkerCredentials.
- LocalWorker's stop RPC resolves the current snapshot and applies the
  identity override so a worker reached only by its logical identity can
  still be stopped.
- WorkerPool, LocalWorker, WorkerProcess, and the WorkerFactory/
  BoundWorkerFactory protocols accept either a WorkerCredentials or a
  provider; a bare WorkerCredentials is coerced to an identity-free
  static provider, leaving existing deployments unchanged.

Refs wool-labs#249
Add end-to-end integration coverage and finalize docs/exports for the
identity-based mTLS feature.

- test_identity_mtls.py spawns real workers over ephemeral loopback
  addresses: one whose certificate's only SAN is a logical identity (a
  credentialed client completes the handshake by verifying against that
  identity, not the dialed address), and one served by an untrusted CA (a
  mismatched client surfaces AllWorkersUnauthenticated with a typed
  handshake reason). The second case validates the classifier against a
  real BoringSSL 'certificate signature failure'.
- The existing secure-pool pairwise integration tests pass unchanged
  through the new provider-coercion path, confirming static-address mTLS,
  one-way TLS, and plaintext deployments are unaffected.
- Document identity verification, rotation, and the diagnosable handshake
  errors in both READMEs, and register the new public symbols in the
  public-API completeness test.

Closes wool-labs#249
@conradbzura conradbzura self-assigned this Jun 17, 2026
…rsUnauthenticated

Review finding (panel consensus): the load balancer hard-evicted a worker
on HandshakeError, and raised AllWorkersUnauthenticated whenever any
handshake rejection occurred. Two problems: an evicted worker could not be
re-admitted by in-place credential rotation (discovery emits worker-updated
for a still-running worker, which never re-admits), and the typed signal
over-claimed while a non-rejecting transient worker remained.

Treat HandshakeError as skip-without-eviction (like a transient error)
while logging each rejection, so a worker stays in the pool and recovers on
the next dispatch once its credentials rotate. Raise AllWorkersUnauthenticated
only when a dispatch cycle drained with at least one handshake rejection and
no non-handshake failure (a surviving transient worker or a non-handshake
eviction yields the plain NoWorkersAvailable). Per-rejection observability is
the warning log; historical aggregation is left to metrics/tracing rather
than retained in memory.

Refs wool-labs#249
…tails

Review decision: handshake-rejection history does not belong in an
in-memory pool ledger — that was unbounded (keyed by per-restart worker
uid, a slow leak on dynamic fleets) and a parallel observability mechanism.
Per-rejection structured logs are the event stream (consumed by
log/metrics tooling), and AllWorkersUnauthenticated.rejections remains the
bounded per-dispatch programmatic signal.

- Remove HandshakeRejection, LoadBalancerContext's rejection ledger
  (record_rejection / rejections), and WorkerProxy.rejections, plus their
  exports. Dissolves the unbounded-growth finding and the two-divergent-
  rejections-shapes finding in one move.
- Redact gRPC's debug_error_string from HandshakeError.details: on the
  empty-details fallback use a fixed message instead of str(error), so an
  internal peer address / C-core source paths no longer ride into logs or
  across the wire. Token classification still reads the verbose blob
  transiently; it is never stored on the surfaced exception.

Remove the now-stale ledger tests; add a details-redaction test.

Refs wool-labs#249
…ture, lock

Address review findings on FileCredentialProvider's rotation path:

- Validate material before caching: parse the CA/cert/key PEM (cryptography)
  and keep the prior good snapshot on failure, so a non-atomic rotation that
  leaves a readable-but-truncated PEM no longer overwrites good material and
  fail opaquely at the handshake later.
- Log a warning on the last-good fallback (a module logger), so a stuck
  rotation — unreadable or malformed replacement — is diagnosable instead of
  silent.
- Strengthen change detection: include st_ino and st_ctime_ns alongside
  (mtime, size), so an inode swap and a same-size in-place rewrite (which an
  os.utime mtime reset would otherwise hide) are both detected.
- Guard resolve()'s read-compare-write with a threading.Lock — the server's
  per-connection fetcher consults it off the asyncio loop on a gRPC C-core
  thread. Excluded from __getstate__; recreated via __setstate__.

Tests: validate-before-cache, fallback logging, same-size-reset-mtime
detection, and a concurrent-rotation thread-safety stress test. Two prior
tests that used dummy PEM bytes now use real certificates.

Refs wool-labs#249
- Promote reloadable to the CredentialProviderLike protocol so the
  rotate-without-restart contract is explicit. A conformant custom provider
  now declares it; the worker still reads it via getattr with a False
  default for robustness. _coerce_provider branches on WorkerCredentials
  (wrap) vs else (pass through as a provider), so a resolve()-only custom
  provider keeps working despite the larger runtime_checkable protocol.
- Normalize an empty/whitespace-only identity to None at the provider
  boundary (both providers and provider_from_files), so a blank identity
  takes the address-based path instead of emitting an empty
  ssl_target_name_override that fails verification opaquely.

Tests: reloadable values for both providers; blank-identity normalization
across the static provider, file provider, and provider_from_files.

Refs wool-labs#249
…; doc fixes

- Bind the insecure self-dispatch Unix-domain socket inside a per-worker
  0700 tempfile.mkdtemp directory (cleaned up on shutdown) instead of a
  world-readable /tmp path, so a co-located process under another uid
  cannot reach the unauthenticated dispatch service. The socket filename
  is short and fixed ('dispatch.sock') because the directory is already
  unique — a per-worker UUID in the name would breach the AF_UNIX path
  limit (~104 bytes on macOS) once nested under the temp directory.
- _server_credentials resolves the provider once at startup (for the
  initial config and the fixed mutual-TLS mode); the fetcher re-resolves
  per connection for rotation.
- Derive the handshake classifier's 'secure' flag from the resolved pool
  key's credential slot, not provider presence, so the self-dispatch UDS
  path (insecure) is classified correctly.
- Correct the false 'forcefully terminated' LocalWorker._stop docstring
  and the stale '(target, credentials, options)' pool-key line; document
  the UDS local-trust assumption.

Test: the self-dispatch socket lives in a 0700 directory.

Refs wool-labs#249
Follow-up to the UDS hardening: binding an over-long Unix-domain-socket
path (struct sockaddr_un.sun_path is 104 bytes on macOS/BSD) wedges worker
startup — which the 0700-directory nesting made reachable under a deeply
nested $TMPDIR. A filesystem socket path can't abbreviate its $TMPDIR
prefix the way discovery's flat shared-memory names do via _short_hash, so
guard the length and degrade to TCP-only self-dispatch (which already works
as the fallback) instead of risking a hang.

Test: an over-long temp path binds no Unix port.

Refs wool-labs#249
Supersede the TCP-downgrade guard with the canonical fix for the AF_UNIX
path-length limit: don't create the socket in a deep directory. macOS's
per-user $TMPDIR (/var/folders/.../T) is deep enough that nesting a 0700
sub-directory under it overflowed the ~104-byte sockaddr_un.sun_path cap
(108 on Linux, as little as 92 on some platforms) and wedged worker
startup. Bind under a short base instead — $XDG_RUNTIME_DIR (the
per-user runtime dir on Linux, already a 0700 tmpfs) where set, else /tmp —
keeping the path well under the portable budget, so there is no reason to
fall back to TCP. The per-worker 0700 directory (uid isolation) is retained.

Refs: man 7 unix (sun_path 108); 'don't create a Unix domain socket in a
deep directory'. Refs wool-labs#249
The handshake classifier matched 'no match found for server name' — which
only appears in gRPC's C-core *logs*. The actual AioRpcError for an
ssl_target_name_override mismatch is UNAVAILABLE with text 'Custom
verification check failed with error: UNAUTHENTICATED: Hostname
Verification Check failed.', carrying none of the gate tokens, so an
identity mismatch fell through to plain NoWorkersAvailable — defeating the
diagnosability the feature exists to provide.

- Add 'hostname verification' to the identity tokens, so an SAN/identity
  mismatch classifies as IDENTITY_MISMATCH.
- Add 'verification check failed' to the handshake gate, so any gRPC
  custom-verification failure is recognized as a handshake failure (worst
  case TLS_HANDSHAKE) rather than mistaken for plain unreachability.

Captured the verbatim strings from grpc.aio against a real worker.

e2e: a real credential rotation is adopted without restarting the worker
(IT-01), and a certificate that does not match the configured identity is
rejected with IDENTITY_MISMATCH (IT-02); the untrusted-CA e2e now asserts
the specific CERT_VERIFY reason.

Refs wool-labs#249
…discovery note

- Pickle/cloudpickle round-trips for HandshakeError and
  AllWorkersUnauthenticated — both cross the wire to a calling process, so
  their reason/code/details and nested rejection mapping must survive.
- In-flight dispatch survives rotation: a primed stream finishes on its
  original channel when the credential fingerprint rotates mid-flight;
  rotation is adopted at the next connection, never by tearing down work.
- Classifier drift canary: the verbatim gRPC hostname-verification string
  maps to IDENTITY_MISMATCH, guarding the token list against gRPC upgrades.
- Document the discovery-plane trust boundary in both READMEs: metadata
  (incl. the secure flag) is self-advertised over an unauthenticated,
  forgeable plane; the security filter is a compatibility gate, not a trust
  boundary; trust rests on the mTLS handshake. Authenticating discovery is
  future work.

Refs wool-labs#249
CI pyright failed: the rotation validator imported cryptography at runtime,
but cryptography is only a test dependency — wool does not depend on it at
runtime (grpcio bundles BoringSSL). It resolved locally (dev env) but not
in CI's runtime-only environment.

Validate through stdlib ssl instead of adding a heavyweight crypto runtime
dependency for a validation nicety: load the cert/key pair and CA bundle
into an SSLContext, which parses the PEM with the same OpenSSL machinery
the transport uses and additionally checks the key matches the certificate.
ssl.SSLError is an OSError, so the existing last-good fallback already
covers it.

Also make the thread-safety test rotate atomically (write-temp + os.replace),
as a real rotator does — stdlib ssl is stricter than the previous parser and
(correctly) rejects a torn read from a non-atomic concurrent write.

Refs wool-labs#249
Tidy the credentials family before it ships (all of these symbols are new
in this PR — only WorkerCredentials is released):

- Pluralize to match the WorkerCredentials anchor and gRPC's own
  *Credentials types: CredentialProviderLike -> CredentialsProviderLike,
  CredentialSnapshot -> CredentialsSnapshot, CredentialContext ->
  CredentialsContext, FileCredentialProvider -> FileCredentialsProvider,
  and the internal _CredentialKey -> _CredentialsKey.
- Privatize the static wrapper: StaticCredentialProvider ->
  _StaticCredentialsProvider. It is pure coercion glue — what a bare
  WorkerCredentials becomes — never a type a user reaches for. The public
  story is now just 'pass a CredentialsProviderLike, or pass WorkerCredentials
  for the trivial case'; provider_from_files remains the factory for a
  fixed-material-plus-identity provider.
- Keep WorkerCredentials as-is: it is a worker attribute (parallels
  WorkerMetadata) and fits wool's pervasive Worker* convention; the
  credential *infrastructure* around it stays role-named Credentials*, like
  DiscoveryLike / LoadBalancerLike.

The negative-identity e2e now builds its client via the public
provider_from_files instead of the (now private) static provider.

Refs wool-labs#249
…ction logs

A drained dispatch — including one where every worker tried failed the
secure handshake — now raises the plain NoWorkersAvailable. The dedicated
AllWorkersUnauthenticated signal (unreleased) is removed: its name
over-claimed ('Unauthenticated' when the per-worker reasons span
CERT_VERIFY / IDENTITY_MISMATCH / PLAINTEXT_VS_ENCRYPTED), and it only
covered the all-or-nothing case while a mixed handshake+transient drain was
left opaque.

What stays is the part that matters: HandshakeError still skips-without-
eviction (a worker mid-rotation self-heals on a later dispatch), and each
rejection is logged as a structured warning carrying the classified reason
— that log is the diagnosability surface.

Surfacing the per-worker failures programmatically (an 'all workers failed'
aggregate that also covers mixed drains) is deferred to a future generic
mechanism — likely an ExceptionGroup-based NoWorkersAvailable variant that
accumulates each worker's RpcError — rather than a misleading typed name.

- roundrobin: drop the rejections ledger + had_non_handshake_failure flag;
  the terminal raise collapses to NoWorkersAvailable.
- Integration tests assert the classified reason via the load balancer's
  logs (caplog) instead of the removed exception's rejections mapping.

Refs wool-labs#249
- Add a RoundRobinLoadBalancer pickle round-trip test covering __reduce__.
  It was an untested gap; removing AllWorkersUnauthenticated shrank the
  coverage denominator and tipped the Unit gate to 97.99% on Python 3.11
  (the floor is 98%). Now 98.04%.
- Relocate the handshake recoverability/observability contract to
  HandshakeError's docstring — the entity that owns the behavior — per the
  style guide's one-home-per-claim rule; the load balancer keeps a one-line
  pointer. The old HandshakeError docstring was doubly stale: it still
  described worker *eviction* and the removed AllWorkersUnauthenticated
  'distinct signal'.
- NoWorkersAvailable's docstring points to HandshakeError rather than
  restating the skip-without-evict contract, and drops the future-work note
  (changelog archaeology). Converted :class: role markup to bare backticks
  in the edited docstrings.

Refs wool-labs#249
Apply the documentation remediations from the style-guide review of the
identity-mTLS docs:

- Fix the error-classification table cell for HandshakeError, which read
  "Evict worker, record rejection" -- contradicting the code (skip without
  eviction) and naming the removed rejection ledger.
- Convert the role-prefix cross-references the PR introduced in new
  docstrings (:class:/:func:/:meth:/:attr:) to bare single backticks.
- Give the new i.e./e.g. abbreviations their required trailing comma.
- Stop the READMEs from restating the HandshakeError contract field by
  field; point to the class docstring and keep only cross-cutting
  narrative. Reduce the comments and the NoWorkersAvailable docstring that
  echoed that contract to pointers.
- Drop forward-looking "future work" framing from the docs.
- Fix a stale Discovery reference (the public protocol is DiscoveryLike)
  and show the full provider_from_files signature (mutual=) in the examples.

Documentation and comments only; no behavior change.
…Provider

Collapse the credential-provider family to one public type,
WorkerCredentialsProvider(fetch, *, identity=None, reloadable=False): a
fetch callback supplies the current WorkerCredentials and the provider
stamps identity + content fingerprint. All source-specific logic (change
detection, return-cached-when-unchanged, validation, last-good fallback,
locking) lives in the fetch, keeping the provider a thin pass-through.

- Remove internal _StaticCredentialsProvider; the constant case is now
  WorkerCredentialsProvider(lambda: creds, reloadable=False) — resolved
  eagerly, callback dropped on pickle so a lambda need not be picklable.
- Replace public FileCredentialsProvider with the internal
  _FileCredentialsReader fetch functor (stat-gate + validate + last-good
  fallback + lock), wired behind WorkerCredentials.provider_from_files.
- Export WorkerCredentialsProvider; drop FileCredentialsProvider.

A reloadable fetch is re-invoked in the worker subprocess and per
handshake, so it must be importable and thread-safe; the file functor is
both. API is unreleased (wool-labs#249), so no compatibility shim. Consumers are
unchanged (resolve()/reloadable/_coerce_provider).
Narrow the credential-provider surface to a single thin mechanism,
WorkerCredentialsProvider(fetch, *, identity=None, reloadable=False),
where the fetch owns any reload strategy. Wool ships the seam, not the
policy — easy simple things (static creds), possible hard things (custom
reloading).

- Add WorkerCredentials.as_provider(*, identity=None) for the fixed case;
  _coerce_provider uses it. Drop WorkerCredentials.provider_from_files.
- Remove the internal _FileCredentialsReader (stat-gate / validate /
  last-good fallback) and _validate_material. File rotation is now a user
  fetch, e.g. WorkerCredentialsProvider(functools.partial(
  WorkerCredentials.from_files, ca, key, cert), reloadable=True), with the
  caller owning caching / fallback / TTL.
- Remove the CredentialsProviderLike protocol: with one concrete provider
  and the fetch callback as the extension seam it is redundant. Consumers
  retype to WorkerCredentialsProvider; custom providers stay possible via
  duck-typed resolve()/reloadable.
- Widen WorkerCredentials.from_files path params to str | os.PathLike[str].

Public surface shrinks to WorkerCredentials, WorkerCredentialsProvider,
CredentialsSnapshot. API unreleased (wool-labs#249); no compatibility shim.
…shot

Align the snapshot type with the Worker* credential family
(WorkerCredentials, WorkerCredentialsProvider). Mechanical rename, no
behavior change. API unreleased (wool-labs#249).
A snapshot was WorkerCredentials + identity + a precomputed fingerprint,
but WorkerCredentials is already a frozen, hashable, value-equal dataclass
— so the fingerprint is redundant with its own hashing and the identity is
just another field. Collapse the type:

- WorkerCredentials gains an identity field (normalized in __post_init__)
  and a fingerprint property (the former snapshot digest, kept for
  diagnostics and logging).
- WorkerCredentialsProvider.resolve() now returns WorkerCredentials,
  stamping the provider's identity (when configured) onto the fetched
  material; otherwise the credentials keep their own.
- The client channel pool keys on the WorkerCredentials value directly;
  _CredentialsKey and _compute_fingerprint's identity arg are removed.
- Drop WorkerCredentialsSnapshot from the public API.

Public surface: WorkerCredentials, WorkerCredentialsProvider. No behavior
change to channel pooling or rotation. API unreleased (wool-labs#249).
@conradbzura conradbzura force-pushed the 249-identity-based-mtls branch from 2b796e9 to e3443b7 Compare June 19, 2026 01:40
Use "i.e.," for the subject-alternative-name aside and wrap to the line
limit. Docstring-only.
@conradbzura conradbzura force-pushed the 249-identity-based-mtls branch 2 times, most recently from 19c5d0b to 70e50a2 Compare June 19, 2026 03:07
@conradbzura conradbzura force-pushed the 249-identity-based-mtls branch from 70e50a2 to 2fa1a09 Compare June 19, 2026 13:15
…tory

Align the callback parameter with the codebase's Factory convention and
the stdlib default_factory (a zero-argument callable that produces a
value) — the right mental model here — rather than fetch. Avoids the
property-accessor connotation of getter. Sync callable, no behavior
change; gRPC's per-handshake 'fetcher' terminology is left intact.
@conradbzura conradbzura force-pushed the 249-identity-based-mtls branch from 2fa1a09 to b7afa60 Compare June 19, 2026 13:15
The channel pool keys on the WorkerCredentials value directly — a frozen
dataclass is hashable and value-equal — so the content fingerprint lost
its only load-bearing role. Remove the public `fingerprint` property and
the `_compute_fingerprint` helper, retarget the tests that used it as a
value-equality proxy to assert WorkerCredentials equality directly, and
fix the docstrings and README that still described fingerprint-keyed
pooling.
State the WorkerCredentialsProvider resolution hot path: a reloadable
factory is consulted on every client dispatch and on every server-side
TLS handshake, so it must be importable, picklable, and safe to call
concurrently.

Convert the _apply and __getstate__ inline comments into method
docstrings and order those members so the pickle dunder precedes the
public methods. Mark WorkerCredentials with the public comment, and drop
a stray reference to the removed fingerprint left in the resolve
docstring.
Inline the trivial _coerce_provider helper at its three call sites
(LocalWorker, WorkerProcess, WorkerProxy) and drop the cross-module
private function, which the IDE flagged as unused.

Replace the CredentialsContext class with two module-level functions:
credentials_scope (a context manager that binds the ambient credentials)
and current_credentials (the reader). The contextmanager form drops the
manual __enter__/__exit__/_token machinery and its exit-without-enter
guard, so the obsolete guard test is removed.
WorkerProcess binds its self-dispatch socket under /tmp when
XDG_RUNTIME_DIR points to a path that is not a directory. Exercise that
fallback so the branch is covered.
Name the channel pool key alias for what it keys, a channel, rather
than the pool it feeds. Cast the untyped factory key to _ChannelKey so
the unpacked credentials and options are statically typed, which in turn
requires annotating grpc_options to admit the string-valued
ssl_target_name_override entry. Fold the identity-verification rationale
from an inline comment into the _channel_factory docstring.
credentials_scope and current_credentials carried a "Not part of the
public API" line that their absence from the package __all__ already
settles, leaving a second place to keep that fact in sync.
Drop the identity, certificate, and protocol-mismatch reason-token
tables and the CERT_VERIFY and IDENTITY_MISMATCH reasons they fed. Those
phrase tables drifted across gRPC and BoringSSL versions while only
refining advisory diagnostics, never the load-bearing HandshakeError
type or the worker-health routing.

Keep the broad gate that promotes an ambiguous UNAVAILABLE to a
handshake failure, since it still separates a TLS failure from a
genuinely down worker. Take the reason structurally instead:
UNAUTHENTICATED is PEER_UNAUTHENTICATED, a gated secure failure is
TLS_HANDSHAKE, and a gated insecure one is PLAINTEXT_VS_ENCRYPTED.
Wrong-CA, wrong-identity, and expired-certificate failures now all
surface as TLS_HANDSHAKE; the finer cause moves to gRPC's own tracing.

Update the unit and integration tests to assert the collapsed reasons.
The reason enum and the reason attribute were diagnostic-only: nothing
branched on a member, and the sole consumer was a warning log. The load
balancer now logs the gRPC code and details directly, and the classifier
keeps the broad UNAVAILABLE gate as a boolean rather than producing a
flavor — UNAUTHENTICATED and a TLS-evidenced UNAVAILABLE both surface the
same HandshakeError.
The value-or-provider normalization was copied verbatim into three
constructors; fold it into one classmethod, coerce, and call it from
LocalWorker, WorkerProcess, WorkerProxy, and WorkerConnection. Also move
the module-private _normalize_identity helper below both classes so the
classes are contiguous, and have resolve branch on the reloadable flag
rather than infer it from the cached sentinel.
Rename WorkerConnection's public constructor parameter from provider to
credentials and widen it to accept a WorkerCredentials value or a
WorkerCredentialsProvider, coercing a bare value internally. This makes
the credential vocabulary uniform with WorkerPool, LocalWorker, and
WorkerProxy, which already accept the union, and keeps the easy case, a
bare value, reachable at this entry point too.
Address round-1 review documentation findings: convert two carried-over
Sphinx role prefixes to bare cross-references in connection.py, drop a
redundant etc. and de-duplicate the picklability rationale in the
WorkerCredentialsProvider docstrings, and correct a test docstring that
described HandshakeError as evicted by the load balancer when it is
skipped without eviction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support identity-based mutual TLS

1 participant