Support identity-based mutual TLS — Closes #249#250
Draft
conradbzura wants to merge 38 commits into
Draft
Conversation
…tion Introduce CredentialProviderLike — the seam through which the runtime obtains current credential material — plus a fingerprinted CredentialSnapshot and two implementations: - StaticCredentialProvider wraps fixed WorkerCredentials (the back-compatible default; a bare WorkerCredentials is wrapped in one with no identity, preserving address-based verification exactly). - FileCredentialProvider re-reads PEM files as they rotate, returning the last-good snapshot on a transient read failure so an in-flight rotation never tears the fleet down. CredentialSnapshot.fingerprint is a content hash over the CA/key/cert bytes, mutual flag, and expected identity — the stable key under which client channels will be pooled (added in a later commit) so unchanged material reuses a channel while rotated material yields a fresh one. WorkerCredentials.provider_from_files(..., identity=, reload=) is the ergonomic entry point. WorkerCredentials itself is untouched. No behavior change yet — wiring follows. Refs wool-labs#249
A failed secure handshake with a discovered worker (wrong CA, identity mismatch, expired/rejected cert, plaintext-vs-encrypted) is today indistinguishable from a transient hiccup: it surfaces as UNAVAILABLE, the load balancer skips the worker forever, and the pool looks empty. Introduce HandshakeError(RpcError) with a typed Reason enum and a structural classifier wired into WorkerConnection.dispatch: - UNAUTHENTICATED is always a handshake failure (peer rejected the client certificate). - UNAVAILABLE is promoted to a HandshakeError only when the error text carries positive TLS evidence (broad ssl/tls/handshake/certificate tokens); a plain unreachable worker stays TransientRpcError, so there is no regression and no false eviction. The reason (CERT_VERIFY, IDENTITY_MISMATCH, PEER_UNAUTHENTICATED, PLAINTEXT_VS_ENCRYPTED, TLS_HANDSHAKE) is best-effort diagnostic metadata; the load-bearing signal is the type. As a non-transient RpcError it is evicted by the existing load-balancer arm — a later commit records it so the pool can surface a distinct drained-on-auth signal. Identity mismatch is distinguished from a CA rejection by the gRPC error text alone, so the classifier needs no configured identity. Refs wool-labs#249
Rework the client channel pool so it keys on credential *content*, not on a gRPC credentials object's identity, and verify discovered workers against a configurable logical identity instead of the dialed address. - _CredentialKey makes the pool key (target, fingerprint+identity, options): unchanged material reuses the pooled channel; rotated material resolves to a new fingerprint and a fresh channel on the next dispatch, while in-flight dispatches finish on the old channel. - WorkerConnection now holds a CredentialProviderLike and resolves it on every dispatch, so rotated material is adopted without restart. - _channel_factory adds grpc.ssl_target_name_override when an identity is configured, verifying the server cert SAN against the identity rather than the address — full chain + SAN verification preserved. - Self-dispatch over the loopback UDS keeps its insecure None-credential key even for a secure provider: a worker never does TLS against itself. The proxy sentinel wraps its resolved WorkerCredentials in a StaticCredentialProvider for now; the fuller provider/context wiring follows. A bare WorkerCredentials with no identity wraps to an identity- free static provider, so static-address mTLS, one-way TLS, and plaintext are byte-for-byte unchanged. Refs wool-labs#249
Widen WorkerProxy's credentials parameter to accept either a WorkerCredentials or a CredentialProviderLike, and store the resolved provider so the worker sentinel forwards it to each WorkerConnection (which resolves it per dispatch for rotation). - auth._coerce_provider normalizes a bare WorkerCredentials into an identity-free StaticCredentialProvider — the single seam pools, proxies, and worker processes use to accept either form. - CredentialContext now carries either a WorkerCredentials or a provider, so a provider bound in the ambient context is resolved by the proxy exactly as credentials were before. - The security filter keys on provider presence, unchanged in behavior: a credentialed proxy still admits only secure workers. The reduce tuple still omits credentials (re-resolved from context on the receiving side), and bare-WorkerCredentials callers are unaffected. Refs wool-labs#249
…pool When every discovered worker fails the secure handshake, the dispatcher previously drained to a generic NoWorkersAvailable — indistinguishable from a genuinely empty pool. Make that condition diagnosable: - AllWorkersUnauthenticated(NoWorkersAvailable) is raised when the pool drains because workers refused (or were refused by) the client's credentials, carrying the per-worker HandshakeErrors. Subclassing NoWorkersAvailable keeps every existing 'except NoWorkersAvailable' caller working. - RoundRobinLoadBalancer catches HandshakeError before the generic RpcError arm: it evicts the worker (a HandshakeError is an RpcError) and records it, then raises AllWorkersUnauthenticated instead of the bare NoWorkersAvailable when any rejection occurred. - LoadBalancerContext keeps a per-worker HandshakeRejection ledger (counts + last reason), surfaced through WorkerProxy.rejections so an operator can see how many workers refused and why. The ledger is an optional capability — not added to the LoadBalancerContextLike protocol — so custom contexts that predate it still satisfy the protocol and the balancer skips recording via getattr. Refs wool-labs#249
Wire the credential-provider seam through the worker server side so a long-running worker adopts rotated material without a restart. - WorkerProcess._server_credentials builds grpc.dynamic_ssl_server_ credentials for a reloadable provider: its fetcher re-resolves the provider on each new connection, so rotated cert/key/CA bytes are adopted at the natural boundary of new connections while established RPCs continue. A static provider keeps the unchanged WorkerCredentials.server_credentials path, so the static-mTLS posture is byte-for-byte preserved. The mTLS mode is fixed from the initial material — rotation replaces bytes, not the handshake mode. - Providers carry into the worker subprocess via multiprocessing, so the credential context now propagates a provider (resolved by nested WorkerProxies) rather than a bare WorkerCredentials. - LocalWorker's stop RPC resolves the current snapshot and applies the identity override so a worker reached only by its logical identity can still be stopped. - WorkerPool, LocalWorker, WorkerProcess, and the WorkerFactory/ BoundWorkerFactory protocols accept either a WorkerCredentials or a provider; a bare WorkerCredentials is coerced to an identity-free static provider, leaving existing deployments unchanged. Refs wool-labs#249
Add end-to-end integration coverage and finalize docs/exports for the identity-based mTLS feature. - test_identity_mtls.py spawns real workers over ephemeral loopback addresses: one whose certificate's only SAN is a logical identity (a credentialed client completes the handshake by verifying against that identity, not the dialed address), and one served by an untrusted CA (a mismatched client surfaces AllWorkersUnauthenticated with a typed handshake reason). The second case validates the classifier against a real BoringSSL 'certificate signature failure'. - The existing secure-pool pairwise integration tests pass unchanged through the new provider-coercion path, confirming static-address mTLS, one-way TLS, and plaintext deployments are unaffected. - Document identity verification, rotation, and the diagnosable handshake errors in both READMEs, and register the new public symbols in the public-API completeness test. Closes wool-labs#249
…rsUnauthenticated Review finding (panel consensus): the load balancer hard-evicted a worker on HandshakeError, and raised AllWorkersUnauthenticated whenever any handshake rejection occurred. Two problems: an evicted worker could not be re-admitted by in-place credential rotation (discovery emits worker-updated for a still-running worker, which never re-admits), and the typed signal over-claimed while a non-rejecting transient worker remained. Treat HandshakeError as skip-without-eviction (like a transient error) while logging each rejection, so a worker stays in the pool and recovers on the next dispatch once its credentials rotate. Raise AllWorkersUnauthenticated only when a dispatch cycle drained with at least one handshake rejection and no non-handshake failure (a surviving transient worker or a non-handshake eviction yields the plain NoWorkersAvailable). Per-rejection observability is the warning log; historical aggregation is left to metrics/tracing rather than retained in memory. Refs wool-labs#249
…tails Review decision: handshake-rejection history does not belong in an in-memory pool ledger — that was unbounded (keyed by per-restart worker uid, a slow leak on dynamic fleets) and a parallel observability mechanism. Per-rejection structured logs are the event stream (consumed by log/metrics tooling), and AllWorkersUnauthenticated.rejections remains the bounded per-dispatch programmatic signal. - Remove HandshakeRejection, LoadBalancerContext's rejection ledger (record_rejection / rejections), and WorkerProxy.rejections, plus their exports. Dissolves the unbounded-growth finding and the two-divergent- rejections-shapes finding in one move. - Redact gRPC's debug_error_string from HandshakeError.details: on the empty-details fallback use a fixed message instead of str(error), so an internal peer address / C-core source paths no longer ride into logs or across the wire. Token classification still reads the verbose blob transiently; it is never stored on the surfaced exception. Remove the now-stale ledger tests; add a details-redaction test. Refs wool-labs#249
…ture, lock Address review findings on FileCredentialProvider's rotation path: - Validate material before caching: parse the CA/cert/key PEM (cryptography) and keep the prior good snapshot on failure, so a non-atomic rotation that leaves a readable-but-truncated PEM no longer overwrites good material and fail opaquely at the handshake later. - Log a warning on the last-good fallback (a module logger), so a stuck rotation — unreadable or malformed replacement — is diagnosable instead of silent. - Strengthen change detection: include st_ino and st_ctime_ns alongside (mtime, size), so an inode swap and a same-size in-place rewrite (which an os.utime mtime reset would otherwise hide) are both detected. - Guard resolve()'s read-compare-write with a threading.Lock — the server's per-connection fetcher consults it off the asyncio loop on a gRPC C-core thread. Excluded from __getstate__; recreated via __setstate__. Tests: validate-before-cache, fallback logging, same-size-reset-mtime detection, and a concurrent-rotation thread-safety stress test. Two prior tests that used dummy PEM bytes now use real certificates. Refs wool-labs#249
- Promote reloadable to the CredentialProviderLike protocol so the rotate-without-restart contract is explicit. A conformant custom provider now declares it; the worker still reads it via getattr with a False default for robustness. _coerce_provider branches on WorkerCredentials (wrap) vs else (pass through as a provider), so a resolve()-only custom provider keeps working despite the larger runtime_checkable protocol. - Normalize an empty/whitespace-only identity to None at the provider boundary (both providers and provider_from_files), so a blank identity takes the address-based path instead of emitting an empty ssl_target_name_override that fails verification opaquely. Tests: reloadable values for both providers; blank-identity normalization across the static provider, file provider, and provider_from_files. Refs wool-labs#249
…; doc fixes
- Bind the insecure self-dispatch Unix-domain socket inside a per-worker
0700 tempfile.mkdtemp directory (cleaned up on shutdown) instead of a
world-readable /tmp path, so a co-located process under another uid
cannot reach the unauthenticated dispatch service. The socket filename
is short and fixed ('dispatch.sock') because the directory is already
unique — a per-worker UUID in the name would breach the AF_UNIX path
limit (~104 bytes on macOS) once nested under the temp directory.
- _server_credentials resolves the provider once at startup (for the
initial config and the fixed mutual-TLS mode); the fetcher re-resolves
per connection for rotation.
- Derive the handshake classifier's 'secure' flag from the resolved pool
key's credential slot, not provider presence, so the self-dispatch UDS
path (insecure) is classified correctly.
- Correct the false 'forcefully terminated' LocalWorker._stop docstring
and the stale '(target, credentials, options)' pool-key line; document
the UDS local-trust assumption.
Test: the self-dispatch socket lives in a 0700 directory.
Refs wool-labs#249
Follow-up to the UDS hardening: binding an over-long Unix-domain-socket path (struct sockaddr_un.sun_path is 104 bytes on macOS/BSD) wedges worker startup — which the 0700-directory nesting made reachable under a deeply nested $TMPDIR. A filesystem socket path can't abbreviate its $TMPDIR prefix the way discovery's flat shared-memory names do via _short_hash, so guard the length and degrade to TCP-only self-dispatch (which already works as the fallback) instead of risking a hang. Test: an over-long temp path binds no Unix port. Refs wool-labs#249
Supersede the TCP-downgrade guard with the canonical fix for the AF_UNIX path-length limit: don't create the socket in a deep directory. macOS's per-user $TMPDIR (/var/folders/.../T) is deep enough that nesting a 0700 sub-directory under it overflowed the ~104-byte sockaddr_un.sun_path cap (108 on Linux, as little as 92 on some platforms) and wedged worker startup. Bind under a short base instead — $XDG_RUNTIME_DIR (the per-user runtime dir on Linux, already a 0700 tmpfs) where set, else /tmp — keeping the path well under the portable budget, so there is no reason to fall back to TCP. The per-worker 0700 directory (uid isolation) is retained. Refs: man 7 unix (sun_path 108); 'don't create a Unix domain socket in a deep directory'. Refs wool-labs#249
The handshake classifier matched 'no match found for server name' — which only appears in gRPC's C-core *logs*. The actual AioRpcError for an ssl_target_name_override mismatch is UNAVAILABLE with text 'Custom verification check failed with error: UNAUTHENTICATED: Hostname Verification Check failed.', carrying none of the gate tokens, so an identity mismatch fell through to plain NoWorkersAvailable — defeating the diagnosability the feature exists to provide. - Add 'hostname verification' to the identity tokens, so an SAN/identity mismatch classifies as IDENTITY_MISMATCH. - Add 'verification check failed' to the handshake gate, so any gRPC custom-verification failure is recognized as a handshake failure (worst case TLS_HANDSHAKE) rather than mistaken for plain unreachability. Captured the verbatim strings from grpc.aio against a real worker. e2e: a real credential rotation is adopted without restarting the worker (IT-01), and a certificate that does not match the configured identity is rejected with IDENTITY_MISMATCH (IT-02); the untrusted-CA e2e now asserts the specific CERT_VERIFY reason. Refs wool-labs#249
…discovery note - Pickle/cloudpickle round-trips for HandshakeError and AllWorkersUnauthenticated — both cross the wire to a calling process, so their reason/code/details and nested rejection mapping must survive. - In-flight dispatch survives rotation: a primed stream finishes on its original channel when the credential fingerprint rotates mid-flight; rotation is adopted at the next connection, never by tearing down work. - Classifier drift canary: the verbatim gRPC hostname-verification string maps to IDENTITY_MISMATCH, guarding the token list against gRPC upgrades. - Document the discovery-plane trust boundary in both READMEs: metadata (incl. the secure flag) is self-advertised over an unauthenticated, forgeable plane; the security filter is a compatibility gate, not a trust boundary; trust rests on the mTLS handshake. Authenticating discovery is future work. Refs wool-labs#249
CI pyright failed: the rotation validator imported cryptography at runtime, but cryptography is only a test dependency — wool does not depend on it at runtime (grpcio bundles BoringSSL). It resolved locally (dev env) but not in CI's runtime-only environment. Validate through stdlib ssl instead of adding a heavyweight crypto runtime dependency for a validation nicety: load the cert/key pair and CA bundle into an SSLContext, which parses the PEM with the same OpenSSL machinery the transport uses and additionally checks the key matches the certificate. ssl.SSLError is an OSError, so the existing last-good fallback already covers it. Also make the thread-safety test rotate atomically (write-temp + os.replace), as a real rotator does — stdlib ssl is stricter than the previous parser and (correctly) rejects a torn read from a non-atomic concurrent write. Refs wool-labs#249
Tidy the credentials family before it ships (all of these symbols are new in this PR — only WorkerCredentials is released): - Pluralize to match the WorkerCredentials anchor and gRPC's own *Credentials types: CredentialProviderLike -> CredentialsProviderLike, CredentialSnapshot -> CredentialsSnapshot, CredentialContext -> CredentialsContext, FileCredentialProvider -> FileCredentialsProvider, and the internal _CredentialKey -> _CredentialsKey. - Privatize the static wrapper: StaticCredentialProvider -> _StaticCredentialsProvider. It is pure coercion glue — what a bare WorkerCredentials becomes — never a type a user reaches for. The public story is now just 'pass a CredentialsProviderLike, or pass WorkerCredentials for the trivial case'; provider_from_files remains the factory for a fixed-material-plus-identity provider. - Keep WorkerCredentials as-is: it is a worker attribute (parallels WorkerMetadata) and fits wool's pervasive Worker* convention; the credential *infrastructure* around it stays role-named Credentials*, like DiscoveryLike / LoadBalancerLike. The negative-identity e2e now builds its client via the public provider_from_files instead of the (now private) static provider. Refs wool-labs#249
…ction logs
A drained dispatch — including one where every worker tried failed the
secure handshake — now raises the plain NoWorkersAvailable. The dedicated
AllWorkersUnauthenticated signal (unreleased) is removed: its name
over-claimed ('Unauthenticated' when the per-worker reasons span
CERT_VERIFY / IDENTITY_MISMATCH / PLAINTEXT_VS_ENCRYPTED), and it only
covered the all-or-nothing case while a mixed handshake+transient drain was
left opaque.
What stays is the part that matters: HandshakeError still skips-without-
eviction (a worker mid-rotation self-heals on a later dispatch), and each
rejection is logged as a structured warning carrying the classified reason
— that log is the diagnosability surface.
Surfacing the per-worker failures programmatically (an 'all workers failed'
aggregate that also covers mixed drains) is deferred to a future generic
mechanism — likely an ExceptionGroup-based NoWorkersAvailable variant that
accumulates each worker's RpcError — rather than a misleading typed name.
- roundrobin: drop the rejections ledger + had_non_handshake_failure flag;
the terminal raise collapses to NoWorkersAvailable.
- Integration tests assert the classified reason via the load balancer's
logs (caplog) instead of the removed exception's rejections mapping.
Refs wool-labs#249
- Add a RoundRobinLoadBalancer pickle round-trip test covering __reduce__. It was an untested gap; removing AllWorkersUnauthenticated shrank the coverage denominator and tipped the Unit gate to 97.99% on Python 3.11 (the floor is 98%). Now 98.04%. - Relocate the handshake recoverability/observability contract to HandshakeError's docstring — the entity that owns the behavior — per the style guide's one-home-per-claim rule; the load balancer keeps a one-line pointer. The old HandshakeError docstring was doubly stale: it still described worker *eviction* and the removed AllWorkersUnauthenticated 'distinct signal'. - NoWorkersAvailable's docstring points to HandshakeError rather than restating the skip-without-evict contract, and drops the future-work note (changelog archaeology). Converted :class: role markup to bare backticks in the edited docstrings. Refs wool-labs#249
Apply the documentation remediations from the style-guide review of the identity-mTLS docs: - Fix the error-classification table cell for HandshakeError, which read "Evict worker, record rejection" -- contradicting the code (skip without eviction) and naming the removed rejection ledger. - Convert the role-prefix cross-references the PR introduced in new docstrings (:class:/:func:/:meth:/:attr:) to bare single backticks. - Give the new i.e./e.g. abbreviations their required trailing comma. - Stop the READMEs from restating the HandshakeError contract field by field; point to the class docstring and keep only cross-cutting narrative. Reduce the comments and the NoWorkersAvailable docstring that echoed that contract to pointers. - Drop forward-looking "future work" framing from the docs. - Fix a stale Discovery reference (the public protocol is DiscoveryLike) and show the full provider_from_files signature (mutual=) in the examples. Documentation and comments only; no behavior change.
…Provider Collapse the credential-provider family to one public type, WorkerCredentialsProvider(fetch, *, identity=None, reloadable=False): a fetch callback supplies the current WorkerCredentials and the provider stamps identity + content fingerprint. All source-specific logic (change detection, return-cached-when-unchanged, validation, last-good fallback, locking) lives in the fetch, keeping the provider a thin pass-through. - Remove internal _StaticCredentialsProvider; the constant case is now WorkerCredentialsProvider(lambda: creds, reloadable=False) — resolved eagerly, callback dropped on pickle so a lambda need not be picklable. - Replace public FileCredentialsProvider with the internal _FileCredentialsReader fetch functor (stat-gate + validate + last-good fallback + lock), wired behind WorkerCredentials.provider_from_files. - Export WorkerCredentialsProvider; drop FileCredentialsProvider. A reloadable fetch is re-invoked in the worker subprocess and per handshake, so it must be importable and thread-safe; the file functor is both. API is unreleased (wool-labs#249), so no compatibility shim. Consumers are unchanged (resolve()/reloadable/_coerce_provider).
Narrow the credential-provider surface to a single thin mechanism, WorkerCredentialsProvider(fetch, *, identity=None, reloadable=False), where the fetch owns any reload strategy. Wool ships the seam, not the policy — easy simple things (static creds), possible hard things (custom reloading). - Add WorkerCredentials.as_provider(*, identity=None) for the fixed case; _coerce_provider uses it. Drop WorkerCredentials.provider_from_files. - Remove the internal _FileCredentialsReader (stat-gate / validate / last-good fallback) and _validate_material. File rotation is now a user fetch, e.g. WorkerCredentialsProvider(functools.partial( WorkerCredentials.from_files, ca, key, cert), reloadable=True), with the caller owning caching / fallback / TTL. - Remove the CredentialsProviderLike protocol: with one concrete provider and the fetch callback as the extension seam it is redundant. Consumers retype to WorkerCredentialsProvider; custom providers stay possible via duck-typed resolve()/reloadable. - Widen WorkerCredentials.from_files path params to str | os.PathLike[str]. Public surface shrinks to WorkerCredentials, WorkerCredentialsProvider, CredentialsSnapshot. API unreleased (wool-labs#249); no compatibility shim.
…shot Align the snapshot type with the Worker* credential family (WorkerCredentials, WorkerCredentialsProvider). Mechanical rename, no behavior change. API unreleased (wool-labs#249).
A snapshot was WorkerCredentials + identity + a precomputed fingerprint, but WorkerCredentials is already a frozen, hashable, value-equal dataclass — so the fingerprint is redundant with its own hashing and the identity is just another field. Collapse the type: - WorkerCredentials gains an identity field (normalized in __post_init__) and a fingerprint property (the former snapshot digest, kept for diagnostics and logging). - WorkerCredentialsProvider.resolve() now returns WorkerCredentials, stamping the provider's identity (when configured) onto the fetched material; otherwise the credentials keep their own. - The client channel pool keys on the WorkerCredentials value directly; _CredentialsKey and _compute_fingerprint's identity arg are removed. - Drop WorkerCredentialsSnapshot from the public API. Public surface: WorkerCredentials, WorkerCredentialsProvider. No behavior change to channel pooling or rotation. API unreleased (wool-labs#249).
2b796e9 to
e3443b7
Compare
Use "i.e.," for the subject-alternative-name aside and wrap to the line limit. Docstring-only.
19c5d0b to
70e50a2
Compare
70e50a2 to
2fa1a09
Compare
…tory Align the callback parameter with the codebase's Factory convention and the stdlib default_factory (a zero-argument callable that produces a value) — the right mental model here — rather than fetch. Avoids the property-accessor connotation of getter. Sync callable, no behavior change; gRPC's per-handshake 'fetcher' terminology is left intact.
2fa1a09 to
b7afa60
Compare
The channel pool keys on the WorkerCredentials value directly — a frozen dataclass is hashable and value-equal — so the content fingerprint lost its only load-bearing role. Remove the public `fingerprint` property and the `_compute_fingerprint` helper, retarget the tests that used it as a value-equality proxy to assert WorkerCredentials equality directly, and fix the docstrings and README that still described fingerprint-keyed pooling.
State the WorkerCredentialsProvider resolution hot path: a reloadable factory is consulted on every client dispatch and on every server-side TLS handshake, so it must be importable, picklable, and safe to call concurrently. Convert the _apply and __getstate__ inline comments into method docstrings and order those members so the pickle dunder precedes the public methods. Mark WorkerCredentials with the public comment, and drop a stray reference to the removed fingerprint left in the resolve docstring.
Inline the trivial _coerce_provider helper at its three call sites (LocalWorker, WorkerProcess, WorkerProxy) and drop the cross-module private function, which the IDE flagged as unused. Replace the CredentialsContext class with two module-level functions: credentials_scope (a context manager that binds the ambient credentials) and current_credentials (the reader). The contextmanager form drops the manual __enter__/__exit__/_token machinery and its exit-without-enter guard, so the obsolete guard test is removed.
WorkerProcess binds its self-dispatch socket under /tmp when XDG_RUNTIME_DIR points to a path that is not a directory. Exercise that fallback so the branch is covered.
Name the channel pool key alias for what it keys, a channel, rather than the pool it feeds. Cast the untyped factory key to _ChannelKey so the unpacked credentials and options are statically typed, which in turn requires annotating grpc_options to admit the string-valued ssl_target_name_override entry. Fold the identity-verification rationale from an inline comment into the _channel_factory docstring.
credentials_scope and current_credentials carried a "Not part of the public API" line that their absence from the package __all__ already settles, leaving a second place to keep that fact in sync.
Drop the identity, certificate, and protocol-mismatch reason-token tables and the CERT_VERIFY and IDENTITY_MISMATCH reasons they fed. Those phrase tables drifted across gRPC and BoringSSL versions while only refining advisory diagnostics, never the load-bearing HandshakeError type or the worker-health routing. Keep the broad gate that promotes an ambiguous UNAVAILABLE to a handshake failure, since it still separates a TLS failure from a genuinely down worker. Take the reason structurally instead: UNAUTHENTICATED is PEER_UNAUTHENTICATED, a gated secure failure is TLS_HANDSHAKE, and a gated insecure one is PLAINTEXT_VS_ENCRYPTED. Wrong-CA, wrong-identity, and expired-certificate failures now all surface as TLS_HANDSHAKE; the finer cause moves to gRPC's own tracing. Update the unit and integration tests to assert the collapsed reasons.
The reason enum and the reason attribute were diagnostic-only: nothing branched on a member, and the sole consumer was a warning log. The load balancer now logs the gRPC code and details directly, and the classifier keeps the broad UNAVAILABLE gate as a boolean rather than producing a flavor — UNAUTHENTICATED and a TLS-evidenced UNAVAILABLE both surface the same HandshakeError.
The value-or-provider normalization was copied verbatim into three constructors; fold it into one classmethod, coerce, and call it from LocalWorker, WorkerProcess, WorkerProxy, and WorkerConnection. Also move the module-private _normalize_identity helper below both classes so the classes are contiguous, and have resolve branch on the reloadable flag rather than infer it from the cached sentinel.
Rename WorkerConnection's public constructor parameter from provider to credentials and widen it to accept a WorkerCredentials value or a WorkerCredentialsProvider, coercing a bare value internally. This makes the credential vocabulary uniform with WorkerPool, LocalWorker, and WorkerProxy, which already accept the union, and keeps the easy case, a bare value, reachable at this entry point too.
Address round-1 review documentation findings: convert two carried-over Sphinx role prefixes to bare cross-references in connection.py, drop a redundant etc. and de-duplicate the picklability rationale in the WorkerCredentialsProvider docstrings, and correct a test docstring that described HandshakeError as evicted by the load balancer when it is skipped without eviction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Close the three gaps that block Wool's mutual TLS on dynamic-address orchestrators (Kubernetes, ECS/Fargate), while keeping the existing certificate-authority trust model and
WorkerCredentialsexactly as they are. The change is additive and opt-in: a bareWorkerCredentialsis coerced into an identity-free static provider everywhere, so static-address mTLS, one-way TLS, and plaintext deployments behave byte-for-byte as before, and no protobuf change is required.The approach introduces a
CredentialsProviderLikeseam — the unit the runtime consults for current credential material — and threads it through the Worker Subsystem (client and server) and the Load Balancing layer:ssl_target_name_override, so a worker reached at a dynamically assigned address is verified against a stable logical identity (its certificate SAN) rather than the dialed address. Full chain and SAN verification are preserved — verification is strengthened, not relaxed.grpc.dynamic_ssl_server_credentials.HandshakeError(with areasonenum). A handshake-failing worker is skipped without eviction — so a worker mid-rotation self-heals on a later dispatch — and each rejection is emitted as a structured log carrying the classified reason; a drained dispatch raises the plainNoWorkersAvailable. A typed aggregate of the per-worker failures is deferred to a follow-up (see Review remediation).One trade-off worth noting: rotation replaces cert/key/CA bytes but not the mutual-TLS mode, which
grpc.dynamic_ssl_server_credentialsfixes at construction. A second: the handshake classifier keys control flow on the gRPC status code plus the presence of any TLS/cert token in the error text — a deliberately broad gate that degrades a tokenless TLS failure to plain transient-skip rather than risking a false eviction. A handshake failure is skipped without eviction, so a worker mid-rotation self-heals on a later dispatch rather than being removed until re-discovery.Closes #249
Proposed changes
Credential provider abstraction (
worker/auth.py, Worker Subsystem)Add
CredentialsProviderLike(aresolve() -> CredentialsSnapshotprotocol with areloadableflag) and the fingerprintedCredentialsSnapshot, with one public implementation —FileCredentialsProvider(re-reads PEM files on change, validates before caching, returns the last-good snapshot on a transient or malformed read, and is lock-guarded for the off-loop server fetcher) — plus an internal_StaticCredentialsProviderthat a bareWorkerCredentialscoerces into.WorkerCredentials.provider_from_files(..., identity=, reload=)is the ergonomic entry point;_coerce_provideris the single seam that normalizes a bareWorkerCredentialsinto a provider. A blank identity normalizes toNone(address-based path).CredentialsContextis widened to carry either form. The public surface is deliberately small: pass aCredentialsProviderLike(typically aFileCredentialsProvider, or your own custom provider), or aWorkerCredentialsfor the trivial static case.Identity verification and fingerprint-keyed pool (
worker/connection.py, Worker Subsystem)Re-key the client channel pool on
(target, credential fingerprint + identity, options)via_CredentialKeyso unchanged material reuses a pooled channel and rotated material yields a fresh one._channel_factoryaddsssl_target_name_overrideonly when an identity is configured;WorkerConnectionholds a provider and resolves it per dispatch. Self-dispatch over the loopback UDS keeps its insecure key.Diagnosable handshake failures (
worker/connection.py+loadbalancer/, Worker Subsystem + Load Balancing)Add
HandshakeError(RpcError)with a typedReason, classified structurally from the gRPC status code and error text (including thessl_target_name_overridehostname-verification mismatch →IDENTITY_MISMATCHand CA-verification failures →CERT_VERIFY).RoundRobinLoadBalancerskips a handshake-failing worker without evicting it (advancing to the next candidate and logging a structured warning that carries the classified reason), so a rotated worker recovers on a later dispatch. A drained dispatch raises the plainNoWorkersAvailable.HandshakeError.detailsis redacted to a fixed message rather than gRPC's verbose debug blob.Provider wiring, client and server (
worker/proxy.py,pool.py,local.py,process.py,base.py)WorkerProxy,WorkerPool,LocalWorker,WorkerProcess, and the factory protocols accept either aWorkerCredentialsor a provider. The proxy resolves its provider from the argument or the ambient context and forwards it to each connection;WorkerProcessserves rotating credentials via a per-connection fetcher for a reloadable provider and the unchanged static path otherwise;LocalWorker's stop RPC applies the identity override. The insecure self-dispatch UDS socket is confined to a per-worker0700directory under a short base ($XDG_RUNTIME_DIRor/tmp, within theAF_UNIXpath limit).Public API and docs (
__init__.py, READMEs)Export
CredentialsProviderLike,CredentialsSnapshot,FileCredentialsProvider, andHandshakeError(the static wrapper stays internal as_StaticCredentialsProvider). Document identity verification, rotation, the diagnosable handshake errors, the self-dispatch socket, and the discovery-plane trust boundary in the top-level and worker READMEs.Test cases
TestCredentialsSnapshotTestStaticCredentialsProviderresolve()is called repeatedlyreloadableFalse, and survives a pickle roundtripTestFileCredentialsProviderTestWorkerCredentialsreloadandidentity(incl. blank)provider_from_filesis calledNoneTestHandshakeErrorHandshakeErroris constructed and pickledRpcErrorexposing the reason, and survives a serialization roundtripTestWorkerConnectionUNAUTHENTICATEDorUNAVAILABLEwith TLS evidence (incl. a verbatim hostname-verification failure)HandshakeErrorwith the matching reason (IDENTITY_MISMATCHfor the hostname case); a plainUNAVAILABLEstays transient; details are redactedTestWorkerConnectionssl_target_name_overrideonly when an identity is setTestWorkerConnectionTestWorkerConnectionTestWorkerProxyTestRoundRobinLoadBalancerNoWorkersAvailablewith the workers left in the pool (skipped, not evicted)TestRoundRobinLoadBalancerNoWorkersAvailableTestRoundRobinLoadBalancerTestWorkerProcess0700directory within the path limitTestLocalWorkerstop()is calledtest_identity_mtls(integration)NoWorkersAvailablewith the classified reason (CERT_VERIFY/IDENTITY_MISMATCH) logged; and a rotated CA is adopted without restarting the workerReview remediation (post-review hardening)
A 20-agent application-security panel reviewed this PR (0 blocking; security core verified sound). The findings are remediated in the commits from
fix(loadbalancer): recover handshake-failed workers …throughtest+docs(mtls): …:NoWorkersAvailable, with each rejection logged.HandshakeError.details. The dedicatedAllWorkersUnauthenticatedsignal was removed in favour of the plainNoWorkersAvailable(its name over-claimed, and it only covered the all-or-nothing case); a typed aggregate of the per-worker failures — likely anExceptionGroup-basedNoWorkersAvailablevariant — is deferred to a follow-up issue.st_ino+st_ctime_ns), and athreading.Lockaround the off-loop resolve.reloadableis now a protocol member; a blank identity normalizes toNone.0700directory under a short base directory (the canonical fix for theAF_UNIXpath-length limit;man 7 unix), not a world-readable temp path.ssl_target_name_overridemismatch now classifies asIDENTITY_MISMATCH(the realAioRpcErrorsays "Hostname Verification Check failed", not the C-core log's "no match found for server name"); a drift canary pins the gRPC strings.WorkerMetadata(incl. thesecureflag) is self-advertised over an unauthenticated, forgeable plane, so the security filter is a compatibility gate, not a trust boundary; trust rests on the mTLS handshake.New e2e coverage: real rotation-without-restart, negative-identity rejection, wire-safety pickle round-trips, in-flight-survives-rotation, and the classifier drift canary. Full suite: 1368 passed, 98.10% coverage.