✨ feat(upstream): remote Docker TCP+TLS upstreams with active/passive failover#104
Merged
Conversation
Adds the internal/upstream package (Endpoint, EndpointSpec, Resolver) — the single dial seam that replaces the hardcoded single-unix-socket assumption — and the config schema for it (upstream.endpoints[], upstream.failover, per-endpoint TLS). Consumers are not yet wired through it; that lands next. - ✨ feat(upstream): Endpoint + client-TLS-in-dialer, ordered failover Resolver, DOCKER_* env spec - ✨ feat(config): upstream.endpoints/failover schema + file-free ValidateSpec, register request_timeout default - 🔧 config(reload): upstream.endpoints/failover are reload-immutable; request_timeout stays mutable
Builds on the upstream foundation (6ac99df) by threading the shared *upstream.Resolver through the whole request stack so failover is coherent across the proxy, hijack, and side-channel inspects, then ships the docs, website, and examples for the feature. - 🔄 refactor(proxy): route reverse proxy, exec/attach hijack, ownership, visibility, client-ACL, and the filter exec-start inspector through the one shared resolver via *WithRoundTripper/*WithDialer constructors; legacy single-socket constructors stay as backward-compat wrappers - 🔄 refactor(serve): build the resolver once in newServeRuntime, reuse it across hot reloads, start its health loop in the serve lifecycle, and show the resolved endpoint label in the banner/startup log - ✨ feat(upstream): CheckReachable boots a failover set when ≥1 endpoint answers and fails fast when all are dark; legacy single socket keeps the precise not-found/permission fail-fast check - 🐛 fix(upstream): don't demote the active endpoint on request-scoped errors (client cancel / request_timeout) — only real reachability failures fail over - 🐛 fix(upstream): DOCKER_TLS_VERIFY with no DOCKER_CERT_PATH now builds a valid system-roots TLS endpoint instead of being rejected as plain TCP - 🐛 fix(upstream): gate demote's async re-probe to one goroutine per endpoint and bind it to the resolver lifetime; serialize setHealth's swap-and-notify - 🐛 fix(config): reject upstream.failover.health_interval "0s" (ambiguous with the default) and point operators at negative-to-disable / omit-for-default - 🐛 fix(cli): validate header shows configured endpoints, not the unused socket - 📝 docs: new Remote Upstreams & Failover guide, configuration + env-var reference, README comparison/roadmap, and a runnable compose example - 🧪 test: cover resolver pool tunings, CheckReachable, system-roots TLS, request-scoped no-demote, endpoints/failover immutability, and failover env vars
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds remote Docker daemon upstreams over TCP+mTLS with active/passive failover, meeting CetusGuard's remote-backend capability and exceeding the field with health-checked failover. The local unix socket stays the default; nothing changes for existing single-socket deployments.
A single shared
*upstream.Resolveris the one dial seam. It implementshttp.RoundTripper(reverse proxy + HTTP side channels) andDialContext(raw-conn hijack), and every production path routes through the same instance so failover is coherent across the proxy, exec/attach hijack, and all inspect side channels (ownership, visibility, client-ACL, exec-start inspector). TLS is negotiated inside the dialer, so it works transparently even thoughReverseProxyrewrites the scheme tohttp.Scope decision
Remote + failover, not cross-daemon fan-out. All endpoints must address the same logical daemon/swarm — container IDs, exec sessions, and owner labels are daemon-local, so fan-out would break hijack sessions and owner-isolation. This is active/passive redundancy (swarm VIP + managers, an HA pair), which is the correct product category.
Highlights
request_timeout) do not demote, so a longdocker buildcan't flap the primary.upstream.socketstill works;upstream.endpointstakes precedence;DOCKER_HOST/DOCKER_TLS_VERIFY/DOCKER_CERT_PATHauto-detected when no endpoints are set (including the system-roots TLS case).endpoints+failoverare reload-immutable (bound to the long-lived resolver built once at startup);request_timeoutstays mutable. The resolver is reused across reloads — no per-reload rebuild, no goroutine leak.insecure_allow_plain_tcpandinsecure_skip_tls_verify, both explicit endpoint-level acknowledgments.Verification
Ran a 4-dimension multi-agent verification workflow (wiring coherence, resolver logic, config/reload, docs accuracy) with each finding adversarially verified. Fixed all 13 confirmed findings, including: demote-on-cancel flap (high),
DOCKER_TLS_VERIFYsystem-roots gap (high), the demote goroutine-storm guard + lifecycle binding (medium), serializedsetHealth, thevalidateheader showing endpoints,health_interval: "0s"rejection, and missing immutability/env tests.Tests / checks
All green locally and in the pre-push gate:
go build, fullgo test,-raceon touched packages,golangci-lint(0 issues),govulncheck(clean),go-fuzz, biome, knip, and the 84 TS tests. New Go tests cover resolver pool tunings,CheckReachable, system-roots TLS, request-scoped no-demote, and endpoints/failover immutability + env vars.Docs / examples
docs/content/docs/multi-host.mdx— Remote Upstreams & Failover guideconfiguration.mdx+ env-var reference, README comparison/roadmap updatesexamples/compose/multi-host/stack