Skip to content

✨ feat(upstream): remote Docker TCP+TLS upstreams with active/passive failover#104

Merged
scttbnsn merged 2 commits into
dev/v1.4from
feat/multi-host-upstreams
Jun 16, 2026
Merged

✨ feat(upstream): remote Docker TCP+TLS upstreams with active/passive failover#104
scttbnsn merged 2 commits into
dev/v1.4from
feat/multi-host-upstreams

Conversation

@scttbnsn

Copy link
Copy Markdown
Contributor

What

Adds remote Docker daemon upstreams over TCP+mTLS with active/passive failover, meeting CetusGuard's remote-backend capability and exceeding the field with health-checked failover. The local unix socket stays the default; nothing changes for existing single-socket deployments.

A single shared *upstream.Resolver is the one dial seam. It implements http.RoundTripper (reverse proxy + HTTP side channels) and DialContext (raw-conn hijack), and every production path routes through the same instance so failover is coherent across the proxy, exec/attach hijack, and all inspect side channels (ownership, visibility, client-ACL, exec-start inspector). TLS is negotiated inside the dialer, so it works transparently even though ReverseProxy rewrites the scheme to http.

Scope decision

Remote + failover, not cross-daemon fan-out. All endpoints must address the same logical daemon/swarm — container IDs, exec sessions, and owner labels are daemon-local, so fan-out would break hijack sessions and owner-isolation. This is active/passive redundancy (swarm VIP + managers, an HA pair), which is the correct product category.

Highlights

  • Failover: first healthy endpoint wins; a real reachability failure demotes it (no in-flight retry — Docker writes aren't idempotent). Request-scoped errors (client cancel, request_timeout) do not demote, so a long docker build can't flap the primary.
  • Backward compat: legacy upstream.socket still works; upstream.endpoints takes precedence; DOCKER_HOST/DOCKER_TLS_VERIFY/DOCKER_CERT_PATH auto-detected when no endpoints are set (including the system-roots TLS case).
  • Reload safety: endpoints + failover are reload-immutable (bound to the long-lived resolver built once at startup); request_timeout stays mutable. The resolver is reused across reloads — no per-reload rebuild, no goroutine leak.
  • Insecure opt-ins: insecure_allow_plain_tcp and insecure_skip_tls_verify, both explicit endpoint-level acknowledgments.

Verification

Ran a 4-dimension multi-agent verification workflow (wiring coherence, resolver logic, config/reload, docs accuracy) with each finding adversarially verified. Fixed all 13 confirmed findings, including: demote-on-cancel flap (high), DOCKER_TLS_VERIFY system-roots gap (high), the demote goroutine-storm guard + lifecycle binding (medium), serialized setHealth, the validate header showing endpoints, health_interval: "0s" rejection, and missing immutability/env tests.

Tests / checks

All green locally and in the pre-push gate: go build, full go test, -race on touched packages, golangci-lint (0 issues), govulncheck (clean), go-fuzz, biome, knip, and the 84 TS tests. New Go tests cover resolver pool tunings, CheckReachable, system-roots TLS, request-scoped no-demote, and endpoints/failover immutability + env vars.

Docs / examples

  • New docs/content/docs/multi-host.mdx — Remote Upstreams & Failover guide
  • configuration.mdx + env-var reference, README comparison/roadmap updates
  • Runnable examples/compose/multi-host/ stack

scttbnsn added 2 commits June 15, 2026 22:35
Adds the internal/upstream package (Endpoint, EndpointSpec, Resolver) — the
single dial seam that replaces the hardcoded single-unix-socket assumption — and
the config schema for it (upstream.endpoints[], upstream.failover, per-endpoint
TLS). Consumers are not yet wired through it; that lands next.

- ✨ feat(upstream): Endpoint + client-TLS-in-dialer, ordered failover Resolver, DOCKER_* env spec
- ✨ feat(config): upstream.endpoints/failover schema + file-free ValidateSpec, register request_timeout default
- 🔧 config(reload): upstream.endpoints/failover are reload-immutable; request_timeout stays mutable
Builds on the upstream foundation (6ac99df) by threading the shared
*upstream.Resolver through the whole request stack so failover is coherent
across the proxy, hijack, and side-channel inspects, then ships the docs,
website, and examples for the feature.

- 🔄 refactor(proxy): route reverse proxy, exec/attach hijack, ownership,
  visibility, client-ACL, and the filter exec-start inspector through the one
  shared resolver via *WithRoundTripper/*WithDialer constructors; legacy
  single-socket constructors stay as backward-compat wrappers
- 🔄 refactor(serve): build the resolver once in newServeRuntime, reuse it
  across hot reloads, start its health loop in the serve lifecycle, and show
  the resolved endpoint label in the banner/startup log
- ✨ feat(upstream): CheckReachable boots a failover set when ≥1 endpoint
  answers and fails fast when all are dark; legacy single socket keeps the
  precise not-found/permission fail-fast check
- 🐛 fix(upstream): don't demote the active endpoint on request-scoped errors
  (client cancel / request_timeout) — only real reachability failures fail over
- 🐛 fix(upstream): DOCKER_TLS_VERIFY with no DOCKER_CERT_PATH now builds a
  valid system-roots TLS endpoint instead of being rejected as plain TCP
- 🐛 fix(upstream): gate demote's async re-probe to one goroutine per endpoint
  and bind it to the resolver lifetime; serialize setHealth's swap-and-notify
- 🐛 fix(config): reject upstream.failover.health_interval "0s" (ambiguous with
  the default) and point operators at negative-to-disable / omit-for-default
- 🐛 fix(cli): validate header shows configured endpoints, not the unused socket
- 📝 docs: new Remote Upstreams & Failover guide, configuration + env-var
  reference, README comparison/roadmap, and a runnable compose example
- 🧪 test: cover resolver pool tunings, CheckReachable, system-roots TLS,
  request-scoped no-demote, endpoints/failover immutability, and failover env vars
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
sockguard-website Ready Ready Preview, Comment Jun 16, 2026 3:42am

@scttbnsn scttbnsn merged commit 7587340 into dev/v1.4 Jun 16, 2026
37 checks passed
@scttbnsn scttbnsn deleted the feat/multi-host-upstreams branch June 16, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant