Skip to content

l4proxy: add dynamic upstreams via DNS (SRV and A/AAAA)#429

Open
tannevaled wants to merge 2 commits into
mholt:masterfrom
tannevaled:feat/dynamic-srv-upstreams
Open

l4proxy: add dynamic upstreams via DNS (SRV and A/AAAA)#429
tannevaled wants to merge 2 commits into
mholt:masterfrom
tannevaled:feat/dynamic-srv-upstreams

Conversation

@tannevaled

@tannevaled tannevaled commented Jun 2, 2026

Copy link
Copy Markdown

What

Adds dynamic upstreams to the layer4 proxy so the backend set can be discovered at runtime instead of being restated in config, with two DNS sources:

  • layer4.proxy.upstreams.srv — resolves SRV records (service/proto/name).
  • layer4.proxy.upstreams.a — resolves A/AAAA records for a name, using a configured port (fits clusters where every member shares a port, e.g. a Postgres cluster on 5432 behind one name).
proxy {
	dynamic srv { service postgres; proto tcp; name db.internal; refresh 30s }
}
proxy {
	dynamic a { name db.internal; port 5432; refresh 30s }
}

Caddyfile: dynamic <source> { … }. Results are cached per name and refreshed (refresh / grace_period / dial_network). When dynamic upstreams are configured the static upstream list may be empty. Discovered peers come from the shared peer pool, so passive health checks and connection counts persist across refreshes.

UpstreamSource.GetUpstreams takes the connection's *caddy.Replacer rather than the connection itself, keeping discovery decoupled from a live connection (and pollable by other callers).

Why

So the L4 config doesn't have to hard-code endpoints DNS already publishes — the common service-discovery case (Consul DNS, Kubernetes headless services, etc.).

Scope / limitations

  • Active health checks still run only on statically-configured upstreams, same as the HTTP reverse_proxy's dynamic upstreams. Passive health + connection counting apply to discovered upstreams.
  • Mirrors caddyhttp/reverseproxy's dynamic srv/a design for consistency.

Tests

upstreams_test.go: SRV and A discovery (record → upstream), caching (one lookup for repeated calls), lookup-error handling, SRV expandedAddr, and Caddyfile parsing for both sources (happy + missing/unknown source + bad option). DNS is stubbed via injectable lookups, so no network is needed. go test ./modules/l4proxy/ passes; gofmt / go vet / golangci-lint clean.

Add an UpstreamSource mechanism so the backend set can be discovered at
runtime instead of being listed statically, with two DNS sources:

  - layer4.proxy.upstreams.srv: resolves SRV records (service/proto/name).
  - layer4.proxy.upstreams.a:   resolves A/AAAA records for a name, using a
    configured port (fits clusters where all members share a port).

Caddyfile: dynamic <source> { ... }. Results are cached per name and
refreshed (refresh / grace_period / dial_network). When dynamic upstreams
are configured the static list may be empty. Discovered peers are drawn
from the shared peer pool, so passive health and connection counts persist
across refreshes.

UpstreamSource.GetUpstreams takes the connection's *caddy.Replacer rather
than the connection itself, keeping discovery decoupled from a live
connection.

Mirrors caddyhttp/reverseproxy's dynamic srv/a sources. Note: active health
checks still run only on statically-configured upstreams (same limitation
as the HTTP reverse_proxy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@francislavoie

Copy link
Copy Markdown
Collaborator

For both this and some of your other PRs, there seems to be a lot of duplication/copying of existing code from the HTTP proxy. I think there should be some consideration on how we can deduplicate to reduce the maintenance burden.

@tannevaled tannevaled changed the title l4proxy: add dynamic upstreams via DNS SRV discovery l4proxy: add dynamic upstreams via DNS (SRV and A/AAAA) Jun 2, 2026
@tannevaled

Copy link
Copy Markdown
Author

Thanks, that's a fair point and worth getting right.

You're correct that the duplication is concentrated in two places: the dynamic DNS upstreams here (#429), which mirror caddyhttp/reverseproxy's dynamic srv/a, and the HTTP active-health-check surface in #423/#426. The reason I copied rather than reused is that those pieces in core are coupled to *http.Request / reverseproxy.Upstream and aren't exported, so there was nothing transport-neutral to import from a separate repo.

If you're open to it, I think the clean fix is to factor the transport-neutral parts into reusable helpers in core, then make these caddy-l4 PRs thin adapters:

  1. DNS discovery — a helper that resolves SRV/A (with the existing caching/refresh/grace logic) and returns neutral targets (host:port + weight), independent of the Upstream type. reverseproxy keeps building its *Upstream, and caddy-l4 builds its own from the same targets.
  2. HTTP health probe — a helper that performs the GET and applies the status/body match given an address, so both proxies share the probe logic.

I'm happy to open a companion PR against caddyserver/caddy exporting those, and then slim #429 / #423 / #426 down to adapters once it lands. Would you accept that direction, or would you rather keep dynamic upstreams out of caddy-l4 for now?

For context on the rest: the other PRs in the series (#425 close-on-unhealthy, #427 rise/fall, #428 weighted LB, #430 active checks on dynamic upstreams) are layer4-native rather than copies, and the observability/timeout additions I'm about to push are too — but if you spot specific spots there you'd like factored out, point me at them and I'll fold them into the same de-duplication pass.

@tannevaled

Copy link
Copy Markdown
Author

To make the de-duplication concrete rather than hypothetical, I opened a draft RFC on core: caddyserver/caddy#7790.

It extracts the SRV resolution + caching into a transport-neutral dynamicupstreams package and refactors reverse_proxy's SRVUpstreams to use it (behavior unchanged, reverseproxy/upstreams.go −72 lines). If that direction looks right, I'll extend it to A/AAAA and then rebase this PR (#429) to consume the shared package and drop the copy — same for the HTTP health-check logic in #423/#426.

Happy to adjust naming/placement or scope based on your preference.

@tannevaled

Copy link
Copy Markdown
Author

Also — thanks for taking the time to look at these, and apologies for the burst of PRs arriving all at once; I realize that's a lot to land on a maintainer's plate.

There's genuinely no urgency on any of them. If it's easier for you, I'm happy to consolidate them, sequence them in whatever order suits your priorities, or close any that aren't a good fit — just say the word and I'll adjust.

#7790 is intended to be the de-duplication step you asked about, so that's probably the most useful place to start; the rest can wait until the direction there is settled. Thanks again for the project and the feedback.

- Document dynamic_upstreams (dynamic srv / dynamic a) in docs/handlers/proxy.md.
- Add a caddyfile_adapt integration test for `dynamic srv`.
- Add tests covering the grace-period path, cache bounding, and
  newDynamicUpstream's invalid-address error. The per-record "skip invalid
  target" branch is defensive and unreachable for well-formed DNS (SRV/A
  always yield a numeric port), so it is intentionally left uncovered.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@notarun

notarun commented Jun 13, 2026

Copy link
Copy Markdown

Would love to see this get merged. The ability to add own dynamic upstream via plugin would be awesome. I maintain caddy-nomad-sd which can be used to service discover upstreams in a nomad cluster. If the right set of APIs are exposed from caddy-l4, it could possibly be extended to support L4 as well.

Happy to help test or contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants