Skip to content

dataplane: add runtime interface split scaffolding#1399

Merged
psaab merged 6 commits into
masterfrom
codex/1381-dataplane-interface-split-impl
May 17, 2026
Merged

dataplane: add runtime interface split scaffolding#1399
psaab merged 6 commits into
masterfrom
codex/1381-dataplane-interface-split-impl

Conversation

@psaab
Copy link
Copy Markdown
Owner

@psaab psaab commented May 17, 2026

Summary

Refs #1381.

This adds the first implementation slice for the dataplane interface split without removing the legacy BPF-shaped dataplane.DataPlane yet.

  • add backend-neutral RuntimeDataPlane, ConfigSink, ApplyResult, SessionStore, telemetry, HA, and link controller contracts
  • add pkg/dataplane/runtime DTOs and a userspace session-delta adapter that avoids importing backend-private userspace types into the neutral runtime package
  • widen apply metadata with filter IDs/spans, NAT counter IDs, capabilities, and apply generation
  • move cluster stale-bulk reconciliation through dataplane.SessionStore.ReconcileClusterBulk, including reverse-session and DNAT/DNATv6 companion cleanup tests
  • update the Feature gap: DataPlane interface contract is BPF-shaped; userspace Manager embeds eBPF Manager #1381 plan with the implemented Phase 1 scaffold and remaining migration steps

Validation

  • go test ./pkg/dataplane/... ./pkg/cluster -count=1
  • go test ./pkg/...
  • git diff --check

A DPDK tagged test build was also probed during implementation. It still fails on existing tagged-DPDK issues (-mrtm pkg-config allowlisting, then missing BatchDeleteSessions / C pointer indexing), so this PR does not claim DPDK-tag validation.

Remaining Scope

  • legacy dataplane.DataPlane is still BPF-shaped
  • userspace still embeds the eBPF manager
  • daemon/API/gRPC/CLI still need migration from LastCompileResult() and BPF map reads to LastApplyResult() and domain interfaces
  • conntrack GC still needs migration to SessionStore / Telemetry; this slice moves the cluster stale-reconcile companion-delete path only

Copilot AI review requested due to automatic review settings May 17, 2026 15:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

First Phase 1 scaffolding slice for issue #1381: introduces backend-neutral dataplane interfaces (RuntimeDataPlane, ConfigSink, SessionStore, Telemetry, HA/Link controllers) alongside the legacy BPF-shaped DataPlane, widens ApplyResult metadata, adds a neutral runtime DTO package with a userspace adapter, and migrates cluster stale-bulk reconciliation to the new SessionStore.ReconcileClusterBulk companion-delete API.

Changes:

  • Add backend-neutral contracts (RuntimeDataPlane, ConfigSink, SessionStore, Telemetry, LinkController, HAController) and ApplyResult with filter spans, NAT counter IDs, capabilities, and generation; eBPF/DPDK/userspace managers now publish LastApplyResult().
  • Introduce pkg/dataplane/runtime package with neutral session-delta DTOs and SessionDeltaSource interface; userspace Manager adapts its private types at the boundary, with an import-canary test forbidding userspace imports in the runtime package.
  • Refactor pkg/cluster/sync.go stale-bulk reconciliation to use SessionStore.ReconcileClusterBulk for forward+reverse+DNAT/DNATv6 cleanup, with tests (including an AST canary) preventing reintroduction of local DNAT cleanup.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/dataplane/apply.go New RuntimeDataPlane/ConfigSink/ApplyResult types and eBPF Manager apply plumbing.
pkg/dataplane/apply_test.go Verifies ApplyResultFromCompileResult cloning, contract size, and canceled-context behavior.
pkg/dataplane/session_store.go New SessionStore interface and DataPlane-backed adapter with companion-delete and bulk reconcile.
pkg/dataplane/session_store_test.go Tests companion-delete for v4/v6 sessions including DNAT cleanup.
pkg/dataplane/compiler.go Compile result gains FilterSpans; records ApplyResult after compile.
pkg/dataplane/compiler_filter.go Populates FilterSpans per-filter (FilterID/RuleStart/RuleCount).
pkg/dataplane/dataplane.go Adds ConfigSink compile-time assertion for eBPF Manager.
pkg/dataplane/loader.go Adds applyMu, applyGeneration, lastApply state to eBPF Manager.
pkg/dataplane/runtime/session_delta.go Backend-neutral session delta DTOs and SessionDeltaSource interface.
pkg/dataplane/runtime/import_canary_test.go Canary forbidding userspace imports from the neutral runtime package.
pkg/dataplane/dpdk/manager.go DPDK backend gains ApplyConfig/LastApplyResult and ConfigSink assertion.
pkg/dataplane/dpdk/dpdk_stub.go Stub Compile records ApplyResult; minor formatting.
pkg/dataplane/dpdk/dpdk_cgo.go Cgo Compile records ApplyResult.
pkg/dataplane/userspace/manager.go Userspace adds ApplyConfig, LastApplyResult, RuntimeSessionDeltaSource, capability/generation tracking.
pkg/dataplane/userspace/runtime_delta.go Userspace adapter mapping internal session-delta/status to neutral runtime DTOs.
pkg/dataplane/userspace/runtime_delta_test.go Tests userspace runtime-delta adapter mapping and interface satisfaction.
pkg/cluster/sync.go Migrates stale-bulk reconciliation to SessionStore.ReconcileClusterBulk.
pkg/cluster/sync_test.go Adds v4/v6 companion-delete tests and AST canary forbidding local DNAT cleanup.
docs/pr/1381-dataplane-interface-split/plan.md Documents the implemented Phase 1 scaffold and remaining work.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-1 review on b427fe14

Verdict: MERGE-READY pending Codex+Gemini r1

First scaffolding slice for #1381 DataPlane interface split. Legacy dataplane.DataPlane preserved alongside new RuntimeDataPlane. Sound separation strategy.

Verifications

RuntimeDataPlane contract at pkg/dataplane/apply.go:

type RuntimeDataPlane interface {
    Start(context.Context) error
    ConfigSink
    Close() error
    Teardown() error
    Link() LinkController
    HA() HAController
    Sessions() SessionStore
    Telemetry() Telemetry
}

Decomposed into 4 sub-controllers (Link/HA/Sessions/Telemetry) + lifecycle methods. Clean separation of concerns.

ApplyResult carries:

  • ZoneIDs map[string]uint16
  • ManagedInterfaces []networkd.InterfaceConfig
  • FilterIDs map[string]uint32
  • FilterSpans map[string]FilterCounterSpan
  • NATCounterIDs map[string]uint32
  • Capabilities Capabilities
  • Generation uint64

Legacy bridge: pkg/dataplane/compiler.go has m.recordApplyResult(ApplyResultFromCompileResult(result)). Legacy CompileResult is bridged to the new ApplyResult for migration — both coexist for now.

No-backend-import invariant at pkg/dataplane/runtime/import_canary_test.go:

func TestRuntimePackageDoesNotImportBackendPackages(t *testing.T) {
    ...
    if strings.Contains(path, "/pkg/dataplane/userspace") {

Parses all .go files in the runtime package, fails on backend imports. Good architectural pin.

Manager dual implementation at pkg/dataplane/dataplane.go:

var _ DataPlane = (*Manager)(nil)
var _ ConfigSink = (*Manager)(nil)

Compile-time asserts that Manager implements both legacy AND new interface. No callers need to change yet.

SessionStore interface at pkg/dataplane/session_store.go — comprehensive: ForEachV4/V6, GetV4/V6, PutClusterSyncedV4/V6, DeleteV4/V6, DeleteWithCompanionsV4/V6, ReconcileClusterBulk, Count, Clear. Backend-neutral via the SessionKey/SessionValue types.

Hostile concerns

MINOR — ApplyResult missing PolicyScheduleRuleSlots (added in #1396). The new ApplyResult struct doesn't carry PolicyScheduleRuleSlots. Legacy CompileResult still has it (and is still used). When the bridge ApplyResultFromCompileResult runs, where does PolicyScheduleRuleSlots go? Verify the policy-scheduler runtime in #1396 still works correctly with the new abstraction in place. Or note this as a follow-up: extend ApplyResult once #1399 lands.

MINOR — Userspace session-delta adapter at pkg/dataplane/userspace/runtime_delta.go "avoids importing backends" per PR description. Verify the adapter type isn't accidentally exposing internal backend types through the abstraction.

MINOR — DPDK adapter (pkg/dataplane/dpdk/dpdk_cgo.go, dpdk_stub.go, manager.go)dpdk_stub.go suggests a no-op stub for non-DPDK builds. Verify the new interfaces are correctly implemented in BOTH the cgo-enabled and stub variants.

MINOR — Cluster sync.go changes integrate the HA contract abstraction. Verify all current sync semantics are preserved (writeMu serialization, sync-hold, fence/disconnect, primary filtering).

MINOR — Plan doc. docs/pr/1381-dataplane-interface-split/plan.md should accurately describe what's in this slice vs deferred. Worth a Codex/Gemini cross-check.

Recommendation

Block on: nothing. Scaffolding with legacy preservation is the right pattern.

MINOR follow-ups:

  • Verify PolicyScheduleRuleSlots from Plumb userspace policy scheduler inactive state #1396 works with the new bridge or extend ApplyResult to carry it
  • Verify runtime_delta.go adapter doesn't leak backend types
  • Verify DPDK stub correctly implements all new interfaces
  • Cross-check plan doc against actual slice scope
  • Verify cluster HA contract preserves writeMu and sync-hold semantics

Awaiting Codex (task-mp9x5b25-7ebcvv) and Gemini Pro 3 (task-mp9x66jy-gikrzt). Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-1 self-correction on b427fe14

Revising verdict MERGE-READY → MERGE-NEEDS-MAJOR to match Codex r1.

I posted READY because the scaffolding pattern (new contract alongside legacy) is structurally sound. Codex traced 5 substantive issues with the contract's completeness — under-specified for known migrations.

Codex MAJOR — ApplyResult missing 4 fields actively used by callers

Codex r1 cited (verified):

  • PoolIDs — used in CLI cli_show_nat.go:187
  • PolicyNames — used in CLI cli_show_flow.go:232, daemon_system.go:42
  • AppNames — used by runtime callers
  • PolicyScheduleRuleSlots (Plumb userspace policy scheduler inactive state #1396) — used at maps.go:1507, dpdk/dpdk_cgo.go:356

I noted PolicyScheduleRuleSlots only; Codex caught 3 more. If ApplyResult is the replacement metadata contract, it's incomplete for known migrations.

Codex MAJOR — RuntimeDataPlane interface defined but UNWIRED

pkg/dataplane/apply.go:14 defines the interface. Only ConfigSink is asserted (var _ ConfigSink = (*Manager)(nil)). No Start/Link/HA/Sessions/Telemetry implementations or compile-time assertions for eBPF/userspace/DPDK. The contract is easy to rot before migration starts.

Codex MAJOR — HA userspace session sync can't migrate

pkg/dataplane/runtime defines SessionDeltaSource. Userspace exposes RuntimeSessionDeltaSource() at userspace/manager.go:183. But RuntimeDataPlane.Sessions() returns only SessionStore — no domain-level way to discover the delta source. The daemon still directly imports userspace DTOs at daemon_ha_userspace.go:17-42.

Codex MAJOR — Import canary is weaker than its name claims

pkg/dataplane/runtime/import_canary_test.go:30 only rejects /pkg/dataplane/userspace. It would allow:

  • /pkg/dataplane/dpdk
  • root pkg/dataplane
  • github.com/cilium/ebpf

So the canary today holds only because the runtime package imports nothing more than time. As soon as anyone reaches for a backend, the canary won't catch it.

Codex MAJOR — Telemetry missing ReadFloodCounters

Used by CLI/gRPC at cli_show_security.go:707, grpcapi/server_show.go:267. Future migration will need either FloodCounters in Telemetry or a documented replacement.

Claude miss pattern

I verified the scaffolding shape and the no-backend-import canary intent. I did NOT:

  1. Grep CompileResult field usage across callers to enumerate what ApplyResult MUST carry to be a viable replacement
  2. Check compile-time assertions for the full RuntimeDataPlane interface (only checked ConfigSink)
  3. Verify the canary's regex actually rejects all backend paths, not just userspace
  4. Cross-reference Telemetry method set against current callers

Same pattern as earlier rounds: verifying the obvious mechanism without grepping the production caller paths.

Recommendation

Block on (per Codex):

  1. Extend ApplyResult with PoolIDs, PolicyNames, AppNames, PolicyScheduleRuleSlots
  2. Add compile-time assertions for RuntimeDataPlane against backend Managers, OR explicitly document this as deferred to a later slice with a TODO
  3. Resolve HA session-delta migration path: extend RuntimeDataPlane.Sessions() or add RuntimeDataPlane.SessionDeltas() SessionDeltaSource
  4. Strengthen import canary to reject all backend paths (dpdk, root dataplane, cilium/ebpf)
  5. Add FloodCounters to Telemetry OR document the replacement

OR clarify in the plan doc that this slice is just the scaffolding interfaces (deliberately under-specified) and that the production migration is gated on subsequent slices that extend ApplyResult field-by-field and wire RuntimeDataPlane to Managers.

Awaiting Gemini r1 (task-mp9x66jy-gikrzt). Codex task: task-mp9x5b25-7ebcvv. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-1 triple-review consolidated synthesis on b427fe14

Reviewer Verdict
Claude MERGE-NEEDS-MAJOR (self-corrected from READY after Codex)
Codex MERGE-NEEDS-MAJOR (5 findings)
Gemini Pro 3 MERGE-NEEDS-MAJOR (PolicyScheduleRuleSlots missing + import canary too narrow)

Triple converge MERGE-NEEDS-MAJOR. Codex and Gemini converge on the same core issue: ApplyResult is incomplete for migration, and the import canary doesn't enforce its claimed invariant.

Codex + Gemini MAJOR — ApplyResult missing fields

apply.go:31 ApplyResult carries: ZoneIDs, ManagedInterfaces, FilterIDs, FilterSpans, NATCounterIDs, Capabilities, Generation.

Missing fields used by current callers:

  • PoolIDspkg/cli/cli_show_nat.go:187 (Codex)
  • PolicyNamespkg/cli/cli_show_flow.go:232, pkg/daemon/daemon_system.go:42 (Codex)
  • AppNames — runtime callers (Codex)
  • PolicyScheduleRuleSlots (Plumb userspace policy scheduler inactive state #1396) — maps.go:1507, dpdk/dpdk_cgo.go:356 (Codex + Gemini)

Gemini emphasizes the PolicyScheduleRuleSlots gap as fatal for the #1396 migration path. Codex enumerates 4 fields.

Codex MAJOR — RuntimeDataPlane interface defined but UNWIRED

apply.go:14 defines Start/Link/HA/Sessions/Telemetry. Only ConfigSink is asserted (var _ ConfigSink = (*Manager)(nil)). No backend Manager has compile-time assertions for the full RuntimeDataPlane contract. Contract can rot before migration even starts.

Codex MAJOR — HA session sync can't migrate cleanly

pkg/dataplane/runtime defines SessionDeltaSource. Userspace exposes RuntimeSessionDeltaSource() at userspace/manager.go:183. But RuntimeDataPlane.Sessions() returns only SessionStore — no domain-level way to discover the delta source. Daemon still directly imports userspace DTOs at daemon_ha_userspace.go:17-42.

Codex + Gemini MAJOR — Import canary too narrow

pkg/dataplane/runtime/import_canary_test.go:30 only rejects /pkg/dataplane/userspace. Allows:

  • /pkg/dataplane/dpdk
  • root pkg/dataplane
  • github.com/cilium/ebpf

Canary passes today only because the runtime package imports just time. First backend import slips through undetected.

Codex MAJOR — Telemetry missing ReadFloodCounters

Used by CLI/gRPC at cli_show_security.go:707, grpcapi/server_show.go:267. Future migration will need FloodCounters in Telemetry OR a documented replacement.

Gemini verifications that PASS

  • HA cluster sync.go migration via ReconcileClusterBulk — clean abstraction (Gemini)
  • dpdk_stub.go build-tag stub pattern correct (Gemini)
  • Plan doc accurately describes the topology (Gemini, with the noted dropped-metadata exception)
  • Legacy dataplane.DataPlane preserved with no breaking signature changes (Codex + Gemini)

Recommendation

Block on (Codex + Gemini convergence):

  1. Extend ApplyResult with PoolIDs, PolicyNames, AppNames, PolicyScheduleRuleSlots
  2. Strengthen import canary to reject ALL backend paths (dpdk, root dataplane, cilium/ebpf), not just userspace
  3. Either add var _ RuntimeDataPlane = (*backendManager)(nil) compile-time assertions OR explicitly document the deferred assertion as a TODO
  4. Resolve HA session-delta migration path (extend Sessions() or add SessionDeltas() method)
  5. Add FloodCounters to Telemetry OR document a replacement

Alternative: clarify in plan doc that this slice is the scaffolding interfaces (deliberately under-specified) and the production migration extends ApplyResult field-by-field and wires RuntimeDataPlane to Managers in subsequent slices. This shifts the contract from "viable replacement" to "scaffolding marker."

Round-1 claim trace

  • Codex task-mp9x5b25-7ebcvv — 5 MAJORs (ApplyResult fields, unwired interface, HA session-delta, canary narrow, Telemetry incomplete)
  • Gemini Pro 3 task-mp9x66jy-gikrzt — MAJOR (PolicyScheduleRuleSlots) + MINOR (canary)

Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

@copilot review

Round-1 quad-review of DataPlane interface split scaffolding commit b427fe1. Codex + Gemini r1 verdicts posted in this thread. Please review.

Copilot AI and others added 2 commits May 17, 2026 15:32
- Extend ApplyResult with PoolIDs, PolicyNames, AppNames, PolicyScheduleRuleSlots so
  callers can migrate from LastCompileResult() to LastApplyResult() without losing
  runtime lookups (Codex MAJOR + Gemini MAJOR)
- Add ReadFloodCounters(uint16)(FloodState,error) to Telemetry interface so CLI/gRPC
  callers have a clear migration path (Codex MAJOR)
- Add SessionDeltas() dpruntime.SessionDeltaSource to RuntimeDataPlane, giving the
  daemon a domain-level way to discover the delta source and documenting the remaining
  daemon_ha_userspace.go migration via TODO (Codex MAJOR)
- Strengthen import canary to reject pkg/dataplane/dpdk, root pkg/dataplane, and
  github.com/cilium/ebpf — not just userspace — with clear per-case comments
  (Codex MAJOR + Gemini MAJOR)
- Add TODO on RuntimeDataPlane documenting that compile-time assertions for all three
  backends are deferred to the wiring slice (Codex MAJOR)
- Update ApplyResultFromCompileResult() and Clone() for the four new fields
- Expand apply_test.go to cover new fields and mutation-independence

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/62e9137f-965d-4559-9cde-1efbd7f38aa4

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 17, 2026 15:35
@psaab psaab review requested due to automatic review settings May 17, 2026 15:35
Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

Round-1 triple-review consolidated synthesis on b427fe14

| Reviewer | Verdict |...

Addressed all 5 MAJOR findings in commit 04efbbc:

  1. ApplyResult missing fields — Added PoolIDs, PolicyNames, AppNames, PolicyScheduleRuleSlots to ApplyResult, updated ApplyResultFromCompileResult() and Clone() to populate/copy them. Callers migrating from LastCompileResult() to LastApplyResult() now retain all runtime lookups.

  2. Import canary too narrow — Rewritten to reject all four forbidden categories: pkg/dataplane/userspace, pkg/dataplane/dpdk, root pkg/dataplane (cycle prevention), and github.com/cilium/ebpf (BPF library), each with a documented rationale.

  3. RuntimeDataPlane interface defined but UNWIRED — Added TODO(#1381) on the interface documenting that var _ RuntimeDataPlane = (*Manager)(nil) assertions for all three backends are deferred to the wiring slice.

  4. HA session-delta migration path — Added SessionDeltas() dpruntime.SessionDeltaSource to RuntimeDataPlane, giving the daemon a domain-level hook without importing dpuserspace directly. The daemon's daemon_ha_userspace.go migration is tracked via a TODO(#1381) comment.

  5. Telemetry missing ReadFloodCounters — Added ReadFloodCounters(zoneID uint16) (FloodState, error) to the Telemetry interface with a note that backends without BPF flood maps return a zero value. FloodState is already backend-neutral (defined in the root dataplane package).

All tests pass: go test ./pkg/dataplane/... ./pkg/cluster ./pkg/daemon.

Copilot finished work on behalf of psaab May 17, 2026 15:36
Copilot AI review requested due to automatic review settings May 17, 2026 15:39
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-2 fix pushed at 94f010e.\n\nChanges:\n- ApplyResult now carries PoolIDs, PolicyNames, AppNames, and compiled PolicyScheduleRuleSlots with copy/clone tests.\n- eBPF, DPDK, and userspace managers now satisfy RuntimeDataPlane at compile time.\n- Telemetry includes ReadFloodCounters.\n- Sessions expose the backend-neutral session-delta source; userspace wires its runtime adapter, generic stores return nil.\n- Runtime import canary rejects root dataplane, userspace, DPDK, and cilium/ebpf imports.\n- Plan doc updated to match the implemented module contract.\n\nValidation: go test ./pkg/dataplane/... ./pkg/cluster ./pkg/api ./pkg/grpcapi ./pkg/cli

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated no new comments.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-2 review on 94f010e4

Verdict: MERGE-READY pending Codex+Gemini r2

All 5 Codex r1 MAJORs addressed across 3 commits.

Verifications

1. ApplyResult extended with 4 missing fields at apply.go:45-65:

type ApplyResult struct {
    ZoneIDs           map[string]uint16
    ManagedInterfaces []networkd.InterfaceConfig
    FilterIDs         map[string]uint32
    FilterSpans       map[string]FilterCounterSpan
    NATCounterIDs     map[string]uint32

    // Display metadata carried from CompileResult so callers can migrate from
    // LastCompileResult() to LastApplyResult() without losing runtime lookups.
    PoolIDs     map[string]uint8
    PolicyNames map[uint32]string
    AppNames    map[uint16]string

    PolicyScheduleRuleSlots []PolicyScheduleRuleSlot

    Capabilities Capabilities
    Generation   uint64
}

Explicit migration intent in comment. All 4 Codex+Gemini r1 fields present.

2. SessionDeltas() added to RuntimeDataPlane at apply.go:36-43:

// SessionDeltas returns the backend-neutral session-delta source used for
// HA session sync. Backends that do not support delta streaming return a
// nil source; callers must nil-check before use.
// TODO(#1381): daemon_ha_userspace.go still imports dpuserspace directly;
// migrate it to use dp.SessionDeltas() once all backends wire this method.
SessionDeltas() dpruntime.SessionDeltaSource

Method added with explicit migration TODO. Backend wire-ups in dpdk/manager.go (+31) and userspace/manager.go (+64).

3. Import canary expanded at pkg/dataplane/runtime/import_canary_test.go:25-55:

backendSubPkgs := []string{
    "github.com/psaab/xpf/pkg/dataplane/userspace",
    "github.com/psaab/xpf/pkg/dataplane/dpdk",
}
for _, forbidden := range backendSubPkgs {
    if path == forbidden || strings.HasPrefix(path, forbidden+"/") {
        t.Fatalf("runtime package imports forbidden backend sub-package in %s: %s", filename, path)
    }
}

Plus separate checks for root pkg/dataplane and github.com/cilium/ebpf (need to verify those are in the same helper).

4. slices.Clone defensive copying — separate commit. Prevents callers from mutating internal ApplyResult slice state.

5. Plan doc updated (+32 lines) describing what's wired and what's still deferred.

Hostile concerns

MINOR — Need to verify all RuntimeDataPlane interface methods compile-time asserted. PR added wire-ups in dpdk + userspace Manager files. Should see var _ RuntimeDataPlane = (*Manager)(nil) in both. Codex will verify.

MINOR — Telemetry ReadFloodCounters — author claim in PR summary: "flood counters" addressed. Need to verify it's a method on Telemetry interface OR documented as a deferred replacement.

MINOR — SessionDeltas() returns nil for backends that don't support delta streaming. Daemon caller paths must nil-check. The TODO mentions daemon_ha_userspace.go migration is deferred; verify this is documented in plan doc.

MINOR — Import canary completeness. New checkForbiddenImport helper covers userspace + dpdk sub-packages. Need to verify it ALSO covers:

  • Root pkg/dataplane import (per the plan)
  • github.com/cilium/ebpf direct imports
    Codex will trace.

Recommendation

Block on: nothing structural. All 5 r1 MAJORs cleanly addressed.

MINOR follow-ups (verify in Codex/Gemini r2):

  • Compile-time asserts present for all RuntimeDataPlane methods on dpdk + userspace Managers
  • Import canary covers root pkg/dataplane and cilium/ebpf
  • Telemetry includes ReadFloodCounters or documented deferred

Awaiting Codex (task-mp9y4ost-7d44c6) and Gemini Pro 3 (task-mp9y5jeg-fa05pl). Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-2 self-correction on 94f010e4

Revising verdict MERGE-READY → MERGE-NEEDS-MAJOR to match Codex r2.

I posted READY because all 5 r1 MAJORs were visibly addressed and Gemini r2 agreed. Codex traced the next layer of the abstraction and caught a false-positive scaffold: userspace HA fabric forwarding is wired but incomplete.

Codex MAJOR — Userspace HA fabric wiring is a false-positive scaffold

userspace.Manager.HA() is wired to the generic dataplane.NewDataPlaneHAController(m) at pkg/dataplane/userspace/manager.go:201. SetFabricForwarding in the shared adapter at pkg/dataplane/apply.go:254 calls UpdateFabricFwd* only.

But the existing working userspace fabric path requires SyncFabricState() to push update_fabrics to the helper:

  • Documented + implemented at pkg/dataplane/userspace/manager_ha.go:69
  • Called explicitly by old daemon path after fabric0 updates at pkg/daemon/daemon_ha_fabric.go:544

Worked failure (Codex):
A caller migrating to HA().SetFabricForwarding(...) alone would silently skip the helper update. Fabric forwarding state in the daemon-side BPF map would update, but the userspace-dp helper's apply_fabrics would never fire. Stale fabric state on the helper → cross-chassis sessions misforwarded.

This is exactly the "false positive scaffold" anti-pattern the new interface is supposed to prevent: the contract says "this method sets fabric forwarding" but it under-delivers for userspace.

Fix: make userspace's HA controller override SetFabricForwarding to also call SyncFabricState(), OR override the entire HA() method to return a userspace-specific HA controller.

Codex MINOR — ApplyResult.Clone() shallow-copies nested slices

pkg/dataplane/apply.go:136 uses slices.Clone for ManagedInterfaces. But networkd.InterfaceConfig contains nested Addresses []string (pkg/networkd/networkd.go:26). Caller mutation of LastApplyResult().ManagedInterfaces[i].Addresses mutates the stored snapshot.

Fix: deep-clone or document the depth of the clone.

Codex MINOR — Plan/code naming mismatch

Plan doc at docs/pr/1381-dataplane-interface-split/plan.md:298 says FloodCounters. Code at pkg/dataplane/apply.go:103 is ReadFloodCounters. Pick one.

The RuntimeDataPlane TODO in the plan still says "add compile-time assertions later", but this round added them (Codex verified at dataplane.go:15, dpdk/manager.go:16, userspace/manager.go:30). Update plan.

Codex confirmed wins

  • A. ApplyResult fields populated from CompileResult and cloned ✓
  • B. Compile-time asserts for eBPF, DPDK, userspace ✓
  • C. SessionDeltas() returns userspace RuntimeSessionDeltaSource(); tests cover ✓
  • D. Import canary rejects userspace, dpdk, exact root pkg/dataplane, github.com/cilium/ebpf + subpaths ✓
  • E. ReadFloodCounters added to Telemetry interface ✓
  • F. slices.Clone applied to top-level slice fields ✓ (nested gap noted above)

Gemini r2 MERGE-READY — did not catch the fabric wiring issue

Gemini verified the interface methods individually but didn't trace HA().SetFabricForwarding() against the existing working userspace path. Codex caught the integration gap.

Claude miss pattern (carried)

I verified the new interface methods exist and have compile-time asserts. I did NOT trace each interface method against the EXISTING legacy implementation to verify equivalent semantics. Codex's check: "if a caller migrates from legacy daemon_ha_fabric.go:544 to HA().SetFabricForwarding(), do they get the same behavior?" Answer: NO for userspace.

Recommendation

Block on:

  1. Userspace HA fabric wiring: override SetFabricForwarding to also call SyncFabricState() OR have userspace return a userspace-specific HA controller

MINOR follow-ups:

  • Deep-clone ManagedInterfaces.Addresses OR document clone depth contract
  • Plan doc: rename FloodCountersReadFloodCounters; update RuntimeDataPlane TODO to reflect completed assertions

Awaiting Copilot. Codex task: task-mp9y4ost-7d44c6. Gemini task: task-mp9y5jeg-fa05pl. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-3 fix pushed at ddc3af9.\n\nChanges:\n- Userspace RuntimeDataPlane.HA() now returns a userspace-specific HA controller. SetFabricForwarding updates the fabric map and then calls SyncFabricState(), so callers migrating to the runtime HA interface still publish fresh fabric state to the helper.\n- Added userspace HA controller tests for fabric0/fabric1 sync and update-error no-sync behavior.\n- ApplyResult clone/copy now deep-copies ManagedInterfaces nested Addresses, with regression coverage for source mutation and Clone mutation.\n- Plan doc corrected from FloodCounters to ReadFloodCounters and updated the adapter note to call out the userspace-specific HA controller.\n- Removed the stale compile-time-assertion TODO from the RuntimeDataPlane comment.\n\nValidation: go test ./pkg/dataplane/... ./pkg/cluster ./pkg/api ./pkg/grpcapi ./pkg/cli

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-3 review on ddc3af98

Verdict: MERGE-READY pending Codex r3

All round-2 issues addressed.

Verifications

1. Userspace HA fabric wiring — FIXED with userspace-specific controller.

pkg/dataplane/userspace/manager.go:

func (m *Manager) HA() dataplane.HAController {
    return userspaceHAController{manager: m}
}

Replaces the generic dataplane.NewDataPlaneHAController(m) adapter that lacked SyncFabricState. New userspaceHAController.SetFabricForwarding:

func (c userspaceHAController) SetFabricForwarding(ctx context.Context, id dataplane.FabricID, info dataplane.FabricFwdInfo) error {
    if err := ctx.Err(); err != nil { return err }
    if c.manager == nil { return errors.New("nil userspace dataplane") }
    var err error
    if id == 1 {
        err = c.manager.UpdateFabricFwd1(info)
    } else {
        err = c.manager.UpdateFabricFwd(info)
    }
    if err != nil { return err }
    if err := ctx.Err(); err != nil { return err }
    c.manager.SyncFabricState()
    return nil
}

Matches the legacy daemon path at daemon_ha_fabric.go:544: UpdateFabricFwd* followed by SyncFabricState(). Plus ctx.Err() checks before and after the fabric update for cancellation propagation.

Defined via userspaceHAOps interface — testable with mocks.

2. Deep clone for ManagedInterfaces.Addresses — FIXED. apply.go (+23 lines) adds nested-slice clone. apply_test.go (+13) pins the deep-clone behavior (caller mutation of ManagedInterfaces[i].Addresses no longer affects the stored snapshot).

3. Plan doc fixes — DONE. FloodCountersReadFloodCounters rename; RuntimeDataPlane TODO updated to reflect completed compile-time assertions.

4. New tests in runtime_delta_test.go (+65 lines) — should pin the userspace HA controller's fabric SyncFabricState call. Codex will verify.

Hostile concerns

MINOR — Test mock interface coverage. The new userspaceHAOps interface accepts mocks. Verify tests:

  • Call userspaceHAController.SetFabricForwarding(ctx, 0, info) and assert UpdateFabricFwd THEN SyncFabricState was called in order
  • Same for FabricID=1UpdateFabricFwd1
  • ctx cancellation between UpdateFabricFwd and SyncFabricState returns error and skips sync

Recommendation

Block on: nothing. Round-2 MAJOR (false-positive fabric scaffold) cleanly closed with userspace-specific controller. Minor fixes shipped.

Awaiting Codex (task-mp9yzipg-3r032i). Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-3 dual-review consolidated synthesis on ddc3af98

Reviewer Verdict
Claude MERGE-READY
Codex MERGE-NEEDS-MINOR (test wiring gap + fabric1 legacy discrepancy + ctx-cancel doc)

Net: MERGE-NEEDS-MINOR. Substantive r2 MAJOR (fabric SyncFabricState) addressed. Three MINOR follow-ups.

Codex confirmed wins

  • Manager.HA() returns userspaceHAController at manager.go:201
  • SetFabricForwarding calls UpdateFabricFwd/UpdateFabricFwd1 then SyncFabricState at manager.go:274-293. For fabric0 success this matches legacy daemon_ha_fabric.go:530+:548
  • ✓ Cancellation propagated before and after the update
  • ManagedInterfaces.Addresses deep-clone verified at apply.go:145-149; test mutates both source and clone at apply_test.go:78+112
  • ✓ Plan doc rename complete (ReadFloodCounters); remaining FloodCounters hits are generated BPF object/map names

Codex MINOR 1 — Test wires controller directly, not Manager.HA()

runtime_delta_test.go:136 tests userspaceHAController directly. The round-2 regression mode (Manager.HA() accidentally returning dataplane.NewDataPlaneHAController(m)) would NOT be caught by these tests.

Fix: add New().HA() / Manager.HA() wiring assertion (e.g., type assert the returned controller is userspaceHAController).

Codex MINOR 2 — Test doesn't pin call order

runtime_delta_test.go:143 only counts update/sync calls. A bad implementation that calls SyncFabricState BEFORE UpdateFabricFwd* could still pass.

Fix: record an in-order call log: ["update0", "sync", "update1", "sync"].

Codex MINOR 3 — fabric1 legacy discrepancy

Codex found:

"For fabric0 success this matches legacy daemon_ha_fabric.go:530 then :548. It is not exact for fabric1: legacy refreshFabricFwd1 updates at :715 and returns without sync."

The NEW code always calls SyncFabricState() for both fabric0 AND fabric1. Legacy only syncs on fabric0. Discrepancy: is the new code right (always sync), or is legacy right (fabric1 doesn't need explicit sync)?

Author should:

  • Investigate the legacy fabric1 path — does fabric1 update implicitly trigger sync somewhere else?
  • Confirm or document the new "always sync" semantics
  • If legacy was actually buggy and fabric1 never got sync, the new code fixes that bug

Codex MINOR 4 — ctx-cancel committed-but-not-synced window

manager.go:290 propagates cancellation after the map update but before SyncFabricState. Means a committed UpdateFabricFwd* can intentionally skip the helper sync if context is canceled in that narrow window.

If the invariant is "successful fabric map write is always mirrored to userspace," this should be tightened (drop the cancel check between update and sync) or documented as intentional.

Recommendation

Block on: nothing critical (substantive r2 MAJOR closed).

MINOR follow-ups (this PR):

  1. Wiring test for Manager.HA() returning userspaceHAController
  2. In-order call log assertion update0 → sync → update1 → sync
  3. Verify fabric1 legacy: should SyncFabricState fire after UpdateFabricFwd1?
  4. Tighten or document the committed-but-not-synced cancel window

Codex task: task-mp9yzipg-3r032i. Not merging — author's decision.

Copilot AI review requested due to automatic review settings May 17, 2026 16:17
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-4 fix pushed at 1c81d20.\n\nChanges:\n- Added a Manager.HA() regression test that fails if userspace ever returns the generic HA adapter again.\n- Pinned fabric update ordering as update -> SyncFabricState for both fabric0 and fabric1.\n- Removed the post-update context cancellation escape hatch: after the fabric map update commits, helper sync is mandatory.\n- Left a code comment documenting that syncing both fabric slots is intentional for the runtime HA contract: a successful fabric forwarding update means the userspace helper has a fresh fabric view.\n\nValidation: go test ./pkg/dataplane/... ./pkg/cluster ./pkg/api ./pkg/grpcapi ./pkg/cli

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-4 review on 1c81d20e

Verdict: MERGE-READY pending Codex r4

All r3 MINORs cleanly addressed.

Verifications

1. Wiring test — Manager.HA() returns userspaceHAController — FIXED.

func TestRuntimeManagerHAUsesUserspaceController(t *testing.T) {
    controller, ok := New().HA().(userspaceHAController)
    if !ok { t.Fatalf("Manager.HA() = %T, want userspaceHAController", New().HA()) }
    if _, ok := controller.manager.(*Manager); !ok { t.Fatalf(...) }
}

Catches the round-2 regression mode where Manager.HA() accidentally returned dataplane.NewDataPlaneHAController(m).

2. Call-order pin via event log — FIXED. fakeUserspaceHAOps now records events []string with "fabric0" / "fabric1" / "sync" markers; assertions use reflect.DeepEqual(events, []string{"fabric0", "sync", "fabric1", "sync"}). A bad implementation that calls SyncFabricState BEFORE UpdateFabricFwd* would fail.

3. ctx-cancel committed-but-not-synced — FIXED.

if err != nil { return err }
// The map update is committed at this point. Always push helper fabric
// state after a successful fabric0 or fabric1 update so RuntimeDataPlane.HA
// preserves the same "fresh helper view" contract for every fabric slot.
c.manager.SyncFabricState()

The ctx.Err() check between update and sync is REMOVED with explicit comment documenting the invariant: "successful update → always sync".

4. New cancel test TestRuntimeUserspaceHAControllerSyncsAfterSuccessfulUpdateDespiteCanceledContext: canceled-from-start ctx returns error without events; subsequent successful ctx still syncs.

On the fabric1 legacy discrepancy (Codex r3 MINOR 3)

Author chose to make the new code's contract canonical: both fabric0 AND fabric1 calls trigger SyncFabricState() after successful update. The explicit comment "preserves the same fresh helper view contract for every fabric slot" documents this as intentional unification.

Legacy daemon_ha_fabric.go:715 refreshFabricFwd1 doesn't sync — author treats this as the buggy legacy path being implicitly fixed by the new uniform contract. Behavior change worth flagging as part of the migration semantics, but the comment documents the intent.

Recommendation

Block on: nothing. All 4 r3 MINORs addressed.

Documentation note: worth a one-liner in plan doc noting the fabric1 sync behavior unification vs legacy.

Awaiting Codex (task-mp9zcs47-jks3hw). Not merging — author's decision.

@psaab psaab merged commit 040fae3 into master May 17, 2026
4 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Comment on lines +199 to +228
for _, key := range staleV4 {
if err := s.DeleteWithCompanionsV4(key, reason); err != nil {
errs = append(errs, err)
}
result.DeletedV4++
}

var staleV6 []SessionKeyV6
if err := s.ForEachV6(func(key SessionKeyV6, val SessionValueV6) bool {
if val.IsReverse != 0 {
return true
}
if input.ShouldSyncZone(val.IngressZone) {
return true
}
if _, ok := input.ReceivedV6[key]; !ok {
staleV6 = append(staleV6, key)
}
return true
}); err != nil {
return result, errors.Join(append(errs, err)...)
}
result.StaleV6 = len(staleV6)

for _, key := range staleV6 {
if err := s.DeleteWithCompanionsV6(key, reason); err != nil {
errs = append(errs, err)
}
result.DeletedV6++
}
Comment on lines +317 to +327
func (m *Manager) recordApplyResultLocked(result *dataplane.ApplyResult, caps UserspaceCapabilities, generation uint64) {
if result == nil {
return
}
result.Capabilities = dataplane.Capabilities{
ForwardingSupported: caps.ForwardingSupported,
UnsupportedReasons: append([]string(nil), caps.UnsupportedReasons...),
}
result.Generation = generation
m.lastApply = result.Clone()
}
Comment on lines +16 to +30
func (s runtimeSessionDeltaSource) DrainSessionDeltas(max uint32) (dpruntime.SessionDeltaSnapshot, error) {
deltas, status, err := s.manager.DrainSessionDeltas(max)
if err != nil {
return runtimeSessionDeltaSnapshot(deltas, status, max), err
}
return runtimeSessionDeltaSnapshot(deltas, status, max), nil
}

func (s runtimeSessionDeltaSource) ExportOwnerRGSessions(rgIDs []int, max uint32) (dpruntime.SessionDeltaSnapshot, error) {
deltas, status, err := s.manager.ExportOwnerRGSessions(rgIDs, max)
if err != nil {
return runtimeSessionDeltaSnapshot(deltas, status, max), err
}
return runtimeSessionDeltaSnapshot(deltas, status, max), nil
}
Comment thread pkg/dataplane/apply.go
next := result.Clone()
next.Generation = m.applyGeneration
m.lastApply = next
return next.Clone()
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-4 dual-review consolidated synthesis on 1c81d20e

Reviewer Verdict
Claude MERGE-READY
Codex MERGE-NEEDS-MINOR (cancel test doesn't actually pin the r3 bug)

Net: MERGE-NEEDS-MINOR. Substantive r3 fixes verified. Cancel-regression test is too lax.

Round-3 wins (Codex verified)

  • Manager.HA() returns userspaceHAController{manager: m}; new wiring test catches a regression to generic NewDataPlaneHAController
  • ✓ Event slice pins call order fabric0 → sync → fabric1 → sync
  • ✓ Post-update ctx.Err() check is removed; cancellation no longer skips SyncFabricState
  • ✓ "Always sync after successful update for both fabrics" contract documented and acceptable as the new canonical (legacy refreshFabricFwd1 direct path remains unchanged — only the new HA abstraction has the fixed contract)

Codex r4 MINOR — Cancel-regression test doesn't pin the actual bug

TestRuntimeUserspaceHAControllerSyncsAfterSuccessfulUpdateDespiteCanceledContext:

  1. Cancels ctx before entry → expects error, asserts no events ✓
  2. Uses a fresh context.Background() for the successful call → asserts ["fabric0", "sync"]

Problem: the old buggy r3 code had ctx.Err() check AFTER UpdateFabricFwd*. With a fresh Background(), that check never trips. So the old buggy implementation would still pass this test.

Fix (Codex): cancel the same ctx FROM INSIDE the fake UpdateFabricFwd (so cancellation happens after update succeeds but before sync would be called), then assert events are ["fabric0", "sync"] and call returns nil. That pins the actual race the r3 fix was meant to close.

Recommendation

Block on: nothing critical. The substantive r3 fix is correct.

MINOR (this PR):

  • Tighten the cancel-regression test to cancel ctx from inside UpdateFabricFwd so the test would fail under the old ctx.Err()-between-update-and-sync buggy code

Codex task: task-mp9zcs47-jks3hw. Not merging — author's decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants