dataplane: add runtime interface split scaffolding#1399
Conversation
There was a problem hiding this comment.
Pull request overview
First Phase 1 scaffolding slice for issue #1381: introduces backend-neutral dataplane interfaces (RuntimeDataPlane, ConfigSink, SessionStore, Telemetry, HA/Link controllers) alongside the legacy BPF-shaped DataPlane, widens ApplyResult metadata, adds a neutral runtime DTO package with a userspace adapter, and migrates cluster stale-bulk reconciliation to the new SessionStore.ReconcileClusterBulk companion-delete API.
Changes:
- Add backend-neutral contracts (
RuntimeDataPlane,ConfigSink,SessionStore,Telemetry,LinkController,HAController) andApplyResultwith filter spans, NAT counter IDs, capabilities, and generation; eBPF/DPDK/userspace managers now publishLastApplyResult(). - Introduce
pkg/dataplane/runtimepackage with neutral session-delta DTOs andSessionDeltaSourceinterface; userspaceManageradapts its private types at the boundary, with an import-canary test forbidding userspace imports in the runtime package. - Refactor
pkg/cluster/sync.gostale-bulk reconciliation to useSessionStore.ReconcileClusterBulkfor forward+reverse+DNAT/DNATv6 cleanup, with tests (including an AST canary) preventing reintroduction of local DNAT cleanup.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/dataplane/apply.go | New RuntimeDataPlane/ConfigSink/ApplyResult types and eBPF Manager apply plumbing. |
| pkg/dataplane/apply_test.go | Verifies ApplyResultFromCompileResult cloning, contract size, and canceled-context behavior. |
| pkg/dataplane/session_store.go | New SessionStore interface and DataPlane-backed adapter with companion-delete and bulk reconcile. |
| pkg/dataplane/session_store_test.go | Tests companion-delete for v4/v6 sessions including DNAT cleanup. |
| pkg/dataplane/compiler.go | Compile result gains FilterSpans; records ApplyResult after compile. |
| pkg/dataplane/compiler_filter.go | Populates FilterSpans per-filter (FilterID/RuleStart/RuleCount). |
| pkg/dataplane/dataplane.go | Adds ConfigSink compile-time assertion for eBPF Manager. |
| pkg/dataplane/loader.go | Adds applyMu, applyGeneration, lastApply state to eBPF Manager. |
| pkg/dataplane/runtime/session_delta.go | Backend-neutral session delta DTOs and SessionDeltaSource interface. |
| pkg/dataplane/runtime/import_canary_test.go | Canary forbidding userspace imports from the neutral runtime package. |
| pkg/dataplane/dpdk/manager.go | DPDK backend gains ApplyConfig/LastApplyResult and ConfigSink assertion. |
| pkg/dataplane/dpdk/dpdk_stub.go | Stub Compile records ApplyResult; minor formatting. |
| pkg/dataplane/dpdk/dpdk_cgo.go | Cgo Compile records ApplyResult. |
| pkg/dataplane/userspace/manager.go | Userspace adds ApplyConfig, LastApplyResult, RuntimeSessionDeltaSource, capability/generation tracking. |
| pkg/dataplane/userspace/runtime_delta.go | Userspace adapter mapping internal session-delta/status to neutral runtime DTOs. |
| pkg/dataplane/userspace/runtime_delta_test.go | Tests userspace runtime-delta adapter mapping and interface satisfaction. |
| pkg/cluster/sync.go | Migrates stale-bulk reconciliation to SessionStore.ReconcileClusterBulk. |
| pkg/cluster/sync_test.go | Adds v4/v6 companion-delete tests and AST canary forbidding local DNAT cleanup. |
| docs/pr/1381-dataplane-interface-split/plan.md | Documents the implemented Phase 1 scaffold and remaining work. |
Claude round-1 review on
|
Claude round-1 self-correction on
|
Round-1 triple-review consolidated synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-NEEDS-MAJOR (self-corrected from READY after Codex) |
| Codex | MERGE-NEEDS-MAJOR (5 findings) |
| Gemini Pro 3 | MERGE-NEEDS-MAJOR (PolicyScheduleRuleSlots missing + import canary too narrow) |
Triple converge MERGE-NEEDS-MAJOR. Codex and Gemini converge on the same core issue: ApplyResult is incomplete for migration, and the import canary doesn't enforce its claimed invariant.
Codex + Gemini MAJOR — ApplyResult missing fields
apply.go:31 ApplyResult carries: ZoneIDs, ManagedInterfaces, FilterIDs, FilterSpans, NATCounterIDs, Capabilities, Generation.
Missing fields used by current callers:
PoolIDs—pkg/cli/cli_show_nat.go:187(Codex)PolicyNames—pkg/cli/cli_show_flow.go:232,pkg/daemon/daemon_system.go:42(Codex)AppNames— runtime callers (Codex)PolicyScheduleRuleSlots(Plumb userspace policy scheduler inactive state #1396) —maps.go:1507,dpdk/dpdk_cgo.go:356(Codex + Gemini)
Gemini emphasizes the PolicyScheduleRuleSlots gap as fatal for the #1396 migration path. Codex enumerates 4 fields.
Codex MAJOR — RuntimeDataPlane interface defined but UNWIRED
apply.go:14 defines Start/Link/HA/Sessions/Telemetry. Only ConfigSink is asserted (var _ ConfigSink = (*Manager)(nil)). No backend Manager has compile-time assertions for the full RuntimeDataPlane contract. Contract can rot before migration even starts.
Codex MAJOR — HA session sync can't migrate cleanly
pkg/dataplane/runtime defines SessionDeltaSource. Userspace exposes RuntimeSessionDeltaSource() at userspace/manager.go:183. But RuntimeDataPlane.Sessions() returns only SessionStore — no domain-level way to discover the delta source. Daemon still directly imports userspace DTOs at daemon_ha_userspace.go:17-42.
Codex + Gemini MAJOR — Import canary too narrow
pkg/dataplane/runtime/import_canary_test.go:30 only rejects /pkg/dataplane/userspace. Allows:
/pkg/dataplane/dpdk- root
pkg/dataplane github.com/cilium/ebpf
Canary passes today only because the runtime package imports just time. First backend import slips through undetected.
Codex MAJOR — Telemetry missing ReadFloodCounters
Used by CLI/gRPC at cli_show_security.go:707, grpcapi/server_show.go:267. Future migration will need FloodCounters in Telemetry OR a documented replacement.
Gemini verifications that PASS
- HA cluster
sync.gomigration viaReconcileClusterBulk— clean abstraction (Gemini) dpdk_stub.gobuild-tag stub pattern correct (Gemini)- Plan doc accurately describes the topology (Gemini, with the noted dropped-metadata exception)
- Legacy
dataplane.DataPlanepreserved with no breaking signature changes (Codex + Gemini)
Recommendation
Block on (Codex + Gemini convergence):
- Extend
ApplyResultwithPoolIDs,PolicyNames,AppNames,PolicyScheduleRuleSlots - Strengthen import canary to reject ALL backend paths (dpdk, root dataplane, cilium/ebpf), not just userspace
- Either add
var _ RuntimeDataPlane = (*backendManager)(nil)compile-time assertions OR explicitly document the deferred assertion as a TODO - Resolve HA session-delta migration path (extend
Sessions()or addSessionDeltas()method) - Add
FloodCounterstoTelemetryOR document a replacement
Alternative: clarify in plan doc that this slice is the scaffolding interfaces (deliberately under-specified) and the production migration extends ApplyResult field-by-field and wires RuntimeDataPlane to Managers in subsequent slices. This shifts the contract from "viable replacement" to "scaffolding marker."
Round-1 claim trace
- Codex
task-mp9x5b25-7ebcvv— 5 MAJORs (ApplyResult fields, unwired interface, HA session-delta, canary narrow, Telemetry incomplete) - Gemini Pro 3
task-mp9x66jy-gikrzt— MAJOR (PolicyScheduleRuleSlots) + MINOR (canary)
Not merging — author's decision.
- Extend ApplyResult with PoolIDs, PolicyNames, AppNames, PolicyScheduleRuleSlots so callers can migrate from LastCompileResult() to LastApplyResult() without losing runtime lookups (Codex MAJOR + Gemini MAJOR) - Add ReadFloodCounters(uint16)(FloodState,error) to Telemetry interface so CLI/gRPC callers have a clear migration path (Codex MAJOR) - Add SessionDeltas() dpruntime.SessionDeltaSource to RuntimeDataPlane, giving the daemon a domain-level way to discover the delta source and documenting the remaining daemon_ha_userspace.go migration via TODO (Codex MAJOR) - Strengthen import canary to reject pkg/dataplane/dpdk, root pkg/dataplane, and github.com/cilium/ebpf — not just userspace — with clear per-case comments (Codex MAJOR + Gemini MAJOR) - Add TODO on RuntimeDataPlane documenting that compile-time assertions for all three backends are deferred to the wiring slice (Codex MAJOR) - Update ApplyResultFromCompileResult() and Clone() for the four new fields - Expand apply_test.go to cover new fields and mutation-independence Agent-Logs-Url: https://github.com/psaab/xpf/sessions/62e9137f-965d-4559-9cde-1efbd7f38aa4 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/62e9137f-965d-4559-9cde-1efbd7f38aa4 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Addressed all 5 MAJOR findings in commit
All tests pass: |
|
Round-2 fix pushed at 94f010e.\n\nChanges:\n- ApplyResult now carries PoolIDs, PolicyNames, AppNames, and compiled PolicyScheduleRuleSlots with copy/clone tests.\n- eBPF, DPDK, and userspace managers now satisfy RuntimeDataPlane at compile time.\n- Telemetry includes ReadFloodCounters.\n- Sessions expose the backend-neutral session-delta source; userspace wires its runtime adapter, generic stores return nil.\n- Runtime import canary rejects root dataplane, userspace, DPDK, and cilium/ebpf imports.\n- Plan doc updated to match the implemented module contract.\n\nValidation: go test ./pkg/dataplane/... ./pkg/cluster ./pkg/api ./pkg/grpcapi ./pkg/cli |
Claude round-2 review on
|
Claude round-2 self-correction on
|
|
Round-3 fix pushed at ddc3af9.\n\nChanges:\n- Userspace RuntimeDataPlane.HA() now returns a userspace-specific HA controller. SetFabricForwarding updates the fabric map and then calls SyncFabricState(), so callers migrating to the runtime HA interface still publish fresh fabric state to the helper.\n- Added userspace HA controller tests for fabric0/fabric1 sync and update-error no-sync behavior.\n- ApplyResult clone/copy now deep-copies ManagedInterfaces nested Addresses, with regression coverage for source mutation and Clone mutation.\n- Plan doc corrected from FloodCounters to ReadFloodCounters and updated the adapter note to call out the userspace-specific HA controller.\n- Removed the stale compile-time-assertion TODO from the RuntimeDataPlane comment.\n\nValidation: go test ./pkg/dataplane/... ./pkg/cluster ./pkg/api ./pkg/grpcapi ./pkg/cli |
Claude round-3 review on
|
Round-3 dual-review consolidated synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-READY |
| Codex | MERGE-NEEDS-MINOR (test wiring gap + fabric1 legacy discrepancy + ctx-cancel doc) |
Net: MERGE-NEEDS-MINOR. Substantive r2 MAJOR (fabric SyncFabricState) addressed. Three MINOR follow-ups.
Codex confirmed wins
- ✓
Manager.HA()returnsuserspaceHAControlleratmanager.go:201 - ✓
SetFabricForwardingcallsUpdateFabricFwd/UpdateFabricFwd1thenSyncFabricStateatmanager.go:274-293. For fabric0 success this matches legacydaemon_ha_fabric.go:530+:548 - ✓ Cancellation propagated before and after the update
- ✓
ManagedInterfaces.Addressesdeep-clone verified atapply.go:145-149; test mutates both source and clone atapply_test.go:78+112 - ✓ Plan doc rename complete (
ReadFloodCounters); remainingFloodCountershits are generated BPF object/map names
Codex MINOR 1 — Test wires controller directly, not Manager.HA()
runtime_delta_test.go:136 tests userspaceHAController directly. The round-2 regression mode (Manager.HA() accidentally returning dataplane.NewDataPlaneHAController(m)) would NOT be caught by these tests.
Fix: add New().HA() / Manager.HA() wiring assertion (e.g., type assert the returned controller is userspaceHAController).
Codex MINOR 2 — Test doesn't pin call order
runtime_delta_test.go:143 only counts update/sync calls. A bad implementation that calls SyncFabricState BEFORE UpdateFabricFwd* could still pass.
Fix: record an in-order call log: ["update0", "sync", "update1", "sync"].
Codex MINOR 3 — fabric1 legacy discrepancy
Codex found:
"For fabric0 success this matches legacy
daemon_ha_fabric.go:530then:548. It is not exact for fabric1: legacyrefreshFabricFwd1updates at:715and returns without sync."
The NEW code always calls SyncFabricState() for both fabric0 AND fabric1. Legacy only syncs on fabric0. Discrepancy: is the new code right (always sync), or is legacy right (fabric1 doesn't need explicit sync)?
Author should:
- Investigate the legacy fabric1 path — does fabric1 update implicitly trigger sync somewhere else?
- Confirm or document the new "always sync" semantics
- If legacy was actually buggy and fabric1 never got sync, the new code fixes that bug
Codex MINOR 4 — ctx-cancel committed-but-not-synced window
manager.go:290 propagates cancellation after the map update but before SyncFabricState. Means a committed UpdateFabricFwd* can intentionally skip the helper sync if context is canceled in that narrow window.
If the invariant is "successful fabric map write is always mirrored to userspace," this should be tightened (drop the cancel check between update and sync) or documented as intentional.
Recommendation
Block on: nothing critical (substantive r2 MAJOR closed).
MINOR follow-ups (this PR):
- Wiring test for
Manager.HA()returninguserspaceHAController - In-order call log assertion
update0 → sync → update1 → sync - Verify fabric1 legacy: should
SyncFabricStatefire afterUpdateFabricFwd1? - Tighten or document the committed-but-not-synced cancel window
Codex task: task-mp9yzipg-3r032i. Not merging — author's decision.
|
Round-4 fix pushed at 1c81d20.\n\nChanges:\n- Added a Manager.HA() regression test that fails if userspace ever returns the generic HA adapter again.\n- Pinned fabric update ordering as update -> SyncFabricState for both fabric0 and fabric1.\n- Removed the post-update context cancellation escape hatch: after the fabric map update commits, helper sync is mandatory.\n- Left a code comment documenting that syncing both fabric slots is intentional for the runtime HA contract: a successful fabric forwarding update means the userspace helper has a fresh fabric view.\n\nValidation: go test ./pkg/dataplane/... ./pkg/cluster ./pkg/api ./pkg/grpcapi ./pkg/cli |
Claude round-4 review on
|
| for _, key := range staleV4 { | ||
| if err := s.DeleteWithCompanionsV4(key, reason); err != nil { | ||
| errs = append(errs, err) | ||
| } | ||
| result.DeletedV4++ | ||
| } | ||
|
|
||
| var staleV6 []SessionKeyV6 | ||
| if err := s.ForEachV6(func(key SessionKeyV6, val SessionValueV6) bool { | ||
| if val.IsReverse != 0 { | ||
| return true | ||
| } | ||
| if input.ShouldSyncZone(val.IngressZone) { | ||
| return true | ||
| } | ||
| if _, ok := input.ReceivedV6[key]; !ok { | ||
| staleV6 = append(staleV6, key) | ||
| } | ||
| return true | ||
| }); err != nil { | ||
| return result, errors.Join(append(errs, err)...) | ||
| } | ||
| result.StaleV6 = len(staleV6) | ||
|
|
||
| for _, key := range staleV6 { | ||
| if err := s.DeleteWithCompanionsV6(key, reason); err != nil { | ||
| errs = append(errs, err) | ||
| } | ||
| result.DeletedV6++ | ||
| } |
| func (m *Manager) recordApplyResultLocked(result *dataplane.ApplyResult, caps UserspaceCapabilities, generation uint64) { | ||
| if result == nil { | ||
| return | ||
| } | ||
| result.Capabilities = dataplane.Capabilities{ | ||
| ForwardingSupported: caps.ForwardingSupported, | ||
| UnsupportedReasons: append([]string(nil), caps.UnsupportedReasons...), | ||
| } | ||
| result.Generation = generation | ||
| m.lastApply = result.Clone() | ||
| } |
| func (s runtimeSessionDeltaSource) DrainSessionDeltas(max uint32) (dpruntime.SessionDeltaSnapshot, error) { | ||
| deltas, status, err := s.manager.DrainSessionDeltas(max) | ||
| if err != nil { | ||
| return runtimeSessionDeltaSnapshot(deltas, status, max), err | ||
| } | ||
| return runtimeSessionDeltaSnapshot(deltas, status, max), nil | ||
| } | ||
|
|
||
| func (s runtimeSessionDeltaSource) ExportOwnerRGSessions(rgIDs []int, max uint32) (dpruntime.SessionDeltaSnapshot, error) { | ||
| deltas, status, err := s.manager.ExportOwnerRGSessions(rgIDs, max) | ||
| if err != nil { | ||
| return runtimeSessionDeltaSnapshot(deltas, status, max), err | ||
| } | ||
| return runtimeSessionDeltaSnapshot(deltas, status, max), nil | ||
| } |
| next := result.Clone() | ||
| next.Generation = m.applyGeneration | ||
| m.lastApply = next | ||
| return next.Clone() |
Round-4 dual-review consolidated synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-READY |
| Codex | MERGE-NEEDS-MINOR (cancel test doesn't actually pin the r3 bug) |
Net: MERGE-NEEDS-MINOR. Substantive r3 fixes verified. Cancel-regression test is too lax.
Round-3 wins (Codex verified)
- ✓
Manager.HA()returnsuserspaceHAController{manager: m}; new wiring test catches a regression to genericNewDataPlaneHAController - ✓ Event slice pins call order
fabric0 → sync → fabric1 → sync - ✓ Post-update
ctx.Err()check is removed; cancellation no longer skipsSyncFabricState - ✓ "Always sync after successful update for both fabrics" contract documented and acceptable as the new canonical (legacy
refreshFabricFwd1direct path remains unchanged — only the new HA abstraction has the fixed contract)
Codex r4 MINOR — Cancel-regression test doesn't pin the actual bug
TestRuntimeUserspaceHAControllerSyncsAfterSuccessfulUpdateDespiteCanceledContext:
- Cancels ctx before entry → expects error, asserts no events ✓
- Uses a fresh
context.Background()for the successful call → asserts["fabric0", "sync"]
Problem: the old buggy r3 code had ctx.Err() check AFTER UpdateFabricFwd*. With a fresh Background(), that check never trips. So the old buggy implementation would still pass this test.
Fix (Codex): cancel the same ctx FROM INSIDE the fake UpdateFabricFwd (so cancellation happens after update succeeds but before sync would be called), then assert events are ["fabric0", "sync"] and call returns nil. That pins the actual race the r3 fix was meant to close.
Recommendation
Block on: nothing critical. The substantive r3 fix is correct.
MINOR (this PR):
- Tighten the cancel-regression test to cancel ctx from inside
UpdateFabricFwdso the test would fail under the oldctx.Err()-between-update-and-sync buggy code
Codex task: task-mp9zcs47-jks3hw. Not merging — author's decision.
Summary
Refs #1381.
This adds the first implementation slice for the dataplane interface split without removing the legacy BPF-shaped
dataplane.DataPlaneyet.RuntimeDataPlane,ConfigSink,ApplyResult,SessionStore, telemetry, HA, and link controller contractspkg/dataplane/runtimeDTOs and a userspace session-delta adapter that avoids importing backend-private userspace types into the neutral runtime packagedataplane.SessionStore.ReconcileClusterBulk, including reverse-session and DNAT/DNATv6 companion cleanup testsValidation
go test ./pkg/dataplane/... ./pkg/cluster -count=1go test ./pkg/...git diff --checkA DPDK tagged test build was also probed during implementation. It still fails on existing tagged-DPDK issues (
-mrtmpkg-config allowlisting, then missingBatchDeleteSessions/ C pointer indexing), so this PR does not claim DPDK-tag validation.Remaining Scope
dataplane.DataPlaneis still BPF-shapedLastCompileResult()and BPF map reads toLastApplyResult()and domain interfacesSessionStore/Telemetry; this slice moves the cluster stale-reconcile companion-delete path only