policy: preserve scheduler rule counters by psaab · Pull Request #1407 · psaab/xpf

psaab · 2026-05-17T18:47:08Z

Summary

Refs #1378.

Narrows the remaining userspace policy-scheduler gap after #1396:

makes missing policy scheduler references commit-time errors for zone-pair and global policies
preserves policy rule hit counters across scheduler-driven snapshot rebuilds by keying counters by stable rule ID
adds Rust coverage showing hit counters survive active/inactive scheduler toggles
updates the Feature gap: time-based policy schedulers (Junos schedulers { ... }) not propagated to userspace-dp #1378 plan and feature-gap docs to leave only integration/HA validation as the remaining scheduler retirement evidence

Validation

go test ./pkg/config ./pkg/dataplane/userspace
cargo test policy:: -- --nocapture
git diff --check

psaab · 2026-05-17T18:49:59Z

@copilot review

Copilot

Pull request overview

Tightens the userspace policy-scheduler retirement work from #1396: missing scheduler references now reject commit instead of warning, and Rust policy hit counters survive snapshot rebuilds across scheduler active/inactive flips by being keyed by stable rule id in a process-wide registry.

Changes:

Promote missing zone-pair and global policy scheduler references from ValidateConfig warnings to a strict commit-time error in compileExpanded, with scope-qualified error messages and updated tests.
Move PolicyRule.hit_count to an Arc<AtomicU64> looked up from a global, rule-id-keyed registry with size-threshold-driven pruning, so rebuilds reuse counters; added a Rust test covering active→inactive→active toggling.
Update #1378 plan and docs/userspace-dataplane-gaps.md to record the 2026-05-17 closeout and narrow the remaining blocker to integration/HA validation.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/config/compiler.go	Wires the strict scheduler validator into `compileExpanded` and adds a `scope` prefix for global policies.
pkg/config/parser_ast_test.go	Renames and rewrites the two scheduler-reference tests to assert commit-time errors instead of warnings.
userspace-dp/src/policy.rs	Introduces a global `OnceLock<Mutex<FxHashMap>>` registry of `Arc<AtomicU64>` hit counters keyed by rule_id, with prune threshold; `parse_policy_state` now reuses counters across rebuilds.
userspace-dp/src/policy_tests.rs	Adds `scheduled_allow_snapshot` helper and `hit_counters_survive_scheduler_snapshot_rebuild` covering active/inactive/active.
docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md	Documents the closeout slice and validation commands.
docs/userspace-dataplane-gaps.md	Narrows remaining #1378 scope to integration/HA failover evidence.

Comments suppressed due to low confidence (2)

userspace-dp/src/policy.rs:131

prune_policy_counter_registry is invoked from parse_policy_state while the previous PolicyState is typically still alive in the caller (it's only dropped after the new snapshot replaces it). At prune time, every entry from the old snapshot therefore has Arc::strong_count > 1 and is retained, even for rules removed by the new snapshot. In practice the registry only shrinks one rebuild cycle after a rule disappears, so the threshold is more lenient than it looks and a rapid churn of rule_ids can still push the map well past POLICY_COUNTER_REGISTRY_PRUNE_THRESHOLD for an extended period. Consider basing retention purely on active_rule_ids membership (since not-present rules are never re-looked-up by policy_rule_hit_counter again and the lingering Arc just delays GC).

fn prune_policy_counter_registry(active_rule_ids: &FxHashSet<String>) {
    if let Ok(mut counters) = policy_counter_registry().lock() {
        if counters.len() <= POLICY_COUNTER_REGISTRY_PRUNE_THRESHOLD {
            return;
        }
        counters.retain(|rule_id, counter| {
            active_rule_ids.contains(rule_id) || Arc::strong_count(counter) > 1
        });
    }
}

userspace-dp/src/policy.rs:144

Inconsistent lock-poisoning handling between the two registry accessors: prune_policy_counter_registry silently no-ops on a poisoned mutex (if let Ok(mut counters) = ...lock()), while policy_rule_hit_counter panics on the same condition (.expect("policy counter registry poisoned")). Either both should propagate (panic) so a poisoned registry is surfaced immediately, or both should recover so a panic on one snapshot build does not bring down later ones. The current asymmetry means a poisoning incident produces hard panics on hit-counter lookup but quietly skipped pruning.

fn prune_policy_counter_registry(active_rule_ids: &FxHashSet<String>) {
    if let Ok(mut counters) = policy_counter_registry().lock() {
        if counters.len() <= POLICY_COUNTER_REGISTRY_PRUNE_THRESHOLD {
            return;
        }
        counters.retain(|rule_id, counter| {
            active_rule_ids.contains(rule_id) || Arc::strong_count(counter) > 1
        });
    }
}

fn policy_rule_hit_counter(rule_id: &str) -> Arc<AtomicU64> {
    let mut counters = policy_counter_registry()
        .lock()
        .expect("policy counter registry poisoned");
    if let Some(counter) = counters.get(rule_id) {
        return counter.clone();
    }

    let counter = Arc::new(AtomicU64::new(0));
    counters.insert(rule_id.to_string(), counter.clone());
    counter
}

+	if err := validatePolicySchedulerReferencesStrict(cfg); err != nil {
+		return nil, err
+	}


+static POLICY_COUNTERS: OnceLock<Mutex<PolicyCounterRegistry>> = OnceLock::new();
+
+fn policy_counter_registry() -> &'static Mutex<PolicyCounterRegistry> {
+    POLICY_COUNTERS.get_or_init(|| Mutex::new(FxHashMap::default()))
+}
+
+fn prune_policy_counter_registry(active_rule_ids: &FxHashSet<String>) {
+    if let Ok(mut counters) = policy_counter_registry().lock() {
+        if counters.len() <= POLICY_COUNTER_REGISTRY_PRUNE_THRESHOLD {
+            return;
+        }
+        counters.retain(|rule_id, counter| {
+            active_rule_ids.contains(rule_id) || Arc::strong_count(counter) > 1
+        });
+    }
+}
+
+fn policy_rule_hit_counter(rule_id: &str) -> Arc<AtomicU64> {
+    let mut counters = policy_counter_registry()
+        .lock()
+        .expect("policy counter registry poisoned");
+    if let Some(counter) = counters.get(rule_id) {
+        return counter.clone();
+    }
+
+    let counter = Arc::new(AtomicU64::new(0));
+    counters.insert(rule_id.to_string(), counter.clone());
+    counter
+}


Copilot · 2026-05-17T18:52:06Z

@copilot review

I completed an adversarial pass over commit ed069a5 (compiler scheduler-reference checks + userspace policy counter persistence) and re-ran targeted validation (go test ./pkg/config ./pkg/dataplane/userspace). I did not find additional correctness/performance issues requiring follow-up changes in this PR.

psaab · 2026-05-17T19:50:37Z

Round-1 Codex review on `ed069a50`

Codex verdict: MERGE-NEEDS-MAJOR

MAJOR 1 — Counter preservation is not end-to-end

Rust now increments an internal Arc<AtomicU64> at userspace-dp/src/policy.rs:390, :396. But:

Existing counter API is ReadPolicyCounters(policyID uint32) CounterValue (pkg/dataplane/dataplane.go:241)
userspace.Manager still embeds/delegates to inner eBPF manager (pkg/dataplane/userspace/manager.go:62, :155)
CLI/API paths still read numeric BPF counters (pkg/cli/cli_show_security.go:48)
No userspace override, no control-socket export, no clear path
Rust counter is packets-only; CounterValue is packets+bytes

This is a private helper counter, not preserved operator-visible policy counters. Operator-visible counters still go to the eBPF path, which on userspace dataplanes won't reflect actual policy hits.

MAJOR 2 — Counter lifetime wrong for delete/re-add

prune_policy_counter_registry returns without pruning until registry exceeds 16_384 entries (userspace-dp/src/policy.rs:122). policy_rule_hit_counter blindly reuses any stored ID (:133).

Worked failure:

Policy P with stable RuleID=42 → counter created, hits=100
Operator removes P from config → snapshot rebuilds without P
Counter registry NOT pruned (stays under 16K threshold)
Operator re-adds P with the same stable RuleID=42 (idempotent generation)
policy_rule_hit_counter(42) returns the same counter → stale hits resurrect

Contradicts plan invariant at plan-1378-policy-schedulers.md:75: "counters survive while the identity remains present" (key word: REMAINS — not "and resumes after re-add").

MINOR — Tests/docs overstate coverage

Rust counter test calls parse_policy_state directly and reads a private field; doesn't go Go snapshot JSON → Rust deserialize → helper runtime → public counter read
Plan lists policy::snapshot_publish_applies_inactive_bits_atomically at plan-1378-policy-schedulers.md:111 but that test doesn't exist

Wins (Codex confirms)

✓ Strict missing-scheduler rejection wired into CompileConfig
✓ Snapshot fields + Rust serde defaults preserve old snapshots
✓ No new per-packet locks/allocations (new mutex/string work is on snapshot rebuild, not packet eval)

Recommendation

Block on:

Expose the userspace policy counter through the existing ReadPolicyCounters API (or add a userspace-specific override that the CLI/gRPC layer consults first)
Prune counter registry when the corresponding rule is removed from the latest snapshot (not just at 16K cap)
Convert packets-only counter to packets+bytes to match CounterValue schema

Codex task: task-mpa6l165-b0mc0i. Not merging — author's decision.

psaab · 2026-05-17T20:13:49Z

@copilot review

Copilot · 2026-05-17T20:17:54Z

Round-1 Codex review on ed069a50

Codex verdict: MERGE-NEEDS-MAJOR...

Addressed. 2b54bda wires userspace policy counters into ReadPolicyCounters (packets+bytes) and moves counter lifetime to PolicyCounterStore with reconcile_rules so delete/re-add does not resurrect stale values; tests cover both continuity and reset-on-removal behavior. I also fixed the plan’s stale test reference in 6303db8 (and path cleanup in 8ac9e4b).

@copilot review

Completed another adversarial pass on current HEAD (8ac9e4b, including 2b54bda) and validated the userspace counter/export lifecycle path. Targeted Go suites pass; Rust cargo test ... policy:: remains blocked in this runner by missing libelf headers/pkg-config.

psaab · 2026-05-17T20:30:16Z

Round-2 Codex review on `8ac9e4b8`

Codex verdict: MERGE-NEEDS-MINOR (substantive r1 MAJORs fully addressed)

r1 MAJOR resolution (Codex verified)

✓ Counter lifetime now per-coordinator via PolicyCounterStore (not earlier global process registry); Coordinator owns store across snapshot rebuilds
✓ Counters track packets AND bytes
✓ Attached to stable rule IDs
✓ Exposed via Rust ProcessStatus.PolicyRuleCounters
✓ Go consumes through pkg/dataplane/userspace/policycounters.go with sparse-policy-ID mapping + clear support
✓ Delete/re-add semantics explicit: absent rules pruned, re-added rule starts from zero (test verifies)
✓ Stale plan-doc test reference corrected

MINOR — Production reporting surfaces lack regression tests

Codex: changed user-facing paths still untested:

pkg/api/handlers.go:218
pkg/grpcapi/server_show_zones.go:65
pkg/api/metrics.go:1498

These are where sparse policy IDs and helper-backed counters can regress silently.

Suggested test: seed userspace helper PolicyRuleCounters → assert reported policy counters match production IDs via API/gRPC/Prometheus path, including a global policy.

MINOR — Plan doc carryover

docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md:119 still records outstanding end-to-end scheduler/HA validation. Fine if this PR is the counter-preservation slice only, not evidence of complete production validation.

Codex task: task-mpa7x5sd-i1zs4c. Not merging — author's decision.

psaab · 2026-05-17T21:15:00Z

@copilot review

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

+impl PolicyCounterStore {
+    pub(crate) fn reconcile_rules(&self, rules: &[PolicyRuleSnapshot]) {
+        let active_rule_ids: FxHashSet<String> = rules.iter().map(stable_policy_rule_id).collect();
+        if let Ok(mut counters) = self.counters.lock() {
+            counters.retain(|rule_id, _| active_rule_ids.contains(rule_id));
+        }
+    }
+
+    pub(crate) fn clear(&self) {
+        if let Ok(counters) = self.counters.lock() {
+            for counter in counters.values() {
+                counter.reset();
+            }
+        }
+    }
+
+    fn rule_hit_counter(&self, rule_id: &str) -> Arc<PolicyRuleCounter> {
+        let mut counters = self.counters.lock().expect("policy counter store poisoned");
+        if let Some(counter) = counters.get(rule_id) {
+            return counter.clone();
+        }
+
+        let counter = Arc::new(PolicyRuleCounter::default());
+        counters.insert(rule_id.to_string(), counter.clone());
+        counter
+    }
+}


 			if s.dp != nil && s.dp.IsLoaded() {
+				policyID := policySetID*dataplane.MaxRulesPerPolicy + uint32(len(pi.Rules))
 				if ctrs, err := s.dp.ReadPolicyCounters(policyID); err == nil {
 					pr.HitPackets = ctrs.Packets
 					pr.HitBytes = ctrs.Bytes
 				}
 			}


-			policyID++
 		}
+		policySetID++
 	}


+func policyRuleIDForCounter(cfg *config.Config, policyID uint32) string {
+	if cfg == nil {
+		return ""
+	}
+	policySetID := policyID / dataplane.MaxRulesPerPolicy
+	ruleIndex := policyID % dataplane.MaxRulesPerPolicy
+
+	var currentSet uint32
+	for _, zpp := range cfg.Security.Policies {
+		if zpp == nil {
+			continue
+		}
+		if currentSet == policySetID {
+			if int(ruleIndex) >= len(zpp.Policies) || zpp.Policies[ruleIndex] == nil {
+				return ""
+			}
+			return stablePolicyRuleID(zpp.FromZone, zpp.ToZone, zpp.Policies[ruleIndex].Name)
+		}
+		currentSet++
+	}
+	if currentSet == policySetID {
+		if int(ruleIndex) >= len(cfg.Security.GlobalPolicies) || cfg.Security.GlobalPolicies[ruleIndex] == nil {
+			return ""
+		}
+		return stablePolicyRuleID("junos-global", "junos-global", cfg.Security.GlobalPolicies[ruleIndex].Name)
+	}
+	return ""
+}


Copilot · 2026-05-17T21:23:05Z

Round-2 Codex review on 8ac9e4b8

Codex verdict: MERGE-NEEDS-MINOR (substantive r1 MAJORs fully addressed)...

Implemented. e1a6680 adds regression tests for helper-backed policy counters on gRPC and Prometheus paths (including global policy IDs), and ad14d5b tightens the metrics mapping/hardening follow-up. Screenshot: N/A (no UI changes).

@copilot review

Completed an adversarial re-review on current HEAD (ad14d5b, includes e1a6680) and reran targeted validation: go test ./pkg/api ./pkg/grpcapi ./pkg/dataplane/userspace ./pkg/config passed; no additional correctness/perf issues found in the updated policy counter reporting paths. Screenshot: N/A (no UI changes).

psaab · 2026-05-17T21:37:19Z

Claude r3 review on `ad14d5bc`

Verdict: MERGE-NEEDS-MINOR (pending Codex/Gemini)

r2 MINOR closure

r2 raised: missing API/gRPC regression tests for scheduler-rule-counter preservation. r3 commit message: "test: tighten policy metrics regression coverage".

Added 527 LOC of tests across:

pkg/api/metrics_test.go +114 LOC
pkg/api/policy_counters_test.go +142 LOC (NEW file)
pkg/dataplane/userspace/manager_test.go +65 LOC
pkg/grpcapi/server_show_zones_test.go +173 LOC (NEW file)

These look like real production-path coverage tests, not isolated helper tests.

Concern raised by reading the metrics.go diff

The r3 diff also includes a behavior change in pkg/api/metrics.go:

for i, rule := range zpp.Policies {
-   if rule == nil {
-       continue
-   }
    policyID := policyCounterID(policySetID, i)
    ctrs, err := dp.ReadPolicyCounters(policyID)
    ...
    ch <- prometheus.MustNewConstMetric(c.policyHitsTotal, ...,
        float64(ctrs.Packets), fromZone, toZone, rule.Name)   // <-- rule.Name deref
}

The removed nil check guarded rule.Name dereference on line 1517 (PR branch numbering). Policies []*Policy (per pkg/config/types.go:1178) is a pointer slice — nil entries are type-permitted.

Inconsistency: The GlobalPolicies loop immediately below in the same function STILL has if rule == nil { continue }. Either:

(a) The invariant "Policies never contains nil" holds for both, and the GlobalPolicies nil check is also dead code → keep removal but also remove from GlobalPolicies for consistency.
(b) The invariant doesn't hold, and removing the per-zone-pair check introduced a nil-deref → put it back.

Worth a one-line justification or symmetric removal.

Verification

git show ad14d5bc:pkg/api/metrics.go | sed -n '1498,1532p' — confirmed asymmetric nil-check removal.
grep "Policies\s*\[\]" pkg/config/types.go — confirmed Policies []*ZonePairPolicies and inner Policies []*Policy are pointer slices.
New tests look production-path-shaped, but I have not run them.

Recommendation

Minor: either restore the nil check at line ~1508 OR remove the symmetric one at GlobalPolicies + add a comment explaining the invariant.

Awaiting Codex (task-mpaah2ui-xq0njo) + Gemini Pro 3 (task-mpaahf6n-fdtge8). Not merging — author's decision.

psaab · 2026-05-17T21:39:54Z

Round-3 triple-review synthesis on `ad14d5bc`

Reviewer	Verdict
Claude	MERGE-NEEDS-MINOR (asymmetric nil-check)
Codex	MERGE-NEEDS-MINOR (mocked DPs in Go tests)
Gemini Pro 3	MERGE-READY

Codex MINOR — Go tests mock past the risk surface

"The new API/gRPC tests still bypass the production counter-preservation bridge. pkg/api/policy_counters_test.go:13 and pkg/grpcapi/server_show_zones_test.go:13 install fake dataplanes whose ReadPolicyCounters is just a map lookup. That tests handler ID arithmetic, but not the real path: Rust PolicyCounterStore → helper policy_rule_counters status → Go Manager.lastStatus → userspace.Manager.ReadPolicyCounters → API/gRPC."

"pkg/dataplane/userspace/manager_test.go:88 is adjacent coverage. It manually swaps lastSnapshot while leaving lastStatus stale, then expects the old counter to reappear after re-add. It does not exercise Compile, UpdatePolicyScheduleState, apply_snapshot, helper status refresh, or the scheduler active/inactive preservation transition."

Gemini counterpoint

Rust-side test hit_counters_survive_scheduler_snapshot_rebuild in userspace-dp/src/policy_tests.rs DOES exercise the real PolicyCounterStore with full reconcile/eval/snapshot-rebuild flow. The Go tests are intentionally boundary tests for the HTTP/gRPC marshaling layer.

My read

Both right at different layers. Rust side has real production-path coverage (the actual counter survival logic). Go side has marshaling-only coverage with fakes. The gap Codex points to (Compile → UpdatePolicyScheduleState → apply_snapshot end-to-end) is a Go-side integration test that nobody wrote. Worth adding but doesn't undermine the Rust correctness coverage.

metrics.go behavior change

Both Claude and Codex flagged the if rule == nil { continue } removal in the per-zone-pair loop. Gemini explains: zpp.Policies is config-layer-guaranteed non-nil (defensive code removal); GlobalPolicies keeps the nil check because globals CAN be nil per compilation rules. Codex also flags this is a behavior change beyond test tightening (LastCompileResult()==nil early-return also removed).

If zpp.Policies invariant holds, the asymmetric nil-check removal is correct cleanup. Worth a one-line code comment explaining the invariant so the asymmetry doesn't look like an oversight.

Recommendation

Block on: add comment to collectPolicyCounters documenting the zpp.Policies non-nil invariant (Claude+Codex MINOR).
Defer: add a Go-side end-to-end integration test that exercises Compile → status helper → ReadPolicyCounters without fakes (Codex MINOR — follow-up issue).

Codex task: task-mpaah2ui-xq0njo. Gemini task: task-mpaahf6n-fdtge8. Not merging — author's decision.

Copilot

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

+impl PolicyCounterStore {
+    pub(crate) fn reconcile_rules(&self, rules: &[PolicyRuleSnapshot]) {
+        let active_rule_ids: FxHashSet<String> = rules.iter().map(stable_policy_rule_id).collect();
+        if let Ok(mut counters) = self.counters.lock() {
+            counters.retain(|rule_id, _| active_rule_ids.contains(rule_id));
+        }
+    }
+
+    pub(crate) fn clear(&self) {
+        if let Ok(counters) = self.counters.lock() {
+            for counter in counters.values() {
+                counter.reset();
+            }
+        }
+    }
+
+    fn rule_hit_counter(&self, rule_id: &str) -> Arc<PolicyRuleCounter> {
+        let mut counters = self.counters.lock().expect("policy counter store poisoned");
+        if let Some(counter) = counters.get(rule_id) {
+            return counter.clone();
+        }
+
+        let counter = Arc::new(PolicyRuleCounter::default());
+        counters.insert(rule_id.to_string(), counter.clone());
+        counter
+    }
+}


+	var currentSet uint32
+	for _, zpp := range cfg.Security.Policies {
+		if zpp == nil {
+			continue
+		}
+		if currentSet == policySetID {
+			if int(ruleIndex) >= len(zpp.Policies) || zpp.Policies[ruleIndex] == nil {
+				return ""
+			}
+			return stablePolicyRuleID(zpp.FromZone, zpp.ToZone, zpp.Policies[ruleIndex].Name)
+		}
+		currentSet++
+	}


+	total.Packets += counter.Packets
+	total.Bytes += counter.Bytes
+	return total, nil


+        self.packets.fetch_add(1, Ordering::Relaxed);
+        if packet_len != 0 {
+            self.bytes.fetch_add(packet_len, Ordering::Relaxed);
+        }


psaab · 2026-05-18T02:41:23Z

Claude r6 review on `62ece646`

Verdict: MERGE-READY (pending Codex/Gemini)

r5 follow-up closed: Read-path IPC test added

The new TestReadPolicyCountersUsesStatusIPCPolicyRuleCounters (manager_test.go +91) mirrors the r5 TestClearPolicyCountersUsesHelperIPCAndRecordsStatus pattern:

Real Unix domain socket listener (not m.inner = nil)
Real requestLocked IPC call with Type: "status"
Helper sends back ProcessStatus with PolicyRuleCounters: [{RuleID, Packets: 23, Bytes: 2300}]
Test then calls m.ReadPolicyCounters(0) and verifies the helper's counters reached the public read path

Verification

git show 62ece646:pkg/dataplane/userspace/manager_test.go — confirmed: nonzero counters set in helper response, ReadPolicyCounters called after IPC roundtrip.
Mirrors the r5 Clear-IPC test structure (Unix socket, ControlRequest decode, ControlResponse encode).
m.recordHelperStatusLocked(&status) is called after the IPC, simulating the production status-refresh path.

Recommendation

MERGE-READY on the r5 Read-path MINOR. The Clear + Read IPC tests now both pin counter behavior across real helper roundtrip.

Awaiting Codex (task-mpalim0e-kct9hw) + Gemini Pro 3 (task-mpaliy4j-48o1rw). Not merging — author's decision.

psaab · 2026-05-18T02:48:32Z

Round-6 triple-review synthesis on `62ece646`

Reviewer	Verdict
Claude	MERGE-READY
Codex	MERGE-READY (narrowly)
Gemini Pro 3	MERGE-NEEDS-MINOR

r5 Read-path IPC test — CLOSED

Both reviewers confirm real Unix-socket IPC roundtrip exercises the production manager path:

m := New() (m.inner non-nil)
Real net.Listen("unix", controlSock) fake helper
requestLocked(ControlRequest{Type: "status"}) → real JSON encode/decode
Helper returns PolicyRuleCounters: [{RuleID, Packets: 23, Bytes: 2300}]
recordHelperStatusLocked ingests
ReadPolicyCounters(0) returns the helper's counters

Codex MINOR caveat (C — "mostly yes")

"It does not seed an explicit same-rule stale zero entry, which would be a sharper fixture, but the intended stale-cache-through-status-refresh path is covered."

The starting cache is empty rather than "preloaded with stale zero counters for the same RuleID". The test still proves the IPC-fresh-counter is what ReadPolicyCounters returns; what it doesn't pin is "fresh helper overwrites a pre-existing stale entry". Codex calls this a non-blocking sharpness nit.

Gemini MINOR (B+C)

Same framing as Codex on point C. Adds point B: "no helper-restart simulation or multi-roundtrip". Both are sharpness nits, not coverage holes for the original r3+r4 concern.

Recommendation

MERGE-READY. The Clear + Read IPC tests now both pin counter behavior across real helper roundtrip. The "stale-cache overlay" + "multi-roundtrip" cases are reasonable follow-ups for an issue, not blockers for this PR. The original r3+r4 production-path coverage gap is closed.

Codex task: task-mpalim0e-kct9hw. Gemini task: task-mpaliy4j-48o1rw. Not merging — author's decision.

psaab · 2026-05-18T02:58:48Z

Follow-up fix on `a7f374bc`

Closed the remaining read-path IPC test-hardening nit. The policy counter IPC test now seeds stale same-rule zero state, performs two real helper status roundtrips, and asserts ReadPolicyCounters tracks the fresh helper data across refreshes instead of accidentally passing from cached state.

Validation:

go test ./pkg/dataplane/userspace
git diff --check

psaab · 2026-05-18T03:04:24Z

Claude r7 review on `a7f374bc`

Verdict: MERGE-READY (pending Codex/Gemini)

r6 sharpness MINOR closure

r6 reviewers asked for a stale-cache overlay + multi-roundtrip helper coverage. r7 refactor delivers both:

responses := []PolicyRuleCounterStatus{
    {RuleID: wantRuleID, Packets: 23, Bytes: 2300},
    {RuleID: wantRuleID, Packets: 31, Bytes: 3100},
}
reqCh := make(chan ControlRequest, len(responses))
done := make(chan struct{}, len(responses))
go func() {
    for _, counter := range responses {
        conn, err := ln.Accept()
        ...
        _ = json.NewEncoder(conn).Encode(ControlResponse{
            OK: true,
            Status: &ProcessStatus{
                PID:                4321,
                PolicyRuleCounters: []PolicyRuleCounterStatus{counter},
            },
        })
        done <- struct{}{}
    }
}()

Same RuleID, two distinct counter values, two real Unix-socket IPC roundtrips. Test should assert ReadPolicyCounters returns the LATEST helper response after each roundtrip.

Verification

git show a7f374bc:pkg/dataplane/userspace/manager_test.go confirms two-response loop with channel buffer sized for 2.
accept loop is sequential — each ln.Accept() waits for the next client connect, so ordering is deterministic (no race).
Real m.proc + m.cfg.ControlSocket path, m.inner non-nil.

Recommendation

MERGE-READY on the r6 MINOR. The IPC test now covers:

Real helper IPC (m.inner != nil) — closed in r6
Counter value reflects latest helper sync — closed in r7
Multi-roundtrip refresh — closed in r7
Same-RuleID counter update (not just append) — closed in r7

Awaiting Codex (task-mpambkr5-pyxiwv) + Gemini Pro 3 (task-mpambwfs-2q77ug). Not merging — author's decision.

psaab · 2026-05-18T03:09:38Z

Round-7 triple-review synthesis on `a7f374bc`

Reviewer	Verdict
Claude	MERGE-READY
Codex	MERGE-READY
Gemini Pro 3	MERGE-READY

All three converge. r6 sharpness MINOR is closed.

What r7 confirmed

Codex (no findings):

"A. Yes. refreshStatus() is called twice, and the fake helper serves two sequential status responses: 23/2300, then 31/3100... B. Yes. The second ReadPolicyCounters(0) explicitly expects Packets: 31, Bytes: 3100, so it would fail if the first helper value 23/2300 stayed cached. C. Yes. m.lastStatus is preloaded before either IPC with PolicyRuleCounters containing the same wantRuleID; packets/bytes are zero by Go default. D. Yes. Both refreshes call m.requestLocked(ControlRequest{Type: "status"}, &status), which goes through requestDetailedLocked and net.DialTimeout("unix", m.cfg.ControlSocket, ...). This is not the m.inner=nil cache-only shortcut. E. Safe."

Gemini Pro 3 (quote-line verification of all four checks A-D, all PASS).

Coverage closure

The IPC test now pins:

Real helper IPC (m.inner != nil) — closed in r6
Counter value reflects latest helper sync — closed in r7
Multi-roundtrip refresh (two sequential IPCs) — closed in r7
Same-RuleID counter update with stale-cache overlay (m.lastStatus preloaded with zero) — closed in r7
No deadlock / race risk (buffered channels, sequential accept loop) — verified

Recommendation

Merge-ready. Four-round IPC test arc concluded.

Codex task: task-mpambkr5-pyxiwv. Gemini task: task-mpambwfs-2q77ug. Not merging — author's decision.

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/44910551-6d29-4da9-916f-072abba2e573 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/9158ee1c-37c2-4067-95a5-20a292e2b448 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/109b8dbc-29e5-406f-b482-8b7f62214c86 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/dataplane/userspace/policycounters.go:87

ReadPolicyCounters adds the legacy inner counter and the helper counter (total.Packets += counter.Packets, same for bytes). This is correct only as long as exactly one of the two backends is actually reporting non-zero values; if a deployment ever has both the legacy eBPF dataplane loaded and the userspace helper reporting for the same rule, hit counters will be double-counted. Consider either documenting/asserting the mutual-exclusion invariant, or preferring helper values over inner ones when a helper rule-id match is found (rather than summing).

	var total dataplane.CounterValue
	var innerErr error
	if m.inner != nil {
		total, innerErr = m.inner.ReadPolicyCounters(policyID)
	}

	m.mu.Lock()
	defer m.mu.Unlock()

	cfg := (*config.Config)(nil)
	if m.lastSnapshot != nil {
		cfg = m.lastSnapshot.Config
	}
	ruleID := policyRuleIDForCounter(cfg, policyID)
	if ruleID == "" {
		if innerErr != nil {
			return dataplane.CounterValue{}, innerErr
		}
		return total, nil
	}
	counter, ok := buildPolicyRuleCounterIndex(&m.lastStatus)[ruleID]
	if !ok {
		if innerErr != nil {
			return dataplane.CounterValue{}, innerErr
		}
		return total, nil
	}
	total.Packets += counter.Packets
	total.Bytes += counter.Bytes
	return total, nil

+	var currentSet uint32
+	for _, zpp := range cfg.Security.Policies {
+		if zpp == nil {
+			continue
+		}
+		if currentSet == policySetID {
+			if int(ruleIndex) >= len(zpp.Policies) || zpp.Policies[ruleIndex] == nil {
+				return ""
+			}
+			return stablePolicyRuleID(zpp.FromZone, zpp.ToZone, zpp.Policies[ruleIndex].Name)
+		}
+		currentSet++
+	}


 			if s.dp != nil && s.dp.IsLoaded() {
+				policyID := policySetID*dataplane.MaxRulesPerPolicy + uint32(len(pi.Rules))
 				if ctrs, err := s.dp.ReadPolicyCounters(policyID); err == nil {
 					pr.HitPackets = ctrs.Packets
 					pr.HitBytes = ctrs.Bytes
 				}
 			}


+) -> PolicyAction {
+    evaluate_policy_with_len(
+        state, from_id, to_id, src_ip, dst_ip, protocol, src_port, dst_port, 0,
+    )
+}


Preserves scheduler rule counters (#1407) in policy. - Expose userspace scheduler policy counters - docs: correct #1378 scheduler test inventory - log: use relative cargo manifest path - test: cover scheduler policy counters in API - test: cover userspace policy counters in metrics and grpc global path - fix: harden policy metrics counter ID mapping Touches pkg/dataplane/userspace, pkg/grpcapi/server_show_zones_test.go, userspace-dp/src/policy_tests.rs, and pkg/api/policy_counters_test.go across documentation, tests, configuration schema, userspace dataplane, dataplane integration, and operator/API output. The largest file deltas are pkg/dataplane/userspace/manager_test.go, pkg/grpcapi/server_show_zones_test.go, and userspace-dp/src/policy_tests.rs. The diff is 1405 additions and 85 deletions across 25 files.

Copilot AI review requested due to automatic review settings May 17, 2026 18:47

Copilot started reviewing on behalf of psaab May 17, 2026 18:47 View session

Copilot started work on behalf of psaab May 17, 2026 18:50 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Copilot finished work on behalf of psaab May 17, 2026 18:52

Copilot started work on behalf of psaab May 17, 2026 20:13 View session

Copilot AI review requested due to automatic review settings May 17, 2026 20:16

psaab review requested due to automatic review settings May 17, 2026 20:16

Copilot AI review requested due to automatic review settings May 17, 2026 20:17

psaab review requested due to automatic review settings May 17, 2026 20:17

Copilot finished work on behalf of psaab May 17, 2026 20:18

Copilot AI review requested due to automatic review settings May 17, 2026 21:14

Copilot started work on behalf of psaab May 17, 2026 21:15 View session

Copilot started reviewing on behalf of psaab May 17, 2026 21:15 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 17, 2026 21:20

psaab review requested due to automatic review settings May 17, 2026 21:20

Copilot AI review requested due to automatic review settings May 17, 2026 21:21

psaab review requested due to automatic review settings May 17, 2026 21:21

Copilot finished work on behalf of psaab May 17, 2026 21:23

Copilot AI reviewed May 18, 2026

View reviewed changes

psaab and others added 13 commits May 17, 2026 20:20

policy: preserve scheduler rule counters

bebec04

Expose userspace scheduler policy counters

a2e6fe0

docs: correct #1378 scheduler test inventory

1558f86

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/44910551-6d29-4da9-916f-072abba2e573 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

log: use relative cargo manifest path

b70b0d3

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/44910551-6d29-4da9-916f-072abba2e573 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

test: cover scheduler policy counters in API

d54f0af

test: cover userspace policy counters in metrics and grpc global path

6722806

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/9158ee1c-37c2-4067-95a5-20a292e2b448 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

fix: harden policy metrics counter ID mapping

1360ffb

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/9158ee1c-37c2-4067-95a5-20a292e2b448 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

test: tighten policy metrics regression coverage

4372003

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/9158ee1c-37c2-4067-95a5-20a292e2b448 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

test: cover userspace scheduled policy counters

fb13fb0

docs: clarify policy counter nil-invariant in metrics loop

c294bf2

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/109b8dbc-29e5-406f-b482-8b7f62214c86 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

test: cover helper policy counter clear IPC

b2e0dbc

test(userspace): cover policy counter read IPC path

779030c

test(userspace): cover stale policy counter IPC refresh

f17e714

Copilot AI review requested due to automatic review settings May 18, 2026 03:26

psaab force-pushed the codex/1403-1378-scheduler-closeout branch from a7f374b to f17e714 Compare May 18, 2026 03:26

Copilot started reviewing on behalf of psaab May 18, 2026 03:27 View session

psaab merged commit afab2ab into master May 18, 2026
2 checks passed

Copilot AI reviewed May 18, 2026

View reviewed changes

psaab mentioned this pull request May 18, 2026

filter: reuse three-color policer runtimes on refresh #1421

Merged

Conversation

psaab commented May 17, 2026

Summary

Validation

Uh oh!

psaab commented May 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI commented May 17, 2026

Uh oh!

psaab commented May 17, 2026

Round-1 Codex review on ed069a50

MAJOR 1 — Counter preservation is not end-to-end

MAJOR 2 — Counter lifetime wrong for delete/re-add

MINOR — Tests/docs overstate coverage

Wins (Codex confirms)

Recommendation

Uh oh!

psaab commented May 17, 2026

Uh oh!

Copilot AI commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-1 Codex review on ed069a50

Uh oh!

psaab commented May 17, 2026

Round-2 Codex review on 8ac9e4b8

r1 MAJOR resolution (Codex verified)

MINOR — Production reporting surfaces lack regression tests

MINOR — Plan doc carryover

Uh oh!

psaab commented May 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Round-2 Codex review on 8ac9e4b8

Uh oh!

psaab commented May 17, 2026

Claude r3 review on ad14d5bc

r2 MINOR closure

Concern raised by reading the metrics.go diff

Verification

Recommendation

Uh oh!

psaab commented May 17, 2026

Round-3 triple-review synthesis on ad14d5bc

Codex MINOR — Go tests mock past the risk surface

Gemini counterpoint

My read

metrics.go behavior change

Recommendation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

psaab commented May 18, 2026

Claude r6 review on 62ece646

r5 follow-up closed: Read-path IPC test added

Verification

Recommendation

Uh oh!

psaab commented May 18, 2026

Round-6 triple-review synthesis on 62ece646

r5 Read-path IPC test — CLOSED

Codex MINOR caveat (C — "mostly yes")

Gemini MINOR (B+C)

Recommendation

Uh oh!

psaab commented May 18, 2026

Follow-up fix on a7f374bc

Uh oh!

psaab commented May 18, 2026

Round-1 Codex review on `ed069a50`

Copilot AI commented May 17, 2026 •

edited

Loading

Round-1 Codex review on `ed069a50`

Round-2 Codex review on `8ac9e4b8`

Copilot AI commented May 17, 2026 •

edited

Loading

Round-2 Codex review on `8ac9e4b8`

Claude r3 review on `ad14d5bc`

Round-3 triple-review synthesis on `ad14d5bc`

Claude r6 review on `62ece646`

Round-6 triple-review synthesis on `62ece646`

Follow-up fix on `a7f374bc`

Claude r7 review on `a7f374bc`

Round-7 triple-review synthesis on `a7f374bc`