Skip to content

policy: preserve scheduler rule counters#1407

Merged
psaab merged 13 commits into
masterfrom
codex/1403-1378-scheduler-closeout
May 18, 2026
Merged

policy: preserve scheduler rule counters#1407
psaab merged 13 commits into
masterfrom
codex/1403-1378-scheduler-closeout

Conversation

@psaab
Copy link
Copy Markdown
Owner

@psaab psaab commented May 17, 2026

Summary

Refs #1378.

Narrows the remaining userspace policy-scheduler gap after #1396:

Validation

  • go test ./pkg/config ./pkg/dataplane/userspace
  • cargo test policy:: -- --nocapture
  • git diff --check

Copilot AI review requested due to automatic review settings May 17, 2026 18:47
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

@copilot review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Tightens the userspace policy-scheduler retirement work from #1396: missing scheduler references now reject commit instead of warning, and Rust policy hit counters survive snapshot rebuilds across scheduler active/inactive flips by being keyed by stable rule id in a process-wide registry.

Changes:

  • Promote missing zone-pair and global policy scheduler references from ValidateConfig warnings to a strict commit-time error in compileExpanded, with scope-qualified error messages and updated tests.
  • Move PolicyRule.hit_count to an Arc<AtomicU64> looked up from a global, rule-id-keyed registry with size-threshold-driven pruning, so rebuilds reuse counters; added a Rust test covering active→inactive→active toggling.
  • Update #1378 plan and docs/userspace-dataplane-gaps.md to record the 2026-05-17 closeout and narrow the remaining blocker to integration/HA validation.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/config/compiler.go Wires the strict scheduler validator into compileExpanded and adds a scope prefix for global policies.
pkg/config/parser_ast_test.go Renames and rewrites the two scheduler-reference tests to assert commit-time errors instead of warnings.
userspace-dp/src/policy.rs Introduces a global OnceLock<Mutex<FxHashMap>> registry of Arc<AtomicU64> hit counters keyed by rule_id, with prune threshold; parse_policy_state now reuses counters across rebuilds.
userspace-dp/src/policy_tests.rs Adds scheduled_allow_snapshot helper and hit_counters_survive_scheduler_snapshot_rebuild covering active/inactive/active.
docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md Documents the closeout slice and validation commands.
docs/userspace-dataplane-gaps.md Narrows remaining #1378 scope to integration/HA failover evidence.
Comments suppressed due to low confidence (2)

userspace-dp/src/policy.rs:131

  • prune_policy_counter_registry is invoked from parse_policy_state while the previous PolicyState is typically still alive in the caller (it's only dropped after the new snapshot replaces it). At prune time, every entry from the old snapshot therefore has Arc::strong_count > 1 and is retained, even for rules removed by the new snapshot. In practice the registry only shrinks one rebuild cycle after a rule disappears, so the threshold is more lenient than it looks and a rapid churn of rule_ids can still push the map well past POLICY_COUNTER_REGISTRY_PRUNE_THRESHOLD for an extended period. Consider basing retention purely on active_rule_ids membership (since not-present rules are never re-looked-up by policy_rule_hit_counter again and the lingering Arc just delays GC).
fn prune_policy_counter_registry(active_rule_ids: &FxHashSet<String>) {
    if let Ok(mut counters) = policy_counter_registry().lock() {
        if counters.len() <= POLICY_COUNTER_REGISTRY_PRUNE_THRESHOLD {
            return;
        }
        counters.retain(|rule_id, counter| {
            active_rule_ids.contains(rule_id) || Arc::strong_count(counter) > 1
        });
    }
}

userspace-dp/src/policy.rs:144

  • Inconsistent lock-poisoning handling between the two registry accessors: prune_policy_counter_registry silently no-ops on a poisoned mutex (if let Ok(mut counters) = ...lock()), while policy_rule_hit_counter panics on the same condition (.expect("policy counter registry poisoned")). Either both should propagate (panic) so a poisoned registry is surfaced immediately, or both should recover so a panic on one snapshot build does not bring down later ones. The current asymmetry means a poisoning incident produces hard panics on hit-counter lookup but quietly skipped pruning.
fn prune_policy_counter_registry(active_rule_ids: &FxHashSet<String>) {
    if let Ok(mut counters) = policy_counter_registry().lock() {
        if counters.len() <= POLICY_COUNTER_REGISTRY_PRUNE_THRESHOLD {
            return;
        }
        counters.retain(|rule_id, counter| {
            active_rule_ids.contains(rule_id) || Arc::strong_count(counter) > 1
        });
    }
}

fn policy_rule_hit_counter(rule_id: &str) -> Arc<AtomicU64> {
    let mut counters = policy_counter_registry()
        .lock()
        .expect("policy counter registry poisoned");
    if let Some(counter) = counters.get(rule_id) {
        return counter.clone();
    }

    let counter = Arc::new(AtomicU64::new(0));
    counters.insert(rule_id.to_string(), counter.clone());
    counter
}

Comment thread pkg/config/compiler.go
Comment on lines +224 to +226
if err := validatePolicySchedulerReferencesStrict(cfg); err != nil {
return nil, err
}
Comment thread userspace-dp/src/policy.rs Outdated
Comment on lines +116 to +144
static POLICY_COUNTERS: OnceLock<Mutex<PolicyCounterRegistry>> = OnceLock::new();

fn policy_counter_registry() -> &'static Mutex<PolicyCounterRegistry> {
POLICY_COUNTERS.get_or_init(|| Mutex::new(FxHashMap::default()))
}

fn prune_policy_counter_registry(active_rule_ids: &FxHashSet<String>) {
if let Ok(mut counters) = policy_counter_registry().lock() {
if counters.len() <= POLICY_COUNTER_REGISTRY_PRUNE_THRESHOLD {
return;
}
counters.retain(|rule_id, counter| {
active_rule_ids.contains(rule_id) || Arc::strong_count(counter) > 1
});
}
}

fn policy_rule_hit_counter(rule_id: &str) -> Arc<AtomicU64> {
let mut counters = policy_counter_registry()
.lock()
.expect("policy counter registry poisoned");
if let Some(counter) = counters.get(rule_id) {
return counter.clone();
}

let counter = Arc::new(AtomicU64::new(0));
counters.insert(rule_id.to_string(), counter.clone());
counter
}
Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

@copilot review

I completed an adversarial pass over commit ed069a5 (compiler scheduler-reference checks + userspace policy counter persistence) and re-ran targeted validation (go test ./pkg/config ./pkg/dataplane/userspace). I did not find additional correctness/performance issues requiring follow-up changes in this PR.

Copilot finished work on behalf of psaab May 17, 2026 18:52
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-1 Codex review on ed069a50

Codex verdict: MERGE-NEEDS-MAJOR

MAJOR 1 — Counter preservation is not end-to-end

Rust now increments an internal Arc<AtomicU64> at userspace-dp/src/policy.rs:390, :396. But:

  • Existing counter API is ReadPolicyCounters(policyID uint32) CounterValue (pkg/dataplane/dataplane.go:241)
  • userspace.Manager still embeds/delegates to inner eBPF manager (pkg/dataplane/userspace/manager.go:62, :155)
  • CLI/API paths still read numeric BPF counters (pkg/cli/cli_show_security.go:48)
  • No userspace override, no control-socket export, no clear path
  • Rust counter is packets-only; CounterValue is packets+bytes

This is a private helper counter, not preserved operator-visible policy counters. Operator-visible counters still go to the eBPF path, which on userspace dataplanes won't reflect actual policy hits.

MAJOR 2 — Counter lifetime wrong for delete/re-add

prune_policy_counter_registry returns without pruning until registry exceeds 16_384 entries (userspace-dp/src/policy.rs:122). policy_rule_hit_counter blindly reuses any stored ID (:133).

Worked failure:

  1. Policy P with stable RuleID=42 → counter created, hits=100
  2. Operator removes P from config → snapshot rebuilds without P
  3. Counter registry NOT pruned (stays under 16K threshold)
  4. Operator re-adds P with the same stable RuleID=42 (idempotent generation)
  5. policy_rule_hit_counter(42) returns the same counter → stale hits resurrect

Contradicts plan invariant at plan-1378-policy-schedulers.md:75: "counters survive while the identity remains present" (key word: REMAINS — not "and resumes after re-add").

MINOR — Tests/docs overstate coverage

  • Rust counter test calls parse_policy_state directly and reads a private field; doesn't go Go snapshot JSON → Rust deserialize → helper runtime → public counter read
  • Plan lists policy::snapshot_publish_applies_inactive_bits_atomically at plan-1378-policy-schedulers.md:111 but that test doesn't exist

Wins (Codex confirms)

  • ✓ Strict missing-scheduler rejection wired into CompileConfig
  • ✓ Snapshot fields + Rust serde defaults preserve old snapshots
  • ✓ No new per-packet locks/allocations (new mutex/string work is on snapshot rebuild, not packet eval)

Recommendation

Block on:

  1. Expose the userspace policy counter through the existing ReadPolicyCounters API (or add a userspace-specific override that the CLI/gRPC layer consults first)
  2. Prune counter registry when the corresponding rule is removed from the latest snapshot (not just at 16K cap)
  3. Convert packets-only counter to packets+bytes to match CounterValue schema

Codex task: task-mpa6l165-b0mc0i. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

@copilot review

Copilot AI review requested due to automatic review settings May 17, 2026 20:16
@psaab psaab review requested due to automatic review settings May 17, 2026 20:16
Copilot AI review requested due to automatic review settings May 17, 2026 20:17
@psaab psaab review requested due to automatic review settings May 17, 2026 20:17
Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

Round-1 Codex review on ed069a50

Codex verdict: MERGE-NEEDS-MAJOR...

Addressed. 2b54bda wires userspace policy counters into ReadPolicyCounters (packets+bytes) and moves counter lifetime to PolicyCounterStore with reconcile_rules so delete/re-add does not resurrect stale values; tests cover both continuity and reset-on-removal behavior. I also fixed the plan’s stale test reference in 6303db8 (and path cleanup in 8ac9e4b).

@copilot review

Completed another adversarial pass on current HEAD (8ac9e4b, including 2b54bda) and validated the userspace counter/export lifecycle path. Targeted Go suites pass; Rust cargo test ... policy:: remains blocked in this runner by missing libelf headers/pkg-config.

Copilot finished work on behalf of psaab May 17, 2026 20:18
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-2 Codex review on 8ac9e4b8

Codex verdict: MERGE-NEEDS-MINOR (substantive r1 MAJORs fully addressed)

r1 MAJOR resolution (Codex verified)

  • ✓ Counter lifetime now per-coordinator via PolicyCounterStore (not earlier global process registry); Coordinator owns store across snapshot rebuilds
  • ✓ Counters track packets AND bytes
  • ✓ Attached to stable rule IDs
  • ✓ Exposed via Rust ProcessStatus.PolicyRuleCounters
  • ✓ Go consumes through pkg/dataplane/userspace/policycounters.go with sparse-policy-ID mapping + clear support
  • ✓ Delete/re-add semantics explicit: absent rules pruned, re-added rule starts from zero (test verifies)
  • ✓ Stale plan-doc test reference corrected

MINOR — Production reporting surfaces lack regression tests

Codex: changed user-facing paths still untested:

  • pkg/api/handlers.go:218
  • pkg/grpcapi/server_show_zones.go:65
  • pkg/api/metrics.go:1498

These are where sparse policy IDs and helper-backed counters can regress silently.

Suggested test: seed userspace helper PolicyRuleCounters → assert reported policy counters match production IDs via API/gRPC/Prometheus path, including a global policy.

MINOR — Plan doc carryover

docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md:119 still records outstanding end-to-end scheduler/HA validation. Fine if this PR is the counter-preservation slice only, not evidence of complete production validation.

Codex task: task-mpa7x5sd-i1zs4c. Not merging — author's decision.

Copilot AI review requested due to automatic review settings May 17, 2026 21:14
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

@copilot review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

Comment on lines +147 to +173
impl PolicyCounterStore {
pub(crate) fn reconcile_rules(&self, rules: &[PolicyRuleSnapshot]) {
let active_rule_ids: FxHashSet<String> = rules.iter().map(stable_policy_rule_id).collect();
if let Ok(mut counters) = self.counters.lock() {
counters.retain(|rule_id, _| active_rule_ids.contains(rule_id));
}
}

pub(crate) fn clear(&self) {
if let Ok(counters) = self.counters.lock() {
for counter in counters.values() {
counter.reset();
}
}
}

fn rule_hit_counter(&self, rule_id: &str) -> Arc<PolicyRuleCounter> {
let mut counters = self.counters.lock().expect("policy counter store poisoned");
if let Some(counter) = counters.get(rule_id) {
return counter.clone();
}

let counter = Arc::new(PolicyRuleCounter::default());
counters.insert(rule_id.to_string(), counter.clone());
counter
}
}
Comment thread pkg/api/handlers.go
Comment on lines 245 to 251
if s.dp != nil && s.dp.IsLoaded() {
policyID := policySetID*dataplane.MaxRulesPerPolicy + uint32(len(pi.Rules))
if ctrs, err := s.dp.ReadPolicyCounters(policyID); err == nil {
pr.HitPackets = ctrs.Packets
pr.HitBytes = ctrs.Bytes
}
}
Comment thread pkg/api/metrics.go
policyID++
}
policySetID++
}
Comment on lines +24 to +51
func policyRuleIDForCounter(cfg *config.Config, policyID uint32) string {
if cfg == nil {
return ""
}
policySetID := policyID / dataplane.MaxRulesPerPolicy
ruleIndex := policyID % dataplane.MaxRulesPerPolicy

var currentSet uint32
for _, zpp := range cfg.Security.Policies {
if zpp == nil {
continue
}
if currentSet == policySetID {
if int(ruleIndex) >= len(zpp.Policies) || zpp.Policies[ruleIndex] == nil {
return ""
}
return stablePolicyRuleID(zpp.FromZone, zpp.ToZone, zpp.Policies[ruleIndex].Name)
}
currentSet++
}
if currentSet == policySetID {
if int(ruleIndex) >= len(cfg.Security.GlobalPolicies) || cfg.Security.GlobalPolicies[ruleIndex] == nil {
return ""
}
return stablePolicyRuleID("junos-global", "junos-global", cfg.Security.GlobalPolicies[ruleIndex].Name)
}
return ""
}
Copilot AI review requested due to automatic review settings May 17, 2026 21:20
@psaab psaab review requested due to automatic review settings May 17, 2026 21:20
Copilot AI review requested due to automatic review settings May 17, 2026 21:21
@psaab psaab review requested due to automatic review settings May 17, 2026 21:21
Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

Round-2 Codex review on 8ac9e4b8

Codex verdict: MERGE-NEEDS-MINOR (substantive r1 MAJORs fully addressed)...

Implemented. e1a6680 adds regression tests for helper-backed policy counters on gRPC and Prometheus paths (including global policy IDs), and ad14d5b tightens the metrics mapping/hardening follow-up. Screenshot: N/A (no UI changes).

@copilot review

Completed an adversarial re-review on current HEAD (ad14d5b, includes e1a6680) and reran targeted validation: go test ./pkg/api ./pkg/grpcapi ./pkg/dataplane/userspace ./pkg/config passed; no additional correctness/perf issues found in the updated policy counter reporting paths. Screenshot: N/A (no UI changes).

Copilot finished work on behalf of psaab May 17, 2026 21:23
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude r3 review on ad14d5bc

Verdict: MERGE-NEEDS-MINOR (pending Codex/Gemini)

r2 MINOR closure

r2 raised: missing API/gRPC regression tests for scheduler-rule-counter preservation. r3 commit message: "test: tighten policy metrics regression coverage".

Added 527 LOC of tests across:

  • pkg/api/metrics_test.go +114 LOC
  • pkg/api/policy_counters_test.go +142 LOC (NEW file)
  • pkg/dataplane/userspace/manager_test.go +65 LOC
  • pkg/grpcapi/server_show_zones_test.go +173 LOC (NEW file)

These look like real production-path coverage tests, not isolated helper tests.

Concern raised by reading the metrics.go diff

The r3 diff also includes a behavior change in pkg/api/metrics.go:

for i, rule := range zpp.Policies {
-   if rule == nil {
-       continue
-   }
    policyID := policyCounterID(policySetID, i)
    ctrs, err := dp.ReadPolicyCounters(policyID)
    ...
    ch <- prometheus.MustNewConstMetric(c.policyHitsTotal, ...,
        float64(ctrs.Packets), fromZone, toZone, rule.Name)   // <-- rule.Name deref
}

The removed nil check guarded rule.Name dereference on line 1517 (PR branch numbering). Policies []*Policy (per pkg/config/types.go:1178) is a pointer slice — nil entries are type-permitted.

Inconsistency: The GlobalPolicies loop immediately below in the same function STILL has if rule == nil { continue }. Either:

  • (a) The invariant "Policies never contains nil" holds for both, and the GlobalPolicies nil check is also dead code → keep removal but also remove from GlobalPolicies for consistency.
  • (b) The invariant doesn't hold, and removing the per-zone-pair check introduced a nil-deref → put it back.

Worth a one-line justification or symmetric removal.

Verification

  • git show ad14d5bc:pkg/api/metrics.go | sed -n '1498,1532p' — confirmed asymmetric nil-check removal.
  • grep "Policies\s*\[\]" pkg/config/types.go — confirmed Policies []*ZonePairPolicies and inner Policies []*Policy are pointer slices.
  • New tests look production-path-shaped, but I have not run them.

Recommendation

Minor: either restore the nil check at line ~1508 OR remove the symmetric one at GlobalPolicies + add a comment explaining the invariant.

Awaiting Codex (task-mpaah2ui-xq0njo) + Gemini Pro 3 (task-mpaahf6n-fdtge8). Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-3 triple-review synthesis on ad14d5bc

Reviewer Verdict
Claude MERGE-NEEDS-MINOR (asymmetric nil-check)
Codex MERGE-NEEDS-MINOR (mocked DPs in Go tests)
Gemini Pro 3 MERGE-READY

Codex MINOR — Go tests mock past the risk surface

"The new API/gRPC tests still bypass the production counter-preservation bridge. pkg/api/policy_counters_test.go:13 and pkg/grpcapi/server_show_zones_test.go:13 install fake dataplanes whose ReadPolicyCounters is just a map lookup. That tests handler ID arithmetic, but not the real path: Rust PolicyCounterStore → helper policy_rule_counters status → Go Manager.lastStatususerspace.Manager.ReadPolicyCounters → API/gRPC."

"pkg/dataplane/userspace/manager_test.go:88 is adjacent coverage. It manually swaps lastSnapshot while leaving lastStatus stale, then expects the old counter to reappear after re-add. It does not exercise Compile, UpdatePolicyScheduleState, apply_snapshot, helper status refresh, or the scheduler active/inactive preservation transition."

Gemini counterpoint

Rust-side test hit_counters_survive_scheduler_snapshot_rebuild in userspace-dp/src/policy_tests.rs DOES exercise the real PolicyCounterStore with full reconcile/eval/snapshot-rebuild flow. The Go tests are intentionally boundary tests for the HTTP/gRPC marshaling layer.

My read

Both right at different layers. Rust side has real production-path coverage (the actual counter survival logic). Go side has marshaling-only coverage with fakes. The gap Codex points to (CompileUpdatePolicyScheduleStateapply_snapshot end-to-end) is a Go-side integration test that nobody wrote. Worth adding but doesn't undermine the Rust correctness coverage.

metrics.go behavior change

Both Claude and Codex flagged the if rule == nil { continue } removal in the per-zone-pair loop. Gemini explains: zpp.Policies is config-layer-guaranteed non-nil (defensive code removal); GlobalPolicies keeps the nil check because globals CAN be nil per compilation rules. Codex also flags this is a behavior change beyond test tightening (LastCompileResult()==nil early-return also removed).

If zpp.Policies invariant holds, the asymmetric nil-check removal is correct cleanup. Worth a one-line code comment explaining the invariant so the asymmetry doesn't look like an oversight.

Recommendation

Block on: add comment to collectPolicyCounters documenting the zpp.Policies non-nil invariant (Claude+Codex MINOR).
Defer: add a Go-side end-to-end integration test that exercises Compile → status helper → ReadPolicyCounters without fakes (Codex MINOR — follow-up issue).

Codex task: task-mpaah2ui-xq0njo. Gemini task: task-mpaahf6n-fdtge8. Not merging — author's decision.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Comment on lines +147 to +173
impl PolicyCounterStore {
pub(crate) fn reconcile_rules(&self, rules: &[PolicyRuleSnapshot]) {
let active_rule_ids: FxHashSet<String> = rules.iter().map(stable_policy_rule_id).collect();
if let Ok(mut counters) = self.counters.lock() {
counters.retain(|rule_id, _| active_rule_ids.contains(rule_id));
}
}

pub(crate) fn clear(&self) {
if let Ok(counters) = self.counters.lock() {
for counter in counters.values() {
counter.reset();
}
}
}

fn rule_hit_counter(&self, rule_id: &str) -> Arc<PolicyRuleCounter> {
let mut counters = self.counters.lock().expect("policy counter store poisoned");
if let Some(counter) = counters.get(rule_id) {
return counter.clone();
}

let counter = Arc::new(PolicyRuleCounter::default());
counters.insert(rule_id.to_string(), counter.clone());
counter
}
}
Comment on lines +31 to +43
var currentSet uint32
for _, zpp := range cfg.Security.Policies {
if zpp == nil {
continue
}
if currentSet == policySetID {
if int(ruleIndex) >= len(zpp.Policies) || zpp.Policies[ruleIndex] == nil {
return ""
}
return stablePolicyRuleID(zpp.FromZone, zpp.ToZone, zpp.Policies[ruleIndex].Name)
}
currentSet++
}
Comment on lines +85 to +87
total.Packets += counter.Packets
total.Bytes += counter.Bytes
return total, nil
Comment on lines +120 to +123
self.packets.fetch_add(1, Ordering::Relaxed);
if packet_len != 0 {
self.bytes.fetch_add(packet_len, Ordering::Relaxed);
}
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 18, 2026

Claude r6 review on 62ece646

Verdict: MERGE-READY (pending Codex/Gemini)

r5 follow-up closed: Read-path IPC test added

The new TestReadPolicyCountersUsesStatusIPCPolicyRuleCounters (manager_test.go +91) mirrors the r5 TestClearPolicyCountersUsesHelperIPCAndRecordsStatus pattern:

  • Real Unix domain socket listener (not m.inner = nil)
  • Real requestLocked IPC call with Type: "status"
  • Helper sends back ProcessStatus with PolicyRuleCounters: [{RuleID, Packets: 23, Bytes: 2300}]
  • Test then calls m.ReadPolicyCounters(0) and verifies the helper's counters reached the public read path

Verification

  • git show 62ece646:pkg/dataplane/userspace/manager_test.go — confirmed: nonzero counters set in helper response, ReadPolicyCounters called after IPC roundtrip.
  • Mirrors the r5 Clear-IPC test structure (Unix socket, ControlRequest decode, ControlResponse encode).
  • m.recordHelperStatusLocked(&status) is called after the IPC, simulating the production status-refresh path.

Recommendation

MERGE-READY on the r5 Read-path MINOR. The Clear + Read IPC tests now both pin counter behavior across real helper roundtrip.

Awaiting Codex (task-mpalim0e-kct9hw) + Gemini Pro 3 (task-mpaliy4j-48o1rw). Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 18, 2026

Round-6 triple-review synthesis on 62ece646

Reviewer Verdict
Claude MERGE-READY
Codex MERGE-READY (narrowly)
Gemini Pro 3 MERGE-NEEDS-MINOR

r5 Read-path IPC test — CLOSED

Both reviewers confirm real Unix-socket IPC roundtrip exercises the production manager path:

  • m := New() (m.inner non-nil)
  • Real net.Listen("unix", controlSock) fake helper
  • requestLocked(ControlRequest{Type: "status"}) → real JSON encode/decode
  • Helper returns PolicyRuleCounters: [{RuleID, Packets: 23, Bytes: 2300}]
  • recordHelperStatusLocked ingests
  • ReadPolicyCounters(0) returns the helper's counters

Codex MINOR caveat (C — "mostly yes")

"It does not seed an explicit same-rule stale zero entry, which would be a sharper fixture, but the intended stale-cache-through-status-refresh path is covered."

The starting cache is empty rather than "preloaded with stale zero counters for the same RuleID". The test still proves the IPC-fresh-counter is what ReadPolicyCounters returns; what it doesn't pin is "fresh helper overwrites a pre-existing stale entry". Codex calls this a non-blocking sharpness nit.

Gemini MINOR (B+C)

Same framing as Codex on point C. Adds point B: "no helper-restart simulation or multi-roundtrip". Both are sharpness nits, not coverage holes for the original r3+r4 concern.

Recommendation

MERGE-READY. The Clear + Read IPC tests now both pin counter behavior across real helper roundtrip. The "stale-cache overlay" + "multi-roundtrip" cases are reasonable follow-ups for an issue, not blockers for this PR. The original r3+r4 production-path coverage gap is closed.

Codex task: task-mpalim0e-kct9hw. Gemini task: task-mpaliy4j-48o1rw. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 18, 2026

Follow-up fix on a7f374bc

Closed the remaining read-path IPC test-hardening nit. The policy counter IPC test now seeds stale same-rule zero state, performs two real helper status roundtrips, and asserts ReadPolicyCounters tracks the fresh helper data across refreshes instead of accidentally passing from cached state.

Validation:

  • go test ./pkg/dataplane/userspace
  • git diff --check

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 18, 2026

Claude r7 review on a7f374bc

Verdict: MERGE-READY (pending Codex/Gemini)

r6 sharpness MINOR closure

r6 reviewers asked for a stale-cache overlay + multi-roundtrip helper coverage. r7 refactor delivers both:

responses := []PolicyRuleCounterStatus{
    {RuleID: wantRuleID, Packets: 23, Bytes: 2300},
    {RuleID: wantRuleID, Packets: 31, Bytes: 3100},
}
reqCh := make(chan ControlRequest, len(responses))
done := make(chan struct{}, len(responses))
go func() {
    for _, counter := range responses {
        conn, err := ln.Accept()
        ...
        _ = json.NewEncoder(conn).Encode(ControlResponse{
            OK: true,
            Status: &ProcessStatus{
                PID:                4321,
                PolicyRuleCounters: []PolicyRuleCounterStatus{counter},
            },
        })
        done <- struct{}{}
    }
}()

Same RuleID, two distinct counter values, two real Unix-socket IPC roundtrips. Test should assert ReadPolicyCounters returns the LATEST helper response after each roundtrip.

Verification

  • git show a7f374bc:pkg/dataplane/userspace/manager_test.go confirms two-response loop with channel buffer sized for 2.
  • accept loop is sequential — each ln.Accept() waits for the next client connect, so ordering is deterministic (no race).
  • Real m.proc + m.cfg.ControlSocket path, m.inner non-nil.

Recommendation

MERGE-READY on the r6 MINOR. The IPC test now covers:

  1. Real helper IPC (m.inner != nil) — closed in r6
  2. Counter value reflects latest helper sync — closed in r7
  3. Multi-roundtrip refresh — closed in r7
  4. Same-RuleID counter update (not just append) — closed in r7

Awaiting Codex (task-mpambkr5-pyxiwv) + Gemini Pro 3 (task-mpambwfs-2q77ug). Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 18, 2026

Round-7 triple-review synthesis on a7f374bc

Reviewer Verdict
Claude MERGE-READY
Codex MERGE-READY
Gemini Pro 3 MERGE-READY

All three converge. r6 sharpness MINOR is closed.

What r7 confirmed

Codex (no findings):

"A. Yes. refreshStatus() is called twice, and the fake helper serves two sequential status responses: 23/2300, then 31/3100... B. Yes. The second ReadPolicyCounters(0) explicitly expects Packets: 31, Bytes: 3100, so it would fail if the first helper value 23/2300 stayed cached. C. Yes. m.lastStatus is preloaded before either IPC with PolicyRuleCounters containing the same wantRuleID; packets/bytes are zero by Go default. D. Yes. Both refreshes call m.requestLocked(ControlRequest{Type: "status"}, &status), which goes through requestDetailedLocked and net.DialTimeout("unix", m.cfg.ControlSocket, ...). This is not the m.inner=nil cache-only shortcut. E. Safe."

Gemini Pro 3 (quote-line verification of all four checks A-D, all PASS).

Coverage closure

The IPC test now pins:

  1. Real helper IPC (m.inner != nil) — closed in r6
  2. Counter value reflects latest helper sync — closed in r7
  3. Multi-roundtrip refresh (two sequential IPCs) — closed in r7
  4. Same-RuleID counter update with stale-cache overlay (m.lastStatus preloaded with zero) — closed in r7
  5. No deadlock / race risk (buffered channels, sequential accept loop) — verified

Recommendation

Merge-ready. Four-round IPC test arc concluded.

Codex task: task-mpambkr5-pyxiwv. Gemini task: task-mpambwfs-2q77ug. Not merging — author's decision.

psaab and others added 13 commits May 17, 2026 20:20
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/44910551-6d29-4da9-916f-072abba2e573

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/44910551-6d29-4da9-916f-072abba2e573

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 18, 2026 03:26
@psaab psaab force-pushed the codex/1403-1378-scheduler-closeout branch from a7f374b to f17e714 Compare May 18, 2026 03:26
@psaab psaab merged commit afab2ab into master May 18, 2026
2 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/dataplane/userspace/policycounters.go:87

  • ReadPolicyCounters adds the legacy inner counter and the helper counter (total.Packets += counter.Packets, same for bytes). This is correct only as long as exactly one of the two backends is actually reporting non-zero values; if a deployment ever has both the legacy eBPF dataplane loaded and the userspace helper reporting for the same rule, hit counters will be double-counted. Consider either documenting/asserting the mutual-exclusion invariant, or preferring helper values over inner ones when a helper rule-id match is found (rather than summing).
	var total dataplane.CounterValue
	var innerErr error
	if m.inner != nil {
		total, innerErr = m.inner.ReadPolicyCounters(policyID)
	}

	m.mu.Lock()
	defer m.mu.Unlock()

	cfg := (*config.Config)(nil)
	if m.lastSnapshot != nil {
		cfg = m.lastSnapshot.Config
	}
	ruleID := policyRuleIDForCounter(cfg, policyID)
	if ruleID == "" {
		if innerErr != nil {
			return dataplane.CounterValue{}, innerErr
		}
		return total, nil
	}
	counter, ok := buildPolicyRuleCounterIndex(&m.lastStatus)[ruleID]
	if !ok {
		if innerErr != nil {
			return dataplane.CounterValue{}, innerErr
		}
		return total, nil
	}
	total.Packets += counter.Packets
	total.Bytes += counter.Bytes
	return total, nil

Comment on lines +31 to +43
var currentSet uint32
for _, zpp := range cfg.Security.Policies {
if zpp == nil {
continue
}
if currentSet == policySetID {
if int(ruleIndex) >= len(zpp.Policies) || zpp.Policies[ruleIndex] == nil {
return ""
}
return stablePolicyRuleID(zpp.FromZone, zpp.ToZone, zpp.Policies[ruleIndex].Name)
}
currentSet++
}
Comment thread pkg/api/handlers.go
Comment on lines 245 to 251
if s.dp != nil && s.dp.IsLoaded() {
policyID := policySetID*dataplane.MaxRulesPerPolicy + uint32(len(pi.Rules))
if ctrs, err := s.dp.ReadPolicyCounters(policyID); err == nil {
pr.HitPackets = ctrs.Packets
pr.HitBytes = ctrs.Bytes
}
}
Comment on lines +380 to +384
) -> PolicyAction {
evaluate_policy_with_len(
state, from_id, to_id, src_ip, dst_ip, protocol, src_port, dst_port, 0,
)
}
psaab added a commit that referenced this pull request May 18, 2026
Preserves scheduler rule counters (#1407) in policy.

- Expose userspace scheduler policy counters
- docs: correct #1378 scheduler test inventory
- log: use relative cargo manifest path
- test: cover scheduler policy counters in API
- test: cover userspace policy counters in metrics and grpc global path
- fix: harden policy metrics counter ID mapping

Touches pkg/dataplane/userspace, pkg/grpcapi/server_show_zones_test.go,
userspace-dp/src/policy_tests.rs, and pkg/api/policy_counters_test.go
across documentation, tests, configuration schema, userspace dataplane,
dataplane integration, and operator/API output. The largest file deltas
are pkg/dataplane/userspace/manager_test.go,
pkg/grpcapi/server_show_zones_test.go, and
userspace-dp/src/policy_tests.rs. The diff is 1405 additions and 85
deletions across 25 files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants