Add userspace policy scheduler evidence validator by psaab · Pull Request #1416 · psaab/xpf

psaab · 2026-05-18T04:03:52Z

Summary

add a deterministic validator for Feature gap: time-based policy schedulers (Junos schedulers { ... }) not propagated to userspace-dp #1378 userspace scheduler active/rebuild/inactive/failover artifacts
pin the userspace apply path so scheduler state seeds compile without legacy policy-map updates
narrow Feature gap: time-based policy schedulers (Junos schedulers { ... }) not propagated to userspace-dp #1378 docs to the remaining live HA artifact capture gate

Tests

python3 test/incus/policy_scheduler_validate_test.py
go test ./pkg/daemon -run 'TestPolicyScheduler|TestApplyConfig.*Scheduler|TestApplyConfig.*Protocol'
go test ./pkg/config -run 'Test.*PolicyScheduler.*ReferenceRejectsCommit'
go test ./pkg/dataplane/userspace -run 'Test(BuildPolicySnapshotsRoundTripsSchedulerInactiveAndRuleID|UpdatePolicyScheduleState)'
cargo test --manifest-path userspace-dp/Cargo.toml policy:: -- --nocapture
git diff --check

Copilot

Pull request overview

Adds a userspace policy-scheduler evidence validator and updates scheduler retirement documentation to narrow #1378 to live HA artifact capture, while adding daemon test coverage around scheduler state seeding for userspace applies.

Changes:

Adds test/incus/policy_scheduler_validate.py plus unit tests for validating active/rebuild/inactive/failover scheduler artifacts.
Adds daemon scheduler apply test scaffolding for userspace scheduler state seeding.
Updates #1378/#1373 documentation and _Log.md to reflect the remaining evidence gate.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`test/incus/policy_scheduler_validate.py`	New deterministic validator for userspace scheduler evidence artifacts.
`test/incus/policy_scheduler_validate_test.py`	Unit tests for validator pass/fail scenarios.
`pkg/daemon/policy_scheduler_apply_test.go`	Adds userspace dataplane test double and scheduler apply-path test.
`docs/userspace-dataplane-gaps.md`	Narrows #1378 status to live HA evidence capture.
`docs/pr/1373-retire-ebpf-dataplane/README.md`	Updates #1378 gap status summary.
`docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md`	Documents closeout evidence harness and validation commands.
`_Log.md`	Records the policy-scheduler closeout slice action and validation.

+	activeState := d.policySchedulerActiveStateForApplyLocked(cfg, testPolicySchedulerApplyNow())
+	d.seedPolicySchedulerActiveStateLocked(activeState)
+	compileResult, err := dp.Compile(cfg)
+	if err != nil {
+		t.Fatalf("Compile: %v", err)
+	}
+	d.publishInitialPolicySchedulerStateLocked(cfg, activeState, compileResult)


+        return int(value)
+    try:
+        return int(value)
+    except (TypeError, ValueError):
+        return default


psaab · 2026-05-18T04:14:52Z

Claude r1 review on `3369e5aa`

Verdict: MERGE-READY (pending Codex/Gemini)

Continues #1378 scheduler retirement work. Adds a python validator (test/incus/policy_scheduler_validate.py — 100 LOC) + meta-test for the validator + daemon test + docs refresh. Total 523 LOC, mostly test infrastructure.

Hostile concerns

A. What invariants does the python validator actually pin? Codex needs to walk the assertion logic and verify it's not a tautology.
B. Designed for CI cluster or local incus? Make target / harness integration?
C. policy_scheduler_apply_test.go (Go-side) — does it cover the Apply-path that #1407 IPC tests don't?
D. False-negative risk: validator passes when scheduler state is actually wrong because of an underspecified assertion?
E. Does the validator exercise live HA failover (the #1378 outstanding gap) or just snapshot-only validation?

Recommendation

MERGE-READY structurally (test infra adds coverage; doesn't change production logic). Awaiting Codex (task-mpaotpjs-qd644o) + Gemini Pro 3 (task-mpaotxjx-mg8syt) verification of validator scope.

Not merging — author's decision.

psaab · 2026-05-18T04:22:38Z

Round-1 triple-review synthesis on `3369e5aa`

Reviewer	Verdict
Claude	MERGE-READY
Codex	MERGE-NEEDS-MAJOR (validator too lenient + Apply test bypasses real path)
Gemini Pro 3	MERGE-READY (with MINOR feedback)

Codex MAJORs

MAJOR 1: Apply test bypasses the real Apply path:

"pkg/daemon/policy_scheduler_apply_test.go:176 does not actually exercise applyConfigLocked. It manually calls the expected helper sequence, so a regression in the real Apply path at daemon_apply.go:441-459 could still pass."

MAJOR 2: Validator skips entry_programs check:

"test/incus/policy_scheduler_validate.py:74-83 treats dataplane_mode and entry_programs as optional. Raw Rust helper status does not include those daemon-enriched fields. Artifacts can pass without proving the attached XDP program is xdp_userspace_prog or that the daemon has not fallen back to eBPF."

Gemini independently flagged the same issue:

"If the status JSON omits entry_programs entirely or provides an empty dictionary, and programs evaluates to False. The check is silently skipped, yielding a pass without actually verifying xdp_userspace attachment."

MINOR: Failover validation artifact-only:

"Required failover-status.json only checks that the counter advanced. Does not prove RG ownership changed, node identity changed, HA state is active, or status came from the new owner."

Gemini also found

Test bug: test_fails_when_missing_scheduler_commit_succeeds uses "commit complete" which fails the FIRST regex (references undefined scheduler) before reaching the success-blocker assertion. So the second regex's branch is never exercised.

Harness false-negative: "If the external test harness (Incus) simply fails to send any traffic during the inactive phase, the counters will naturally remain equal, and the validator will pass a potentially fail-open scheduler."

Recommendation

Block on (Codex MAJORs):

policy_scheduler_apply_test.go: drive through applyConfigLocked rather than calling the helper sequence directly. Without this, a regression in the real Apply path passes.
Make entry_programs MANDATORY in the validator — raise ValidationFailure if missing or empty. Optional today = silent pass on eBPF fallback.

Strongly consider:
3. Add HA ownership/node-identity check to failover-status.json validation (Codex MINOR).
4. Fix the test_fails_when_missing_scheduler_commit_succeeds payload to exercise both regex branches (Gemini MINOR).

Document harness contract:
5. The validator pre-supposes the Incus harness sent traffic during inactive. Add an INACTIVE_TRAFFIC_ATTEMPTS field that the harness records, and have the validator assert it's > 0 (Gemini MINOR).

Codex task: task-mpaotpjs-qd644o. Gemini task: task-mpaotxjx-mg8syt. Not merging — author's decision.

psaab · 2026-05-18T04:45:03Z

Claude r2 review on `66775b21`

Verdict: MERGE-READY (pending Codex/Gemini)

r1 MAJOR closures verified

MAJOR 2 (entry_programs validation) — CLOSED. policy_scheduler_validate.py:

programs = status.get("entry_programs")
if not isinstance(programs, dict) or not programs:
    raise ValidationFailure(f"{label}: entry_programs must be a non-empty object")
if not any("xdp_userspace" in str(name) for name in programs.values()):
    raise ValidationFailure(...)

Was if isinstance(programs, dict) and programs: (silent pass on missing). Now if not isinstance(...) or not programs: raise. Mandatory check.

MAJOR 1 (apply test bypassed applyConfigLocked) — CLOSED via +63/-19 refactor. Need Codex to verify the test entry point now drives applyConfigLocked rather than calling helpers directly. The diff size suggests a substantive refactor of the test plumbing.

Validator test fix (Gemini r1 MINOR) — policy_scheduler_validate_test.py +14 likely fixes the commit complete regex-shadowing bug. Need Codex to verify.

Recommendation

MERGE-READY structurally. Awaiting Codex (task-mpapy16s-jaiuf5) + Gemini Pro 3 (task-mpapy9lw-a1c5yj) for verification of the applyConfigLocked test path.

Not merging — author's decision.

psaab · 2026-05-18T04:50:59Z

Round-2 triple-review synthesis on `66775b21`

Reviewer	Verdict
Claude	MERGE-READY
Codex	MERGE-NEEDS-MINOR (failover identity unverified, shadow bug not fixed)
Gemini Pro 3	MERGE-NEEDS-MAJOR (her own r1 shadow bug not fixed)

Two r1 MAJORs closed

Codex+Gemini both verify:

apply_test.go now drives applyConfigLocked end-to-end (was bypassed in r1)
entry_programs validator check is now mandatory (was optional in r1)

Codex: "TestApplyConfigSeedsUserspacePolicySchedulerStateBeforeCompile now enters through d.applyConfigLocked(cfg) at policy_scheduler_apply_test.go:182. The production path computes scheduler state, seeds it, then calls Compile."

Gemini: "The test now directly drives d.applyConfigLocked() instead of artificially seeding internal state."

Gemini MAJOR — her r1 test shadow bug NOT fixed

Gemini: "The +14 line addition to test/incus/policy_scheduler_validate_test.py consists EXCLUSIVELY of the new test_fails_when_entry_programs_missing test. The shadowing bug remains unfixed because test_fails_when_missing_scheduler_commit_succeeds was completely untouched."

"Because the test injects missing_text='commit complete', it fails the FIRST regex (as it lacks the rejection string) and raises the 'strict rejection' error. This completely shadows the actual 'commit complete' check on the next line. The test still incorrectly asserts 'strict rejection' and falsely passes on the wrong code path."

Codex confirms independently:

"Gemini's test bug remains: test_fails_when_missing_scheduler_commit_succeeds writes only 'commit complete' and expects 'strict rejection' at policy_scheduler_validate_test.py:129-138, so it still trips the first missing-rejection regex and never exercises the later successful-commit guard at policy_scheduler_validate.py:174-176."

This is a real test bug that Gemini correctly flagged in r1 — the author missed it.

Codex MINORs

Failover ownership / node identity not validated by the python validator (artifact-trust only)
entry_programs absent-vs-empty: the new test only covers ABSENT case, not empty dict (though same code path)
Apply test doesn't pin counter survival (counter survival is pinned in Rust policy_tests.rs, so OK)

Recommendation

Block on (Gemini MAJOR): fix test_fails_when_missing_scheduler_commit_succeeds — payload must contain BOTH the strict rejection string AND "commit complete" so the second regex actually fires. Otherwise this test branch never exercises the success-blocker logic and the validator could let a missing-scheduler commit pass.

Fixed example:

_write_artifacts(root, missing_text="references undefined scheduler ... commit complete")

Then the test should EXPECT ValidationFailure("missing scheduler commit artifact looks like a successful commit") (the SECOND regex's message), not the first.

Defer:

Failover ownership/identity validation (Codex MINOR)
Empty-dict entry_programs case (covered by same code path as absent)

Codex task: task-mpapy16s-jaiuf5. Gemini task: task-mpapy9lw-a1c5yj. Not merging — author's decision.

Make entry_programs mandatory in policy-scheduler validation and drive the daemon regression through applyConfigLocked. This closes the optional-validator bypass and keeps the apply test on the production path. Close the validator-test shadow bug by making the missing-scheduler success test include both the strict rejection text and commit-complete text, so it reaches the successful-commit guard instead of failing the earlier rejection-presence check. Add explicit empty-entry_programs coverage for the mandatory runtime gate.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

+        required = rebuild_counter.packets + min_failover_packet_delta
+        if failover_counter.packets < required:
+            raise ValidationFailure(
+                "failover: policy counter did not advance on the new userspace owner "
+                f"({failover_counter.packets} < {required})"
+            )


psaab · 2026-05-18T05:13:49Z

Claude r4 review on `f794be92`

Verdict: MERGE-READY (pending Codex/Gemini)

Gemini r1+r2 shadow-bug MAJOR — CLOSED (verified)

The test payload + expected error message both fixed:

def test_fails_when_missing_scheduler_commit_succeeds(self) -> None:
    ...
    _write_artifacts(
        root,
        missing_text=(
            'policy "scheduled-allow" references undefined scheduler "missing"\n'
            "commit complete"
        ),
    )

    with self.assertRaisesRegex(
        policy_validate.ValidationFailure,
        "looks like a successful commit",   # ← was "strict rejection"
    ):
        policy_validate.validate_artifacts(root, rule_id=RULE_ID)

The payload NOW contains BOTH:

references undefined scheduler "missing" — passes the FIRST regex (no longer trips the rejection-missing error)
commit complete — trips the SECOND regex (success-blocker guard)

The expected error is now "looks like a successful commit" — the second regex's message. The success-blocker code path is finally exercised. Gemini r1 shadow-bug catch is closed.

Bonus: empty-vs-absent entry_programs coverage

r4 also adds test_fails_when_entry_programs_empty distinct from test_fails_when_entry_programs_missing — both expect the same entry_programs must be a non-empty object error. Closes Codex r2 MINOR (B point about {} not separately tested).

What's still open (per Codex r2)

Failover ownership / node identity validation deferred (Codex r2 MINOR)
Apply test counter survival not pinned in Go (covered by Rust policy_tests.rs at the other layer — defensible)

Recommendation

MERGE-READY. The two-round saga (Gemini r1 shadow bug → r2 unfixed → r4 fixed) concludes. The success-blocker guard is now actually tested. Defer the remaining failover-identity MINOR.

Awaiting Codex (task-mpaqzv0t-grpp1n) + Gemini Pro 3 (task-mpar02bb-qry2lt). Not merging — author's decision.

psaab · 2026-05-18T05:17:48Z

Round-4 triple-review synthesis on `f794be92`

Reviewer	Verdict
Claude	MERGE-READY
Codex	MERGE-NEEDS-MINOR (failover ownership identity still deferred)
Gemini Pro 3	MERGE-READY

Gemini r1 shadow-bug MAJOR — VERIFIED CLOSED

Both Codex and Gemini independently confirmed the fix:

Codex:

"A: Yes. test_fails_when_missing_scheduler_commit_succeeds now writes both the strict rejection string and commit complete in the same payload."

"B: Yes. The expected failure now matches 'looks like a successful commit', so it targets the second validator guard, not the first."

"I also sanity-checked the validator in-memory from f794be92: the paired rejection/success payload raises missing scheduler commit artifact looks like a successful commit."

Gemini quoted the fix:

def test_fails_when_missing_scheduler_commit_succeeds(self) -> None:
    ...
    _write_artifacts(
        root,
        missing_text=(
            'policy "scheduled-allow" references undefined scheduler "missing"\n'
            "commit complete"
        ),
    )
    with self.assertRaisesRegex(
        policy_validate.ValidationFailure,
        "looks like a successful commit",
    ):

Shadow bug is properly closed: payload trips the SECOND regex; expected error message matches the SECOND guard's wording.

Empty entry_programs case — VERIFIED ADDED

Codex+Gemini confirmed both test_fails_when_entry_programs_missing and test_fails_when_entry_programs_empty are now present and both expect the same entry_programs must be a non-empty object error.

Codex MINOR (carries forward from r2)

"r4 still does not address the failover ownership identity concern. validate_artifacts() only checks that failover-status.json has a counter at least rebuild + delta; it does not prove the status came from the new owner/node."

Deferrable — this is a deeper integration check that needs harness-side machinery to record the node identity.

Recommendation

MERGE-READY. Gemini's r1 shadow-bug catch is structurally closed; the success-blocker guard is finally exercised. Defer the failover ownership identity MINOR for a follow-up issue.

Codex task: task-mpaqzv0t-grpp1n. Gemini task: task-mpar02bb-qry2lt. Not merging — author's decision.

* userspace: add policy scheduler evidence validator * scheduler: validate entry programs through real apply path Make entry_programs mandatory in policy-scheduler validation and drive the daemon regression through applyConfigLocked. This closes the optional-validator bypass and keeps the apply test on the production path. Close the validator-test shadow bug by making the missing-scheduler success test include both the strict rejection text and commit-complete text, so it reaches the successful-commit guard instead of failing the earlier rejection-presence check. Add explicit empty-entry_programs coverage for the mandatory runtime gate.

userspace: add policy scheduler evidence validator

3369e5a

Copilot AI review requested due to automatic review settings May 18, 2026 04:03

Copilot started reviewing on behalf of psaab May 18, 2026 04:04 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

psaab force-pushed the codex/next-1378-scheduler-validation branch from 66775b2 to 08a81cb Compare May 18, 2026 04:45

Copilot AI review requested due to automatic review settings May 18, 2026 05:09

psaab force-pushed the codex/next-1378-scheduler-validation branch from 08a81cb to f794be9 Compare May 18, 2026 05:09

Copilot started reviewing on behalf of psaab May 18, 2026 05:10 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

psaab merged commit 8dcf866 into master May 18, 2026
4 checks passed

psaab mentioned this pull request May 18, 2026

Retire the eBPF dataplane; userspace AF_XDP dataplane becomes the default and only path #1373

Closed

This was referenced May 18, 2026

filter: reuse three-color policer runtimes on refresh #1421

Merged

Add #1378 live policy scheduler evidence #1426

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add userspace policy scheduler evidence validator#1416

Add userspace policy scheduler evidence validator#1416
psaab merged 2 commits into
masterfrom
codex/next-1378-scheduler-validation

psaab commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

psaab commented May 18, 2026

Uh oh!

psaab commented May 18, 2026

Uh oh!

psaab commented May 18, 2026

Uh oh!

psaab commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

psaab commented May 18, 2026

Uh oh!

psaab commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

psaab commented May 18, 2026

Summary

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

psaab commented May 18, 2026

Claude r1 review on 3369e5aa

Hostile concerns

Recommendation

Uh oh!

psaab commented May 18, 2026

Round-1 triple-review synthesis on 3369e5aa

Codex MAJORs

Gemini also found

Recommendation

Uh oh!

psaab commented May 18, 2026

Claude r2 review on 66775b21

r1 MAJOR closures verified

Recommendation

Uh oh!

psaab commented May 18, 2026

Round-2 triple-review synthesis on 66775b21

Two r1 MAJORs closed

Gemini MAJOR — her r1 test shadow bug NOT fixed

Codex MINORs

Recommendation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

psaab commented May 18, 2026

Claude r4 review on f794be92

Gemini r1+r2 shadow-bug MAJOR — CLOSED (verified)

Bonus: empty-vs-absent entry_programs coverage

What's still open (per Codex r2)

Recommendation

Uh oh!

psaab commented May 18, 2026

Round-4 triple-review synthesis on f794be92

Gemini r1 shadow-bug MAJOR — VERIFIED CLOSED

Empty entry_programs case — VERIFIED ADDED

Codex MINOR (carries forward from r2)

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Claude r1 review on `3369e5aa`

Round-1 triple-review synthesis on `3369e5aa`

Claude r2 review on `66775b21`

Round-2 triple-review synthesis on `66775b21`

Claude r4 review on `f794be92`

Round-4 triple-review synthesis on `f794be92`