Add userspace policy scheduler evidence validator#1416
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a userspace policy-scheduler evidence validator and updates scheduler retirement documentation to narrow #1378 to live HA artifact capture, while adding daemon test coverage around scheduler state seeding for userspace applies.
Changes:
- Adds
test/incus/policy_scheduler_validate.pyplus unit tests for validating active/rebuild/inactive/failover scheduler artifacts. - Adds daemon scheduler apply test scaffolding for userspace scheduler state seeding.
- Updates #1378/#1373 documentation and
_Log.mdto reflect the remaining evidence gate.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
test/incus/policy_scheduler_validate.py |
New deterministic validator for userspace scheduler evidence artifacts. |
test/incus/policy_scheduler_validate_test.py |
Unit tests for validator pass/fail scenarios. |
pkg/daemon/policy_scheduler_apply_test.go |
Adds userspace dataplane test double and scheduler apply-path test. |
docs/userspace-dataplane-gaps.md |
Narrows #1378 status to live HA evidence capture. |
docs/pr/1373-retire-ebpf-dataplane/README.md |
Updates #1378 gap status summary. |
docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md |
Documents closeout evidence harness and validation commands. |
_Log.md |
Records the policy-scheduler closeout slice action and validation. |
| activeState := d.policySchedulerActiveStateForApplyLocked(cfg, testPolicySchedulerApplyNow()) | ||
| d.seedPolicySchedulerActiveStateLocked(activeState) | ||
| compileResult, err := dp.Compile(cfg) | ||
| if err != nil { | ||
| t.Fatalf("Compile: %v", err) | ||
| } | ||
| d.publishInitialPolicySchedulerStateLocked(cfg, activeState, compileResult) |
| return int(value) | ||
| try: | ||
| return int(value) | ||
| except (TypeError, ValueError): | ||
| return default |
Claude r1 review on
|
Round-1 triple-review synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-READY |
| Codex | MERGE-NEEDS-MAJOR (validator too lenient + Apply test bypasses real path) |
| Gemini Pro 3 | MERGE-READY (with MINOR feedback) |
Codex MAJORs
MAJOR 1: Apply test bypasses the real Apply path:
"
pkg/daemon/policy_scheduler_apply_test.go:176does not actually exerciseapplyConfigLocked. It manually calls the expected helper sequence, so a regression in the real Apply path atdaemon_apply.go:441-459could still pass."
MAJOR 2: Validator skips entry_programs check:
"
test/incus/policy_scheduler_validate.py:74-83treatsdataplane_modeandentry_programsas optional. Raw Rust helper status does not include those daemon-enriched fields. Artifacts can pass without proving the attached XDP program isxdp_userspace_progor that the daemon has not fallen back to eBPF."
Gemini independently flagged the same issue:
"If the status JSON omits
entry_programsentirely or provides an empty dictionary,and programsevaluates to False. The check is silently skipped, yielding a pass without actually verifyingxdp_userspaceattachment."
MINOR: Failover validation artifact-only:
"Required
failover-status.jsononly checks that the counter advanced. Does not prove RG ownership changed, node identity changed, HA state is active, or status came from the new owner."
Gemini also found
Test bug: test_fails_when_missing_scheduler_commit_succeeds uses "commit complete" which fails the FIRST regex (references undefined scheduler) before reaching the success-blocker assertion. So the second regex's branch is never exercised.
Harness false-negative: "If the external test harness (Incus) simply fails to send any traffic during the inactive phase, the counters will naturally remain equal, and the validator will pass a potentially fail-open scheduler."
Recommendation
Block on (Codex MAJORs):
policy_scheduler_apply_test.go: drive throughapplyConfigLockedrather than calling the helper sequence directly. Without this, a regression in the real Apply path passes.- Make
entry_programsMANDATORY in the validator — raiseValidationFailureif missing or empty. Optional today = silent pass on eBPF fallback.
Strongly consider:
3. Add HA ownership/node-identity check to failover-status.json validation (Codex MINOR).
4. Fix the test_fails_when_missing_scheduler_commit_succeeds payload to exercise both regex branches (Gemini MINOR).
Document harness contract:
5. The validator pre-supposes the Incus harness sent traffic during inactive. Add an INACTIVE_TRAFFIC_ATTEMPTS field that the harness records, and have the validator assert it's > 0 (Gemini MINOR).
Codex task: task-mpaotpjs-qd644o. Gemini task: task-mpaotxjx-mg8syt. Not merging — author's decision.
Claude r2 review on
|
66775b2 to
08a81cb
Compare
Round-2 triple-review synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-READY |
| Codex | MERGE-NEEDS-MINOR (failover identity unverified, shadow bug not fixed) |
| Gemini Pro 3 | MERGE-NEEDS-MAJOR (her own r1 shadow bug not fixed) |
Two r1 MAJORs closed
Codex+Gemini both verify:
apply_test.gonow drivesapplyConfigLockedend-to-end (was bypassed in r1)entry_programsvalidator check is now mandatory (was optional in r1)
Codex: "
TestApplyConfigSeedsUserspacePolicySchedulerStateBeforeCompilenow enters throughd.applyConfigLocked(cfg)atpolicy_scheduler_apply_test.go:182. The production path computes scheduler state, seeds it, then callsCompile."
Gemini: "The test now directly drives
d.applyConfigLocked()instead of artificially seeding internal state."
Gemini MAJOR — her r1 test shadow bug NOT fixed
Gemini: "The +14 line addition to
test/incus/policy_scheduler_validate_test.pyconsists EXCLUSIVELY of the newtest_fails_when_entry_programs_missingtest. The shadowing bug remains unfixed becausetest_fails_when_missing_scheduler_commit_succeedswas completely untouched."
"Because the test injects
missing_text='commit complete', it fails the FIRST regex (as it lacks the rejection string) and raises the'strict rejection'error. This completely shadows the actual'commit complete'check on the next line. The test still incorrectly asserts'strict rejection'and falsely passes on the wrong code path."
Codex confirms independently:
"Gemini's test bug remains:
test_fails_when_missing_scheduler_commit_succeedswrites only'commit complete'and expects'strict rejection'atpolicy_scheduler_validate_test.py:129-138, so it still trips the first missing-rejection regex and never exercises the later successful-commit guard atpolicy_scheduler_validate.py:174-176."
This is a real test bug that Gemini correctly flagged in r1 — the author missed it.
Codex MINORs
- Failover ownership / node identity not validated by the python validator (artifact-trust only)
- entry_programs absent-vs-empty: the new test only covers ABSENT case, not empty dict (though same code path)
- Apply test doesn't pin counter survival (counter survival is pinned in Rust
policy_tests.rs, so OK)
Recommendation
Block on (Gemini MAJOR): fix test_fails_when_missing_scheduler_commit_succeeds — payload must contain BOTH the strict rejection string AND "commit complete" so the second regex actually fires. Otherwise this test branch never exercises the success-blocker logic and the validator could let a missing-scheduler commit pass.
Fixed example:
_write_artifacts(root, missing_text="references undefined scheduler ... commit complete")Then the test should EXPECT ValidationFailure("missing scheduler commit artifact looks like a successful commit") (the SECOND regex's message), not the first.
Defer:
- Failover ownership/identity validation (Codex MINOR)
- Empty-dict entry_programs case (covered by same code path as absent)
Codex task: task-mpapy16s-jaiuf5. Gemini task: task-mpapy9lw-a1c5yj. Not merging — author's decision.
Make entry_programs mandatory in policy-scheduler validation and drive the daemon regression through applyConfigLocked. This closes the optional-validator bypass and keeps the apply test on the production path. Close the validator-test shadow bug by making the missing-scheduler success test include both the strict rejection text and commit-complete text, so it reaches the successful-commit guard instead of failing the earlier rejection-presence check. Add explicit empty-entry_programs coverage for the mandatory runtime gate.
08a81cb to
f794be9
Compare
| required = rebuild_counter.packets + min_failover_packet_delta | ||
| if failover_counter.packets < required: | ||
| raise ValidationFailure( | ||
| "failover: policy counter did not advance on the new userspace owner " | ||
| f"({failover_counter.packets} < {required})" | ||
| ) |
Claude r4 review on
|
Round-4 triple-review synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-READY |
| Codex | MERGE-NEEDS-MINOR (failover ownership identity still deferred) |
| Gemini Pro 3 | MERGE-READY |
Gemini r1 shadow-bug MAJOR — VERIFIED CLOSED
Both Codex and Gemini independently confirmed the fix:
Codex:
"A: Yes.
test_fails_when_missing_scheduler_commit_succeedsnow writes both the strict rejection string andcommit completein the same payload.""B: Yes. The expected failure now matches
'looks like a successful commit', so it targets the second validator guard, not the first.""I also sanity-checked the validator in-memory from
f794be92: the paired rejection/success payload raisesmissing scheduler commit artifact looks like a successful commit."
Gemini quoted the fix:
def test_fails_when_missing_scheduler_commit_succeeds(self) -> None:
...
_write_artifacts(
root,
missing_text=(
'policy "scheduled-allow" references undefined scheduler "missing"\n'
"commit complete"
),
)
with self.assertRaisesRegex(
policy_validate.ValidationFailure,
"looks like a successful commit",
):Shadow bug is properly closed: payload trips the SECOND regex; expected error message matches the SECOND guard's wording.
Empty entry_programs case — VERIFIED ADDED
Codex+Gemini confirmed both test_fails_when_entry_programs_missing and test_fails_when_entry_programs_empty are now present and both expect the same entry_programs must be a non-empty object error.
Codex MINOR (carries forward from r2)
"r4 still does not address the failover ownership identity concern.
validate_artifacts()only checks thatfailover-status.jsonhas a counter at leastrebuild + delta; it does not prove the status came from the new owner/node."
Deferrable — this is a deeper integration check that needs harness-side machinery to record the node identity.
Recommendation
MERGE-READY. Gemini's r1 shadow-bug catch is structurally closed; the success-blocker guard is finally exercised. Defer the failover ownership identity MINOR for a follow-up issue.
Codex task: task-mpaqzv0t-grpp1n. Gemini task: task-mpar02bb-qry2lt. Not merging — author's decision.
* userspace: add policy scheduler evidence validator * scheduler: validate entry programs through real apply path Make entry_programs mandatory in policy-scheduler validation and drive the daemon regression through applyConfigLocked. This closes the optional-validator bypass and keeps the apply test on the production path. Close the validator-test shadow bug by making the missing-scheduler success test include both the strict rejection text and commit-complete text, so it reaches the successful-commit guard instead of failing the earlier rejection-presence check. Add explicit empty-entry_programs coverage for the mandatory runtime gate.
Summary
Tests
Refs #1378