Skip to content

Add #1378 live policy scheduler evidence#1426

Merged
psaab merged 1 commit into
masterfrom
codex/b-1378-live-evidence
May 19, 2026
Merged

Add #1378 live policy scheduler evidence#1426
psaab merged 1 commit into
masterfrom
codex/b-1378-live-evidence

Conversation

@psaab
Copy link
Copy Markdown
Owner

@psaab psaab commented May 19, 2026

Summary

Validation

  • python3 test/incus/policy_scheduler_validate.py docs/pr/1373-retire-ebpf-dataplane/evidence-1378-policy-scheduler-live-20260519 --rule-id 'lan->wan/scheduled-allow'
  • git diff --check
  • git diff --cached --check
  • git show -s --format=%B HEAD | awk 'length($0)>72 { print length($0) ":" $0 }'

Copilot AI review requested due to automatic review settings May 19, 2026 03:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR captures and commits live HA evidence artifacts for issue #1378 (policy schedulers in userspace dataplane) collected from the loss userspace HA cluster, and updates tracking docs (#1373/#1378) to mark the scheduler live-evidence gate as closed.

Changes:

  • Adds the evidence-1378-policy-scheduler-live-20260519/ artifact set (configs, counters, raw ip-link/XDP captures, failover/restore stdouts, README) accepted by test/incus/policy_scheduler_validate.py --rule-id 'lan->wan/scheduled-allow'.
  • Updates plan-1378-policy-schedulers.md closeout section and switches the documented --rule-id example from trust->untrust/scheduled-allow to lan->wan/scheduled-allow.
  • Updates userspace-dataplane-gaps.md, userspace-dataplane-architecture.md, 1373-retire-ebpf-dataplane/README.md, and plan.md to reflect that #1378 is closed for retirement purposes.

Reviewed changes

Copilot reviewed 53 out of 86 changed files in this pull request and generated no comments.

Show a summary per file
File Description
docs/userspace-dataplane-gaps.md Marks #1378 as live HA evidence captured; removes it from remaining work narrative.
docs/userspace-dataplane-architecture.md Reflects scheduler live artifact validation closeout.
docs/pr/1373-retire-ebpf-dataplane/README.md Tracker table entry for #1378 set to closed.
docs/pr/1373-retire-ebpf-dataplane/plan.md Plan table/narrative update reflecting #1378 closure.
docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md Adds 2026-05-19 closeout note, updates rule-id example, adds validation slice.
.../evidence-1378-policy-scheduler-live-20260519/* New evidence artifact set: configs, counters JSONs, raw link/XDP captures, failover/restore stdouts, README, run metadata.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 19, 2026

Claude r1 review on 1fd17502

Verdict: MERGE-READY (pending Codex/Gemini scope-check)

+37538/-20 — almost entirely live HA failover evidence artifacts captured under docs/pr/1373-retire-ebpf-dataplane/evidence-1378-policy-scheduler-live-20260519/. The directory contains:

  • README explaining the capture
  • active/rebuild/inactive/failover status JSONs
  • active/inactive/failover traffic logs (stderr/stdout)
  • failover-cluster-status.txt
  • clear-before-active + clear-before-inactive + failover-clear-sessions stdout/stderr
  • active-policy-counters.json
  • active-full.conf
  • failover-load-active-full.stderr

This satisfies the original #1378 outstanding gap — live HA validation evidence — which through previous rounds was deferred. The policy_scheduler_validate.py from #1416 (which we hardened across multiple rounds) is the verification harness; this PR provides the evidence run.

Hostile concerns for Codex/Gemini

A. Scope check: 37538 LOC additions — are they ALL evidence artifacts, or did any production code drift hide in this volume?
B. Validator goalpost not moved during capture — is policy_scheduler_validate.py unchanged between this commit and prior?
C. entry_programs in active-status.json shows xpf_userspace (the mandatory check from #1416)?
D. failover-status.json from new RG owner (not same as active)?

Recommendation

MERGE-READY structurally — evidence captures are a doc deliverable, low risk if the scope check passes. Awaiting Codex (task-mpc366uv-rpgfzj) + Gemini Pro 3 (task-mpc36fbn-rnsnw8).

Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 19, 2026

Round-1 triple-review synthesis on 1fd17502

Reviewer Verdict
Claude MERGE-READY
Codex MERGE-READY
Gemini Pro 3 MERGE-READY

All three converge. Clean evidence-only PR.

Codex independent verification

"Scope: no production-code drift in 1fd17502. The huge diff is docs-only. Evidence is +37507/-0, five non-evidence docs account for +31/-20. No production files or validator files changed."

"Validator contract passes against the new artifacts: Counters for lan->wan/scheduled-allow are active 5, rebuild 5, inactive 5, failover 20; bytes are 490/490/490/1876. That satisfies active > 0, rebuild >= active, inactive == rebuild, failover >= rebuild + 1. Missing-scheduler artifact shows commit-check rejection, not success."

"xdp_userspace evidence present: All status files and raw *-entry-programs.json show entries 4/5/6 = xdp_userspace_p."

"RG ownership flipped: failover-cluster-status.txt shows RG1/RG2 primary on node1; failover-status.json is from the node1 side (ge-7-0-*, peer 10.99.13.1, RG1/RG2 active), while active/rebuild/inactive are node0 side (ge-0-0-*, peer 10.99.13.2)."

"No validator goalpost movement: test/incus/policy_scheduler_validate.py and policy_scheduler_validate_test.py blob hashes are identical between 1fd17502^ and 1fd17502. The docs changed the invoked --rule-id to match the live topology, but the validator config/code did not change."

Recommendation

Merge-ready. Closes the long-standing #1378 live evidence gap with proper RG-flip + xdp_userspace attachment + monotonically-increasing counters. Validator config unchanged.

Codex task: task-mpc366uv-rpgfzj. Gemini task: task-mpc36fbn-rnsnw8. Not merging — author's decision.

Capture the accepted loss userspace HA evidence set for #1378.
The artifacts cover active, rebuild, inactive, and post-failover
status snapshots for lan->wan/scheduled-allow, plus a strict
undefined-scheduler commit-check rejection.

The failover status was captured from xpf-userspace-fw1 after RG1
and RG2 moved to node1. The scheduled-rule counter advances from
5 packets / 490 bytes before failover to 20 packets / 1876 bytes on
the new owner, while inactive remains equal to rebuild.

Add a concise evidence README and update the #1373/#1378 trackers to
mark policy schedulers closed for the live HA evidence gate. The lab
restore artifacts show node0 primary again and no scheduler-policy
residue in the final config.

Validation:
- python3 test/incus/policy_scheduler_validate.py
  docs/pr/1373-retire-ebpf-dataplane/
  evidence-1378-policy-scheduler-live-20260519
  --rule-id 'lan->wan/scheduled-allow'
- git diff --check
- git diff --cached --check
@psaab psaab force-pushed the codex/b-1378-live-evidence branch from 1fd1750 to b1c1410 Compare May 19, 2026 14:23
@psaab psaab merged commit 265f672 into master May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants