Plumb userspace policy scheduler inactive state by psaab · Pull Request #1396 · psaab/xpf

psaab · 2026-05-17T01:53:53Z

References #1378.

Summary

Add rule_id, scheduler_name, and inactive fields to userspace policy snapshots and Rust PolicyRule.
Skip inactive policy rules before match evaluation and preserve first-match fallthrough semantics.
Document that daemon scheduler tick/restart/failover republish and missing-scheduler commit errors remain gates.

Tests

go test ./pkg/dataplane/userspace -run 'TestBuildPolicySnapshotsRoundTripsSchedulerInactiveAndRuleID|TestBuildPolicySnapshotsIncludesGlobalPolicies'\n- cargo test policy::tests::evaluate_policy_skips_inactive_rules\n- cargo test policy::tests::inactive_rule_falls_through_to_next_match\n- cargo test policy_rule_snapshot\n- git diff --check

Copilot

Pull request overview

Plumbs Junos scheduler/inactive policy state through the userspace dataplane snapshot DTO and Rust policy evaluator so that scheduled policy rules can be skipped during match evaluation. This is a slice of the broader #1378 work; the daemon still does not yet populate the active-state map on tick/restart/failover, so end-to-end scheduling remains gated.

Changes:

Adds RuleID, SchedulerName, and Inactive to the PolicyRuleSnapshot DTO (Go + Rust) with serde/JSON backward-compat defaults.
Adds a Go buildPolicySnapshotsWithSchedulerState builder that derives Inactive from an activeState map, plus a stable rule-ID helper.
Rust parse_policy_state propagates the new fields and try_match_rule skips inactive rules so first-match fallthrough remains correct.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/dataplane/userspace/protocol.go	Adds `RuleID`/`SchedulerName`/`Inactive` JSON fields on `PolicyRuleSnapshot`.
pkg/dataplane/userspace/snapshot.go	Builds new fields; introduces `policyRuleInactive` and `stablePolicyRuleID` helpers.
pkg/dataplane/userspace/manager_test.go	Adds round-trip test for new fields with scheduler active-state map.
userspace-dp/src/protocol.rs	Adds matching fields on Rust DTO and serde round-trip / legacy-decode tests.
userspace-dp/src/policy.rs	Stores fields on `PolicyRule`; skips inactive rules in `try_match_rule`; adds `stable_policy_rule_id` fallback.
userspace-dp/src/policy_tests.rs	Tests inactive-skip and fall-through, updates existing fixtures with `..Default::default()`.
userspace-dp/src/afxdp/tests.rs, test_fixtures.rs	Spreads `..Default::default()` into `PolicyRuleSnapshot` fixtures.
docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md	Documents that daemon republish and missing-scheduler commit errors remain gates.

+		return false
+	}
+	active, ok := activeState[schedulerName]
+	return !ok || !active


 		}
+		schedulerName := pol.SchedulerName
 		out = append(out, PolicyRuleSnapshot{
+			RuleID:               stablePolicyRuleID("junos-global", "junos-global", pol.Name),


psaab · 2026-05-17T02:16:12Z

Claude round-1 review on `ba62ce26`

Verdict: MERGE-READY (foundation — cross-language plumbing for #1378)

297+/0- across 9 Go+Rust files. Wires rule_id, scheduler_name, and inactive through userspace policy snapshots and Rust policy evaluation. Per the plan-1378 update in this PR:

"The Go snapshot builder can compute inactive bits when supplied scheduler active state, but the normal daemon publish path still builds snapshots without that state. Scheduler tick/restart/failover republish and missing-scheduler commit errors remain gates before scheduled userspace policies are end-to-end enforced."

What's in scope

Go: protocol.go field additions, snapshot.go buildPolicySnapshotsWithSchedulerState helper, manager_test
Rust: protocol.rs matching deserialization, policy.rs evaluation honors inactive bit, policy_tests.rs round-trip
Plan doc updated to reflect the foundation slice status

Hostile checks

Wire format backward compat: new scheduler_name + inactive fields should be omitempty on Go side / #[serde(default)] on Rust side so older builds tolerate. Codex will verify.
Monotonic clock: the actual scheduler tick/eval logic is out of scope for this PR. But: per Feature gap: time-based policy schedulers (Junos schedulers { ... }) not propagated to userspace-dp #1378 contract, when wired up the time source MUST be monotonic, NOT wall clock — NTP rollback must not flap policies.
Apply-cycle race: snapshot publish via ArcSwap (or similar) — split-brain (some rules see new inactive bits, others don't) must not happen mid-apply. Foundation PR may not exercise this, but the new fields should be in the same snapshot atomic.
Inactive bit semantics: inactive=true means the rule is currently outside its scheduler's active window → policy evaluation must skip it (and fall through to next rule per first-match ordering). Verify in policy.rs.

Recommendation

Block on: nothing structural for the foundation slice.

Strongly consider: confirm inactive=true correctly causes first-match-ordering fall-through (Codex will trace).

Defer to follow-up PRs: scheduler tick/eval loop, missing-scheduler-name commit error, atomic snapshot republish on window transition, HA failover semantics.

Awaiting Codex (task-mp955c3b-xto9kj).

psaab · 2026-05-17T02:17:47Z

Claude round-1 self-correction on `ba62ce26`

Revising verdict from MERGE-READY → MERGE-NEEDS-MAJOR to match Codex. Same pattern as #1394 — Codex verified the production wiring; I accepted the foundation-only framing without checking the actual publish path.

Codex finding 1 — Production publish path doesn't use the new field

"buildSnapshot still calls buildPolicySnapshots(cfg) with no scheduler state, and that path makes every scheduled rule active because policyRuleInactive(..., nil) returns false."

Plumbing was added but the daemon's actual publish path is unchanged. buildPolicySnapshotsWithSchedulerState exists but buildSnapshot at snapshot.go:53 doesn't call it. PR's own plan admits this — but that admission means the PR doesn't actually move the needle on #1378.

Codex finding 2 — `UpdatePolicyScheduleState` callback not wired to userspace

"Userspace does not override UpdatePolicyScheduleState. Manager embeds the eBPF dataplane at pkg/dataplane/userspace/manager.go:57-59... so the scheduler callback in pkg/daemon/daemon_run.go:585-589 still dispatches to the embedded eBPF map updater."

The scheduler tick fires UpdatePolicyScheduleState; on userspace dataplane, this is dispatched to the embedded eBPF backend (which updates policy_rules BPF map), not to the Rust userspace policy snapshot. Window changes don't propagate to userspace at all. The whole new field is dead code in production.

Codex finding 3 — Wall-clock in scheduler violates monotonic-clock contract

"pkg/scheduler/scheduler.go:31 and :76 use time.Now(), and isWithinWindow evaluates parsed dates and timeOfDay(now) from wall-clock fields at :124-164."

#1378's refined contract explicitly requires monotonic clock: "NTP rollback must not flap policies." This is wall-clock. An NTP rollback across a start/stop boundary CAN flip active state.

Codex finding 4 — Wire format fail-open on mixed-version

"New Rust defaults missing rule_id, scheduler_name, and inactive fields, so old Go payloads decode. But ProtocolVersion stays 1 and there is no capability/version gate; a new Go daemon sending inactive:true to an old helper would be silently ignored by serde's unknown-field behavior and the rule would become active. That is fail-open."

Mixed-version: new cli → old xpfd would silently drop the inactive bit and serve traffic through a rule that should be inactive. Real security gap if this is the only deployment safeguard.

Codex MINOR — Test gap

Go test = Go self-round-trip. Rust test = Rust self-round-trip. No Go→Rust cross-language test. No UpdatePolicyScheduleState userspace republish test. No monotonic rollback test.

Claude miss pattern

I labeled this "foundation cross-language plumbing" but didn't verify (a) production path uses it, (b) the scheduler callback reaches userspace, (c) time source is monotonic, (d) wire format has a version gate. The PR adds fields but no actual code path exercises them on userspace dataplane.

Recommendation (revised)

Block on:

Wire buildSnapshot → buildPolicySnapshotsWithSchedulerState with real scheduler state (publish path actually uses the new field).
Override UpdatePolicyScheduleState on userspace Manager to republish snapshot, not dispatch to embedded eBPF.
Move pkg/scheduler/scheduler.go to monotonic clock (or document the wall-clock dependency as a known issue with NTP-rollback test pinning the failure mode).
Add wire-format version gate OR document fail-closed contract for mixed-version (new field present = required).

Strongly consider: Go→Rust round-trip test + monotonic rollback test + apply-during-window-transition test.

Codex task: task-mp955c3b-xto9kj. Not merging — author's decision.

psaab · 2026-05-17T02:22:04Z

Round-1 review consolidated on `ba62ce26`

Reviewer	Verdict
Claude	MERGE-READY → MERGE-NEEDS-MAJOR (self-corrected)
Codex	MERGE-NEEDS-MAJOR
Copilot	(no review yet on this HEAD)

Codex MAJORs

Production publish path doesn't use the new field. buildSnapshot:53 calls buildPolicySnapshots(cfg) without scheduler state; policyRuleInactive(..., nil) returns false. New buildPolicySnapshotsWithSchedulerState helper exists but isn't called by the daemon's actual publish path.
UpdatePolicyScheduleState callback not wired to userspace. manager.go:57-59 embeds eBPF dataplane; scheduler tick at daemon_run.go:585-589 dispatches to the eBPF map updater. Window changes never reach the Rust userspace policy snapshot.
Wall-clock in scheduler eval violates monotonic-clock contract. scheduler.go:31, 76 uses time.Now(). NTP rollback CAN flip active state. Per #1378 refined contract: "monotonic clock for window eval; NTP rollback must not flap policies."
Wire format fail-open on mixed-version. No ProtocolVersion bump, no capability gate. New Go sending inactive:true to old Rust helper → silently ignored via serde unknown-field tolerance → rule becomes active. Security gap.

Codex MINOR

Tests are Go-self / Rust-self round-trip only. No cross-language Go→Rust round-trip. No UpdatePolicyScheduleState userspace republish test. No monotonic rollback test.

Self-correction

Same pattern as #1394/#1395: I labeled "foundation cross-language plumbing" without verifying (a) production path uses it, (b) the scheduler callback reaches userspace, (c) time source is monotonic, (d) wire format has a version gate. The PR adds fields but no code path exercises them on userspace dataplane.

Recommendation

Block on:

Wire buildSnapshot → buildPolicySnapshotsWithSchedulerState with real scheduler state.
Override UpdatePolicyScheduleState on userspace Manager to republish snapshot, not dispatch to embedded eBPF.
Migrate pkg/scheduler/scheduler.go to monotonic clock OR document wall-clock dependency + pin failure mode with NTP-rollback test.
Add wire-format version gate OR document fail-closed contract for mixed-version.

Strongly consider: cross-language round-trip test + monotonic rollback test + apply-during-window-transition test.

Codex task: task-mp955c3b-xto9kj. Not merging — author's decision.

psaab · 2026-05-17T02:48:49Z

Review follow-up on `52686dd4`

Addressed the round-1 blockers for the userspace policy-scheduler slice:

Production userspace publish path now preserves and applies scheduler active-state via buildSnapshotWithSchedulerState, so scheduled policy inactive bits are not dropped on normal config rebuilds.
pkg/dataplane/userspace.Manager now shadows UpdatePolicyScheduleState and republishes one coherent userspace snapshot instead of falling through to the embedded eBPF manager.
Missing scheduler references now fail compile/commit instead of warning and defaulting active.
Packet workers still do not evaluate wall-clock time; wall-clock remains control-plane-only in pkg/scheduler, and Rust consumes snapshot booleans.
ConfigSnapshot.version is bumped to v2 and the Rust control server rejects unsupported snapshot versions instead of silently accepting scheduler fields it might ignore.
Updated module docs for the policy scheduler and userspace server protocol contract.

Validation run:

go test ./pkg/dataplane/userspace
go test ./pkg/config
go test ./pkg/daemon
go test ./pkg/scheduler
cargo test --manifest-path userspace-dp/Cargo.toml policy -- --nocapture
cargo test --manifest-path userspace-dp/Cargo.toml apply_snapshot_rejects_unsupported_protocol_version -- --nocapture
git diff --check

All passed. Rust emitted the repo's existing warning set.

psaab · 2026-05-17T02:56:30Z

Claude round-2 review on `52686dd4`

Verdict: MERGE-NEEDS-MINOR (3 of 4 round-1 MAJORs closed; wall-clock not addressed)

Round-1 MAJORs status

1. Production publish path uses scheduler state — CLOSED ✓

manager.go:285-286 Compile() now does:

activeState := m.policySchedulerActiveStateSnapshot()
snap := buildSnapshotWithSchedulerState(cfg, ucfg, m.bumpGeneration(), m.readFIBGeneration(), activeState)

plus the new policySchedulerActive field on Manager + policySchedulerActiveStateSnapshot() getter.

2. UpdatePolicyScheduleState override — CLOSED ✓

manager.go:404-450 new override:

func (m *Manager) UpdatePolicyScheduleState(cfg *config.Config, activeState map[string]bool) {
    // ... bumps generation, rebuilds policies via buildPolicySnapshotsWithSchedulerState,
    // publishes via apply_snapshot control request to the Rust helper
}

Explicitly shadows the embedded eBPF manager method. Comment is unambiguous: "scheduled userspace policies must not update the policy_rules BPF map."

3. Wall-clock vs monotonic — NOT ADDRESSED

pkg/scheduler/scheduler.go does NOT appear in the r2 diff. git diff ba62ce26..52686dd4 -- pkg/scheduler/*.go returns empty. The wall-clock dependency at scheduler.go:31, 76 is unchanged — NTP rollback can still flip active state.

The PR comment claims "scheduler/server contracts" doc updated but not the time source. #1378's contract explicitly requires monotonic clock. This is the remaining MAJOR from round-1.

4. Protocol version reject — CLOSED ✓ (need Codex/Gemini verification)

PR comment says "Rust rejects unsupported snapshot protocol versions" with handlers.rs +92. Will verify.

Hostile spot-checks

policySchedulerActive field protected by m.mu ✓
New apply_snapshot control request via requestLocked — atomic publish, not split-brain ✓
bumpGeneration correctly preceds publish ✓

Recommendation

Block on: address the wall-clock issue in pkg/scheduler/scheduler.go — either migrate to monotonic clock, OR document the wall-clock dependency + pin failure with an NTP-rollback test that demonstrates the known issue.

Strongly consider: verify protocol version reject path in handlers.rs properly fails closed on unknown version (Codex/Gemini will trace).

Defer: nothing else.

Awaiting Codex (task-mp96mn6y-6cq7ia) and Gemini Pro 3 (task-mp96nhjj-fqnbn1).

psaab · 2026-05-17T03:02:54Z

Round-2 quad-review consolidated synthesis on `52686dd4`

Reviewer	Verdict
Claude	MERGE-NEEDS-MINOR → MERGE-NEEDS-MAJOR (self-corrected after Codex)
Codex	MERGE-NEEDS-MAJOR
Gemini Pro 3	MERGE-READY (4 round-1 MAJORs verified, accepts README architectural note)
Copilot	(no fresh review on this HEAD)

Reviewer disagreement

Gemini accepted the round-1 MAJORs as "all addressed" including the wall-clock issue via README architectural note. Codex found 4 NEW MAJORs not covered by Gemini's verification.

Codex new MAJOR 1 — Scheduler lifecycle not wired for config commits

"The daemon only creates d.scheduler once in daemon_run.go:584-596; daemon_apply.go:423-428 recompiles dataplane config but never calls Scheduler.Update or starts a scheduler when a commit adds the first scheduler. A runtime commit adding scheduled policies can publish with nil/stale scheduler state and may never get corrected."

Worked failure: operator commits config without schedulers { } block → daemon boots, scheduler runs with empty state. Operator later commits adding scheduled policy → dataplane gets new policies but scheduler never learns about them → policies publish with inactive: false permanently.

This is the round-1 finding "publish path doesn't use scheduler state" surviving in a different form — the publish path NOW uses scheduler state, but the scheduler itself isn't kept in sync with config commits.

Codex new MAJOR 2 — Hybrid snapshot race

"UpdatePolicyScheduleState copies m.lastSnapshot, then replaces only Config and Policies from the passed cfg at manager.go:423-429. The daemon callback uses store.ActiveConfig() outside the apply semaphore, so a scheduler tick during/after commit can combine new policies with old zones/routes/interfaces."

Race window: scheduler tick fires between commit's policy update and zone/route update. Publishes a hybrid snapshot with new policies + stale zones. First-match policy evaluation against wrong zone bindings → real packet misrouting.

Codex MAJOR 3 — Wall-clock just reframed, not fixed

"scheduler.go:31, 76 still use time.Now(). The new docs just reframe this as control-plane wall-clock behavior, not a monotonic-clock implementation or a signed-off deferral."

Gemini accepted the README architectural note ("wall-clock only in control plane, fast path consumes booleans"). Codex doesn't accept the rename as a fix. My read: the README framing IS legitimate engineering rationale — fast-path tearing is the real risk and is avoided. But the NTP-rollback control-plane behavior is undocumented in tests. Recommend either committing to monotonic in the control plane OR adding an explicit "wall-clock NTP rollback can cause brief active-state churn; operators should run sane NTP" note + test pinning the failure mode.

Codex MAJOR 4 — Mixed-version asymmetric

"New Rust rejects old Go version-1 snapshots, but new Go cannot force a truly old Rust helper to reject version 2. Older Rust without the version gate accepts snapshots; pre-scheduler Rust also ignores unknown inactive, recreating the fail-open path."

Forward compat (old Go → new Rust): works, version mismatch rejected ✓
Backward compat (new Go → old Rust): fail open. Old Rust without version gate accepts the v2 snapshot, ignores inactive field, treats rule as active.

This is the same fail-open Codex flagged in round-1 — only HALF closed.

What's actually fixed

Claim	Status	Reviewer
UpdatePolicyScheduleState override → Rust republish	✓	Both
Compile() passes scheduler state	✓	Both
Missing scheduler ref = commit error	✓	Both
Rust rejects unsupported version	✓	Both
Wall-clock monotonic	reframed only (Gemini accepts, Codex doesn't)	Disagreement
Mixed-version BOTH directions	half (new Rust closed; old Rust still open)	Codex caught

Self-correction

My round-2 review caught 1 of these (wall-clock) and flagged as MINOR. I missed scheduler lifecycle, hybrid snapshot race, and asymmetric mixed-version. Same "didn't probe the production wiring" pattern as round-1.

Recommendation (revised)

Block on:

Wire scheduler update on config commits — daemon_apply.go must call Scheduler.Update on commits that change scheduler config.
Close hybrid snapshot race — UpdatePolicyScheduleState must take the apply semaphore OR use atomic snapshot publish that doesn't tear across fields.
Mixed-version: new Go must NOT send v2 to old Rust. Either (a) capability negotiation handshake on connect, or (b) accept that v2 is only safe with v2-capable Rust and document the upgrade order requirement.

Strongly consider: wall-clock — either commit to monotonic in pkg/scheduler/scheduler.go OR add an explicit deferral note + NTP-rollback test pinning the known control-plane churn.

Codex task: task-mp96mn6y-6cq7ia. Gemini Pro 3 task: task-mp96nhjj-fqnbn1. Not merging — author's decision.

psaab · 2026-05-17T03:25:50Z

Round-3 fix update

Pushed 50a94ed3 for the r2 production-wiring blockers.

What changed:

Removed the startup-only scheduler loop in Run; scheduler lifecycle is now reconciled on every committed config under applySem.
Added scheduler.NewPrimed so apply can compute the initial active-state map without firing a callback while the apply semaphore is already held.
Runtime scheduler callbacks now reacquire applySem, carry an epoch, and skip stale callbacks after config replacement.
Userspace snapshots are seeded with the same scheduler active-state map used by the current apply transaction, preventing hybrid Config/Policies snapshots.
The Rust helper now reports config_snapshot_protocol_version; Go refuses scheduled-policy snapshot publishes to older helpers before they can fail open.
Updated scheduler/server/module docs and the Feature gap: time-based policy schedulers (Junos schedulers { ... }) not propagated to userspace-dp #1378 plan notes to match the new lifecycle and compatibility contract.

Validation:

go test ./pkg/scheduler ./pkg/daemon ./pkg/dataplane/userspace
cargo test --manifest-path userspace-dp/Cargo.toml apply_snapshot_rejects_unsupported_protocol_version -- --nocapture
git diff --check

The cargo run still emits existing warning noise; the targeted protocol-version gate test passes.

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (2)

pkg/daemon/daemon_scheduler.go:60

publishPolicyScheduleState reads d.policySchedulerEpoch after acquiring applySem, which guards against stale callbacks from previous scheduler generations. However, between the moment a scheduler callback is queued (inside Scheduler.evaluate after s.mu.Unlock() but before updateFn(cp)) and the moment publishPolicyScheduleState runs, a reconcilePolicySchedulerLocked may have already cancelled the old scheduler (d.schedulerCancel()), but Run's ticker goroutine fires evaluate directly — not through ctx.Done(). The cancelled scheduler may still execute the pending updateFn after policySchedulerEpoch.Add(1) ran. The epoch check correctly drops the stale update, but the call still acquires applySem, which can introduce latency on the apply path if many stale callbacks pile up. This is benign but worth a brief comment to document that the epoch guard is the only correctness barrier here.

func (d *Daemon) publishPolicyScheduleState(epoch uint64, activeState map[string]bool) {
	if err := d.applySem.Acquire(context.Background(), 1); err != nil {
		slog.Warn("scheduler: failed to acquire apply semaphore", "err", err)
		return
	}
	defer d.applySem.Release(1)

	if epoch != d.policySchedulerEpoch.Load() {
		return
	}
	cfg := d.store.ActiveConfig()
	if cfg == nil || d.dp == nil {
		return
	}
	d.seedPolicySchedulerActiveStateLocked(activeState)
	d.dp.UpdatePolicyScheduleState(cfg, activeState)
}

pkg/daemon/daemon_scheduler.go:59

policySchedulerActive is read/written under m.mu everywhere except inside UpdatePolicyScheduleState, where line 424 assigns m.policySchedulerActive = activeCopy under m.mu.Lock, then seedPolicySchedulerActiveStateLocked in daemon_scheduler.go (line 67) calls SetPolicySchedulerActiveState on the same manager via the interface dispatch. SetPolicySchedulerActiveState re-acquires m.mu. This is currently fine because seedPolicySchedulerActiveStateLocked is called before UpdatePolicyScheduleState (not from within), but the duplicate seeding (publishPolicyScheduleState calls both seedPolicySchedulerActiveStateLocked and d.dp.UpdatePolicyScheduleState, the latter of which also seeds) is redundant and easy to misread. Consider dropping the explicit seedPolicySchedulerActiveStateLocked call from publishPolicyScheduleState since UpdatePolicyScheduleState already overwrites m.policySchedulerActive from its argument.

	d.seedPolicySchedulerActiveStateLocked(activeState)
	d.dp.UpdatePolicyScheduleState(cfg, activeState)

+                    if snapshot.version != CONFIG_SNAPSHOT_PROTOCOL_VERSION {
+                        response.ok = false;
+                        response.error = format!(
+                            "unsupported snapshot protocol version {} (want {})",


@@ -228,6 +231,37 @@ func compileExpanded(tree *ConfigTree) (*Config, error) {
 	return cfg, nil
 }

+func validatePolicySchedulerReferencesStrict(cfg *Config) error {
+	if cfg == nil {
+		return nil
+	}
+	check := func(pol *Policy) error {
+		if pol == nil || pol.SchedulerName == "" {
+			return nil
+		}
+		if _, ok := cfg.Schedulers[pol.SchedulerName]; ok {
+			return nil
+		}
+		return fmt.Errorf("policy %q references undefined scheduler %q", pol.Name, pol.SchedulerName)
+	}
+	for _, zpp := range cfg.Security.Policies {
+		if zpp == nil {
+			continue
+		}
+		for _, pol := range zpp.Policies {
+			if err := check(pol); err != nil {
+				return err
+			}
+		}
+	}
+	for _, pol := range cfg.Security.GlobalPolicies {
+		if err := check(pol); err != nil {
+			return err
+		}
+	}
+	return nil
+}


+	if d.dp != nil && policySchedulerActiveState != nil && compileResult != nil {
+		d.dp.UpdatePolicyScheduleState(cfg, policySchedulerActiveState)
+	}


+	if d.daemonCtx != nil {
+		ctx, cancel := context.WithCancel(d.daemonCtx)
+		d.schedulerCancel = cancel
+		go sched.Run(ctx)
+	}


+	if len(s.active) > 0 {
+		s.notifyActiveState()
+	}


+fn stable_policy_rule_id(snap: &PolicyRuleSnapshot) -> String {
+    if !snap.rule_id.is_empty() {
+        return snap.rule_id.clone();
+    }
+    format!("{}->{}/{}", snap.from_zone, snap.to_zone, snap.name)


+	if schedulerName == "" || activeState == nil {
+		return false
+	}


+	}
+	if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil {
+		slog.Warn("userspace: refusing policy scheduler publish to incompatible helper", "err", err)
+		return
+	}


 	ucfg := deriveUserspaceConfig(cfg)
-	snap := buildSnapshot(cfg, ucfg, m.bumpGeneration(), m.readFIBGeneration())
+	activeState := m.policySchedulerActiveStateSnapshot()
+	snap := buildSnapshotWithSchedulerState(cfg, ucfg, m.bumpGeneration(), m.readFIBGeneration(), activeState)


+	monitoredIfindexes      map[int]struct{}
 	lastBindingIndices      []uint32
 	neighborsPrewarmed      bool
 	ctrlEnableAt            time.Time


psaab · 2026-05-17T04:13:21Z

Claude round-3 review on `50a94ed3`

Verdict: MERGE-READY-pending (PR claims address all 4 round-2 MAJORs; need reviewer verification on race + mixed-version)

Round-3 diff: 15 files, +337/-61. PR claim covers all 4 round-2 MAJORs:

"scheduler lifecycle now reconciles under applySem" → addresses round-2 MAJOR 1 (lifecycle not wired)
"runtime scheduler publishes are serialized" → addresses round-2 MAJOR 2 (hybrid snapshot race)
"stale callbacks are epoch-gated" → defense-in-depth for race
"userspace snapshots are seeded coherently" → race fix
"old helpers fail closed for scheduled-policy snapshots" → round-2 MAJOR 4 (mixed-version)

Wall-clock (round-2 MAJOR 3) — PR claim doesn't mention. Let me check scheduler.go changes.

Notable: pkg/scheduler/scheduler.go has +67/-? changes. Need to verify if wall-clock → monotonic was addressed OR explicit deferral test was added.

Verification pending

A. Scheduler lifecycle wired on commits (Codex finding: daemon_apply.go must call Scheduler.Update)
B. UpdatePolicyScheduleState holds applySem + epoch-gated stale callbacks
C. Wall-clock: monotonic OR explicit failure-mode test
D. Mixed-version: new Go capability handshake refusing v2 to old Rust

This is high-volume refactor — Codex r3 will trace the race and lifecycle changes carefully.

Awaiting Codex (task-mp99dkbf-9xf7tg) and Gemini Pro 3 (task-mp99ed5e-y81abm).

psaab · 2026-05-17T04:17:49Z

Round-3 quad-review consolidated synthesis on `50a94ed3`

Reviewer	Verdict
Claude	MERGE-READY-pending → MERGE-NEEDS-MAJOR (self-corrected)
Codex	MERGE-NEEDS-MAJOR
Gemini Pro 3	MERGE-NEEDS-MAJOR

Round-2 MAJORs status — 2 of 4 closed, 2 still open

Round-2 MAJOR	Status	Verified by
1. Scheduler lifecycle not wired for commits	CLOSED ✓	Codex+Gemini
2. Hybrid snapshot race	CLOSED ✓	Codex+Gemini (applySem + epoch-gated)
3. Wall-clock	STILL OPEN	Codex+Gemini both verify wall-clock not fixed
4. Mixed-version	PARTIAL	Codex flagged: refusal exists but doesn't fail dataplane closed

Wall-clock — converged MAJOR (Codex+Gemini)

Codex: "Still uses wall time directly: NewPrimed(... time.Now()), ticker timestamps, and Update(... time.Now()) in scheduler.go:39,61,90. The docs explicitly reframe it instead of fixing it."

Gemini: "The author ignored the monotonic evaluation requirement and merely reframed the wall-clock usage in the documentation. No monotonic calculation or explicit deferral test exists."

Both reviewers reject the README reframing as a fix. This has been MAJOR for 3 rounds.

Codex new MAJOR — Old-helper "fail closed" is publish refusal only

"ensurePolicySchedulerProtocolLocked at manager.go:533-551 probes status and errors if helper protocol is < 2. But applyConfigLocked at daemon_apply.go:430-433 only records the compile error and continues, and commitAndApply at daemon_apply.go:101-105 still returns the committed config. The userspace manager also sets m.lastSnapshot = snap BEFORE the protocol check at manager.go:327, 370-371."

Worked failure: operator commits config with scheduled policy on a cluster running old Rust helper. The commit SUCCEEDS at Go-side. Old helper keeps forwarding the previous (un-scheduled) snapshot. If the commit added a scheduler that should DEACTIVATE a permit rule outside work hours, the old helper keeps permitting outside that window. Not fail-closed.

What IS verified fixed

✓ Scheduler reconciles on apply via reconcilePolicySchedulerLocked (recreate, not Update)
✓ commitAndApply holds applySem across commit + apply
✓ Runtime scheduler publishes acquire applySem + check epoch (daemon_scheduler.go:44-59)
✓ Stale callbacks epoch-gated
✓ Userspace compile seeds scheduler state coherently
✓ New Rust rejects old Go version-1 snapshots
✓ New Go refuses to publish scheduled-policy snapshots to old helpers (just doesn't fail the dataplane on refusal)

Test gaps

Missing: daemon-level scheduler reconciliation on config commit
Missing: apply-vs-tick race/epoch behavior test
Missing: NTP rollback / clock-jump failure mode
Missing: full mixed-version commit/apply path proving old helper cannot keep forwarding stale permissive snapshot

Self-correction

I posted "MERGE-READY-pending" trusting the PR claim. Both Codex and Gemini found the wall-clock issue unaddressed for the 3rd round running. I also missed the "fail-closed-is-only-publish-refusal" gap that Codex traced through commitAndApply.

Recommendation

Block on:

Wall-clock: this is the 3rd round flagging it. Options: (a) migrate pkg/scheduler/scheduler.go to monotonic clock, (b) add explicit time.Time injection so tests can verify NTP-rollback behavior + add a rollback test, OR (c) get explicit operator-level sign-off that wall-clock control-plane behavior is acceptable and bake that into the issue contract — don't keep reframing in the README.
Fail-closed enforcement on old-helper refusal: applyConfigLocked must surface the error to fail commitAndApply, OR the snapshot must not be cached before the protocol check, OR the dataplane must enter a fail-safe state (deny all / fall back to known-safe last snapshot without scheduled policies).

Strongly consider: the 4 missing tests (scheduler-on-commit, apply-vs-tick race, NTP rollback, mixed-version full path).

Codex task: task-mp99dkbf-9xf7tg. Gemini Pro 3 task: task-mp99ed5e-y81abm. Not merging — author's decision.

psaab · 2026-05-17T04:31:46Z

Round-4 blocker fixes pushed: `6be6f57e`

Addressed the r3 MERGE-NEEDS-MAJOR findings:

Wall-clock discontinuity is now an actual fail-closed behavior
- Scheduler evaluations compare wall elapsed time with Go monotonic elapsed time from the prior evaluation.
- Backward wall-clock steps, monotonic regressions, or drift beyond tolerance publish all scheduler bits inactive for that evaluation.
- Added regression coverage for backward wall-clock step fail-closed behavior.
Old-helper fail-closed now actively disarms forwarding
- Compile, UpdatePolicyScheduleState, and deferred publish now call a fail-closed disarm helper when scheduled policies require a newer snapshot protocol than the helper supports.
- The manager sends set_forwarding_state armed=false before returning/refusing the publish error, so an old helper cannot keep forwarding a stale snapshot that ignores scheduler inactive bits.
- Updated the old-helper test to assert the disarm request and the resulting ForwardingArmed=false status.
Docs updated with the real invariants
- pkg/scheduler/README.md
- userspace-dp/src/server/README.md
- docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md

Validation:

go test ./pkg/scheduler ./pkg/dataplane/userspace ./pkg/daemon
git diff --check

psaab · 2026-05-17T04:33:49Z

Claude round-4 review on `6be6f57e`

Verdict: MERGE-READY (both round-3 MAJORs actually fixed this round)

Round-3 MAJOR — Wall-clock — CLOSED at last ✓

Not a README reframe this time. Real defense at scheduler.go:140-167:

func (s *Scheduler) wallClockUnsafeLocked(now time.Time) bool {
    wallElapsed := time.Duration(now.UnixNano() - s.lastWallUnixNano)
    if wallElapsed < 0 { fail_closed; return true }
    monoElapsed := now.Sub(s.lastEval)
    if monoElapsed < 0 { fail_closed; return true }
    delta := abs(wallElapsed - monoElapsed)
    if delta > wallClockDriftTolerance { fail_closed; return true }  // 5s tolerance
    return false
}

Then in evaluate:

wallClockUnsafe := s.wallClockUnsafeLocked(now)
...
cur := false
if !wallClockUnsafe {
    cur = isWithinWindow(now, sched)
}

When wall-clock differs from monotonic by >5s (NTP rollback, clock jump, container migration), schedulers fail closed — cur = false for all schedulers in this evaluation → inactive bit goes true → policies fall through.

This is the right monotonic-clock pattern: use mono as ground truth, detect wall-clock drift, fail closed on detection. Better than a pure monotonic clock for the actual use case (Junos time windows are wall-clock-defined).

Round-3 MAJOR — Fail-closed enforcement — CLOSED ✓

manager.go:556-578 new disarmPolicySchedulerProtocolFailureLocked:

func (m *Manager) disarmPolicySchedulerProtocolFailureLocked(protocolErr error) {
    ...
    req := ControlRequest{
        Type: "set_forwarding_state",
        Forwarding: &ForwardingControlRequest{Armed: false},
    }
    m.requestLocked(req, &status)
    ...
    slog.Warn("userspace: disarmed helper after policy scheduler protocol error", "err", protocolErr)
}

Called from both Compile() and UpdatePolicyScheduleState() on protocol error. Old helper is disarmed → forwarding stops → fail-closed at dataplane level, not just publish refusal.

Hostile spot-checks

Wall-clock test coverage: scheduler_test.go +29 should pin NTP-rollback / clock-jump behavior. Codex will verify.
Disarm idempotency: disarming an already-disarmed helper — graceful? applyHelperStatusLocked runs after disarm; check for state leak.
Test gap: does mid-flight scheduler tick during disarm produce inconsistent state? Likely covered by the apply semaphore from round-2.

Recommendation

Block on: nothing. Both round-3 MAJORs actually closed with real defense, not docs reframing.

Strongly consider: verify scheduler_test.go +29 covers wall-clock backward + drift > 5s + monotonic backward (Codex will trace).

Awaiting Codex (task-mp9a4x1r-fcex88).

psaab · 2026-05-17T04:39:07Z

Claude round-4 self-correction on `6be6f57e`

Revising verdict MERGE-READY → MERGE-NEEDS-MAJOR to match Codex.

I posted READY because the new wallClockUnsafeLocked did mono-vs-wall comparison + 5s drift tolerance. Codex showed the fix has a one-tick recovery hole I missed.

Codex MAJOR 1 — Wall-clock rollback reopens after one tick

pkg/scheduler/scheduler.go:126-128 updates lastEval / lastWallUnixNano even after wallClockUnsafeLocked() forces the current evaluation inactive. On the next tick, the rolled-back wall clock IS now the new baseline, so wallElapsed ≈ monoElapsed and isWithinWindow() runs against the stale wall time again.

Worked failure (Codex):

T=0 (wall 16:59, mono 16:59): isWithinWindow=true, active
T=1 (wall 12:00, mono 17:00): wallElapsed=-4h59m, monoElapsed=1m,
     delta=5h+ > 5s → wallClockUnsafe=true → inactive ✓ (correct)
     BUT lastEval=mono 17:00, lastWallUnixNano=wall 12:00 written ✗
T=2 (wall 12:01, mono 17:01): wallElapsed=1m, monoElapsed=1m,
     delta=0 → wallClockUnsafe=FALSE
     isWithinWindow(12:01 against 08:00-17:00) → TRUE → active 💥

Rollback fails closed once, then the 08:00-17:00 allow window reopens for hours under rolled-back time.

Codex MAJOR 2 — Old-helper commit path still doesn't fail the commit transaction

pkg/daemon/daemon_apply.go:430-433 still only records the compile error. commitAndApply() still returns the committed config. The new set_forwarding_state armed=false is best-effort runtime disarm, NOT commit refusal.

Also pkg/dataplane/userspace/manager.go:327 still assigns m.lastSnapshot = snap BEFORE the protocol gate at manager.go:370-372. Stale snapshot cached even when protocol rejects.

Codex MAJOR 3 — Fail-closed disarm is best-effort and non-fatal

disarmPolicySchedulerProtocolFailureLocked() at manager.go:567-575 logs and returns if the disarm RPC fails or returns failed status. In the scheduler tick path, that leaves UpdatePolicyScheduleState() with no propagated error and no guaranteed local hard-disable if the old helper keeps forwarding stale state.

Test coverage gaps (Codex)

NTP rollback: scheduler_test.go:202-228 checks only the immediate inactive transition using wall-only time.Date values. Does NOT pin the real failure: monotonic advances while wall rolls back, then next evaluation reopens.
Mixed-version commit failure: manager_test.go:2704-2788 only exercises UpdatePolicyScheduleState(), NOT Compile(), applyConfigLocked(), or commitAndApply().

Recommendation

Block on:

After wallClockUnsafeLocked returns true, DO NOT update lastEval / lastWallUnixNano — preserve pre-rollback baseline until a sufficiently consistent monotonic-aligned wall reading observed. Stay-disarmed period (e.g., until monoElapsed > 60s AND delta < 5s).
Multi-tick NTP rollback test: T=0 active, T=1 rollback inactive, T=2 still inactive, T=3 still inactive — guarantees rollback doesn't auto-recover.
commitAndApply() returns error when policy scheduler compile fails on old helper. Commit refusal, not best-effort disarm.
Move m.lastSnapshot = snap AFTER protocol gate succeeds.
Fail-closed disarm must escalate to caller (return error from UpdatePolicyScheduleState), not log-and-continue.

Claude miss pattern (5th self-correction on #1396 across rounds)

Round 1: production publish path didn't use scheduler state. Round 2: lifecycle/race/wall/asymmetric (4 new MAJORs). Round 3: wall-clock was still wall-clock. Round 4: I saw the new mono-vs-wall comparison and stopped looking — missed that the bookkeeping update happens unconditionally. Pattern: I verify the obvious branch and miss the bookkeeping side-effect downstream.

Codex task: task-mp9a4x1r-fcex88. Not merging — author's decision.

psaab · 2026-05-17T04:40:34Z

@copilot review

Round-4 commit 6be6f57. Codex flagged 3 MAJORs: (1) wall-clock rollback recovers after 1 tick because lastEval/lastWallUnixNano are written unconditionally at scheduler.go:126-128, (2) commit transaction doesn't fail on old-helper compile error at daemon_apply.go:430-433, (3) fail-closed disarm is best-effort log-and-return at manager.go:567-575. Re-review against these.

psaab · 2026-05-17T04:46:31Z

Round-4 quad-review consolidated synthesis on `6be6f57e`

Reviewer	Verdict
Claude	MERGE-NEEDS-MAJOR (self-corrected from READY after Codex MAJOR)
Codex	MERGE-NEEDS-MAJOR
Gemini Pro 3	MERGE-NEEDS-MAJOR
Copilot	Re-review requested; still on stale commit `50a94ed3`

Codex + Gemini converge on 4 MAJORs (Gemini split Codex's MAJOR-2 into 2). Block.

MAJOR 1 — Wall-clock rollback recovers after 1 tick (Codex + Gemini)

pkg/scheduler/scheduler.go (end of evaluate loop):

s.active = newActive
s.lastEval = now
s.lastWallUnixNano = now.UnixNano()

Bookkeeping runs unconditionally, even when wallClockUnsafeLocked() already returned true and forced newActive=false.

Worked failure (both reviewers reproduced identically):

T=0 (wall 16:59, mono 16:59): isWithinWindow=true, active ✓
                              lastEval=16:59, lastWallUnixNano=16:59
T=1 (wall 12:00, mono 17:00): NTP rolls wall back 4h59m
                              wallElapsed=-4h59m, monoElapsed=1m, delta>5s
                              wallClockUnsafe=true → active=false ✓ (one tick)
                              BUT bookkeeping writes lastEval=17:00, lastWallUnixNano=12:00 ✗
T=2 (wall 12:01, mono 17:01): wallElapsed=1m, monoElapsed=1m, delta=0
                              wallClockUnsafe=FALSE → isWithinWindow(12:01 in 08-17) → active=true 💥

Rollback fails closed for exactly one tick, then 08:00-17:00 allow window reopens for hours under rolled-back time.

MAJOR 2 — Commit transaction ignores compile failure (Codex + Gemini)

pkg/daemon/daemon_apply.go:430-433:

if compileResult, err = d.dp.Compile(cfg); err != nil {
    d.recordCompileFailure(err)
} else {
    d.recordCompileSuccess()
}

Compile error is recorded but not returned. commitAndApply() at daemon_apply.go:101-105 completes successfully. The daemon persists config to DB/cluster while dataplane rejected it.

MAJOR 3 — Stale snapshot cached on protocol rejection (Codex + Gemini)

pkg/dataplane/userspace/manager.go:327:

m.lastSnapshot = snap   // <-- written BEFORE protocol gate
// ...
if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil {
    m.disarmPolicySchedulerProtocolFailureLocked(err)
    return result, err
}

When protocol gate rejects, m.lastSnapshot already holds the unpublished config. Next BumpFIBGeneration / RegenerateNeighborSnapshot / UpdatePolicyScheduleState reads stale state and may force an apply_snapshot of the rejected config.

MAJOR 4 — Fail-closed disarm is best-effort (Codex + Gemini)

pkg/dataplane/userspace/manager.go:567-575:

if err := m.requestLocked(req, &status); err != nil {
    slog.Warn("userspace: failed to disarm incompatible helper after policy scheduler protocol error",
        "protocol_err", protocolErr, "err", err)
    return
}

If the disarm RPC fails (socket full, helper hung, transport error), function logs and returns. Caller sees no error. Helper continues forwarding under old rules. Fail-closed never achieved.

Test coverage gap (both reviewers)

Wall-clock rollback: TestScheduler_WallClockBackwardStepFailsClosed at scheduler_test.go:202-228 only asserts T=1 inactive. Stops before T=2. Blinds CI to the one-tick recovery bug.
Mixed-version commit failure: manager_test.go:2704-2788 only exercises UpdatePolicyScheduleState() directly. Does NOT exercise Compile(), applyConfigLocked(), or commitAndApply() with old-helper protocol.

Block on

Wall-clock: After wallClockUnsafeLocked() returns true, preserve pre-rollback lastEval / lastWallUnixNano baseline. Stay-disarmed until a consistent monotonic-aligned wall reading is observed across multiple ticks.
Multi-tick rollback test: T=0 active → T=1 rollback inactive → T=2 still inactive → T=3 still inactive. Pin against current code regressing.
Commit transaction: Compile() failure on mixed-version protocol → commitAndApply() returns error → commit refused. Not best-effort disarm.
Stale snapshot: Move m.lastSnapshot = snap AFTER protocol gate succeeds.
Fail-closed escalation: disarmPolicySchedulerProtocolFailureLocked must propagate disarm-RPC failures to caller, not log-and-continue.

Round-4 claim trace

Codex task-mp9a4x1r-fcex88 — 3 MAJORs (Codex bundled wall-clock + commit + disarm into 3)
Gemini task-mp9adv2c-4jm4p3 — split into 4 MAJORs (A/B/C/D); same quote-grounded evidence
Copilot 50a94ed3 — stale; re-review @copilot requested

Not merging — author's decision.

psaab · 2026-05-17T08:32:34Z

Claude round-13 review on `8553583e`

Verdict: MERGE-READY

Trivial dead-code cleanup commit. Author shipped exactly what all 3 r12 reviewers (Claude + Codex + Gemini) requested:

✓ pkg/dataplane/policy_scheduler.go deleted (-123 lines, the obsolete BuildScheduledPolicyRuleSlots helper from r11)
✓ pkg/dataplane/policy_scheduler_test.go deleted (-88 lines)

No production code touched. Runtime path (compiler records exact slots on CompileResult.PolicyScheduleRuleSlots, runtime reads source-of-truth) remains intact and verified by Codex+Gemini r12.

Carry-overs (separate follow-up issues recommended)

Test asserting recorded slots match actual map writes (Codex r12 MINOR)
Disarm-failure Prometheus counter
Pre-Commit Compile validation for protocol-mismatch commit-atomicity
Apply-flow integration test for non-protocol Compile no-abort
Hermetic test setup (real netlink cleanup in compile-error tests)

No full quad-review dispatched for this trivial 4-insert/211-delete cleanup since the substantive design (r12) was already verified by 3 reviewers and this commit makes ZERO production code changes.

Not merging — author's decision.

Copilot · 2026-05-17T08:33:50Z

@copilot review

Round-12 commit d168648 takes new approach: compiler records exact (PolicySetID, RuleIndex) on CompileResult.PolicyScheduleRuleSlots during compilePolicies; runtime reads source-of-t...

Re-reviewed and applied the r12 MINOR cleanup in 8553583.

Removed obsolete recomputation path (pkg/dataplane/policy_scheduler.go) and its self-only test (pkg/dataplane/policy_scheduler_test.go).
Runtime remains sourced from compile-recorded CompileResult.PolicyScheduleRuleSlots (compilePolicies -> LastCompileResult() consumers in eBPF/DPDK).

Validation re-run on this commit:

go test ./pkg/dataplane ./pkg/config ./pkg/daemon ./pkg/dataplane/userspace ./pkg/scheduler
go test ./pkg/dataplane/dpdk
git diff --check

All passed. No UI changes in this backend-only update, so no screenshot applies.

…rrors Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/8be45280-5d1e-434e-b8d7-7ee35e5789f3 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/1bb33a9b-c0bb-4973-8771-b143edac1b95 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

…eduler protocol mismatch Agent-Logs-Url: https://github.com/psaab/xpf/sessions/1bb33a9b-c0bb-4973-8771-b143edac1b95 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/4088459f-cc3c-47bb-8ec6-831e389d7a37 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/913eb4a1-a048-478c-832f-e80f4c7fd212 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 5 comments.

+func validatePolicySchedulerReferencesStrict(cfg *Config) error {
+	if cfg == nil {
+		return nil
+	}
+	check := func(pol *Policy) error {
+		if pol == nil || pol.SchedulerName == "" {
+			return nil
+		}
+		if _, ok := cfg.Schedulers[pol.SchedulerName]; ok {
+			return nil
+		}
+		return fmt.Errorf("policy %q references undefined scheduler %q", pol.Name, pol.SchedulerName)
+	}
+	for _, zpp := range cfg.Security.Policies {
+		if zpp == nil {
+			continue
+		}
+		for _, pol := range zpp.Policies {
+			if err := check(pol); err != nil {
+				return err
+			}
+		}
+	}
+	for _, pol := range cfg.Security.GlobalPolicies {
+		if err := check(pol); err != nil {
+			return err
+		}
+	}
+	return nil
+}


+	}
+	ctx, cancel := context.WithCancel(d.daemonCtx)
+	d.schedulerCancel = cancel
+	go d.scheduler.Run(ctx)


+	m.mu.Lock()
+	defer m.mu.Unlock()
+
+	m.policySchedulerActive = activeCopy
+	if cfg == nil {
+		if m.lastSnapshot == nil {
+			return
+		}
+		cfg = m.lastSnapshot.Config
+	}
+	if cfg == nil || m.lastSnapshot == nil {
+		return
+	}
+	if m.proc == nil || m.proc.Process == nil {
+		return
+	}
+
+	if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil {
+		if disarmErr := m.disarmPolicySchedulerProtocolFailureLocked(err); disarmErr != nil {
+			slog.Warn("userspace: failed to disarm helper after refusing policy scheduler publish",
+				"protocol_err", err, "err", disarmErr)
+		}
+		slog.Warn("userspace: refusing policy scheduler publish to incompatible helper", "err", err)
+		return
+	}


+- **Timestamp**: 2026-05-17T05:12:00Z
+  - **Action**: Round-5 follow-up fix — in userspace pending-XSK-startup compile path, defer `lastSnapshot` cache update until ingress/local/NAT map sync succeeds so sync failures cannot poison cached snapshot state with an unpublished generation.
+  - **File(s)**: `pkg/dataplane/userspace/manager.go`, `_Log.md`
+
+- **Timestamp**: 2026-05-17T05:16:00Z
+  - **Action**: Restored `go.mod` after an unintended direct/indirect dependency classification flip introduced by an automation-only progress update.
+  - **File(s)**: `go.mod`, `_Log.md`
+
+- **Timestamp**: 2026-05-17T04:48:51Z
+  - **Action**: Re-restored `go.mod` after a subsequent tooling pass reintroduced the same direct/indirect dependency classification flip.
+  - **File(s)**: `go.mod`, `_Log.md`
+


+	if d.dp == nil || activeState == nil || compileResult == nil {
+		return
+	}
+	if _, isUserspace := d.dp.(*dpuserspace.Manager); isUserspace {
+		return
+	}


Copilot AI review requested due to automatic review settings May 17, 2026 01:53

Copilot started reviewing on behalf of psaab May 17, 2026 01:54 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

This was referenced May 17, 2026

Userspace three-color policer core #1395

Merged

userspace: add port mirror snapshot foundation #1392

Merged

userspace-dp: add dataplane event codec foundation #1394

Merged

Copilot AI review requested due to automatic review settings May 17, 2026 03:25

Copilot started reviewing on behalf of psaab May 17, 2026 03:26 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Copilot started work on behalf of psaab May 17, 2026 04:40 View session

Copilot AI review requested due to automatic review settings May 17, 2026 04:44

psaab review requested due to automatic review settings May 17, 2026 04:44

Copilot AI review requested due to automatic review settings May 17, 2026 04:46

psaab review requested due to automatic review settings May 17, 2026 04:46

Copilot AI review requested due to automatic review settings May 17, 2026 04:47

psaab review requested due to automatic review settings May 17, 2026 04:47

Copilot finished work on behalf of psaab May 17, 2026 08:34

psaab and others added 20 commits May 17, 2026 07:23

plumb userspace policy scheduler inactive state

850c618

userspace: publish policy scheduler state

572859f

userspace: serialize policy scheduler snapshots

9d369d9

userspace: fail closed scheduled policy drift

bd59177

fix: keep scheduler fail-closed and surface protocol mismatch apply e…

94aa3f2

…rrors Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

chore: tighten protocol error text and test clarity

bdd81d8

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

chore: restore go.mod classification drift

0135c3a

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

chore: finalize round-4 blocker validation

5fe6429

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

userspace: avoid caching rejected scheduler snapshots

f4c38b0

fix: cache deferred userspace snapshot only after map sync succeeds

f0c503d

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/8be45280-5d1e-434e-b8d7-7ee35e5789f3 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

chore: restore go.mod after unintended dependency flip

363f92a

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/8be45280-5d1e-434e-b8d7-7ee35e5789f3 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

userspace: harden policy scheduler fail-closed paths

32000b8

Harden policy scheduler apply lifecycle

1074c62

fix: keep degraded compile contract except protocol mismatch

33baa49

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/1bb33a9b-c0bb-4973-8771-b143edac1b95 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Preserve degraded compile contract; fail closed only on userspace sch…

29a0c27

…eduler protocol mismatch Agent-Logs-Url: https://github.com/psaab/xpf/sessions/1bb33a9b-c0bb-4973-8771-b143edac1b95 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

daemon: pin policy scheduler apply edge cases

2fda579

daemon: preserve scheduler state across aborting applies

e243e51

fix: align scheduler rule slot updates for ebpf/dpdk

ffa694b

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/4088459f-cc3c-47bb-8ec6-831e389d7a37 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

dataplane: schedule compiled policy slots

fb36b3e

dataplane: remove obsolete scheduler slot helper

3a79c99

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/913eb4a1-a048-478c-832f-e80f4c7fd212 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 17, 2026 14:24

psaab force-pushed the codex/1378-userspace-policy-scheduler branch from 8553583 to 3a79c99 Compare May 17, 2026 14:24

psaab merged commit e3f2176 into master May 17, 2026
2 checks passed

Copilot started reviewing on behalf of psaab May 17, 2026 14:24 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

This was referenced May 17, 2026

Userspace SYN cookie core (#1374) #1393

Merged

dataplane: add runtime interface split scaffolding #1399

Open

Conversation

psaab commented May 17, 2026

Summary

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

psaab commented May 17, 2026

Claude round-1 review on ba62ce26

What's in scope

Hostile checks

Recommendation

Uh oh!

psaab commented May 17, 2026

Claude round-1 self-correction on ba62ce26

Codex finding 1 — Production publish path doesn't use the new field

Codex finding 2 — UpdatePolicyScheduleState callback not wired to userspace

Codex finding 3 — Wall-clock in scheduler violates monotonic-clock contract

Codex finding 4 — Wire format fail-open on mixed-version

Codex MINOR — Test gap

Claude miss pattern

Recommendation (revised)

Uh oh!

psaab commented May 17, 2026

Round-1 review consolidated on ba62ce26

Codex MAJORs

Codex MINOR

Self-correction

Recommendation

Uh oh!

psaab commented May 17, 2026

Review follow-up on 52686dd4

Uh oh!

psaab commented May 17, 2026

Claude round-2 review on 52686dd4

Round-1 MAJORs status

Hostile spot-checks

Recommendation

Uh oh!

psaab commented May 17, 2026

Round-2 quad-review consolidated synthesis on 52686dd4

Reviewer disagreement

Codex new MAJOR 1 — Scheduler lifecycle not wired for config commits

Codex new MAJOR 2 — Hybrid snapshot race

Codex MAJOR 3 — Wall-clock just reframed, not fixed

Codex MAJOR 4 — Mixed-version asymmetric

What's actually fixed

Self-correction

Recommendation (revised)

Uh oh!

psaab commented May 17, 2026

Round-3 fix update

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

psaab commented May 17, 2026

Claude round-3 review on 50a94ed3

Verification pending

Uh oh!

psaab commented May 17, 2026

Round-3 quad-review consolidated synthesis on 50a94ed3

Round-2 MAJORs status — 2 of 4 closed, 2 still open

Wall-clock — converged MAJOR (Codex+Gemini)

Codex new MAJOR — Old-helper "fail closed" is publish refusal only

What IS verified fixed

Test gaps

Self-correction

Recommendation

Uh oh!

psaab commented May 17, 2026

Round-4 blocker fixes pushed: 6be6f57e

Uh oh!

psaab commented May 17, 2026

Claude round-4 review on 6be6f57e

Round-3 MAJOR — Wall-clock — CLOSED at last ✓

Claude round-1 review on `ba62ce26`

Claude round-1 self-correction on `ba62ce26`

Codex finding 2 — `UpdatePolicyScheduleState` callback not wired to userspace

Round-1 review consolidated on `ba62ce26`

Review follow-up on `52686dd4`

Claude round-2 review on `52686dd4`

Round-2 quad-review consolidated synthesis on `52686dd4`

Claude round-3 review on `50a94ed3`

Round-3 quad-review consolidated synthesis on `50a94ed3`

Round-4 blocker fixes pushed: `6be6f57e`

Claude round-4 review on `6be6f57e`

Claude round-4 self-correction on `6be6f57e`

Round-4 quad-review consolidated synthesis on `6be6f57e`

Claude round-13 review on `8553583e`