Skip to content

Plumb userspace policy scheduler inactive state#1396

Merged
psaab merged 20 commits into
masterfrom
codex/1378-userspace-policy-scheduler
May 17, 2026
Merged

Plumb userspace policy scheduler inactive state#1396
psaab merged 20 commits into
masterfrom
codex/1378-userspace-policy-scheduler

Conversation

@psaab
Copy link
Copy Markdown
Owner

@psaab psaab commented May 17, 2026

References #1378.

Summary

  • Add rule_id, scheduler_name, and inactive fields to userspace policy snapshots and Rust PolicyRule.
  • Skip inactive policy rules before match evaluation and preserve first-match fallthrough semantics.
  • Document that daemon scheduler tick/restart/failover republish and missing-scheduler commit errors remain gates.

Tests

  • go test ./pkg/dataplane/userspace -run 'TestBuildPolicySnapshotsRoundTripsSchedulerInactiveAndRuleID|TestBuildPolicySnapshotsIncludesGlobalPolicies'\n- cargo test policy::tests::evaluate_policy_skips_inactive_rules\n- cargo test policy::tests::inactive_rule_falls_through_to_next_match\n- cargo test policy_rule_snapshot\n- git diff --check

Copilot AI review requested due to automatic review settings May 17, 2026 01:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Plumbs Junos scheduler/inactive policy state through the userspace dataplane snapshot DTO and Rust policy evaluator so that scheduled policy rules can be skipped during match evaluation. This is a slice of the broader #1378 work; the daemon still does not yet populate the active-state map on tick/restart/failover, so end-to-end scheduling remains gated.

Changes:

  • Adds RuleID, SchedulerName, and Inactive to the PolicyRuleSnapshot DTO (Go + Rust) with serde/JSON backward-compat defaults.
  • Adds a Go buildPolicySnapshotsWithSchedulerState builder that derives Inactive from an activeState map, plus a stable rule-ID helper.
  • Rust parse_policy_state propagates the new fields and try_match_rule skips inactive rules so first-match fallthrough remains correct.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/dataplane/userspace/protocol.go Adds RuleID/SchedulerName/Inactive JSON fields on PolicyRuleSnapshot.
pkg/dataplane/userspace/snapshot.go Builds new fields; introduces policyRuleInactive and stablePolicyRuleID helpers.
pkg/dataplane/userspace/manager_test.go Adds round-trip test for new fields with scheduler active-state map.
userspace-dp/src/protocol.rs Adds matching fields on Rust DTO and serde round-trip / legacy-decode tests.
userspace-dp/src/policy.rs Stores fields on PolicyRule; skips inactive rules in try_match_rule; adds stable_policy_rule_id fallback.
userspace-dp/src/policy_tests.rs Tests inactive-skip and fall-through, updates existing fixtures with ..Default::default().
userspace-dp/src/afxdp/tests.rs, test_fixtures.rs Spreads ..Default::default() into PolicyRuleSnapshot fixtures.
docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md Documents that daemon republish and missing-scheduler commit errors remain gates.

return false
}
active, ok := activeState[schedulerName]
return !ok || !active
}
schedulerName := pol.SchedulerName
out = append(out, PolicyRuleSnapshot{
RuleID: stablePolicyRuleID("junos-global", "junos-global", pol.Name),
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-1 review on ba62ce26

Verdict: MERGE-READY (foundation — cross-language plumbing for #1378)

297+/0- across 9 Go+Rust files. Wires rule_id, scheduler_name, and inactive through userspace policy snapshots and Rust policy evaluation. Per the plan-1378 update in this PR:

"The Go snapshot builder can compute inactive bits when supplied scheduler active state, but the normal daemon publish path still builds snapshots without that state. Scheduler tick/restart/failover republish and missing-scheduler commit errors remain gates before scheduled userspace policies are end-to-end enforced."

What's in scope

  • Go: protocol.go field additions, snapshot.go buildPolicySnapshotsWithSchedulerState helper, manager_test
  • Rust: protocol.rs matching deserialization, policy.rs evaluation honors inactive bit, policy_tests.rs round-trip
  • Plan doc updated to reflect the foundation slice status

Hostile checks

  • Wire format backward compat: new scheduler_name + inactive fields should be omitempty on Go side / #[serde(default)] on Rust side so older builds tolerate. Codex will verify.
  • Monotonic clock: the actual scheduler tick/eval logic is out of scope for this PR. But: per Feature gap: time-based policy schedulers (Junos schedulers { ... }) not propagated to userspace-dp #1378 contract, when wired up the time source MUST be monotonic, NOT wall clock — NTP rollback must not flap policies.
  • Apply-cycle race: snapshot publish via ArcSwap (or similar) — split-brain (some rules see new inactive bits, others don't) must not happen mid-apply. Foundation PR may not exercise this, but the new fields should be in the same snapshot atomic.
  • Inactive bit semantics: inactive=true means the rule is currently outside its scheduler's active window → policy evaluation must skip it (and fall through to next rule per first-match ordering). Verify in policy.rs.

Recommendation

Block on: nothing structural for the foundation slice.

Strongly consider: confirm inactive=true correctly causes first-match-ordering fall-through (Codex will trace).

Defer to follow-up PRs: scheduler tick/eval loop, missing-scheduler-name commit error, atomic snapshot republish on window transition, HA failover semantics.

Awaiting Codex (task-mp955c3b-xto9kj).

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-1 self-correction on ba62ce26

Revising verdict from MERGE-READY → MERGE-NEEDS-MAJOR to match Codex. Same pattern as #1394 — Codex verified the production wiring; I accepted the foundation-only framing without checking the actual publish path.

Codex finding 1 — Production publish path doesn't use the new field

"buildSnapshot still calls buildPolicySnapshots(cfg) with no scheduler state, and that path makes every scheduled rule active because policyRuleInactive(..., nil) returns false."

Plumbing was added but the daemon's actual publish path is unchanged. buildPolicySnapshotsWithSchedulerState exists but buildSnapshot at snapshot.go:53 doesn't call it. PR's own plan admits this — but that admission means the PR doesn't actually move the needle on #1378.

Codex finding 2 — UpdatePolicyScheduleState callback not wired to userspace

"Userspace does not override UpdatePolicyScheduleState. Manager embeds the eBPF dataplane at pkg/dataplane/userspace/manager.go:57-59... so the scheduler callback in pkg/daemon/daemon_run.go:585-589 still dispatches to the embedded eBPF map updater."

The scheduler tick fires UpdatePolicyScheduleState; on userspace dataplane, this is dispatched to the embedded eBPF backend (which updates policy_rules BPF map), not to the Rust userspace policy snapshot. Window changes don't propagate to userspace at all. The whole new field is dead code in production.

Codex finding 3 — Wall-clock in scheduler violates monotonic-clock contract

"pkg/scheduler/scheduler.go:31 and :76 use time.Now(), and isWithinWindow evaluates parsed dates and timeOfDay(now) from wall-clock fields at :124-164."

#1378's refined contract explicitly requires monotonic clock: "NTP rollback must not flap policies." This is wall-clock. An NTP rollback across a start/stop boundary CAN flip active state.

Codex finding 4 — Wire format fail-open on mixed-version

"New Rust defaults missing rule_id, scheduler_name, and inactive fields, so old Go payloads decode. But ProtocolVersion stays 1 and there is no capability/version gate; a new Go daemon sending inactive:true to an old helper would be silently ignored by serde's unknown-field behavior and the rule would become active. That is fail-open."

Mixed-version: new cli → old xpfd would silently drop the inactive bit and serve traffic through a rule that should be inactive. Real security gap if this is the only deployment safeguard.

Codex MINOR — Test gap

Go test = Go self-round-trip. Rust test = Rust self-round-trip. No Go→Rust cross-language test. No UpdatePolicyScheduleState userspace republish test. No monotonic rollback test.

Claude miss pattern

I labeled this "foundation cross-language plumbing" but didn't verify (a) production path uses it, (b) the scheduler callback reaches userspace, (c) time source is monotonic, (d) wire format has a version gate. The PR adds fields but no actual code path exercises them on userspace dataplane.

Recommendation (revised)

Block on:

  1. Wire buildSnapshotbuildPolicySnapshotsWithSchedulerState with real scheduler state (publish path actually uses the new field).
  2. Override UpdatePolicyScheduleState on userspace Manager to republish snapshot, not dispatch to embedded eBPF.
  3. Move pkg/scheduler/scheduler.go to monotonic clock (or document the wall-clock dependency as a known issue with NTP-rollback test pinning the failure mode).
  4. Add wire-format version gate OR document fail-closed contract for mixed-version (new field present = required).

Strongly consider: Go→Rust round-trip test + monotonic rollback test + apply-during-window-transition test.

Codex task: task-mp955c3b-xto9kj. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-1 review consolidated on ba62ce26

Reviewer Verdict
Claude MERGE-READY → MERGE-NEEDS-MAJOR (self-corrected)
Codex MERGE-NEEDS-MAJOR
Copilot (no review yet on this HEAD)

Codex MAJORs

  1. Production publish path doesn't use the new field. buildSnapshot:53 calls buildPolicySnapshots(cfg) without scheduler state; policyRuleInactive(..., nil) returns false. New buildPolicySnapshotsWithSchedulerState helper exists but isn't called by the daemon's actual publish path.
  2. UpdatePolicyScheduleState callback not wired to userspace. manager.go:57-59 embeds eBPF dataplane; scheduler tick at daemon_run.go:585-589 dispatches to the eBPF map updater. Window changes never reach the Rust userspace policy snapshot.
  3. Wall-clock in scheduler eval violates monotonic-clock contract. scheduler.go:31, 76 uses time.Now(). NTP rollback CAN flip active state. Per #1378 refined contract: "monotonic clock for window eval; NTP rollback must not flap policies."
  4. Wire format fail-open on mixed-version. No ProtocolVersion bump, no capability gate. New Go sending inactive:true to old Rust helper → silently ignored via serde unknown-field tolerance → rule becomes active. Security gap.

Codex MINOR

  • Tests are Go-self / Rust-self round-trip only. No cross-language Go→Rust round-trip. No UpdatePolicyScheduleState userspace republish test. No monotonic rollback test.

Self-correction

Same pattern as #1394/#1395: I labeled "foundation cross-language plumbing" without verifying (a) production path uses it, (b) the scheduler callback reaches userspace, (c) time source is monotonic, (d) wire format has a version gate. The PR adds fields but no code path exercises them on userspace dataplane.

Recommendation

Block on:

  1. Wire buildSnapshotbuildPolicySnapshotsWithSchedulerState with real scheduler state.
  2. Override UpdatePolicyScheduleState on userspace Manager to republish snapshot, not dispatch to embedded eBPF.
  3. Migrate pkg/scheduler/scheduler.go to monotonic clock OR document wall-clock dependency + pin failure mode with NTP-rollback test.
  4. Add wire-format version gate OR document fail-closed contract for mixed-version.

Strongly consider: cross-language round-trip test + monotonic rollback test + apply-during-window-transition test.

Codex task: task-mp955c3b-xto9kj. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Review follow-up on 52686dd4

Addressed the round-1 blockers for the userspace policy-scheduler slice:

  • Production userspace publish path now preserves and applies scheduler active-state via buildSnapshotWithSchedulerState, so scheduled policy inactive bits are not dropped on normal config rebuilds.
  • pkg/dataplane/userspace.Manager now shadows UpdatePolicyScheduleState and republishes one coherent userspace snapshot instead of falling through to the embedded eBPF manager.
  • Missing scheduler references now fail compile/commit instead of warning and defaulting active.
  • Packet workers still do not evaluate wall-clock time; wall-clock remains control-plane-only in pkg/scheduler, and Rust consumes snapshot booleans.
  • ConfigSnapshot.version is bumped to v2 and the Rust control server rejects unsupported snapshot versions instead of silently accepting scheduler fields it might ignore.
  • Updated module docs for the policy scheduler and userspace server protocol contract.

Validation run:

  • go test ./pkg/dataplane/userspace
  • go test ./pkg/config
  • go test ./pkg/daemon
  • go test ./pkg/scheduler
  • cargo test --manifest-path userspace-dp/Cargo.toml policy -- --nocapture
  • cargo test --manifest-path userspace-dp/Cargo.toml apply_snapshot_rejects_unsupported_protocol_version -- --nocapture
  • git diff --check

All passed. Rust emitted the repo's existing warning set.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-2 review on 52686dd4

Verdict: MERGE-NEEDS-MINOR (3 of 4 round-1 MAJORs closed; wall-clock not addressed)

Round-1 MAJORs status

1. Production publish path uses scheduler state — CLOSED ✓

manager.go:285-286 Compile() now does:

activeState := m.policySchedulerActiveStateSnapshot()
snap := buildSnapshotWithSchedulerState(cfg, ucfg, m.bumpGeneration(), m.readFIBGeneration(), activeState)

plus the new policySchedulerActive field on Manager + policySchedulerActiveStateSnapshot() getter.

2. UpdatePolicyScheduleState override — CLOSED ✓

manager.go:404-450 new override:

func (m *Manager) UpdatePolicyScheduleState(cfg *config.Config, activeState map[string]bool) {
    // ... bumps generation, rebuilds policies via buildPolicySnapshotsWithSchedulerState,
    // publishes via apply_snapshot control request to the Rust helper
}

Explicitly shadows the embedded eBPF manager method. Comment is unambiguous: "scheduled userspace policies must not update the policy_rules BPF map."

3. Wall-clock vs monotonic — NOT ADDRESSED

pkg/scheduler/scheduler.go does NOT appear in the r2 diff. git diff ba62ce26..52686dd4 -- pkg/scheduler/*.go returns empty. The wall-clock dependency at scheduler.go:31, 76 is unchanged — NTP rollback can still flip active state.

The PR comment claims "scheduler/server contracts" doc updated but not the time source. #1378's contract explicitly requires monotonic clock. This is the remaining MAJOR from round-1.

4. Protocol version reject — CLOSED ✓ (need Codex/Gemini verification)

PR comment says "Rust rejects unsupported snapshot protocol versions" with handlers.rs +92. Will verify.

Hostile spot-checks

  • policySchedulerActive field protected by m.mu
  • New apply_snapshot control request via requestLocked — atomic publish, not split-brain ✓
  • bumpGeneration correctly preceds publish ✓

Recommendation

Block on: address the wall-clock issue in pkg/scheduler/scheduler.go — either migrate to monotonic clock, OR document the wall-clock dependency + pin failure with an NTP-rollback test that demonstrates the known issue.

Strongly consider: verify protocol version reject path in handlers.rs properly fails closed on unknown version (Codex/Gemini will trace).

Defer: nothing else.

Awaiting Codex (task-mp96mn6y-6cq7ia) and Gemini Pro 3 (task-mp96nhjj-fqnbn1).

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-2 quad-review consolidated synthesis on 52686dd4

Reviewer Verdict
Claude MERGE-NEEDS-MINOR → MERGE-NEEDS-MAJOR (self-corrected after Codex)
Codex MERGE-NEEDS-MAJOR
Gemini Pro 3 MERGE-READY (4 round-1 MAJORs verified, accepts README architectural note)
Copilot (no fresh review on this HEAD)

Reviewer disagreement

Gemini accepted the round-1 MAJORs as "all addressed" including the wall-clock issue via README architectural note. Codex found 4 NEW MAJORs not covered by Gemini's verification.

Codex new MAJOR 1 — Scheduler lifecycle not wired for config commits

"The daemon only creates d.scheduler once in daemon_run.go:584-596; daemon_apply.go:423-428 recompiles dataplane config but never calls Scheduler.Update or starts a scheduler when a commit adds the first scheduler. A runtime commit adding scheduled policies can publish with nil/stale scheduler state and may never get corrected."

Worked failure: operator commits config without schedulers { } block → daemon boots, scheduler runs with empty state. Operator later commits adding scheduled policy → dataplane gets new policies but scheduler never learns about them → policies publish with inactive: false permanently.

This is the round-1 finding "publish path doesn't use scheduler state" surviving in a different form — the publish path NOW uses scheduler state, but the scheduler itself isn't kept in sync with config commits.

Codex new MAJOR 2 — Hybrid snapshot race

"UpdatePolicyScheduleState copies m.lastSnapshot, then replaces only Config and Policies from the passed cfg at manager.go:423-429. The daemon callback uses store.ActiveConfig() outside the apply semaphore, so a scheduler tick during/after commit can combine new policies with old zones/routes/interfaces."

Race window: scheduler tick fires between commit's policy update and zone/route update. Publishes a hybrid snapshot with new policies + stale zones. First-match policy evaluation against wrong zone bindings → real packet misrouting.

Codex MAJOR 3 — Wall-clock just reframed, not fixed

"scheduler.go:31, 76 still use time.Now(). The new docs just reframe this as control-plane wall-clock behavior, not a monotonic-clock implementation or a signed-off deferral."

Gemini accepted the README architectural note ("wall-clock only in control plane, fast path consumes booleans"). Codex doesn't accept the rename as a fix. My read: the README framing IS legitimate engineering rationale — fast-path tearing is the real risk and is avoided. But the NTP-rollback control-plane behavior is undocumented in tests. Recommend either committing to monotonic in the control plane OR adding an explicit "wall-clock NTP rollback can cause brief active-state churn; operators should run sane NTP" note + test pinning the failure mode.

Codex MAJOR 4 — Mixed-version asymmetric

"New Rust rejects old Go version-1 snapshots, but new Go cannot force a truly old Rust helper to reject version 2. Older Rust without the version gate accepts snapshots; pre-scheduler Rust also ignores unknown inactive, recreating the fail-open path."

Forward compat (old Go → new Rust): works, version mismatch rejected ✓
Backward compat (new Go → old Rust): fail open. Old Rust without version gate accepts the v2 snapshot, ignores inactive field, treats rule as active.

This is the same fail-open Codex flagged in round-1 — only HALF closed.

What's actually fixed

Claim Status Reviewer
UpdatePolicyScheduleState override → Rust republish Both
Compile() passes scheduler state Both
Missing scheduler ref = commit error Both
Rust rejects unsupported version Both
Wall-clock monotonic reframed only (Gemini accepts, Codex doesn't) Disagreement
Mixed-version BOTH directions half (new Rust closed; old Rust still open) Codex caught

Self-correction

My round-2 review caught 1 of these (wall-clock) and flagged as MINOR. I missed scheduler lifecycle, hybrid snapshot race, and asymmetric mixed-version. Same "didn't probe the production wiring" pattern as round-1.

Recommendation (revised)

Block on:

  1. Wire scheduler update on config commits — daemon_apply.go must call Scheduler.Update on commits that change scheduler config.
  2. Close hybrid snapshot race — UpdatePolicyScheduleState must take the apply semaphore OR use atomic snapshot publish that doesn't tear across fields.
  3. Mixed-version: new Go must NOT send v2 to old Rust. Either (a) capability negotiation handshake on connect, or (b) accept that v2 is only safe with v2-capable Rust and document the upgrade order requirement.

Strongly consider: wall-clock — either commit to monotonic in pkg/scheduler/scheduler.go OR add an explicit deferral note + NTP-rollback test pinning the known control-plane churn.

Codex task: task-mp96mn6y-6cq7ia. Gemini Pro 3 task: task-mp96nhjj-fqnbn1. Not merging — author's decision.

Copilot AI review requested due to automatic review settings May 17, 2026 03:25
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-3 fix update

Pushed 50a94ed3 for the r2 production-wiring blockers.

What changed:

  • Removed the startup-only scheduler loop in Run; scheduler lifecycle is now reconciled on every committed config under applySem.
  • Added scheduler.NewPrimed so apply can compute the initial active-state map without firing a callback while the apply semaphore is already held.
  • Runtime scheduler callbacks now reacquire applySem, carry an epoch, and skip stale callbacks after config replacement.
  • Userspace snapshots are seeded with the same scheduler active-state map used by the current apply transaction, preventing hybrid Config/Policies snapshots.
  • The Rust helper now reports config_snapshot_protocol_version; Go refuses scheduled-policy snapshot publishes to older helpers before they can fail open.
  • Updated scheduler/server/module docs and the Feature gap: time-based policy schedulers (Junos schedulers { ... }) not propagated to userspace-dp #1378 plan notes to match the new lifecycle and compatibility contract.

Validation:

  • go test ./pkg/scheduler ./pkg/daemon ./pkg/dataplane/userspace
  • cargo test --manifest-path userspace-dp/Cargo.toml apply_snapshot_rejects_unsupported_protocol_version -- --nocapture
  • git diff --check

The cargo run still emits existing warning noise; the targeted protocol-version gate test passes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.

Comments suppressed due to low confidence (2)

pkg/daemon/daemon_scheduler.go:60

  • publishPolicyScheduleState reads d.policySchedulerEpoch after acquiring applySem, which guards against stale callbacks from previous scheduler generations. However, between the moment a scheduler callback is queued (inside Scheduler.evaluate after s.mu.Unlock() but before updateFn(cp)) and the moment publishPolicyScheduleState runs, a reconcilePolicySchedulerLocked may have already cancelled the old scheduler (d.schedulerCancel()), but Run's ticker goroutine fires evaluate directly — not through ctx.Done(). The cancelled scheduler may still execute the pending updateFn after policySchedulerEpoch.Add(1) ran. The epoch check correctly drops the stale update, but the call still acquires applySem, which can introduce latency on the apply path if many stale callbacks pile up. This is benign but worth a brief comment to document that the epoch guard is the only correctness barrier here.
func (d *Daemon) publishPolicyScheduleState(epoch uint64, activeState map[string]bool) {
	if err := d.applySem.Acquire(context.Background(), 1); err != nil {
		slog.Warn("scheduler: failed to acquire apply semaphore", "err", err)
		return
	}
	defer d.applySem.Release(1)

	if epoch != d.policySchedulerEpoch.Load() {
		return
	}
	cfg := d.store.ActiveConfig()
	if cfg == nil || d.dp == nil {
		return
	}
	d.seedPolicySchedulerActiveStateLocked(activeState)
	d.dp.UpdatePolicyScheduleState(cfg, activeState)
}

pkg/daemon/daemon_scheduler.go:59

  • policySchedulerActive is read/written under m.mu everywhere except inside UpdatePolicyScheduleState, where line 424 assigns m.policySchedulerActive = activeCopy under m.mu.Lock, then seedPolicySchedulerActiveStateLocked in daemon_scheduler.go (line 67) calls SetPolicySchedulerActiveState on the same manager via the interface dispatch. SetPolicySchedulerActiveState re-acquires m.mu. This is currently fine because seedPolicySchedulerActiveStateLocked is called before UpdatePolicyScheduleState (not from within), but the duplicate seeding (publishPolicyScheduleState calls both seedPolicySchedulerActiveStateLocked and d.dp.UpdatePolicyScheduleState, the latter of which also seeds) is redundant and easy to misread. Consider dropping the explicit seedPolicySchedulerActiveStateLocked call from publishPolicyScheduleState since UpdatePolicyScheduleState already overwrites m.policySchedulerActive from its argument.
	d.seedPolicySchedulerActiveStateLocked(activeState)
	d.dp.UpdatePolicyScheduleState(cfg, activeState)

Comment on lines +55 to +58
if snapshot.version != CONFIG_SNAPSHOT_PROTOCOL_VERSION {
response.ok = false;
response.error = format!(
"unsupported snapshot protocol version {} (want {})",
Comment thread pkg/config/compiler.go Outdated
Comment on lines +221 to +263
@@ -228,6 +231,37 @@ func compileExpanded(tree *ConfigTree) (*Config, error) {
return cfg, nil
}

func validatePolicySchedulerReferencesStrict(cfg *Config) error {
if cfg == nil {
return nil
}
check := func(pol *Policy) error {
if pol == nil || pol.SchedulerName == "" {
return nil
}
if _, ok := cfg.Schedulers[pol.SchedulerName]; ok {
return nil
}
return fmt.Errorf("policy %q references undefined scheduler %q", pol.Name, pol.SchedulerName)
}
for _, zpp := range cfg.Security.Policies {
if zpp == nil {
continue
}
for _, pol := range zpp.Policies {
if err := check(pol); err != nil {
return err
}
}
}
for _, pol := range cfg.Security.GlobalPolicies {
if err := check(pol); err != nil {
return err
}
}
return nil
}
Comment thread pkg/daemon/daemon_apply.go Outdated
Comment on lines +436 to +438
if d.dp != nil && policySchedulerActiveState != nil && compileResult != nil {
d.dp.UpdatePolicyScheduleState(cfg, policySchedulerActiveState)
}
Comment thread pkg/daemon/daemon_scheduler.go Outdated
Comment on lines +36 to +40
if d.daemonCtx != nil {
ctx, cancel := context.WithCancel(d.daemonCtx)
d.schedulerCancel = cancel
go sched.Run(ctx)
}
Comment on lines +43 to +45
if len(s.active) > 0 {
s.notifyActiveState()
}
Comment on lines +364 to +368
fn stable_policy_rule_id(snap: &PolicyRuleSnapshot) -> String {
if !snap.rule_id.is_empty() {
return snap.rule_id.clone();
}
format!("{}->{}/{}", snap.from_zone, snap.to_zone, snap.name)
Comment thread pkg/dataplane/userspace/snapshot.go Outdated
Comment on lines +2023 to +2025
if schedulerName == "" || activeState == nil {
return false
}
Comment on lines +446 to +450
}
if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil {
slog.Warn("userspace: refusing policy scheduler publish to incompatible helper", "err", err)
return
}
Comment on lines 294 to +296
ucfg := deriveUserspaceConfig(cfg)
snap := buildSnapshot(cfg, ucfg, m.bumpGeneration(), m.readFIBGeneration())
activeState := m.policySchedulerActiveStateSnapshot()
snap := buildSnapshotWithSchedulerState(cfg, ucfg, m.bumpGeneration(), m.readFIBGeneration(), activeState)
Comment on lines +83 to 86
monitoredIfindexes map[int]struct{}
lastBindingIndices []uint32
neighborsPrewarmed bool
ctrlEnableAt time.Time
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-3 review on 50a94ed3

Verdict: MERGE-READY-pending (PR claims address all 4 round-2 MAJORs; need reviewer verification on race + mixed-version)

Round-3 diff: 15 files, +337/-61. PR claim covers all 4 round-2 MAJORs:

  1. "scheduler lifecycle now reconciles under applySem" → addresses round-2 MAJOR 1 (lifecycle not wired)
  2. "runtime scheduler publishes are serialized" → addresses round-2 MAJOR 2 (hybrid snapshot race)
  3. "stale callbacks are epoch-gated" → defense-in-depth for race
  4. "userspace snapshots are seeded coherently" → race fix
  5. "old helpers fail closed for scheduled-policy snapshots" → round-2 MAJOR 4 (mixed-version)

Wall-clock (round-2 MAJOR 3) — PR claim doesn't mention. Let me check scheduler.go changes.

Notable: pkg/scheduler/scheduler.go has +67/-? changes. Need to verify if wall-clock → monotonic was addressed OR explicit deferral test was added.

Verification pending

  • A. Scheduler lifecycle wired on commits (Codex finding: daemon_apply.go must call Scheduler.Update)
  • B. UpdatePolicyScheduleState holds applySem + epoch-gated stale callbacks
  • C. Wall-clock: monotonic OR explicit failure-mode test
  • D. Mixed-version: new Go capability handshake refusing v2 to old Rust

This is high-volume refactor — Codex r3 will trace the race and lifecycle changes carefully.

Awaiting Codex (task-mp99dkbf-9xf7tg) and Gemini Pro 3 (task-mp99ed5e-y81abm).

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-3 quad-review consolidated synthesis on 50a94ed3

Reviewer Verdict
Claude MERGE-READY-pending → MERGE-NEEDS-MAJOR (self-corrected)
Codex MERGE-NEEDS-MAJOR
Gemini Pro 3 MERGE-NEEDS-MAJOR

Round-2 MAJORs status — 2 of 4 closed, 2 still open

Round-2 MAJOR Status Verified by
1. Scheduler lifecycle not wired for commits CLOSED Codex+Gemini
2. Hybrid snapshot race CLOSED Codex+Gemini (applySem + epoch-gated)
3. Wall-clock STILL OPEN Codex+Gemini both verify wall-clock not fixed
4. Mixed-version PARTIAL Codex flagged: refusal exists but doesn't fail dataplane closed

Wall-clock — converged MAJOR (Codex+Gemini)

Codex: "Still uses wall time directly: NewPrimed(... time.Now()), ticker timestamps, and Update(... time.Now()) in scheduler.go:39,61,90. The docs explicitly reframe it instead of fixing it."

Gemini: "The author ignored the monotonic evaluation requirement and merely reframed the wall-clock usage in the documentation. No monotonic calculation or explicit deferral test exists."

Both reviewers reject the README reframing as a fix. This has been MAJOR for 3 rounds.

Codex new MAJOR — Old-helper "fail closed" is publish refusal only

"ensurePolicySchedulerProtocolLocked at manager.go:533-551 probes status and errors if helper protocol is < 2. But applyConfigLocked at daemon_apply.go:430-433 only records the compile error and continues, and commitAndApply at daemon_apply.go:101-105 still returns the committed config. The userspace manager also sets m.lastSnapshot = snap BEFORE the protocol check at manager.go:327, 370-371."

Worked failure: operator commits config with scheduled policy on a cluster running old Rust helper. The commit SUCCEEDS at Go-side. Old helper keeps forwarding the previous (un-scheduled) snapshot. If the commit added a scheduler that should DEACTIVATE a permit rule outside work hours, the old helper keeps permitting outside that window. Not fail-closed.

What IS verified fixed

  • ✓ Scheduler reconciles on apply via reconcilePolicySchedulerLocked (recreate, not Update)
  • commitAndApply holds applySem across commit + apply
  • ✓ Runtime scheduler publishes acquire applySem + check epoch (daemon_scheduler.go:44-59)
  • ✓ Stale callbacks epoch-gated
  • ✓ Userspace compile seeds scheduler state coherently
  • ✓ New Rust rejects old Go version-1 snapshots
  • ✓ New Go refuses to publish scheduled-policy snapshots to old helpers (just doesn't fail the dataplane on refusal)

Test gaps

  • Missing: daemon-level scheduler reconciliation on config commit
  • Missing: apply-vs-tick race/epoch behavior test
  • Missing: NTP rollback / clock-jump failure mode
  • Missing: full mixed-version commit/apply path proving old helper cannot keep forwarding stale permissive snapshot

Self-correction

I posted "MERGE-READY-pending" trusting the PR claim. Both Codex and Gemini found the wall-clock issue unaddressed for the 3rd round running. I also missed the "fail-closed-is-only-publish-refusal" gap that Codex traced through commitAndApply.

Recommendation

Block on:

  1. Wall-clock: this is the 3rd round flagging it. Options: (a) migrate pkg/scheduler/scheduler.go to monotonic clock, (b) add explicit time.Time injection so tests can verify NTP-rollback behavior + add a rollback test, OR (c) get explicit operator-level sign-off that wall-clock control-plane behavior is acceptable and bake that into the issue contract — don't keep reframing in the README.
  2. Fail-closed enforcement on old-helper refusal: applyConfigLocked must surface the error to fail commitAndApply, OR the snapshot must not be cached before the protocol check, OR the dataplane must enter a fail-safe state (deny all / fall back to known-safe last snapshot without scheduled policies).

Strongly consider: the 4 missing tests (scheduler-on-commit, apply-vs-tick race, NTP rollback, mixed-version full path).

Codex task: task-mp99dkbf-9xf7tg. Gemini Pro 3 task: task-mp99ed5e-y81abm. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-4 blocker fixes pushed: 6be6f57e

Addressed the r3 MERGE-NEEDS-MAJOR findings:

  1. Wall-clock discontinuity is now an actual fail-closed behavior

    • Scheduler evaluations compare wall elapsed time with Go monotonic elapsed time from the prior evaluation.
    • Backward wall-clock steps, monotonic regressions, or drift beyond tolerance publish all scheduler bits inactive for that evaluation.
    • Added regression coverage for backward wall-clock step fail-closed behavior.
  2. Old-helper fail-closed now actively disarms forwarding

    • Compile, UpdatePolicyScheduleState, and deferred publish now call a fail-closed disarm helper when scheduled policies require a newer snapshot protocol than the helper supports.
    • The manager sends set_forwarding_state armed=false before returning/refusing the publish error, so an old helper cannot keep forwarding a stale snapshot that ignores scheduler inactive bits.
    • Updated the old-helper test to assert the disarm request and the resulting ForwardingArmed=false status.
  3. Docs updated with the real invariants

    • pkg/scheduler/README.md
    • userspace-dp/src/server/README.md
    • docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md

Validation:

go test ./pkg/scheduler ./pkg/dataplane/userspace ./pkg/daemon
git diff --check

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-4 review on 6be6f57e

Verdict: MERGE-READY (both round-3 MAJORs actually fixed this round)

Round-3 MAJOR — Wall-clock — CLOSED at last

Not a README reframe this time. Real defense at scheduler.go:140-167:

func (s *Scheduler) wallClockUnsafeLocked(now time.Time) bool {
    wallElapsed := time.Duration(now.UnixNano() - s.lastWallUnixNano)
    if wallElapsed < 0 { fail_closed; return true }
    monoElapsed := now.Sub(s.lastEval)
    if monoElapsed < 0 { fail_closed; return true }
    delta := abs(wallElapsed - monoElapsed)
    if delta > wallClockDriftTolerance { fail_closed; return true }  // 5s tolerance
    return false
}

Then in evaluate:

wallClockUnsafe := s.wallClockUnsafeLocked(now)
...
cur := false
if !wallClockUnsafe {
    cur = isWithinWindow(now, sched)
}

When wall-clock differs from monotonic by >5s (NTP rollback, clock jump, container migration), schedulers fail closed — cur = false for all schedulers in this evaluation → inactive bit goes true → policies fall through.

This is the right monotonic-clock pattern: use mono as ground truth, detect wall-clock drift, fail closed on detection. Better than a pure monotonic clock for the actual use case (Junos time windows are wall-clock-defined).

Round-3 MAJOR — Fail-closed enforcement — CLOSED

manager.go:556-578 new disarmPolicySchedulerProtocolFailureLocked:

func (m *Manager) disarmPolicySchedulerProtocolFailureLocked(protocolErr error) {
    ...
    req := ControlRequest{
        Type: "set_forwarding_state",
        Forwarding: &ForwardingControlRequest{Armed: false},
    }
    m.requestLocked(req, &status)
    ...
    slog.Warn("userspace: disarmed helper after policy scheduler protocol error", "err", protocolErr)
}

Called from both Compile() and UpdatePolicyScheduleState() on protocol error. Old helper is disarmed → forwarding stops → fail-closed at dataplane level, not just publish refusal.

Hostile spot-checks

  • Wall-clock test coverage: scheduler_test.go +29 should pin NTP-rollback / clock-jump behavior. Codex will verify.
  • Disarm idempotency: disarming an already-disarmed helper — graceful? applyHelperStatusLocked runs after disarm; check for state leak.
  • Test gap: does mid-flight scheduler tick during disarm produce inconsistent state? Likely covered by the apply semaphore from round-2.

Recommendation

Block on: nothing. Both round-3 MAJORs actually closed with real defense, not docs reframing.

Strongly consider: verify scheduler_test.go +29 covers wall-clock backward + drift > 5s + monotonic backward (Codex will trace).

Awaiting Codex (task-mp9a4x1r-fcex88).

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-4 self-correction on 6be6f57e

Revising verdict MERGE-READY → MERGE-NEEDS-MAJOR to match Codex.

I posted READY because the new wallClockUnsafeLocked did mono-vs-wall comparison + 5s drift tolerance. Codex showed the fix has a one-tick recovery hole I missed.

Codex MAJOR 1 — Wall-clock rollback reopens after one tick

pkg/scheduler/scheduler.go:126-128 updates lastEval / lastWallUnixNano even after wallClockUnsafeLocked() forces the current evaluation inactive. On the next tick, the rolled-back wall clock IS now the new baseline, so wallElapsed ≈ monoElapsed and isWithinWindow() runs against the stale wall time again.

Worked failure (Codex):

T=0 (wall 16:59, mono 16:59): isWithinWindow=true, active
T=1 (wall 12:00, mono 17:00): wallElapsed=-4h59m, monoElapsed=1m,
     delta=5h+ > 5s → wallClockUnsafe=true → inactive ✓ (correct)
     BUT lastEval=mono 17:00, lastWallUnixNano=wall 12:00 written ✗
T=2 (wall 12:01, mono 17:01): wallElapsed=1m, monoElapsed=1m,
     delta=0 → wallClockUnsafe=FALSE
     isWithinWindow(12:01 against 08:00-17:00) → TRUE → active 💥

Rollback fails closed once, then the 08:00-17:00 allow window reopens for hours under rolled-back time.

Codex MAJOR 2 — Old-helper commit path still doesn't fail the commit transaction

pkg/daemon/daemon_apply.go:430-433 still only records the compile error. commitAndApply() still returns the committed config. The new set_forwarding_state armed=false is best-effort runtime disarm, NOT commit refusal.

Also pkg/dataplane/userspace/manager.go:327 still assigns m.lastSnapshot = snap BEFORE the protocol gate at manager.go:370-372. Stale snapshot cached even when protocol rejects.

Codex MAJOR 3 — Fail-closed disarm is best-effort and non-fatal

disarmPolicySchedulerProtocolFailureLocked() at manager.go:567-575 logs and returns if the disarm RPC fails or returns failed status. In the scheduler tick path, that leaves UpdatePolicyScheduleState() with no propagated error and no guaranteed local hard-disable if the old helper keeps forwarding stale state.

Test coverage gaps (Codex)

  • NTP rollback: scheduler_test.go:202-228 checks only the immediate inactive transition using wall-only time.Date values. Does NOT pin the real failure: monotonic advances while wall rolls back, then next evaluation reopens.
  • Mixed-version commit failure: manager_test.go:2704-2788 only exercises UpdatePolicyScheduleState(), NOT Compile(), applyConfigLocked(), or commitAndApply().

Recommendation

Block on:

  1. After wallClockUnsafeLocked returns true, DO NOT update lastEval / lastWallUnixNano — preserve pre-rollback baseline until a sufficiently consistent monotonic-aligned wall reading observed. Stay-disarmed period (e.g., until monoElapsed > 60s AND delta < 5s).
  2. Multi-tick NTP rollback test: T=0 active, T=1 rollback inactive, T=2 still inactive, T=3 still inactive — guarantees rollback doesn't auto-recover.
  3. commitAndApply() returns error when policy scheduler compile fails on old helper. Commit refusal, not best-effort disarm.
  4. Move m.lastSnapshot = snap AFTER protocol gate succeeds.
  5. Fail-closed disarm must escalate to caller (return error from UpdatePolicyScheduleState), not log-and-continue.

Claude miss pattern (5th self-correction on #1396 across rounds)

Round 1: production publish path didn't use scheduler state. Round 2: lifecycle/race/wall/asymmetric (4 new MAJORs). Round 3: wall-clock was still wall-clock. Round 4: I saw the new mono-vs-wall comparison and stopped looking — missed that the bookkeeping update happens unconditionally. Pattern: I verify the obvious branch and miss the bookkeeping side-effect downstream.

Codex task: task-mp9a4x1r-fcex88. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

@copilot review

Round-4 commit 6be6f57. Codex flagged 3 MAJORs: (1) wall-clock rollback recovers after 1 tick because lastEval/lastWallUnixNano are written unconditionally at scheduler.go:126-128, (2) commit transaction doesn't fail on old-helper compile error at daemon_apply.go:430-433, (3) fail-closed disarm is best-effort log-and-return at manager.go:567-575. Re-review against these.

Copilot AI review requested due to automatic review settings May 17, 2026 04:44
@psaab psaab review requested due to automatic review settings May 17, 2026 04:44
Copilot AI review requested due to automatic review settings May 17, 2026 04:46
@psaab psaab review requested due to automatic review settings May 17, 2026 04:46
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-4 quad-review consolidated synthesis on 6be6f57e

Reviewer Verdict
Claude MERGE-NEEDS-MAJOR (self-corrected from READY after Codex MAJOR)
Codex MERGE-NEEDS-MAJOR
Gemini Pro 3 MERGE-NEEDS-MAJOR
Copilot Re-review requested; still on stale commit 50a94ed3

Codex + Gemini converge on 4 MAJORs (Gemini split Codex's MAJOR-2 into 2). Block.

MAJOR 1 — Wall-clock rollback recovers after 1 tick (Codex + Gemini)

pkg/scheduler/scheduler.go (end of evaluate loop):

s.active = newActive
s.lastEval = now
s.lastWallUnixNano = now.UnixNano()

Bookkeeping runs unconditionally, even when wallClockUnsafeLocked() already returned true and forced newActive=false.

Worked failure (both reviewers reproduced identically):

T=0 (wall 16:59, mono 16:59): isWithinWindow=true, active ✓
                              lastEval=16:59, lastWallUnixNano=16:59
T=1 (wall 12:00, mono 17:00): NTP rolls wall back 4h59m
                              wallElapsed=-4h59m, monoElapsed=1m, delta>5s
                              wallClockUnsafe=true → active=false ✓ (one tick)
                              BUT bookkeeping writes lastEval=17:00, lastWallUnixNano=12:00 ✗
T=2 (wall 12:01, mono 17:01): wallElapsed=1m, monoElapsed=1m, delta=0
                              wallClockUnsafe=FALSE → isWithinWindow(12:01 in 08-17) → active=true 💥

Rollback fails closed for exactly one tick, then 08:00-17:00 allow window reopens for hours under rolled-back time.

MAJOR 2 — Commit transaction ignores compile failure (Codex + Gemini)

pkg/daemon/daemon_apply.go:430-433:

if compileResult, err = d.dp.Compile(cfg); err != nil {
    d.recordCompileFailure(err)
} else {
    d.recordCompileSuccess()
}

Compile error is recorded but not returned. commitAndApply() at daemon_apply.go:101-105 completes successfully. The daemon persists config to DB/cluster while dataplane rejected it.

MAJOR 3 — Stale snapshot cached on protocol rejection (Codex + Gemini)

pkg/dataplane/userspace/manager.go:327:

m.lastSnapshot = snap   // <-- written BEFORE protocol gate
// ...
if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil {
    m.disarmPolicySchedulerProtocolFailureLocked(err)
    return result, err
}

When protocol gate rejects, m.lastSnapshot already holds the unpublished config. Next BumpFIBGeneration / RegenerateNeighborSnapshot / UpdatePolicyScheduleState reads stale state and may force an apply_snapshot of the rejected config.

MAJOR 4 — Fail-closed disarm is best-effort (Codex + Gemini)

pkg/dataplane/userspace/manager.go:567-575:

if err := m.requestLocked(req, &status); err != nil {
    slog.Warn("userspace: failed to disarm incompatible helper after policy scheduler protocol error",
        "protocol_err", protocolErr, "err", err)
    return
}

If the disarm RPC fails (socket full, helper hung, transport error), function logs and returns. Caller sees no error. Helper continues forwarding under old rules. Fail-closed never achieved.

Test coverage gap (both reviewers)

  • Wall-clock rollback: TestScheduler_WallClockBackwardStepFailsClosed at scheduler_test.go:202-228 only asserts T=1 inactive. Stops before T=2. Blinds CI to the one-tick recovery bug.
  • Mixed-version commit failure: manager_test.go:2704-2788 only exercises UpdatePolicyScheduleState() directly. Does NOT exercise Compile(), applyConfigLocked(), or commitAndApply() with old-helper protocol.

Block on

  1. Wall-clock: After wallClockUnsafeLocked() returns true, preserve pre-rollback lastEval / lastWallUnixNano baseline. Stay-disarmed until a consistent monotonic-aligned wall reading is observed across multiple ticks.
  2. Multi-tick rollback test: T=0 active → T=1 rollback inactive → T=2 still inactive → T=3 still inactive. Pin against current code regressing.
  3. Commit transaction: Compile() failure on mixed-version protocol → commitAndApply() returns error → commit refused. Not best-effort disarm.
  4. Stale snapshot: Move m.lastSnapshot = snap AFTER protocol gate succeeds.
  5. Fail-closed escalation: disarmPolicySchedulerProtocolFailureLocked must propagate disarm-RPC failures to caller, not log-and-continue.

Round-4 claim trace

  • Codex task-mp9a4x1r-fcex88 — 3 MAJORs (Codex bundled wall-clock + commit + disarm into 3)
  • Gemini task-mp9adv2c-4jm4p3 — split into 4 MAJORs (A/B/C/D); same quote-grounded evidence
  • Copilot 50a94ed3 — stale; re-review @copilot requested

Not merging — author's decision.

Copilot AI review requested due to automatic review settings May 17, 2026 04:47
@psaab psaab review requested due to automatic review settings May 17, 2026 04:47
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude round-13 review on 8553583e

Verdict: MERGE-READY

Trivial dead-code cleanup commit. Author shipped exactly what all 3 r12 reviewers (Claude + Codex + Gemini) requested:

  • pkg/dataplane/policy_scheduler.go deleted (-123 lines, the obsolete BuildScheduledPolicyRuleSlots helper from r11)
  • pkg/dataplane/policy_scheduler_test.go deleted (-88 lines)

No production code touched. Runtime path (compiler records exact slots on CompileResult.PolicyScheduleRuleSlots, runtime reads source-of-truth) remains intact and verified by Codex+Gemini r12.

Carry-overs (separate follow-up issues recommended)

  • Test asserting recorded slots match actual map writes (Codex r12 MINOR)
  • Disarm-failure Prometheus counter
  • Pre-Commit Compile validation for protocol-mismatch commit-atomicity
  • Apply-flow integration test for non-protocol Compile no-abort
  • Hermetic test setup (real netlink cleanup in compile-error tests)

No full quad-review dispatched for this trivial 4-insert/211-delete cleanup since the substantive design (r12) was already verified by 3 reviewers and this commit makes ZERO production code changes.

Not merging — author's decision.

Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

@copilot review

Round-12 commit d168648 takes new approach: compiler records exact (PolicySetID, RuleIndex) on CompileResult.PolicyScheduleRuleSlots during compilePolicies; runtime reads source-of-t...

Re-reviewed and applied the r12 MINOR cleanup in 8553583.

  • Removed obsolete recomputation path (pkg/dataplane/policy_scheduler.go) and its self-only test (pkg/dataplane/policy_scheduler_test.go).
  • Runtime remains sourced from compile-recorded CompileResult.PolicyScheduleRuleSlots (compilePolicies -> LastCompileResult() consumers in eBPF/DPDK).

Validation re-run on this commit:

  • go test ./pkg/dataplane ./pkg/config ./pkg/daemon ./pkg/dataplane/userspace ./pkg/scheduler
  • go test ./pkg/dataplane/dpdk
  • git diff --check

All passed. No UI changes in this backend-only update, so no screenshot applies.

Copilot finished work on behalf of psaab May 17, 2026 08:34
psaab and others added 20 commits May 17, 2026 07:23
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
…eduler protocol mismatch

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/1bb33a9b-c0bb-4973-8771-b143edac1b95

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 17, 2026 14:24
@psaab psaab force-pushed the codex/1378-userspace-policy-scheduler branch from 8553583 to 3a79c99 Compare May 17, 2026 14:24
@psaab psaab merged commit e3f2176 into master May 17, 2026
2 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 5 comments.

Comment thread pkg/config/compiler.go
Comment on lines +275 to +304
func validatePolicySchedulerReferencesStrict(cfg *Config) error {
if cfg == nil {
return nil
}
check := func(pol *Policy) error {
if pol == nil || pol.SchedulerName == "" {
return nil
}
if _, ok := cfg.Schedulers[pol.SchedulerName]; ok {
return nil
}
return fmt.Errorf("policy %q references undefined scheduler %q", pol.Name, pol.SchedulerName)
}
for _, zpp := range cfg.Security.Policies {
if zpp == nil {
continue
}
for _, pol := range zpp.Policies {
if err := check(pol); err != nil {
return err
}
}
}
for _, pol := range cfg.Security.GlobalPolicies {
if err := check(pol); err != nil {
return err
}
}
return nil
}
}
ctx, cancel := context.WithCancel(d.daemonCtx)
d.schedulerCancel = cancel
go d.scheduler.Run(ctx)
Comment on lines +433 to +457
m.mu.Lock()
defer m.mu.Unlock()

m.policySchedulerActive = activeCopy
if cfg == nil {
if m.lastSnapshot == nil {
return
}
cfg = m.lastSnapshot.Config
}
if cfg == nil || m.lastSnapshot == nil {
return
}
if m.proc == nil || m.proc.Process == nil {
return
}

if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil {
if disarmErr := m.disarmPolicySchedulerProtocolFailureLocked(err); disarmErr != nil {
slog.Warn("userspace: failed to disarm helper after refusing policy scheduler publish",
"protocol_err", err, "err", disarmErr)
}
slog.Warn("userspace: refusing policy scheduler publish to incompatible helper", "err", err)
return
}
Comment thread _Log.md
Comment on lines +14 to +25
- **Timestamp**: 2026-05-17T05:12:00Z
- **Action**: Round-5 follow-up fix — in userspace pending-XSK-startup compile path, defer `lastSnapshot` cache update until ingress/local/NAT map sync succeeds so sync failures cannot poison cached snapshot state with an unpublished generation.
- **File(s)**: `pkg/dataplane/userspace/manager.go`, `_Log.md`

- **Timestamp**: 2026-05-17T05:16:00Z
- **Action**: Restored `go.mod` after an unintended direct/indirect dependency classification flip introduced by an automation-only progress update.
- **File(s)**: `go.mod`, `_Log.md`

- **Timestamp**: 2026-05-17T04:48:51Z
- **Action**: Re-restored `go.mod` after a subsequent tooling pass reintroduced the same direct/indirect dependency classification flip.
- **File(s)**: `go.mod`, `_Log.md`

Comment on lines +1052 to +1057
if d.dp == nil || activeState == nil || compileResult == nil {
return
}
if _, isUserspace := d.dp.(*dpuserspace.Manager); isUserspace {
return
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants