Plumb userspace policy scheduler inactive state#1396
Conversation
There was a problem hiding this comment.
Pull request overview
Plumbs Junos scheduler/inactive policy state through the userspace dataplane snapshot DTO and Rust policy evaluator so that scheduled policy rules can be skipped during match evaluation. This is a slice of the broader #1378 work; the daemon still does not yet populate the active-state map on tick/restart/failover, so end-to-end scheduling remains gated.
Changes:
- Adds
RuleID,SchedulerName, andInactiveto thePolicyRuleSnapshotDTO (Go + Rust) with serde/JSON backward-compat defaults. - Adds a Go
buildPolicySnapshotsWithSchedulerStatebuilder that derivesInactivefrom anactiveStatemap, plus a stable rule-ID helper. - Rust
parse_policy_statepropagates the new fields andtry_match_ruleskips inactive rules so first-match fallthrough remains correct.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/dataplane/userspace/protocol.go | Adds RuleID/SchedulerName/Inactive JSON fields on PolicyRuleSnapshot. |
| pkg/dataplane/userspace/snapshot.go | Builds new fields; introduces policyRuleInactive and stablePolicyRuleID helpers. |
| pkg/dataplane/userspace/manager_test.go | Adds round-trip test for new fields with scheduler active-state map. |
| userspace-dp/src/protocol.rs | Adds matching fields on Rust DTO and serde round-trip / legacy-decode tests. |
| userspace-dp/src/policy.rs | Stores fields on PolicyRule; skips inactive rules in try_match_rule; adds stable_policy_rule_id fallback. |
| userspace-dp/src/policy_tests.rs | Tests inactive-skip and fall-through, updates existing fixtures with ..Default::default(). |
| userspace-dp/src/afxdp/tests.rs, test_fixtures.rs | Spreads ..Default::default() into PolicyRuleSnapshot fixtures. |
| docs/pr/1373-retire-ebpf-dataplane/plan-1378-policy-schedulers.md | Documents that daemon republish and missing-scheduler commit errors remain gates. |
| return false | ||
| } | ||
| active, ok := activeState[schedulerName] | ||
| return !ok || !active |
| } | ||
| schedulerName := pol.SchedulerName | ||
| out = append(out, PolicyRuleSnapshot{ | ||
| RuleID: stablePolicyRuleID("junos-global", "junos-global", pol.Name), |
Claude round-1 review on
|
Claude round-1 self-correction on
|
Round-1 review consolidated on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-READY → MERGE-NEEDS-MAJOR (self-corrected) |
| Codex | MERGE-NEEDS-MAJOR |
| Copilot | (no review yet on this HEAD) |
Codex MAJORs
- Production publish path doesn't use the new field.
buildSnapshot:53callsbuildPolicySnapshots(cfg)without scheduler state;policyRuleInactive(..., nil)returns false. NewbuildPolicySnapshotsWithSchedulerStatehelper exists but isn't called by the daemon's actual publish path. UpdatePolicyScheduleStatecallback not wired to userspace.manager.go:57-59embeds eBPF dataplane; scheduler tick atdaemon_run.go:585-589dispatches to the eBPF map updater. Window changes never reach the Rust userspace policy snapshot.- Wall-clock in scheduler eval violates monotonic-clock contract.
scheduler.go:31, 76usestime.Now(). NTP rollback CAN flip active state. Per #1378 refined contract: "monotonic clock for window eval; NTP rollback must not flap policies." - Wire format fail-open on mixed-version. No
ProtocolVersionbump, no capability gate. New Go sendinginactive:trueto old Rust helper → silently ignored via serde unknown-field tolerance → rule becomes active. Security gap.
Codex MINOR
- Tests are Go-self / Rust-self round-trip only. No cross-language Go→Rust round-trip. No
UpdatePolicyScheduleStateuserspace republish test. No monotonic rollback test.
Self-correction
Same pattern as #1394/#1395: I labeled "foundation cross-language plumbing" without verifying (a) production path uses it, (b) the scheduler callback reaches userspace, (c) time source is monotonic, (d) wire format has a version gate. The PR adds fields but no code path exercises them on userspace dataplane.
Recommendation
Block on:
- Wire
buildSnapshot→buildPolicySnapshotsWithSchedulerStatewith real scheduler state. - Override
UpdatePolicyScheduleStateon userspaceManagerto republish snapshot, not dispatch to embedded eBPF. - Migrate
pkg/scheduler/scheduler.goto monotonic clock OR document wall-clock dependency + pin failure mode with NTP-rollback test. - Add wire-format version gate OR document fail-closed contract for mixed-version.
Strongly consider: cross-language round-trip test + monotonic rollback test + apply-during-window-transition test.
Codex task: task-mp955c3b-xto9kj. Not merging — author's decision.
Review follow-up on
|
Claude round-2 review on
|
Round-2 quad-review consolidated synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-NEEDS-MINOR → MERGE-NEEDS-MAJOR (self-corrected after Codex) |
| Codex | MERGE-NEEDS-MAJOR |
| Gemini Pro 3 | MERGE-READY (4 round-1 MAJORs verified, accepts README architectural note) |
| Copilot | (no fresh review on this HEAD) |
Reviewer disagreement
Gemini accepted the round-1 MAJORs as "all addressed" including the wall-clock issue via README architectural note. Codex found 4 NEW MAJORs not covered by Gemini's verification.
Codex new MAJOR 1 — Scheduler lifecycle not wired for config commits
"The daemon only creates
d.scheduleronce in daemon_run.go:584-596; daemon_apply.go:423-428 recompiles dataplane config but never callsScheduler.Updateor starts a scheduler when a commit adds the first scheduler. A runtime commit adding scheduled policies can publish with nil/stale scheduler state and may never get corrected."
Worked failure: operator commits config without schedulers { } block → daemon boots, scheduler runs with empty state. Operator later commits adding scheduled policy → dataplane gets new policies but scheduler never learns about them → policies publish with inactive: false permanently.
This is the round-1 finding "publish path doesn't use scheduler state" surviving in a different form — the publish path NOW uses scheduler state, but the scheduler itself isn't kept in sync with config commits.
Codex new MAJOR 2 — Hybrid snapshot race
"
UpdatePolicyScheduleStatecopiesm.lastSnapshot, then replaces onlyConfigandPoliciesfrom the passedcfgat manager.go:423-429. The daemon callback usesstore.ActiveConfig()outside the apply semaphore, so a scheduler tick during/after commit can combine new policies with old zones/routes/interfaces."
Race window: scheduler tick fires between commit's policy update and zone/route update. Publishes a hybrid snapshot with new policies + stale zones. First-match policy evaluation against wrong zone bindings → real packet misrouting.
Codex MAJOR 3 — Wall-clock just reframed, not fixed
"scheduler.go:31, 76 still use
time.Now(). The new docs just reframe this as control-plane wall-clock behavior, not a monotonic-clock implementation or a signed-off deferral."
Gemini accepted the README architectural note ("wall-clock only in control plane, fast path consumes booleans"). Codex doesn't accept the rename as a fix. My read: the README framing IS legitimate engineering rationale — fast-path tearing is the real risk and is avoided. But the NTP-rollback control-plane behavior is undocumented in tests. Recommend either committing to monotonic in the control plane OR adding an explicit "wall-clock NTP rollback can cause brief active-state churn; operators should run sane NTP" note + test pinning the failure mode.
Codex MAJOR 4 — Mixed-version asymmetric
"New Rust rejects old Go version-1 snapshots, but new Go cannot force a truly old Rust helper to reject version 2. Older Rust without the version gate accepts snapshots; pre-scheduler Rust also ignores unknown
inactive, recreating the fail-open path."
Forward compat (old Go → new Rust): works, version mismatch rejected ✓
Backward compat (new Go → old Rust): fail open. Old Rust without version gate accepts the v2 snapshot, ignores inactive field, treats rule as active.
This is the same fail-open Codex flagged in round-1 — only HALF closed.
What's actually fixed
| Claim | Status | Reviewer |
|---|---|---|
| UpdatePolicyScheduleState override → Rust republish | ✓ | Both |
| Compile() passes scheduler state | ✓ | Both |
| Missing scheduler ref = commit error | ✓ | Both |
| Rust rejects unsupported version | ✓ | Both |
| Wall-clock monotonic | reframed only (Gemini accepts, Codex doesn't) | Disagreement |
| Mixed-version BOTH directions | half (new Rust closed; old Rust still open) | Codex caught |
Self-correction
My round-2 review caught 1 of these (wall-clock) and flagged as MINOR. I missed scheduler lifecycle, hybrid snapshot race, and asymmetric mixed-version. Same "didn't probe the production wiring" pattern as round-1.
Recommendation (revised)
Block on:
- Wire scheduler update on config commits —
daemon_apply.gomust callScheduler.Updateon commits that change scheduler config. - Close hybrid snapshot race —
UpdatePolicyScheduleStatemust take the apply semaphore OR use atomic snapshot publish that doesn't tear across fields. - Mixed-version: new Go must NOT send v2 to old Rust. Either (a) capability negotiation handshake on connect, or (b) accept that v2 is only safe with v2-capable Rust and document the upgrade order requirement.
Strongly consider: wall-clock — either commit to monotonic in pkg/scheduler/scheduler.go OR add an explicit deferral note + NTP-rollback test pinning the known control-plane churn.
Codex task: task-mp96mn6y-6cq7ia. Gemini Pro 3 task: task-mp96nhjj-fqnbn1. Not merging — author's decision.
Round-3 fix updatePushed What changed:
Validation:
The cargo run still emits existing warning noise; the targeted protocol-version gate test passes. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.
Comments suppressed due to low confidence (2)
pkg/daemon/daemon_scheduler.go:60
publishPolicyScheduleStatereadsd.policySchedulerEpochafter acquiringapplySem, which guards against stale callbacks from previous scheduler generations. However, between the moment a scheduler callback is queued (insideScheduler.evaluateafters.mu.Unlock()but beforeupdateFn(cp)) and the momentpublishPolicyScheduleStateruns, areconcilePolicySchedulerLockedmay have already cancelled the old scheduler (d.schedulerCancel()), butRun's ticker goroutine firesevaluatedirectly — not throughctx.Done(). The cancelled scheduler may still execute the pending updateFn afterpolicySchedulerEpoch.Add(1)ran. The epoch check correctly drops the stale update, but the call still acquiresapplySem, which can introduce latency on the apply path if many stale callbacks pile up. This is benign but worth a brief comment to document that the epoch guard is the only correctness barrier here.
func (d *Daemon) publishPolicyScheduleState(epoch uint64, activeState map[string]bool) {
if err := d.applySem.Acquire(context.Background(), 1); err != nil {
slog.Warn("scheduler: failed to acquire apply semaphore", "err", err)
return
}
defer d.applySem.Release(1)
if epoch != d.policySchedulerEpoch.Load() {
return
}
cfg := d.store.ActiveConfig()
if cfg == nil || d.dp == nil {
return
}
d.seedPolicySchedulerActiveStateLocked(activeState)
d.dp.UpdatePolicyScheduleState(cfg, activeState)
}
pkg/daemon/daemon_scheduler.go:59
policySchedulerActiveis read/written underm.mueverywhere except insideUpdatePolicyScheduleState, where line 424 assignsm.policySchedulerActive = activeCopyunderm.mu.Lock, thenseedPolicySchedulerActiveStateLockedin daemon_scheduler.go (line 67) callsSetPolicySchedulerActiveStateon the same manager via the interface dispatch.SetPolicySchedulerActiveStatere-acquiresm.mu. This is currently fine becauseseedPolicySchedulerActiveStateLockedis called beforeUpdatePolicyScheduleState(not from within), but the duplicate seeding (publishPolicyScheduleState calls bothseedPolicySchedulerActiveStateLockedandd.dp.UpdatePolicyScheduleState, the latter of which also seeds) is redundant and easy to misread. Consider dropping the explicitseedPolicySchedulerActiveStateLockedcall frompublishPolicyScheduleStatesinceUpdatePolicyScheduleStatealready overwritesm.policySchedulerActivefrom its argument.
d.seedPolicySchedulerActiveStateLocked(activeState)
d.dp.UpdatePolicyScheduleState(cfg, activeState)
| if snapshot.version != CONFIG_SNAPSHOT_PROTOCOL_VERSION { | ||
| response.ok = false; | ||
| response.error = format!( | ||
| "unsupported snapshot protocol version {} (want {})", |
| @@ -228,6 +231,37 @@ func compileExpanded(tree *ConfigTree) (*Config, error) { | |||
| return cfg, nil | |||
| } | |||
|
|
|||
| func validatePolicySchedulerReferencesStrict(cfg *Config) error { | |||
| if cfg == nil { | |||
| return nil | |||
| } | |||
| check := func(pol *Policy) error { | |||
| if pol == nil || pol.SchedulerName == "" { | |||
| return nil | |||
| } | |||
| if _, ok := cfg.Schedulers[pol.SchedulerName]; ok { | |||
| return nil | |||
| } | |||
| return fmt.Errorf("policy %q references undefined scheduler %q", pol.Name, pol.SchedulerName) | |||
| } | |||
| for _, zpp := range cfg.Security.Policies { | |||
| if zpp == nil { | |||
| continue | |||
| } | |||
| for _, pol := range zpp.Policies { | |||
| if err := check(pol); err != nil { | |||
| return err | |||
| } | |||
| } | |||
| } | |||
| for _, pol := range cfg.Security.GlobalPolicies { | |||
| if err := check(pol); err != nil { | |||
| return err | |||
| } | |||
| } | |||
| return nil | |||
| } | |||
| if d.dp != nil && policySchedulerActiveState != nil && compileResult != nil { | ||
| d.dp.UpdatePolicyScheduleState(cfg, policySchedulerActiveState) | ||
| } |
| if d.daemonCtx != nil { | ||
| ctx, cancel := context.WithCancel(d.daemonCtx) | ||
| d.schedulerCancel = cancel | ||
| go sched.Run(ctx) | ||
| } |
| if len(s.active) > 0 { | ||
| s.notifyActiveState() | ||
| } |
| fn stable_policy_rule_id(snap: &PolicyRuleSnapshot) -> String { | ||
| if !snap.rule_id.is_empty() { | ||
| return snap.rule_id.clone(); | ||
| } | ||
| format!("{}->{}/{}", snap.from_zone, snap.to_zone, snap.name) |
| if schedulerName == "" || activeState == nil { | ||
| return false | ||
| } |
| } | ||
| if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil { | ||
| slog.Warn("userspace: refusing policy scheduler publish to incompatible helper", "err", err) | ||
| return | ||
| } |
| ucfg := deriveUserspaceConfig(cfg) | ||
| snap := buildSnapshot(cfg, ucfg, m.bumpGeneration(), m.readFIBGeneration()) | ||
| activeState := m.policySchedulerActiveStateSnapshot() | ||
| snap := buildSnapshotWithSchedulerState(cfg, ucfg, m.bumpGeneration(), m.readFIBGeneration(), activeState) |
| monitoredIfindexes map[int]struct{} | ||
| lastBindingIndices []uint32 | ||
| neighborsPrewarmed bool | ||
| ctrlEnableAt time.Time |
Claude round-3 review on
|
Round-3 quad-review consolidated synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-READY-pending → MERGE-NEEDS-MAJOR (self-corrected) |
| Codex | MERGE-NEEDS-MAJOR |
| Gemini Pro 3 | MERGE-NEEDS-MAJOR |
Round-2 MAJORs status — 2 of 4 closed, 2 still open
| Round-2 MAJOR | Status | Verified by |
|---|---|---|
| 1. Scheduler lifecycle not wired for commits | CLOSED ✓ | Codex+Gemini |
| 2. Hybrid snapshot race | CLOSED ✓ | Codex+Gemini (applySem + epoch-gated) |
| 3. Wall-clock | STILL OPEN | Codex+Gemini both verify wall-clock not fixed |
| 4. Mixed-version | PARTIAL | Codex flagged: refusal exists but doesn't fail dataplane closed |
Wall-clock — converged MAJOR (Codex+Gemini)
Codex: "Still uses wall time directly: NewPrimed(... time.Now()), ticker timestamps, and Update(... time.Now()) in scheduler.go:39,61,90. The docs explicitly reframe it instead of fixing it."
Gemini: "The author ignored the monotonic evaluation requirement and merely reframed the wall-clock usage in the documentation. No monotonic calculation or explicit deferral test exists."
Both reviewers reject the README reframing as a fix. This has been MAJOR for 3 rounds.
Codex new MAJOR — Old-helper "fail closed" is publish refusal only
"ensurePolicySchedulerProtocolLocked at manager.go:533-551 probes status and errors if helper protocol is
< 2. But applyConfigLocked at daemon_apply.go:430-433 only records the compile error and continues, and commitAndApply at daemon_apply.go:101-105 still returns the committed config. The userspace manager also setsm.lastSnapshot = snapBEFORE the protocol check at manager.go:327, 370-371."
Worked failure: operator commits config with scheduled policy on a cluster running old Rust helper. The commit SUCCEEDS at Go-side. Old helper keeps forwarding the previous (un-scheduled) snapshot. If the commit added a scheduler that should DEACTIVATE a permit rule outside work hours, the old helper keeps permitting outside that window. Not fail-closed.
What IS verified fixed
- ✓ Scheduler reconciles on apply via
reconcilePolicySchedulerLocked(recreate, not Update) - ✓
commitAndApplyholdsapplySemacross commit + apply - ✓ Runtime scheduler publishes acquire
applySem+ check epoch (daemon_scheduler.go:44-59) - ✓ Stale callbacks epoch-gated
- ✓ Userspace compile seeds scheduler state coherently
- ✓ New Rust rejects old Go version-1 snapshots
- ✓ New Go refuses to publish scheduled-policy snapshots to old helpers (just doesn't fail the dataplane on refusal)
Test gaps
- Missing: daemon-level scheduler reconciliation on config commit
- Missing: apply-vs-tick race/epoch behavior test
- Missing: NTP rollback / clock-jump failure mode
- Missing: full mixed-version commit/apply path proving old helper cannot keep forwarding stale permissive snapshot
Self-correction
I posted "MERGE-READY-pending" trusting the PR claim. Both Codex and Gemini found the wall-clock issue unaddressed for the 3rd round running. I also missed the "fail-closed-is-only-publish-refusal" gap that Codex traced through commitAndApply.
Recommendation
Block on:
- Wall-clock: this is the 3rd round flagging it. Options: (a) migrate
pkg/scheduler/scheduler.goto monotonic clock, (b) add explicittime.Timeinjection so tests can verify NTP-rollback behavior + add a rollback test, OR (c) get explicit operator-level sign-off that wall-clock control-plane behavior is acceptable and bake that into the issue contract — don't keep reframing in the README. - Fail-closed enforcement on old-helper refusal:
applyConfigLockedmust surface the error to failcommitAndApply, OR the snapshot must not be cached before the protocol check, OR the dataplane must enter a fail-safe state (deny all / fall back to known-safe last snapshot without scheduled policies).
Strongly consider: the 4 missing tests (scheduler-on-commit, apply-vs-tick race, NTP rollback, mixed-version full path).
Codex task: task-mp99dkbf-9xf7tg. Gemini Pro 3 task: task-mp99ed5e-y81abm. Not merging — author's decision.
Round-4 blocker fixes pushed:
|
Claude round-4 review on
|
Claude round-4 self-correction on
|
|
@copilot review Round-4 commit 6be6f57. Codex flagged 3 MAJORs: (1) wall-clock rollback recovers after 1 tick because lastEval/lastWallUnixNano are written unconditionally at scheduler.go:126-128, (2) commit transaction doesn't fail on old-helper compile error at daemon_apply.go:430-433, (3) fail-closed disarm is best-effort log-and-return at manager.go:567-575. Re-review against these. |
Round-4 quad-review consolidated synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude | MERGE-NEEDS-MAJOR (self-corrected from READY after Codex MAJOR) |
| Codex | MERGE-NEEDS-MAJOR |
| Gemini Pro 3 | MERGE-NEEDS-MAJOR |
| Copilot | Re-review requested; still on stale commit 50a94ed3 |
Codex + Gemini converge on 4 MAJORs (Gemini split Codex's MAJOR-2 into 2). Block.
MAJOR 1 — Wall-clock rollback recovers after 1 tick (Codex + Gemini)
pkg/scheduler/scheduler.go (end of evaluate loop):
s.active = newActive
s.lastEval = now
s.lastWallUnixNano = now.UnixNano()Bookkeeping runs unconditionally, even when wallClockUnsafeLocked() already returned true and forced newActive=false.
Worked failure (both reviewers reproduced identically):
T=0 (wall 16:59, mono 16:59): isWithinWindow=true, active ✓
lastEval=16:59, lastWallUnixNano=16:59
T=1 (wall 12:00, mono 17:00): NTP rolls wall back 4h59m
wallElapsed=-4h59m, monoElapsed=1m, delta>5s
wallClockUnsafe=true → active=false ✓ (one tick)
BUT bookkeeping writes lastEval=17:00, lastWallUnixNano=12:00 ✗
T=2 (wall 12:01, mono 17:01): wallElapsed=1m, monoElapsed=1m, delta=0
wallClockUnsafe=FALSE → isWithinWindow(12:01 in 08-17) → active=true 💥
Rollback fails closed for exactly one tick, then 08:00-17:00 allow window reopens for hours under rolled-back time.
MAJOR 2 — Commit transaction ignores compile failure (Codex + Gemini)
pkg/daemon/daemon_apply.go:430-433:
if compileResult, err = d.dp.Compile(cfg); err != nil {
d.recordCompileFailure(err)
} else {
d.recordCompileSuccess()
}Compile error is recorded but not returned. commitAndApply() at daemon_apply.go:101-105 completes successfully. The daemon persists config to DB/cluster while dataplane rejected it.
MAJOR 3 — Stale snapshot cached on protocol rejection (Codex + Gemini)
pkg/dataplane/userspace/manager.go:327:
m.lastSnapshot = snap // <-- written BEFORE protocol gate
// ...
if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil {
m.disarmPolicySchedulerProtocolFailureLocked(err)
return result, err
}When protocol gate rejects, m.lastSnapshot already holds the unpublished config. Next BumpFIBGeneration / RegenerateNeighborSnapshot / UpdatePolicyScheduleState reads stale state and may force an apply_snapshot of the rejected config.
MAJOR 4 — Fail-closed disarm is best-effort (Codex + Gemini)
pkg/dataplane/userspace/manager.go:567-575:
if err := m.requestLocked(req, &status); err != nil {
slog.Warn("userspace: failed to disarm incompatible helper after policy scheduler protocol error",
"protocol_err", protocolErr, "err", err)
return
}If the disarm RPC fails (socket full, helper hung, transport error), function logs and returns. Caller sees no error. Helper continues forwarding under old rules. Fail-closed never achieved.
Test coverage gap (both reviewers)
- Wall-clock rollback:
TestScheduler_WallClockBackwardStepFailsClosedatscheduler_test.go:202-228only asserts T=1 inactive. Stops before T=2. Blinds CI to the one-tick recovery bug. - Mixed-version commit failure:
manager_test.go:2704-2788only exercisesUpdatePolicyScheduleState()directly. Does NOT exerciseCompile(),applyConfigLocked(), orcommitAndApply()with old-helper protocol.
Block on
- Wall-clock: After
wallClockUnsafeLocked()returns true, preserve pre-rollbacklastEval/lastWallUnixNanobaseline. Stay-disarmed until a consistent monotonic-aligned wall reading is observed across multiple ticks. - Multi-tick rollback test: T=0 active → T=1 rollback inactive → T=2 still inactive → T=3 still inactive. Pin against current code regressing.
- Commit transaction:
Compile()failure on mixed-version protocol →commitAndApply()returns error → commit refused. Not best-effort disarm. - Stale snapshot: Move
m.lastSnapshot = snapAFTER protocol gate succeeds. - Fail-closed escalation:
disarmPolicySchedulerProtocolFailureLockedmust propagate disarm-RPC failures to caller, not log-and-continue.
Round-4 claim trace
- Codex
task-mp9a4x1r-fcex88— 3 MAJORs (Codex bundled wall-clock + commit + disarm into 3) - Gemini
task-mp9adv2c-4jm4p3— split into 4 MAJORs (A/B/C/D); same quote-grounded evidence - Copilot
50a94ed3— stale; re-review @copilot requested
Not merging — author's decision.
Claude round-13 review on
|
Re-reviewed and applied the r12 MINOR cleanup in
Validation re-run on this commit:
All passed. No UI changes in this backend-only update, so no screenshot applies. |
…rrors Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/3a39f0a4-bc8e-440c-8652-dcf9a84b021a Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/8be45280-5d1e-434e-b8d7-7ee35e5789f3 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/8be45280-5d1e-434e-b8d7-7ee35e5789f3 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/1bb33a9b-c0bb-4973-8771-b143edac1b95 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
…eduler protocol mismatch Agent-Logs-Url: https://github.com/psaab/xpf/sessions/1bb33a9b-c0bb-4973-8771-b143edac1b95 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/4088459f-cc3c-47bb-8ec6-831e389d7a37 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Agent-Logs-Url: https://github.com/psaab/xpf/sessions/913eb4a1-a048-478c-832f-e80f4c7fd212 Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
8553583 to
3a79c99
Compare
| func validatePolicySchedulerReferencesStrict(cfg *Config) error { | ||
| if cfg == nil { | ||
| return nil | ||
| } | ||
| check := func(pol *Policy) error { | ||
| if pol == nil || pol.SchedulerName == "" { | ||
| return nil | ||
| } | ||
| if _, ok := cfg.Schedulers[pol.SchedulerName]; ok { | ||
| return nil | ||
| } | ||
| return fmt.Errorf("policy %q references undefined scheduler %q", pol.Name, pol.SchedulerName) | ||
| } | ||
| for _, zpp := range cfg.Security.Policies { | ||
| if zpp == nil { | ||
| continue | ||
| } | ||
| for _, pol := range zpp.Policies { | ||
| if err := check(pol); err != nil { | ||
| return err | ||
| } | ||
| } | ||
| } | ||
| for _, pol := range cfg.Security.GlobalPolicies { | ||
| if err := check(pol); err != nil { | ||
| return err | ||
| } | ||
| } | ||
| return nil | ||
| } |
| } | ||
| ctx, cancel := context.WithCancel(d.daemonCtx) | ||
| d.schedulerCancel = cancel | ||
| go d.scheduler.Run(ctx) |
| m.mu.Lock() | ||
| defer m.mu.Unlock() | ||
|
|
||
| m.policySchedulerActive = activeCopy | ||
| if cfg == nil { | ||
| if m.lastSnapshot == nil { | ||
| return | ||
| } | ||
| cfg = m.lastSnapshot.Config | ||
| } | ||
| if cfg == nil || m.lastSnapshot == nil { | ||
| return | ||
| } | ||
| if m.proc == nil || m.proc.Process == nil { | ||
| return | ||
| } | ||
|
|
||
| if err := m.ensurePolicySchedulerProtocolLocked(cfg); err != nil { | ||
| if disarmErr := m.disarmPolicySchedulerProtocolFailureLocked(err); disarmErr != nil { | ||
| slog.Warn("userspace: failed to disarm helper after refusing policy scheduler publish", | ||
| "protocol_err", err, "err", disarmErr) | ||
| } | ||
| slog.Warn("userspace: refusing policy scheduler publish to incompatible helper", "err", err) | ||
| return | ||
| } |
| - **Timestamp**: 2026-05-17T05:12:00Z | ||
| - **Action**: Round-5 follow-up fix — in userspace pending-XSK-startup compile path, defer `lastSnapshot` cache update until ingress/local/NAT map sync succeeds so sync failures cannot poison cached snapshot state with an unpublished generation. | ||
| - **File(s)**: `pkg/dataplane/userspace/manager.go`, `_Log.md` | ||
|
|
||
| - **Timestamp**: 2026-05-17T05:16:00Z | ||
| - **Action**: Restored `go.mod` after an unintended direct/indirect dependency classification flip introduced by an automation-only progress update. | ||
| - **File(s)**: `go.mod`, `_Log.md` | ||
|
|
||
| - **Timestamp**: 2026-05-17T04:48:51Z | ||
| - **Action**: Re-restored `go.mod` after a subsequent tooling pass reintroduced the same direct/indirect dependency classification flip. | ||
| - **File(s)**: `go.mod`, `_Log.md` | ||
|
|
| if d.dp == nil || activeState == nil || compileResult == nil { | ||
| return | ||
| } | ||
| if _, isUserspace := d.dp.(*dpuserspace.Manager); isUserspace { | ||
| return | ||
| } |
References #1378.
Summary
Tests