events: close filter-log enforcement gaps#1430
Conversation
There was a problem hiding this comment.
Pull request overview
This PR closes several remaining gaps in userspace dataplane event emission and enforcement, ensuring FILTER_LOG actions (including reject) are preserved end-to-end and that lo0/local-delivery filter terminal actions are enforced before slow-path reinjection or pass-to-kernel session installation.
Changes:
- Enforce lo0/local-delivery filter discard/reject actions as terminal during packet processing (preventing slow-path reinjection/session installation) while still emitting FILTER_LOG when configured.
- Pin/validate FILTER_LOG
rejectaction handling across the Rust wire codec + encoder and the Go event-stream → syslog fanout path (with added tests). - Update #1379 event plan documentation to reflect the enforced lo0 terminal behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| userspace-dp/src/event_stream/codec.rs | Adds RT_FLOW reject action constant to the codec contract. |
| userspace-dp/src/event_stream/codec_tests.rs | Adds round-trip test ensuring FILTER_LOG can encode/decode reject. |
| userspace-dp/src/afxdp/tests.rs | Adds regression test asserting lo0 discard drops without slow-path reinjection/session install. |
| userspace-dp/src/afxdp/poll_descriptor.rs | Applies lo0 filter action before LocalDelivery slow-path reinjection/session caching; emits lo0 filter-log when matched. |
| userspace-dp/src/afxdp/event_emit.rs | Adds test ensuring emit path preserves reject action in emitted FILTER_LOG events. |
| pkg/logging/ringbuf.go | Corrects binary record documentation for action encoding (0=deny, 1=permit, 2=reject). |
| pkg/dataplane/userspace/eventstream_test.go | Extends syslog fanout test coverage to include FILTER_LOG reject action. |
| docs/pr/1373-retire-ebpf-dataplane/plan-1379-dataplane-events.md | Updates plan text to match implemented lo0 terminal behavior and wording. |
| ) | ||
| { | ||
| telemetry.dbg.local += 1; | ||
| telemetry.dbg.policy_deny += 1; |
Claude round-1 review on
|
Round-1 synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r1 (self-corrected) | MERGE-NEEDS-MAJOR (was MERGE-READY) |
| Codex r1 | MERGE-NEEDS-MAJOR (2 MAJORs) |
| Gemini Pro 3 | pending (still in tool_call at 5m+; task-mpct5w6y-8a5q18) |
Self-correction (Claude r1 → MAJOR)
My round-1 Claude review marked this MERGE-READY with H1 (Reject contract divergence) and H4 (cached bypass) as non-blocking observations. Codex r1 elevated both to MAJOR, and reframed H4 from "userspace flow cache bypass" (which Codex correctly notes is a non-issue — LocalDelivery isn't cacheable in the flow cache) to "BPF session-map PASS_TO_KERNEL bypass" — a deeper issue I should have probed.
Verified Codex r1 MAJORs
M1 — BPF session-map PASS_TO_KERNEL bypass for existing LocalDelivery sessions.
The new apply_lo0_filter_action gate runs BEFORE session install — correct for NEW sessions. But the install path explicitly publishes a BPF session-map entry so subsequent packets bypass userspace entirely:
poll_descriptor.rs:1271-1274 (verified)
// Keep firewall-local sessions in the helper only for HA
// state. Publish only the exact observed key back into the
// BPF session map so subsequent established packets bypass
// userspace and return directly to the kernel.
And forwarding/mod.rs:1106-1107 confirms:
let _ = publish_session_map_entry_for_session(session_map_fd, key, decision, &local_entry.metadata);No mechanism flushes the BPF session map on lo0 filter config change. I grepped userspace-dp/src/afxdp/ for session_map.*delete, clear_session_map, flush_session_map, invalidate.*session and found nothing. coordinator/worker_manager.rs:55-75 (worker stop) clears only XSK and heartbeat slots, not session-map entries.
Real consequence: if an operator establishes a flow when lo0 was Accept, then changes config to add a lo0 Discard filter, existing long-lived sessions (BGP, SSH, established TCP) continue to bypass userspace via the BPF session map. The new gate is never reached. The PR's plan-1379 update claims "discard/reject actions are terminal and prevent slow-path reinjection" — true only for NEW sessions, not existing ones.
M2 — Reject is still a contract lie.
filter/mod.rs:40 enum docstring /// Drop with ICMP unreachable. is unchanged. The new lo0 path drops Reject identically to Discard (scratch_recycle + continue, no ICMP-unreachable generated). The new tests pin RT_FLOW_ACTION_REJECT in the wire codec, so the LOG correctly says reject, but the wire behavior is silent drop. Either implement ICMP-unreachable OR update enum doc + plan-1379 to say "currently silent drop, ICMP unreachable not yet implemented".
Codex r1 confirmed
- M3 lo0 placement on miss path: fixed. Old removed call at
:2368, new placement at:1240-1253BEFORE LocalDelivery cache install. - No double-eval of lo0 filter; single call at
poll_descriptor.rs:275. - Scope OK: 8 files, 321/53, all filter/log/codec/test related.
Codex r1 found additional MINORs (matching my Claude r1 H2/H3)
- Go syslog test isolation gap: the PBR FILTER_LOG frame doesn't assert ABSENCE of
action=reject, so a regression that mistakenly emits reject in the PBR path wouldn't fail this test. - No lo0 reject path test: only
discardis tested (poll_descriptor_lo0_filter_discard_drops_without_reinject). - Counter taxonomy:
telemetry.dbg.policy_deny += 1for lo0 filter drops conflates filter rejections with policy denies in debug telemetry. Should be a separatefilter_dropcounter.
Recommendation
Block on M1 and M2:
- M1 (BPF session-map bypass): critical. Either (a) flush PASS_TO_KERNEL session-map entries when the lo0 filter config changes, or (b) add an explicit "filter generation" check at session-map publish time so stale entries auto-invalidate, or (c) document the gap explicitly in plan-1379 (operators with long-lived local sessions need to know that filter config changes don't take effect on established flows). Option (c) is the minimum acceptable; (a)/(b) are the proper fix.
- M2 (Reject contract): pick one — implement ICMP-unreachable OR update enum docstring + plan-1379 to say "silent drop, ICMP unreachable not yet implemented." Don't ship the log/wire divergence.
Strongly consider in this PR (MINORs):
- Add a lo0
rejectaction test (symmetric coverage). - Tighten the Go syslog test isolation: PBR frame should assert ABSENCE of
action=reject. - Rename or split
telemetry.dbg.policy_denyto distinguish filter rejections from policy denies.
Codex task: task-mpct579q-l3ges5. Gemini Pro 3 still in tool_call at 5m+ (task-mpct5w6y-8a5q18); will update synthesis when it lands.
Not merging — author's decision.
Round-1 Gemini Pro 3 verification — all 2 Codex MAJORs INDEPENDENTLY VERIFIEDGemini Pro 3 ( Triple-review fully converged on MERGE-NEEDS-MAJOR
Gemini's quoted verificationM2 — Reject contract divergence (Gemini's "DIVERGENCE FOUND").
Identical to Codex M2. M1 — Cached LocalDelivery bypass (Gemini's "BYPASS CONFIRMED"). Gemini reframes the same gap from a complementary angle:
Codex framed the bypass via BPF session-map PASS_TO_KERNEL (kernel direct path). Gemini framed it via userspace SessionTable hit (when BPF map fails / is bypassed). Both describe the same root cause from different angles: the new gate runs only on session-miss; established sessions in EITHER bypass mechanism (BPF map OR userspace SessionTable hit) never re-evaluate the lo0 filter. Gemini also confirmed
Gemini also confirmed coverage gap
Recommendation (unchanged from earlier synthesis)Block on M1 and M2. Both reviewers agree on root cause + severity; the only divergence is which bypass mechanism each emphasizes (and both apply). For M1, the fix needs to either:
For M2, pick: (a) implement ICMP-unreachable for Reject, OR (b) update enum docstring + plan-1379 to say "currently silent drop, ICMP unreachable not yet implemented". Strongly consider in this PR (Codex+Gemini both flagged):
Strongly consider (Codex only):
Codex task: Not merging — author's decision. |
a911d14 to
e36ba81
Compare
|
Round-2 fix pushed on Fixes addressed:
Validation:
|
Claude round-2 review on
|
Round-2 synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r2 (self-corrected) | MERGE-NEEDS-MAJOR (was MERGE-READY) |
| Codex r2 | MERGE-NEEDS-MAJOR (3 MAJORs + 1 MEDIUM) |
| Gemini Pro 3 | pending (still in tool_call at 6m+; task-mpcwcn3d-7wkz7c) |
Self-correction (Claude r2 → MAJOR)
My round-2 Claude review marked this MERGE-READY with the retroactive PASS_TO_KERNEL flush gap (H1) as non-blocking. Codex r2 elevates it to MAJOR and found two additional MAJORs I missed (shared_sessions cleanup, PolicyAction::Reject parallel divergence).
Verified Codex r2 MAJORs
M1 — Pre-existing PASS_TO_KERNEL not retroactively flushed (elevation of my H1).
refresh_runtime_snapshot() rotates the forwarding Arc but the worker update path only refreshes screen profiles and session timeouts. No code calls publish_worker_session_map_entry on filter config change, no session-map scan, no delete of old kernel-local entries. A LocalDelivery session installed BEFORE lo0 filter was configured continues bypassing userspace indefinitely until normal expiry.
The session-hit re-eval doesn't help because BPF PASS_TO_KERNEL packets never reach userspace. The publish-time gate only catches new sessions. Pre-existing sessions remain bypassed.
I called this "operationally rare" in my r2; Codex correctly treats it as a real security gap. Long-lived local sessions (BGP, SSH, idle TCP) are exactly the scenario where filter config changes happen during established sessions.
M2 — Session-hit lo0 drop doesn't clean shared_sessions (NEW).
At poll_descriptor.rs:843-852 the cleanup is:
delete_session_map_entry_for_removed_session_with_origin(...);
sessions.delete(&resolved.key);SessionTable::delete() in session/mod.rs is literally pub fn delete(&mut self, key: &SessionKey) { self.remove_entry(key); } — just removes the local entry, no close delta emission, no shared_sessions removal. Shared cleanup is only in flush_session_deltas() or explicit remove_shared_session(...); the new path calls neither.
Consequence: lookup_session_across_scopes() can re-find the entry in shared_sessions and rematerialize it. A lo0 filter drop is not actually durable — the next packet can re-resolve the same session via the shared map.
M3 — PolicyAction::Reject has the same contract divergence as FilterAction::Reject had (NEW).
The M2 fix from r1 made FilterAction::Reject fail closed and log as deny. But PolicyAction::Reject is a separate type with the same issue:
policy.rs:38definesPolicyAction::Rejectpolicy.rs:515parses"reject" → PolicyAction::Rejectevent_emit.rs:191-196mapsemit_policy_deny_eventpolicy reject →RT_FLOW_ACTION_REJECTin syslogpoll_descriptor.rs:1944-1955policy deny path just emits the event and setsdisposition = PolicyDenied. No ICMP unreachable or TCP RST generated anywhere in the userspace policy path.
So a then reject policy rule emits RT_FLOW ... action=reject in syslog while the dataplane silently drops the packet. Same contract divergence as the filter path had pre-r1. Either:
- (a) implement ICMP/RST synthesis on the policy reject branch, OR
- (b) update PolicyAction enum doc +
emit_policy_deny_eventto map policy reject → DENY (parallel to the filter fix).
Verified Codex MEDIUM
Output filter terminal action: TxSelectionFilterResult drops the terminal action; output filters can log reject while forwarding unless a policer drops. The r1 fix closed input/lo0 filter terminals but didn't audit the output-filter terminal action.
Confirmed correct
- Session-hit lo0 re-eval placement: correctly placed before TTL check on session-hit branch; cleans BPF map + local SessionTable.
- Publish-time gate coverage: all 3 sites in
apply_worker_commands(session_glue/mod.rs:422, 507, 586) threadforwardingand usesession_key_has_lo0_filter. - FilterAction::Reject fix: filter path correctly maps Reject → DENY in syslog; tests updated.
- Codec wire:
RT_FLOW_ACTION_REJECTreserved for policy events; filter events forced to DENY when action == Reject.
Codex coverage gaps (matching my r2 observations)
- No integration test for publish-time gate (only the unit test for
session_key_has_lo0_filterhelper). - The new
poll_descriptor_lo0_filter_drops_cached_local_delivery_session_hitpreinstalls only aSessionOrigin::LocalMisssession, not a BPFPASS_TO_KERNELpeer-synced local session. - No lo0 reject path test.
- No Go-side end-to-end test for policy reject.
Codex scope correction
Codex reports actual diff size is 12 files, +645/-55, not the +335/-13 I quoted in the prompt. Files are still all filter/event/session-glue/docs/tests/log-formatting related.
Recommendation
Block on M1, M2, M3.
For M1 (retroactive flush): add a snapshot-apply handler that re-publishes existing kernel-local sessions when lo0 filter config changes (or any filter config change). OR document the gap as a known operator limitation in plan-1379 with explicit operator guidance (e.g., "filter config changes may not take effect on existing long-lived local sessions until session expiry; flush sessions manually via clear flow session").
For M2 (shared_sessions cleanup): the lo0 drop path needs remove_shared_session(...) in addition to sessions.delete(...), and should emit a close delta so HA peers see the removal. Otherwise the drop is not durable.
For M3 (policy reject parity): pick (a) or (b) per the filter-path precedent. Simplest is (b) — map policy reject → DENY in emit_policy_deny_event until ICMP/RST synthesis is wired.
Strongly consider (MEDIUM): audit output-filter terminal action handling.
Codex task: task-mpcwbybx-2hrpuy. Gemini Pro 3 still in tool_call at 6m+; will update synthesis when it lands.
Not merging — author's decision.
Round-2 Gemini Pro 3 verification — 2 Codex MAJORs INDEPENDENTLY VERIFIED, 1 not coveredGemini Pro 3 (
Gemini quoted verificationM1 — Retroactive PASS_TO_KERNEL flush gap (Gemini "FAIL (Blocker)").
Gemini directly refutes the commit message claim: "'BPF cached state cannot bypass userspace lo0 enforcement after a filter change.' This logic is flawed." Matches Codex's MAJOR #1 framing. M2 — shared_sessions cleanup missing (Gemini "FAIL (Minor)").
Codex labeled this MAJOR (rematerialization via Gemini didn't probe (Codex unique)M3 — PolicyAction::Reject contract divergence. Codex independently found that policy reject has the same log-vs-wire divergence the original M2 fix closed for the filter path. Gemini's hostile checks focused on the filter side and confirmed This finding stands on Codex's single-reviewer verification only, but it's structurally verifiable: emit_policy_deny_event maps policy reject to RT_FLOW_ACTION_REJECT, and grep finds no ICMP/RST synthesis on the PolicyDenied branch. Same class of bug as the original M2; same fix pattern applies. MEDIUM — output filter terminal action. Codex flagged that Gemini also confirmed
Both reviewers flag test gaps
Note on Codex/Gemini scope differenceCodex measured the PR vs parent commit ( Recommendation (unchanged)Block on M1, M2, M3:
Strongly consider:
Codex task: Not merging — author's decision. |
e36ba81 to
053537c
Compare
Round-3 fix-forward on
|
| forwarding = new_forwarding; | ||
| let republished = republish_local_delivery_sessions_for_lo0_filter( | ||
| &sessions, | ||
| session_map_fd, | ||
| &forwarding, | ||
| ); |
| resolved.origin, | ||
| ); | ||
| telemetry.dbg.local += 1; | ||
| telemetry.dbg.policy_deny += 1; |
| ) | ||
| { | ||
| telemetry.dbg.local += 1; | ||
| telemetry.dbg.policy_deny += 1; |
Claude round-3 review on
|
Round-3 partial synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r3 (self-corrected) | MERGE-NEEDS-MAJOR (was MERGE-READY) |
| Codex r3 | running 6m+ (task-mpcxo2ir-cm9ttp) |
| Gemini Pro 3 r3 | MERGE-NEEDS-MAJOR (3 FAILs/MAJORs) |
Self-correction (Claude r3 → MAJOR)
My round-3 review marked this MERGE-READY with H1 (republish triggers on every forwarding change), H2 (reverse direction not handled), and H4 (overwrite semantics) as non-blocking observations. Gemini r3 elevated H1 and H2 to MAJOR, confirmed H4 (verified via libbpf BPF_ANY), and found NAT64 cross-family edge case I didn't probe.
Verified Gemini r3 findings
Verified PASS (5/10):
- publish_live_session_entry overwrite: cascades to
libbpf_sys::bpf_map_update_elemwithBPF_ANY— reliably overwrites PASS_TO_KERNEL. - M2 cleanup: all 5 components present and called in the right order at
session_glue/mod.rs:303-345. - M3 PolicyAction::Reject: consistent everywhere;
policy.rsparses"reject" → PolicyAction::Reject;poll_descriptor.rstreats Deny+Reject identically viaif let PolicyAction::Permit .... - Test cleanup assertions: pre-installs to both local + shared, asserts all 4 maps cleared and close delta emitted.
- emit_close_delta_with_origin skips is_reverse: guard at line 956.
- HA peer race: mitigated by
apply_lo0_filter_actionre-evaluating on session-hit per worker — even if peer B doesn't see delete command yet, it independently drops + cleans up.
Verified MAJOR (3):
Gemini M1 (republish O(N) on every forwarding-Arc rotation). Quote at worker/mod.rs:1278-1290 confirms placement inside load_arc_if_changed branch. Forwarding rotates on snapshot install, FIB updates, HA config — not just lo0 filter changes. With many LocalDelivery sessions, every unrelated rotation incurs an O(N) iteration. Gemini calls this a "jitter bomb under dynamic conditions."
My read: forwarding rotation is typically rare (config events, batched netlink), but the over-broad trigger is real. Worth either:
- (a) Tracking lo0_filter_v4_fast / lo0_filter_v6_fast separately and only iterating when those change.
- (b) Documenting the operational expectation that forwarding rotations are rare.
Reasonable to flag as MAJOR if operator-environment forwarding rotates frequently; downgradeable to MINOR if it doesn't.
Gemini M3 (reverse direction not handled — lo0 filter removal). Quote at session_glue/mod.rs:242 shows the early return when !session_key_has_lo0_filter. If filter is removed after sessions were demoted to userspace-live, those sessions stay userspace-live forever. Permanent performance regression for long-lived local sessions during filter-add-then-remove cycles.
My read: real but bounded. Operators rarely add-then-remove lo0 filters; sessions eventually age out. Author can either implement the reverse (publish_kernel_local_session_key on filter removal) or document as accepted permanent demotion.
Gemini M10 (NAT64 cross-family bypass) — NEW MAJOR I didn't probe. session_key_has_lo0_filter checks addr_family of the pre-translation session key. A NAT64 session arrives v6 (key=AF_INET6), gets translated to v4 for local delivery. If operator only configured v4 lo0 filter (typical since host lo0 is v4), lo0_filter_v6_fast.is_some() returns false → no lo0 detected → PASS_TO_KERNEL bypass for the v6 session despite the v4 lo0 filter being configured for the post-translation destination.
My read: this is operator-surprising but arguably expected Junos semantics (filters are per-family, NAT64 changes family). However, if the threat model is "block all local-delivery via lo0 filter", a single-family filter doesn't catch NAT64. Either:
- (a) Apply lo0 filter check to the POST-translation family in the NAT64 path.
- (b) Document explicitly that NAT64 + single-family lo0 filter has a bypass — operator must configure both families.
If choice is (b), I'd recommend a commit warning or status message when the operator configures a v4-only lo0 filter while NAT64 is active.
Recommendation
Block on Gemini's 3 MAJORs:
- Gemini M1 (republish trigger scope): narrow trigger to lo0-filter-state-transition only, OR document operational cost expectation.
- Gemini M3 (reverse demotion): implement reverse republish on filter removal, OR document permanent demotion.
- Gemini M10 (NAT64 cross-family): either fix the filter check to consider post-translation family, OR add operator-facing warning + plan-1379 doc.
Strongly consider (Gemini caught): add integration test exercising the real BPF map update (currently unit test uses fd=-1).
Codex r3 still running; will update synthesis when it lands. Self-correction documented above.
Not merging — author's decision.
Round-3 final synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r3 (self-corrected) | MERGE-NEEDS-MAJOR |
| Codex r3 | MERGE-NEEDS-MINOR |
| Gemini Pro 3 r3 | MERGE-NEEDS-MAJOR |
All reviewers converge: 3 r2 MAJORs materially fixed
Codex and Gemini both verify with quoted-line evidence:
- M2 cleanup (full terminal cleanup across local + 3 shared maps + owner_rg + peer + close delta): complete at
session_glue/mod.rs:303-345. - M3 PolicyAction::Reject fail-closed:
event_emit.rs:191mapsReject → RT_FLOW_ACTION_DENY.RT_FLOW_ACTION_REJECTremains only as a producer-side const + codec compat test. - M1 retroactive republish overwrites PASS_TO_KERNEL:
publish_live_session_entry → bpf_map_update_elem(BPF_ANY)reliably overwrites the BPF map entry. emit_close_delta_with_originskips reverse: verified.- HA peer race: mitigated by per-worker
apply_lo0_filter_actionon session-hit.
Codex vs Gemini severity split on residuals
| Residual gap | Codex r3 | Gemini r3 | Claude read |
|---|---|---|---|
| M1 republish scope (every forwarding rotation) | MINOR | MAJOR | MINOR-leaning |
| Reverse demotion (lo0 filter removal) | MINOR/MEDIUM | MAJOR | MINOR |
| Output filter terminal action (r2 MEDIUM open) | MEDIUM | not flagged | real bug class |
| NAT64 cross-family bypass | not flagged | MAJOR | doc-required |
M1 republish scope. Codex: "fires on every forwarding snapshot change observed by the worker, not every loop tick." Gemini: "jitter bomb under dynamic conditions." Both correct framings. Forwarding rotations are typically rare (config events, batched netlink), but environments with frequent FIB updates or HA transitions would see O(N) scans more often. I lean MINOR for the default case with a guidance note for pathological scenarios.
Reverse demotion. Codex: "no bulk demotion for existing live entries after filter removal." Gemini: "permanent userspace helper traversal until session ages out naturally." Both correct. Performance regression, not correctness. MINOR with optional follow-up.
Output filter terminal action (Codex's MEDIUM). This is a REAL bug class still open from r2:
Output tx evaluation records log_match and policer_drop, but does not carry
FilterAction::Discard/Rejectinto drop;build_live_forward_request_from_frameemits the output filter log, then only returns None whencos.dropis true. That means a logged outputdiscard/rejectcan produce a deny-looking event while the packet still forwards.
Same class of bug as the original M2/M3 was for filter input + lo0. This means there's still ONE filter path where syslog says "deny" while wire behavior is "forward". Cited files: filter/engine.rs:379, tx/cos_classify.rs:321, forward_request.rs:152.
NAT64 cross-family bypass (Gemini-only). session_key_has_lo0_filter checks pre-translation addr_family. A v6 NAT64 session destined for v4 lo0 with v4-only filter configured → no lo0 detected → PASS_TO_KERNEL bypass. Operator-surprising. Either:
- (a) Apply filter check to post-translation family in NAT64 path.
- (b) Document as known limitation in plan-1379 + commit warning when v4-only lo0 filter is configured while NAT64 rules exist.
Codex didn't probe this; Gemini found it on hostile check #10. Doc-required at minimum.
Test gaps flagged by both reviewers
- No integration test exercising real BPF map fd for the republish helper (unit test uses fd=-1).
- Enhanced cached-hit test doesn't verify peer command replication, owner-RG removal with owner > 0, or BPF map deletion (peer commands empty, owner_rg = 0, fds = -1 in the test).
Recommendation
I'm splitting Codex/Gemini's verdicts — neither extreme is right.
Block on the output filter terminal action gap (Codex's MEDIUM): this is a real same-class bug as the original M3 was. Should be fixed in this PR for completeness, OR explicitly tracked as a follow-up with a plan-1379 doc note that says output-filter logged-deny does NOT imply drop until that path is fixed. Without the doc note, operators reading the new RT_FLOW_FILTER_LOG events will misinterpret output-filter deny logs.
Strongly consider (Gemini-flagged):
- NAT64 cross-family: at minimum, plan-1379 doc note explaining single-family lo0 filters + NAT64 produces bypass; operator must configure both families.
- Reverse demotion: either implement publish_kernel_local_session_key on filter removal OR doc as accepted permanent demotion (low operational impact).
Strongly consider (both reviewers flagged):
- Real-BPF-map integration test for republish.
- Owner_rg > 0 / peer command / BPF map deletion coverage in the cached-hit test.
Defer-acceptable (Gemini-elevated to MAJOR but Codex-minor):
- Republish scope (forwarding-rotation trigger). Optimize only if operational evidence shows it.
Codex task: task-mpcxo2ir-cm9ttp. Gemini Pro 3 task: task-mpcxot2g-rsz0l3.
Not merging — author's decision.
053537c to
d0f863b
Compare
Round-4 fix-forward on
|
Claude round-4 review on
|
Round-4 partial synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r4 (self-corrected) | MERGE-NEEDS-MAJOR (was MERGE-READY) |
| Codex r4 | MERGE-NEEDS-MAJOR (1 MAJOR — cached DSCP bypass) |
| Gemini Pro 3.1 r4 | empty output (only "Task completed."; re-dispatching) |
Self-correction (Claude r4 → MAJOR)
My round-4 Claude review marked this MERGE-READY focused on whether the output filter terminal action enforcement was plumbed through TX selection. I verified the plumbing but did NOT probe the cached-flow invariants. Codex r4 found a sharp cached-flow bypass specific to DSCP-match terminal filters.
Verified Codex r4 MAJOR
Cached output-filter terminal decisions are stale for DSCP-changing flows.
Verified against the code at d0f863b3:
- Filter engine matches on DSCP:
filter/engine.rs:17,32,45,58,...— eval functions takedscp: u8and use it in term_matches. - SessionKey / flow cache key does NOT include DSCP:
session/key.rs:10-16shows fields are onlyaddr_family, protocol, src_ip, dst_ip, src_port, dst_port. - Cache install at
tx/cos_classify.rs:158:drop: output_result.action != crate::filter::FilterAction::Accept,
output_resultwas evaluated using the FIRST packet'smeta.dscp. The cacheddropvalue is frozen against that DSCP. - Cached fast path at
poll_descriptor.rs:496:No DSCP re-evaluation — just trusts the storedif cached_descriptor.tx_selection.drop || policer_action.drop { ... }
dropbit.
Concrete bypass scenario:
- Operator config:
output filter wan-drop term drop-ef from { dscp ef; destination-port 443; } then discard log; - First packet of flow has DSCP=0 → does NOT match (dscp != ef=46) → cache stamped
drop=false. - Later packet on same 5-tuple has DSCP=46 → SHOULD hit terminal discard → but cached path reads
drop=false→ packet forwards.
This is the SAME class of bug the r4 fix closed for non-DSCP-conditional output discard, except for DSCP-conditional terms. Not a CoS misclassification anymore — security terminal-action bypass on cached flows.
Same root cause applies to ANY per-packet-variable filter match input not in the cache key: TOS, IP options, fragment flag, possibly others. Need an audit of term_matches_v4/v6 for all match fields and which ones aren't in the cache key.
Codex recommended fixes
- Make filters with DSCP-conditional terminal terms non-cacheable.
- Include DSCP in cache validity key.
- Re-evaluate terminal/log-relevant output filter terms on cached hits when ANY packet-varying match input is present.
Option (3) is the most general — restores the input filter's eval-every-packet semantic for cached output filters that have variable-input terms.
Codex confirmed correct
- CoSTxSelection drop propagation: all 6 return sites at
cos_classify.rs:362-415carry thedropfield correctly. - Cached install + replay: works for the non-DSCP-conditional case (the obvious-bug case the r4 fix closed).
has_terminal_action_termsgate: updated at all 4 sites (compiler.rs:108-110, 150-154, 195-199;cos_classify.rs:78-82, 291-295;forwarding_build.rs:421-436).- Input vs output race: input terminal action wins (runs at
poll_descriptor.rs:1071-1088before route/policy/output TX). - Policer OR: correct (
policer_drop || action != Accept). - Filter struct compat:
Filteris internal compiled Rust, not the serialized snapshot.FirewallTermSnapshot.actiondefaults via serde + parse maps unknown/empty to Accept.
Codex confirmed gaps from my r4
- Output Reject test missing: only Discard is tested. (My Claude r4 H1.)
- Reverse demotion doc: acceptable as documented perf caveat; manual
clear flow sessionCLI would be operationally useful but not blocking.
Codex scope correction
Diff is 23 files, +1094/-75. The extra files are still in the event/filter/lo0/session cleanup area; I did not see unrelated scope creep.
I quoted 14 files +245/-15 (r3→r4 incremental). Codex measured PR vs parent commit. Both correct, different bases.
NAT64 doc wording
Codex notes the plan-1379 phrase "NAT64 local-delivery traffic is evaluated using the observed packet family" — actual code at session_glue/mod.rs:223-230 keys off SessionKey.addr_family (pre-translation). Codex: "The doc is acceptable only if 'observed' means pre-NAT64; I would clarify that wording."
Recommend tightening to: "evaluated using the pre-translation (wire-observed) packet family."
Gemini Pro 3.1 status
task-mpd0pd2g-0nt6m9 ran for 2m 10s with gemini-3.1-pro-preview model but produced only "Task completed." with no substantive review content. The 5 tool calls in the log suggest the model did work but the final output was minimal. Re-dispatching with a tighter prompt asking for explicit verdict + per-check quotes.
Recommendation
Block on the cached DSCP bypass. Pick one of Codex's three fixes:
- Recommended: re-evaluate terminal-action output filter terms on cached hits when DSCP (or other per-packet-variable inputs) is part of any term's match criteria. Lowest invasiveness.
- Alternative: make filters with DSCP-conditional terminal terms non-cacheable (mark at compile time, skip cache install).
- Alternative: extend SessionKey/flow cache key with DSCP.
Strongly consider (Codex + Claude flagged):
- Audit
term_matches_v4/v6for all match fields that vary per-packet but aren't in cache key (DSCP, possibly TOS, fragment, IP options). - Add output Reject path test.
- Tighten NAT64 doc wording to clarify "observed" = "pre-translation".
Codex task: task-mpd0onlk-s1dqk0. Gemini re-dispatch pending.
Not merging — author's decision.
d0f863b to
4567c9f
Compare
Round-5 update on
|
Claude round-5 review on
|
Round-5 triple-review consolidated synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r5 | MERGE-NEEDS-MAJOR (input-DSCP cache bypass) |
| Codex r5 | MERGE-NEEDS-MAJOR (input parity + session-hit gap) |
| Gemini Pro 3.1 r5 | MERGE-NEEDS-MAJOR (input parity confirmed as CRITICAL) |
Triple-converge: r4 output-DSCP fix is correct; r5 has input-side gap
All three reviewers verify the r5 output fix at flow_cache.rs:281-291 is structurally correct and that lo0 enforcement is correctly per-packet via apply_lo0_filter_action (passes meta.dscp directly into evaluate_lo0_filter_counted).
All three reviewers also verify the same bypass class is open for INPUT filters.
Codex sharpens my finding — session-hit path also needs input re-eval
My Claude r5 said "add symmetric cache-decline for input filters" (mirroring the output fix). Codex r5 says that's necessary BUT INCOMPLETE:
Worse, a cache-decline-only input fix would still be incomplete unless the session-hit path also re-runs DSCP-sensitive input filters, because established session hits at
poll_descriptor.rs:769-895do not callevaluate_non_pbr_input_filter.
So the full input-side fix needs THREE pieces, mirroring how M1 (PASS_TO_KERNEL) was eventually fixed in r2/r3:
- Cache-install gate: decline
from_forward_decisionwhen input filter has DSCP-match terms. Addinterface_input_filter_has_dscp_matchhelper. - Session-hit re-eval: at
poll_descriptor.rs:829+(the session resolution branch), re-runevaluate_non_pbr_input_filterwhen input has DSCP-match terms — analogous to howapply_lo0_filter_actionis called for LocalDelivery on session hit. - Retroactive flush: when input-DSCP filter is added at runtime (filter config rotation), iterate existing sessions to either drop them or re-evaluate (analogous to
republish_local_delivery_sessions_for_lo0_filter).
Without piece (2), even if (1) is added, EXISTING cached/installed sessions (installed when filter wasn't configured) keep bypassing on session hits. Without piece (3), filter config changes don't take effect on pre-existing sessions until expiry.
Gemini r3.1 — back to substantive output with the new format demand
Gemini r5 returned full quoted-line evidence per check on gemini-3.1-pro-preview:
Check 4 — FAIL (CRITICAL)
INPUT filter parity: The OUTPUT fix is incomplete as the same bypass exists for INPUT filters. Input filters are not re-evaluated on cache hits, and
FlowCacheEntry::from_forward_decisionhas no check to decline cache insertion for DSCP-dependent input filters. A flow with a first-packet DSCP of 0 bypassing an input filter's discard term will be cached, and subsequent packets with a matched DSCP of 46 will blindly follow the fast path.
Independently confirms Codex's reading.
Codex additional findings
Codex confirms (matching my Claude r5):
- Output fix at
flow_cache.rs:281-292is correct. has_dscp_match_termsis wired fromdscp_valuesonly (Codex notes: current Go snapshot generation normalizes named/numeric DSCP intoDSCPValuesonly; nodscp_ranges/dscp_exceptfields exist).- lo0 path correct (apply_lo0_filter_action evaluates per-packet).
- Conservatism (decline-on-any-DSCP-term) is correct — DSCP-dependent log/count/policer/forwarding-class/dscp-rewrite decisions are also stale if cached by 5-tuple only. Don't narrow to terminal+DSCP.
- No other cache-install paths beyond
from_forward_decision. - Output Reject regression test STILL missing.
- NAT64 doc wording adequate.
Recommendation
Block on input filter DSCP cache bypass. Three-piece fix required (parallel to how r2/r3 fixed the lo0/PASS_TO_KERNEL story):
- Cache-install gate: add
interface_input_filter_has_dscp_matchhelper + cache-decline infrom_forward_decision. (Symmetric to r5's output fix.) - Session-hit re-eval: invoke
evaluate_non_pbr_input_filterin the session-resolution branch when input has DSCP-match terms. (Symmetric to r3'sapply_lo0_filter_actionfor lo0.) - Retroactive flush: when input filter rotation adds DSCP terms, iterate existing sessions and drop or re-evaluate. (Symmetric to r3's
republish_local_delivery_sessions_for_lo0_filter.)
Strongly consider:
- Add output Reject regression test (carrying from r3/r4/r5 — still open).
- Audit
term_matches_v4/v6for OTHER per-packet-variable match fields not in cache key (TOS, fragment, IHL, IP options).
Codex task: task-mpd5b9r5-mo1dff. Gemini Pro 3.1 task: task-mpd5buaq-j9e1gp.
Not merging — author's decision.
4567c9f to
77d93ce
Compare
Round-6 update on
|
Claude round-6 review on
|
Round-6 triple-review synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r6 (self-corrected) | MERGE-NEEDS-MAJOR (was MERGE-READY) |
| Codex r6 | MERGE-NEEDS-MAJOR |
| Gemini Pro 3.1 r6 | MERGE-READY (missed Codex's finding) |
Self-correction (Claude r6 → MAJOR)
My r6 review verified the three-piece input-DSCP fix landed and explicitly stated input_dscp_filter_families_added correctly handles "existing ifindex GAINED DSCP terms." Codex r6 sharpened the analysis: the helper detects when an ifindex transitions from NOT-DSCP-sensitive → DSCP-sensitive (set membership change), but does NOT detect when a DSCP-sensitive ifindex stays DSCP-sensitive while its DSCP TERMS change.
I conflated "ifindex transitioning into the has_dscp_match set" with "DSCP-sensitive filter content changing on an ifindex already in the set." They are different cases. Codex's finding is real.
Verified Codex MAJOR
Purge trigger misses DSCP filter content changes on already-DSCP-sensitive ifindexes.
filter/engine.rs:1058-1071 (verified at the commit):
let v4 = new
.iface_filter_v4_has_dscp_match
.iter()
.any(|ifindex| !old.iface_filter_v4_has_dscp_match.contains(ifindex));This checks "is there an ifindex in NEW's DSCP set that's not in OLD's DSCP set?" — set membership transition only.
Concrete bypass scenario:
- OLD config:
reth1.0has input filter A withfrom dscp 0 destination-port 443 then accept→ ifindex 10 ∈ has_dscp_match set. - NEW config:
reth1.0has input filter A modified tofrom dscp 46 destination-port 443 then discard log→ ifindex 10 still ∈ has_dscp_match set. input_dscp_filter_families_added(old, new)returns(false, false). No purge runs.- Existing sessions installed under the OLD config (when DSCP=0 was the match criterion) keep their cached state.
- The session-hit re-eval (
evaluate_dscp_sensitive_input_filter_on_session_hit) WILL eventually re-evaluate using the NEW filter — but only for packets that ARRIVE at the worker. Already-published PASS_TO_KERNEL session-map entries bypass userspace entirely and don't trigger re-eval.
Codex's framing:
The session-hit re-eval only protects packets that reach userspace; it is not a substitute for deleting previously published session state on rotation.
This is the same class of gap r3 closed for lo0 (M1 retroactive PASS_TO_KERNEL flush). The r6 implementation handles ifindex-set transitions but not within-set content changes.
Codex recommended fixes
- Conservative purge (recommended): purge family when ANY DSCP-sensitive input filter on that family changes (not just ifindex set membership). Compare filter identity / content on each ifindex.
- Filter-content compare: compute a per-ifindex hash of DSCP-sensitive term content and compare old vs new.
Option (1) is simpler — could iterate old.iface_filter_v4_has_dscp_match ∪ new.iface_filter_v4_has_dscp_match and check if the filter Arc identity OR a content fingerprint changed.
Required regression test (Codex)
Add a regression for "same ifindex, old DSCP-sensitive filter, new DSCP-sensitive filter with stricter/new DSCP term"
Gemini r6 missed this finding
Gemini r6 returned MERGE-READY. Gemini's Check 4 stated:
Retroactive purge helper cleanly detects interfaces that gained a DSCP filter over the rotation.
True for the ifindex-set-transition case. Gemini didn't probe the orthogonal content-change case. This is a category of bug Codex tends to catch better — mechanical edge case in a comparison helper.
Both Codex and Gemini confirmed
- Cache-decline order correct (input before output, ingress logical ifindex via
resolve_ingress_logical_ifindex). interface_input_filter_has_dscp_matchreads from new FilterState sets directly.- Session-hit re-eval placement correct: after session_ingress_zone set, before lo0 check; falls through on Accept; emits log + recycles on non-Accept.
- Helper short-circuits with
Nonewhen interface has no DSCP-match terms — zero per-packet cost when filters aren't DSCP-sensitive. purge_sessions_for_input_dscp_filter_revalidationis comprehensive for SELECTED sessions: snapshots matching family, callsrelease_source_nat_allocation+delete_terminal_filtered_session(5-step cleanup).- DSCP filter REMOVAL doesn't need symmetric purge (existing cache entries are correctly cached now that DSCP no longer matters).
- NAT64 not reachable (rejected by
should_cache). - Output reject regression test exercises
FilterAction::Rejectdirectly (no log/count). - Diff scope correct (14 files, +580/-15).
Codex additional minor note
The code is still fragile: output filter family selection elsewhere is based on pre-translation
meta.addr_family, so NAT64 output-filter semantics need a dedicated test if this PR is claiming that surface.
Defer-acceptable since NAT64 isn't currently cacheable; flag for follow-up if cacheability is extended.
Recommendation
Block on the DSCP content-change purge gap. Either:
- (a) Conservative purge — trigger purge when any DSCP-sensitive input filter on a family CHANGES (compare filter Arc identity OR content fingerprint across rotation), OR
- (b) Identity-based compare — track per-ifindex DSCP-sensitive filter content hash in FilterState and compare in
input_dscp_filter_families_added.
Required test: regression for "same ifindex, old DSCP filter, new DSCP filter with different DSCP terms" → purge fires.
Strongly consider in follow-up:
- Codex r5 carry: audit
term_matches_v4/v6for OTHER per-packet-variable match fields not in cache key (TOS, fragment, IHL, IP options). - NAT64 output-filter-family test if NAT64 cacheability is ever extended.
Codex task: task-mpd63nsc-100xjn. Gemini Pro 3.1 task: task-mpd64cnv-hf59kw.
Not merging — author's decision.
77d93ce to
e9aa5bc
Compare
Round-7 fix pushed on
|
Claude round-7 review on
|
| Case | OLD | NEW | Caught by |
|---|---|---|---|
| Filter added | absent | DSCP-sensitive | Second arm |
| Filter removed | DSCP-sensitive | absent | First arm |
| Same ifindex, content changed (r6 case) | DSCP-sensitive A | DSCP-sensitive B | Both arms |
| Same ifindex, DSCP terms removed | DSCP-sensitive | non-DSCP | First arm |
| Same ifindex, DSCP terms gained | non-DSCP | DSCP-sensitive | Second arm |
Deep semantic compare:
filter_term_semantics_matchatengine.rs:1058-1080compares 20 FilterTerm config fields (id, name, source_v4/v6, dest_v4/v6, protocol_bitmap, protocol_match_enabled, source_ports, dest_ports, dscp_bitmap, dscp_match_enabled, action, count, has_count, log, policer_name, routing_instance, forwarding_class, dscp_rewrite).dscp_sensitive_filter_semantics_matchatengine.rs:1082-1100compares Filter struct fields (id, name, family, affects_, has__terms) + term-by-term via the above.
Worker rename: callsite at worker/mod.rs:1273 updated input_dscp_filter_families_added → input_dscp_filter_families_changed. No other callers.
New tests at filter/tests.rs:
input_dscp_filter_families_changed_detects_same_ifindex_content_change — exactly the r6 bypass case. ifindex 10, OLD has filter with dscp 0 accept, NEW has filter (same name) with dscp 46 discard log. Asserts (true, false) — v4 family detected as changed.
input_dscp_filter_families_changed_ignores_unchanged_filter — same filter content reused across rotations. Asserts (false, false) — no spurious purges.
Doc update at plan-1379-dataplane-events.md:58-61: "add, remove, or change DSCP-sensitive input filters" (was just "add").
Hostile observations (non-blocking)
H1 — FilterTerm fields NOT compared. filter_term_semantics_match skips two FilterTerm fields:
three_color_policer: Option<Arc<ThreeColorPolicerRuntime>>— runtime policer state. Comparison would be Arc-pointer-only (no semantic deep compare available). Filter changes that swap the policer runtime but keep the term'spolicer_namewould not trigger DSCP purge — acceptable since policer changes don't affect DSCP-conditional cache decisions.counter: Arc<FilterTermCounter>— atomic counter values (change per packet). Correctly excluded.
Both exclusions are intentional and correct for the DSCP-cache-bypass concern. Worth a code comment explaining the exclusions.
H2 — Filter.id stability question. filter_term_semantics_match and dscp_sensitive_filter_semantics_match both compare old.id == new.id. If Filter.id is auto-incremented per recompile, identical filters across rotations would have DIFFERENT IDs → comparison always returns false → unnecessary purges on every rotation. If Filter.id is content-derived (hash) or name-derived (stable), comparison works correctly. Need to verify how Filter.id is assigned at filter/compiler.rs.
H3 — Performance. Helper now iterates BOTH old.iface_filter_v4_fast AND new.iface_filter_v4_fast (full per-interface maps), filtering by has_dscp_match_terms. Previous helper iterated only the smaller has_dscp_match set. For configs with many interfaces but few DSCP-sensitive filters, the new helper still scans all interfaces. Forwarding rotations are rare so cost is acceptable; would benefit from O(|set|) iteration if the iface_filter_v4_has_dscp_match index were preserved alongside the deep compare.
H4 — r5 carry still open. The Codex r5 finding about auditing term_matches_v4/v6 for other per-packet-variable match fields (TOS, fragment, IHL, IP options) is NOT addressed in r7. Defer-acceptable but worth tracking. Need to grep term_matches_v4 for any non-cache-key match input.
Recommendation
MERGE-READY for the r6 closure. The fix is structurally correct, comprehensive (5/5 transition cases), and has a directly-targeted regression test.
Strongly consider:
- H2: verify Filter.id assignment is stable across recompiles (hash-derived or name-derived, not auto-increment). If not stable, the helper falsely reports "changed" on every rotation, triggering unnecessary purges.
- H1: add a comment explaining why
three_color_policerandcounterare excluded from the term-semantic compare. - H4: file follow-up issue to audit
term_matches_v4/v6for other per-packet-variable match fields.
Awaiting Codex (task-mpd6m8cx-tf97ba) and Gemini Pro 3.1 (task-mpd6mw6a-l8o3k8).
Not merging — author's decision.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
userspace-dp/src/afxdp/poll_descriptor.rs:1339
telemetry.dbg.policy_denyis incremented for lo0 filter drops on the session-miss local-delivery path as well, which makes thepolicy_denydebug counter cover more than policy denies. Consider tracking lo0/filter terminal drops separately so debug output remains interpretable.
)
{
telemetry.dbg.local += 1;
telemetry.dbg.policy_deny += 1;
binding.scratch.scratch_recycle.push(desc.addr);
continue;
| sessions.iter_with_origin(|key, decision, metadata, origin| { | ||
| let family_matches = | ||
| (purge_v4 && key.addr_family == libc::AF_INET as u8) | ||
| || (purge_v6 && key.addr_family == libc::AF_INET6 as u8); | ||
| if family_matches { | ||
| stale.push((key.clone(), decision, metadata.clone(), origin)); | ||
| } | ||
| }); | ||
| let purged = stale.len(); | ||
| for (key, decision, metadata, origin) in stale { | ||
| release_source_nat_allocation( | ||
| &forwarding.source_nat_rules, | ||
| &key, | ||
| decision.nat, | ||
| metadata.is_reverse, | ||
| now_ns, | ||
| ); | ||
| delete_terminal_filtered_session( | ||
| sessions, | ||
| session_map_fd, | ||
| conntrack_v4_fd, | ||
| conntrack_v6_fd, | ||
| shared_sessions, | ||
| shared_nat_sessions, | ||
| shared_forward_wire_sessions, | ||
| shared_owner_rg_indexes, | ||
| peer_worker_commands, | ||
| &key, | ||
| decision, | ||
| &metadata, | ||
| origin, | ||
| ); |
| resolved.origin, | ||
| ); | ||
| telemetry.dbg.local += 1; | ||
| telemetry.dbg.policy_deny += 1; |
Round-7 triple-review synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r7 (self-corrected) | MERGE-NEEDS-MINOR (was MERGE-READY) |
| Codex r7 | MERGE-NEEDS-MINOR |
| Gemini Pro 3.1 r7 | MERGE-NEEDS-MAJOR (Critical) |
Self-correction (Claude r7 → MINOR)
My r7 review flagged Filter.id stability as H2 ("verify, may be auto-incremented") but kept verdict at MERGE-READY pending verification. Both Codex and Gemini independently verified IDs ARE positional (assigned by filter_idx as u32 at filter/compiler.rs:101), confirming the concern. I should have escalated to MINOR after raising the question.
Triple-converge: r6 bypass is closed
Both Codex and Gemini verify the r6 same-ifindex DSCP filter content-change bypass is structurally fixed:
Codex: "Confirmed exact case in
e9aa5bc6:userspace-dp/src/filter/tests.rs:1733... Bidirectional Checkengine.rs:1100-1119catches all requested transitions"Gemini: "The new deep content comparison and regression tests correctly identify when an unchanged interface modifies its DSCP-sensitive terms... Bidirectional check catches ADDED, REMOVED, GAINED-DSCP, LOST-DSCP, and CONTENT-CHANGED scenarios"
5/5 transition cases covered.
Codex/Gemini disagree on severity of two residuals
1. Filter.id is positional, not content-stable.
Both verify at filter/compiler.rs:90-99:
for (filter_idx, snap) in filters.iter().enumerate() {
...
id: filter_idx as u32,The deep comparison starts with old.id == new.id (engine.rs:1080). If operator prepends an unrelated filter to the snapshot, all subsequent filter IDs shift → comparison fails → purge triggers.
Codex framing (MINOR): "Reused IDs cannot hide content changes because the deep field compare still runs. Same content with reordered filter snapshot gets a different Filter.id and causes a conservative purge."
Gemini framing (CRITICAL): "An unrelated index shift will fail the match and trick the engine into thinking the DSCP filter changed, spuriously purging all sessions."
My calibration: MINOR. Codex is right on the substance. The behavior is conservative over-purge, not a security bypass or correctness regression. Sessions are still functionally correct after purge — they just hit the slow path one more time. Performance/operator-experience concern, not a correctness blocker. Gemini's "Critical" is overstatement.
Recommended fix (non-blocking): either remove id from the comparison (rely on name + term content), OR derive Filter.id from content-hash for stability across rotations.
2. three_color_policer field NOT in filter_term_semantics_match.
Both reviewers flag this; same severity split.
Codex framing (MINOR): "Three_color_policer can affect packet treatment through policer_drop / DSCP rewrite in TX selection, so the helper is not exhaustive if 'semantics' means full packet behavior. Either compare the resolved runtime identity/shape or document that this is 'filter snapshot content' equality."
Gemini framing (Critical): "Cached sessions hold an Arc to the policer runtime; if the runtime is replaced (e.g. changing bandwidth limits), the cached session will hold a stale pointer and enforce the old bandwidth limits."
My calibration: MINOR. Two reasons:
- The helper is specifically about INPUT DSCP filter changes that drive cache-decline + purge for DSCP-cache-bypass correctness. Three-color policers are TX-selection (egress), not DSCP-match-conditional caching.
- Three-color policer stale-state is a SEPARATE bug class — rate limits applying stale bandwidth limits. Not a bypass, not a DSCP-conditional correctness issue.
Worth either (a) adding a comment explaining the intentional exclusion, OR (b) filing a separate follow-up for cached three-color policer staleness.
Test gap (Gemini Check 6, valid)
Gemini correctly notes the "unchanged filter" test uses same single-element snapshot for old + new → doesn't exercise the index-shift over-purge scenario. To prove "unchanged DSCP filter ignored on unrelated config rotation," the test should add a sentinel filter to one side and verify still (false, false) — currently impossible because the ID-shift would fire a spurious purge.
This is a real test gap. Closes naturally if the Filter.id stability fix lands.
Codex r5 carry status
Codex r7 confirms the r5 carry is STILL deferred:
"term_matches_v4/v6 still only checks protocol, src/dst prefixes, src/dst ports, and DSCP. Config has fields like ICMP type/code, TCP flags,
is-fragment, and flexible match; those are not in the Rust userspace filter snapshot/evaluator path. DSCP/TOS is covered; fragment/IHL/IP-options-style matching is not fixed by r7."
Gemini independently confirms:
"The FilterTerm AST does not currently parse or support these metadata fields, meaning there are no hidden cache bypass vectors for them in this PR."
Both agree: the carry items (TOS, fragment, IHL, IP options) are not yet in the parser/evaluator surface, so there's no equivalent cached-bypass for them right now. If those fields get added to FilterTerm later, the symmetric cache-decline pattern would need to apply.
Recommendation
MERGE-NEEDS-MINOR. The r6 bypass is closed. Two non-blocking observations:
- Filter.id positional instability: either remove
idfromfilter_term_semantics_match+dscp_sensitive_filter_semantics_match, OR derive Filter.id from content hash. Comment the choice. - three_color_policer exclusion: add a code comment explaining that the helper compares filter-SNAPSHOT semantics for DSCP-cache-decline purposes, not full TX-selection runtime equality. File separate follow-up for cached three-color policer staleness if that's a real concern.
Strongly consider:
- Improve the "unchanged" test to add a sentinel non-DSCP filter to NEW (current test can't catch the ID-shift over-purge).
- File follow-up issue for cached three-color policer runtime staleness on rotation.
Defer:
- r5 carry items (TOS/fragment/IHL/IP options) — not yet in the FilterTerm surface.
Codex task: task-mpd6m8cx-tf97ba. Gemini Pro 3.1 task: task-mpd6mw6a-l8o3k8.
Not merging — author's decision.
Close the userspace RT_FLOW filter-log gaps that let logs describe a security decision that the dataplane did not actually enforce. Existing local-delivery sessions could keep bypassing a newly-added lo0 filter because the helper had already published PASS_TO_KERNEL state in the BPF session map. Add a forwarding-snapshot reload hook that scans current local-delivery sessions when lo0 filters are present and republishes matching entries as helper-visible live sessions. This keeps established local-delivery traffic visible to the userspace lo0 gate. Cached local-delivery session hits also need terminal cleanup when the current lo0 filter denies the packet. Replace the bare local session remove with a helper that deletes BPF/session-map state, removes shared session/NAT/forward-wire indexes, replicates the peer delete, and emits the close delta. Extend the cached-hit regression test so it proves the local table, shared indexes, and HA delta are all cleared together. Policy reject events now fail closed consistently with the forwarding behavior. The userspace PolicyDenied path does not synthesize ICMP or TCP reset packets yet, so PolicyAction::Reject is emitted as RT_FLOW deny instead of claiming reject-on-wire behavior that did not happen. Output filters now carry terminal actions through TX selection. The compiler marks output filters with discard/reject terms as needing TX path evaluation even if the term has no log, count, policer, forwarding-class, or DSCP rewrite side effect. Runtime and cached TX descriptors preserve the matched action and drop before enqueue when the output term is terminal. Logged output deny/reject events therefore no longer describe packets that still forward. DSCP-matched output filters deliberately decline flow-cache insertion. The flow-cache key is only the 5-tuple, while DSCP is per-packet metadata. Caching the first packet's output-filter decision would let a later packet on the same tuple with a different DSCP bypass terminal terms. Keep those flows on the live TX-selection path instead. Apply the same invariant to DSCP-matched input filters. Cache insertion now declines when the ingress input filter has DSCP match terms, session hits re-run DSCP-sensitive input filters before forwarding, and config rotations purge existing sessions when DSCP-sensitive input filters are added, removed, or semantically changed. The purge is intentionally conservative because session metadata does not retain the original logical ingress ifindex needed for exact per-interface re-evaluation. The rotation comparison now inspects actual DSCP-sensitive filter content, not only ifindex set membership. It compares stable names, term match/action semantics, and three-color policer runtime shape while ignoring compiler-positional filter and term IDs. That closes the same-ifindex bypass without over-purging after unrelated filter reorder, and still purges when a DSCP-sensitive term points at a changed policer shape. Update the AF_XDP and filter module docs plus the #1379 closeout plan to state the enforced input/output filter behavior and the remaining lo0 caveats. NAT64 local-delivery filtering is documented in terms of the pre-translation packet family. Validation: - RUSTFLAGS='-Awarnings' cargo test input_dscp_filter_families_changed -- --nocapture - RUSTFLAGS='-Awarnings' cargo test \ from_forward_decision_skips_cache_for_dscp_matched_input_filter \ -- --nocapture - RUSTFLAGS='-Awarnings' cargo test \ poll_descriptor_session_hit_rechecks_dscp_input_filter \ -- --nocapture - RUSTFLAGS='-Awarnings' cargo test \ purge_sessions_for_input_dscp_filter_revalidation_removes_family \ -- --nocapture - RUSTFLAGS='-Awarnings' cargo test input_filter -- --nocapture - RUSTFLAGS='-Awarnings' cargo test output_filter -- --nocapture - RUSTFLAGS='-Awarnings' cargo test filter_log -- --nocapture - RUSTFLAGS='-Awarnings' cargo test lo0_filter -- --nocapture - go test ./pkg/dataplane/userspace ./pkg/logging -run \ TestEventStreamRawDataplaneEventsFeedSyslogFanout - go test ./pkg/dataplane/userspace ./pkg/logging -run \ TestDecodeDataplaneEventPolicyDenyRTFlow - go test ./pkg/dataplane/userspace ./pkg/logging -run \ TestFormatBinaryRecord_Basic - git diff --check
e9aa5bc to
d2bb24b
Compare
Round-8 cleanup pushed on
|
Claude round-8 review on
|
Round-8 triple-review final synthesis on
|
| Reviewer | Verdict |
|---|---|
| Claude r8 | MERGE-READY |
| Codex r8 | MERGE-READY |
| Gemini Pro 3.1 r8 | MERGE-READY (8/8 checks PASS) |
After 8 rounds of hostile review with self-corrections and finding-iteration on every round, all three reviewers independently converged on MERGE-READY with quoted-line verification.
Triple-verified
Positional ID exclusion (r7 MINOR #1 closed):
- Codex: "No
old.id == new.idremains in either helper." - Gemini: "The
idfields ... are correctly omitted from bothfilter_term_semantics_matchanddscp_sensitive_filter_semantics_match." - Test asserts
assert_ne!on old/new IDs to prove non-vacuous, thenassert_eq!returns(false, false).
Three-color policer runtime semantics (r7 MINOR #2 closed):
- All four cases verified:
(None, None) → true,(Some, Some) → Arc::ptr_eq || same_runtime_shape_as, asymmetric(None, Some)/(Some, None) → false. same_runtime_shapecompares 7 fields: mode, color_blind, committed_rate, committed_burst, peak_or_excess_rate, peak_or_excess_burst, treatments. Codex: "All fields that dictate the runtime performance envelope or treatment ... are systematically compared."
Lock-ordering safety (Gemini's sharp framing):
selfis the old, currently-live policer runtime which is contended by dataplane workers.otheris the newly-parsed unexposed policer runtime (which no workers can access yet). Thus,other.stateis uncontended. Dataplane workers only lock a single policer at a time (self.state), so a cyclic deadlock is impossible. If the mutex is poisoned, it safely returnsfalseto trigger a conservative purge.
Better articulation than my Claude r8 lock-ordering note — explains why deadlock is impossible (not just "callers maintain order").
Tests rigorous: both new regression tests explicitly assert the test setup forces the relevant divergence (ID-shift OR new Arc), then assert the helper makes the correct decision. Not vacuous.
Performance: Codex framing: "Common unchanged rotations preserve Arcs and hit Arc::ptr_eq, so no mutexes. Worst case is roughly 4 * workers * DSCP-ifindexes * policer-terms mutex acquisitions per rotation when runtimes are not pointer-equal." Forwarding rotations are control-plane events, not packet path. Not a merge blocker.
Carried forward (out of scope for this PR)
r5 carry — audit other per-packet-variable filter match fields. Codex r8: "deferred/not addressed by r8. The r8 delta is filter semantic comparison only; it does not touch AF_XDP parser/event paths."
The FilterTerm AST does not currently support TOS, fragment, IHL, or IP options as match fields, so no equivalent cached-bypass exists for them in the userspace dataplane today. If those fields are added to the FilterTerm surface in a future PR, the symmetric pattern (cache-decline + content-comparison + session purge) will need to apply.
Recommend filing this as a follow-up issue tracker entry.
Recommendation
MERGE-READY converged across all three reviewers.
The complete fix chain delivered in this PR closes 6 bug classes across 8 rounds:
- r1: Output filter terminal action enforcement (
Discard/Rejectcarried through TX selection, drop before enqueue) - r1: Reject contract divergence (filter-path Reject fail-closed as deny; policy-path semantics preserved)
- r2: lo0 enforcement on cached LocalDelivery (re-eval on session hit + publish-time gate + retroactive republish)
- r4: Cached output-filter DSCP bypass (cache-decline for output filters with DSCP-match terms)
- r5/r6: Cached input-filter DSCP bypass (cache-decline + session-hit re-eval + retroactive purge with content-aware detection)
- r8: Filter.id positional instability and three-color policer runtime shape (proper semantic comparison for purge trigger)
Strongly consider in follow-up (non-blocking):
- Code comment in
same_runtime_shape_asdocumenting Gemini's lock-ordering analysis (self=live, other=unexposed → no contention). - File issue for r5 carry (TOS/fragment/IHL/IP options) when/if those FilterTerm fields are added.
Codex task: task-mpd7grdg-urepem. Gemini Pro 3.1 task: task-mpd7hbxx-07v7b5.
Not merging — author's decision.
PR #1430 closed the immediate filter-log enforcement gaps by treating DSCP as cache-sensitive metadata living outside the 5-tuple session/flow-cache key. #1431 codifies the contract going forward so the next per-packet match field (TOS lower bits/ECN, TCP flags, fragment/IHL, IP options, ICMP type/code, flex_match, etc.) cannot silently bypass flow-cache. This PR is documentation plus tests — no runtime change. Three deliverables: 1. userspace-dp/src/filter/README.md grows a 'Cache-key invariants for per-packet match fields (#1431)' section: the invariant wording verbatim from the issue body, a classification table covering every match criterion on FirewallTermSnapshot and FilterTerm (including future candidates), a path (b) runbook listing the seven hooks (aggregate flag, per-interface set, lookup helpers, flow-cache gate, session-hit re-eval, rotation purge, tests) with file:line refs to the DSCP reference implementation, a path (a) pointer for promoting a field into SessionKey, and a lo0 note clarifying that LocalDelivery is non-cacheable (is_cacheable() returns false) so lo0 filters do not need a per-interface has_<X>_match set. 2. In-source CACHE-KEY INVARIANT comment blocks above FilterTerm in userspace-dp/src/filter/mod.rs and above FirewallTermSnapshot in userspace-dp/src/protocol/security.rs. These are the loud reviewer-facing tripwires: anyone adding a match field to either surface will see the contract in the diff. Both blocks point at the README section for the full classification table and recipe. 3. Two new runbook-shaped tests in userspace-dp/src/afxdp/flow_cache_tests.rs: - dscp_input_gate_blocks_flow_cache_insertion_via_runbook_pattern - dscp_output_gate_blocks_flow_cache_insertion_via_runbook_pattern These re-exercise the DSCP gate already covered by the bespoke tests at lines 644/696, but in an explicitly runbook-shaped layout with step-by-step comments. They serve as the canonical 'clone this when adding a new cache-sensitive match field' references that the README cites. Cited (not duplicated) existing coverage: - rotation purge positional-ID immunity: filter/tests.rs:1806 input_dscp_filter_families_changed_ignores_positional_filter_id_change - session-hit re-evaluation: afxdp/tests.rs:3184 Methodology: 4 rounds of Codex + AGY adversarial plan review. - r1: Codex PLAN-KILL / AGY PLAN-NEEDS-MAJOR — both rejected the v1 PER_PACKET_MATCH_FIELDS constant list + PerPacketMatchField trait + fake-field harness as compile-time theater (Rust has no struct-field reflection without proc-macros). - r2: Codex PLAN-NEEDS-MINOR / AGY PLAN-READY — kept the salvage path (README + harness as runbook reference); Codex caught pub(super) visibility on FlowCacheEntry::from_forward_decision forcing the tests into afxdp/ not filter/. - r3: both PLAN-NEEDS-MINOR on stale-from-v2 leftovers in v3. - r4: both PLAN-READY on v5. Plan + reviewer task ids preserved under docs/pr/1431-filter-cache-invariants/. Validation: - cargo build --release clean (warnings unchanged from master). - 2 new tests pass 5/5 (flake check). - Full cargo --release passes except a pre-existing snat_contract_documents_current_fail_closed_runtime doc-guard that also fails on origin/master (docs/userspace-dataplane-gaps.md has 0 occurrences of 'fail-closed' on master) — unrelated to #1431. - Go suite clean. Closes #1431
Codex r1 returned MERGE-NEEDS-MINOR on PR #1604 with two findings: 1. The CACHE-KEY INVARIANT block above FirewallTermSnapshot was a shortened version of the block above FilterTerm and dropped the concrete file:line hooks. For a doc-invariant PR, the two blocks should either be identical or both be intentionally short pointers to the README. Harmonized: protocol/security.rs now mirrors filter/mod.rs's block, including the extend-SessionKey prerequisites, the cache-sensitive runbook hooks with afxdp/flow_cache.rs:297-309 / afxdp/poll_descriptor/mod.rs:217-244 / afxdp/worker/loop_body/mod.rs:295-330 references, the #1430 precedent, and the README pointer. 2. The 'Step 4: same gate, different per-interface set (iface_filter_out_v4_fast.has_dscp_match_terms)' comment in the output runbook test was sloppy — has_dscp_match_terms is an aggregate bool on Filter (looked up via the per-interface iface_filter_out_v{4,6}_fast map), not a HashSet. Reworded to 'fast map lookup plus aggregate flag, not a HashSet' so the reference comment is accurate. Copilot r1 (COMMENTED) flagged two inline issues: 3. The README's 'parse_flow_ports stores the ICMP echo identifier in src_port' was over-specifying. parse_flow_ports (frame/inspect.rs:212-232) unconditionally reads bytes 4-6 of the ICMP header for ALL ICMP types — those bytes only carry an 'identifier' for Echo Request/Reply, and are opaque for other ICMP types. Reworded the README explanation and the classification-table row accordingly. 4. The 'lines 644 and 696' references in the new tests' header comment and the README runbook are off-by-one against current master (the bespoke tests are at 643/695) AND will drift over time. Replaced both occurrences with test names: from_forward_decision_skips_cache_for_dscp_matched_input_filter / _output_filter. Per /triple-review Step 8.5, the Claude SMR code-review doc is now on the branch at docs/pr/1431-filter-cache-invariants/claude-smr-code-r1.md with MERGE-READY verdict. The doc covers diff-coverage check, allocation/lock/numerical/HA invariants, test-coverage shape verification, both Codex r1 minors and both Copilot inline findings analysed in detail, ICMP + lo0 claims independently re-verified, pre-existing snat_contract doc-guard failure proven master-broken, and a self-correction section listing the misses Codex / AGY caught in earlier plan rounds. Build + 2 new tests still pass 5/5 flake. Go suite still clean.
PR #1430 closed the immediate filter-log enforcement gaps by treating DSCP as cache-sensitive metadata living outside the 5-tuple session/flow-cache key. #1431 codifies the contract going forward so the next per-packet match field (TOS lower bits/ECN, TCP flags, fragment/IHL, IP options, ICMP type/code, flex_match, etc.) cannot silently bypass flow-cache. This PR is documentation plus tests — no runtime change. Three deliverables: 1. userspace-dp/src/filter/README.md grows a 'Cache-key invariants for per-packet match fields (#1431)' section: the invariant wording verbatim from the issue body, a classification table covering every match criterion on FirewallTermSnapshot and FilterTerm (including future candidates), a path (b) runbook listing the seven hooks (aggregate flag, per-interface set, lookup helpers, flow-cache gate, session-hit re-eval, rotation purge, tests) with file:line refs to the DSCP reference implementation, a path (a) pointer for promoting a field into SessionKey, and a lo0 note clarifying that LocalDelivery is non-cacheable (is_cacheable() returns false) so lo0 filters do not need a per-interface has_<X>_match set. 2. In-source CACHE-KEY INVARIANT comment blocks above FilterTerm in userspace-dp/src/filter/mod.rs and above FirewallTermSnapshot in userspace-dp/src/protocol/security.rs. These are the loud reviewer-facing tripwires: anyone adding a match field to either surface will see the contract in the diff. Both blocks point at the README section for the full classification table and recipe. 3. Two new runbook-shaped tests in userspace-dp/src/afxdp/flow_cache_tests.rs: - dscp_input_gate_blocks_flow_cache_insertion_via_runbook_pattern - dscp_output_gate_blocks_flow_cache_insertion_via_runbook_pattern These re-exercise the DSCP gate already covered by the bespoke tests at lines 644/696, but in an explicitly runbook-shaped layout with step-by-step comments. They serve as the canonical 'clone this when adding a new cache-sensitive match field' references that the README cites. Cited (not duplicated) existing coverage: - rotation purge positional-ID immunity: filter/tests.rs:1806 input_dscp_filter_families_changed_ignores_positional_filter_id_change - session-hit re-evaluation: afxdp/tests.rs:3184 Methodology: 4 rounds of Codex + AGY adversarial plan review. - r1: Codex PLAN-KILL / AGY PLAN-NEEDS-MAJOR — both rejected the v1 PER_PACKET_MATCH_FIELDS constant list + PerPacketMatchField trait + fake-field harness as compile-time theater (Rust has no struct-field reflection without proc-macros). - r2: Codex PLAN-NEEDS-MINOR / AGY PLAN-READY — kept the salvage path (README + harness as runbook reference); Codex caught pub(super) visibility on FlowCacheEntry::from_forward_decision forcing the tests into afxdp/ not filter/. - r3: both PLAN-NEEDS-MINOR on stale-from-v2 leftovers in v3. - r4: both PLAN-READY on v5. Plan + reviewer task ids preserved under docs/pr/1431-filter-cache-invariants/. Validation: - cargo build --release clean (warnings unchanged from master). - 2 new tests pass 5/5 (flake check). - Full cargo --release passes except a pre-existing snat_contract_documents_current_fail_closed_runtime doc-guard that also fails on origin/master (docs/userspace-dataplane-gaps.md has 0 occurrences of 'fail-closed' on master) — unrelated to #1431. - Go suite clean. Closes #1431
Codex r1 returned MERGE-NEEDS-MINOR on PR #1604 with two findings: 1. The CACHE-KEY INVARIANT block above FirewallTermSnapshot was a shortened version of the block above FilterTerm and dropped the concrete file:line hooks. For a doc-invariant PR, the two blocks should either be identical or both be intentionally short pointers to the README. Harmonized: protocol/security.rs now mirrors filter/mod.rs's block, including the extend-SessionKey prerequisites, the cache-sensitive runbook hooks with afxdp/flow_cache.rs:297-309 / afxdp/poll_descriptor/mod.rs:217-244 / afxdp/worker/loop_body/mod.rs:295-330 references, the #1430 precedent, and the README pointer. 2. The 'Step 4: same gate, different per-interface set (iface_filter_out_v4_fast.has_dscp_match_terms)' comment in the output runbook test was sloppy — has_dscp_match_terms is an aggregate bool on Filter (looked up via the per-interface iface_filter_out_v{4,6}_fast map), not a HashSet. Reworded to 'fast map lookup plus aggregate flag, not a HashSet' so the reference comment is accurate. Copilot r1 (COMMENTED) flagged two inline issues: 3. The README's 'parse_flow_ports stores the ICMP echo identifier in src_port' was over-specifying. parse_flow_ports (frame/inspect.rs:212-232) unconditionally reads bytes 4-6 of the ICMP header for ALL ICMP types — those bytes only carry an 'identifier' for Echo Request/Reply, and are opaque for other ICMP types. Reworded the README explanation and the classification-table row accordingly. 4. The 'lines 644 and 696' references in the new tests' header comment and the README runbook are off-by-one against current master (the bespoke tests are at 643/695) AND will drift over time. Replaced both occurrences with test names: from_forward_decision_skips_cache_for_dscp_matched_input_filter / _output_filter. Per /triple-review Step 8.5, the Claude SMR code-review doc is now on the branch at docs/pr/1431-filter-cache-invariants/claude-smr-code-r1.md with MERGE-READY verdict. The doc covers diff-coverage check, allocation/lock/numerical/HA invariants, test-coverage shape verification, both Codex r1 minors and both Copilot inline findings analysed in detail, ICMP + lo0 claims independently re-verified, pre-existing snat_contract doc-guard failure proven master-broken, and a self-correction section listing the misses Codex / AGY caught in earlier plan rounds. Build + 2 new tests still pass 5/5 flake. Go suite still clean.
Summary
Why
#1428 was merged at
f967a112before the round-4 review findings were fully closed. This fix-forward PR carries the follow-up commit that closes the three verified blockers:Validation
RUSTFLAGS='-Awarnings' cargo test input_filter -- --nocaptureRUSTFLAGS='-Awarnings' cargo test lo0_filter -- --nocaptureRUSTFLAGS='-Awarnings' cargo test filter_log_can_encode_reject_action -- --nocaptureRUSTFLAGS='-Awarnings' cargo test filter_log_event_emit_preserves_reject_action -- --nocapturego test ./pkg/dataplane/userspace ./pkg/logging -run 'TestEventStreamRawDataplaneEventsFeedSyslogFanout|TestDecodeDataplaneEventPolicyDenyRTFlow|TestFormatBinaryRecord_Basic'git diff --check HEAD^Refs #1379
Refs #1373