diff --git a/README.md b/README.md index 102da5bd4..d4b29ffee 100644 --- a/README.md +++ b/README.md @@ -85,7 +85,7 @@ TC Egress: main -> screen_egress -> conntrack -> nat -> forward | NAT64 (IPv6↔IPv4) | Yes | Yes | | NPTv6 (RFC 6296) | Yes | Yes | | Screen/IDS (11 checks) | Yes | Most checks yes; SYN-cookie behavior falls back | -| Firewall filters + policers | Yes | Filters yes; three-color policers still gated | +| Firewall filters + policers | Yes | Filters yes; three-color policers admitted for color-blind `then discard` slice, with remaining #1375 hardening still open | | TCP MSS clamping | Yes | Yes | | GRE tunnel transit | Yes | Yes (passthrough) | | IPsec / XFRM | Yes | Yes (passthrough) | @@ -97,11 +97,12 @@ TC Egress: main -> screen_egress -> conntrack -> nat -> forward The userspace dataplane now covers most of the transit feature set in native Rust, but it is not "fallback-free". Current explicit gates in code still -include SYN-cookie-dependent screen behavior, three-color policers, and port -mirroring. Pool-mode SNAT is admitted, and #1385 added userspace-v1 -`address-persistent` selection; #1377 still owns per-pool `persistent-nat` -lease reuse, allocator/exhaustion counters, and the mixed-backend rollback -boundary. The exact admission boundary is documented in +include SYN-cookie-dependent screen behavior and port mirroring. Three-color +policers are admitted only for the bounded color-blind `then discard` runtime +slice while #1375 hardening remains. Pool-mode SNAT is admitted, and #1385 +added userspace-v1 `address-persistent` selection; #1377 still owns per-pool +`persistent-nat` lease reuse, allocator/exhaustion counters, and the +mixed-backend rollback boundary. The exact admission boundary is documented in [`docs/userspace-dataplane-gaps.md`](docs/userspace-dataplane-gaps.md). ## Architecture diff --git a/_Log.md b/_Log.md index e32288fc0..bda9cce60 100644 --- a/_Log.md +++ b/_Log.md @@ -17,6 +17,28 @@ - **File(s)**: `userspace-dp/src/afxdp/umem/mod.rs`, `_Log.md` - **Validation**: `go test ./pkg/dataplane/userspace`; `cargo test --manifest-path userspace-dp/Cargo.toml mirror::` (expected environment failure: missing libelf headers/pkg-config); `git diff --check` +- **Timestamp**: 2026-05-17T21:58:24Z + - **Action**: PR #1410 residual review follow-up — made emit-on-wire inject tuple identity an explicit Go/Rust control-wire contract, added helper status gating for mixed-version fail-closed behavior, and stopped Rust from synthesizing source tuple fields for emitted inject packets. + - **File(s)**: `pkg/dataplane/userspace/protocol.go`, `pkg/dataplane/userspace/inject.go`, `pkg/dataplane/userspace/inject_test.go`, `pkg/dataplane/userspace/protocol_test.go`, `pkg/cmdtree/tree.go`, `userspace-dp/src/protocol.rs`, `userspace-dp/src/server/lifecycle.rs`, `userspace-dp/src/server/README.md`, `userspace-dp/src/afxdp/coordinator/inject.rs`, `userspace-dp/src/afxdp/coordinator/tests.rs`, `userspace-dp/src/afxdp/frame/mod.rs`, `_Log.md` + - **Validation**: `go test ./pkg/dataplane/userspace ./pkg/cmdtree`; `cargo test --manifest-path userspace-dp/Cargo.toml inject_packet -- --nocapture`; `cargo test --manifest-path userspace-dp/Cargo.toml injected_packet -- --nocapture`; `cargo check --manifest-path userspace-dp/Cargo.toml`; `git diff --check` + +- **Timestamp**: 2026-05-17T21:39:00Z + - **Action**: PR #1410 review follow-up — reconciled README userspace capability wording so three-color policers are described as partially admitted (color-blind `then discard` slice) rather than fully gated, matching current userspace capability documentation and runtime admission behavior. + - **File(s)**: `README.md`, `_Log.md` + +- **Timestamp**: 2026-05-17T21:23:55Z + - **Action**: PR #1410 round-3 blocker follow-up — added explicit pending-forward CoS resolution state so resolved `None`/`None` selections are not metered again, carried metadata-derived ICMP flow keys through local and embedded ICMP prebuilt-forward paths, stamped emitted inject packets with synthetic ICMP tuples before TX selection, and preserved local tunnel tuple metadata through TX. + - **File(s)**: `userspace-dp/src/afxdp/types/tx.rs`, `userspace-dp/src/afxdp/forward_request.rs`, `userspace-dp/src/afxdp/tx/dispatch.rs`, `userspace-dp/src/afxdp/icmp.rs`, `userspace-dp/src/afxdp/poll_descriptor.rs`, `userspace-dp/src/afxdp/coordinator/inject.rs`, `userspace-dp/src/afxdp/tunnel.rs`, `userspace-dp/src/afxdp/tx/dispatch_tests.rs`, `userspace-dp/src/afxdp/tests.rs`, `userspace-dp/src/afxdp/frame/tests.rs`, `userspace-dp/src/afxdp/coordinator/tests.rs`, `_Log.md` + - **Validation**: `cargo test --manifest-path userspace-dp/Cargo.toml build_local_time_exceeded_request -- --nocapture`; `cargo test --manifest-path userspace-dp/Cargo.toml pending_forward_cos_resolution -- --nocapture`; `cargo test --manifest-path userspace-dp/Cargo.toml stamp_injected_packet_tuple -- --nocapture`; `cargo test --manifest-path userspace-dp/Cargo.toml build_live_forward_request_marks_empty_cos_selection_resolved -- --nocapture`; `cargo test --manifest-path userspace-dp/Cargo.toml build_live_forward_request_meters_non_l4_metadata_flow -- --nocapture`; `cargo test --manifest-path userspace-dp/Cargo.toml three_color -- --nocapture`; `cargo check --manifest-path userspace-dp/Cargo.toml`; `git diff --check` + +- **Timestamp**: 2026-05-17T20:30:00Z + - **Action**: PR #1410 round-1 review follow-up — removed flow-cache hit TX-selection cloning from the packet fast path, switched local ICMP/tunnel/control-packet CoS resolution to timestamped `_at` evaluation with flow-key fallback, and enforced `cos.drop` handling on those paths so three-color policer drops are not bypassed when metadata-only classification is used. + - **File(s)**: `userspace-dp/src/afxdp/poll_descriptor.rs`, `userspace-dp/src/afxdp/icmp.rs`, `userspace-dp/src/afxdp/tunnel.rs`, `userspace-dp/src/afxdp/coordinator/inject.rs`, `_Log.md` + +- **Timestamp**: 2026-05-17T20:37:00Z + - **Action**: Addressed post-validation review nit by lazily constructing cached precomputed TX-selection descriptors only on flow-cache fallback forwarding, avoiding unnecessary per-hit descriptor construction on successful in-place TX hits. + - **File(s)**: `userspace-dp/src/afxdp/poll_descriptor.rs`, `_Log.md` + - **Timestamp**: 2026-05-17T15:28:13Z - **Action**: PR #1397 follow-up — fixed mouse-latency diagnostics review findings by making `cwnd_settle_ok` tri-state in manifests (unknown/true/false), correcting cwnd byte-unit parsing to 1024-based `K/M/G/TBytes`, recording probe phase timings even on failed/timed-out connect/drain/read attempts, tightening fairness-regimes settle-evidence wording, and extending unit coverage for settle-diagnostics CLI output/status and failure-phase timing counts. - **File(s)**: `test/incus/test-mouse-latency.sh`, `test/incus/mouse_latency_orchestrate.py`, `test/incus/mouse_latency_orchestrate_test.py`, `test/incus/mouse_latency_probe.py`, `test/incus/mouse_latency_probe_test.py`, `test/incus/test_mouse_latency_shell_test.py`, `docs/fairness-regimes.md`, `_Log.md` diff --git a/cmd/cli/request.go b/cmd/cli/request.go index 165abcebd..df7422df8 100644 --- a/cmd/cli/request.go +++ b/cmd/cli/request.go @@ -177,7 +177,7 @@ func (c *ctl) handleRequestChassisClusterDataPlane(args []string) error { return err } action = fmt.Sprintf("userspace-inject:%d:%s", slot, mode) - target = extra["destination-ip"] + target = dpuserspace.EncodeInjectPacketTarget(extra) case len(args) > 0 && args[0] == "forwarding": armed, err := dpuserspace.ParseForwardingCommand(args) if err != nil { diff --git a/docs/pr/1373-retire-ebpf-dataplane/plan-1375-three-color-policers.md b/docs/pr/1373-retire-ebpf-dataplane/plan-1375-three-color-policers.md index adf7181f5..37cca530e 100644 --- a/docs/pr/1373-retire-ebpf-dataplane/plan-1375-three-color-policers.md +++ b/docs/pr/1373-retire-ebpf-dataplane/plan-1375-three-color-policers.md @@ -5,6 +5,37 @@ Add userspace support for Junos three-color policers so configs under `firewall three-color-policer` no longer require the eBPF dataplane. +## Current Status + +The bounded runtime slice is implemented after #1395: + +- Rust compiles three-color policer snapshots into stable name-sorted runtime + IDs and links filter terms to shared runtime handles. +- Live forwarding-path TX selection meters srTCM/trTCM policers, applies red + drops for `then discard`, and records green/yellow/red/drop packet and byte + counters. +- Flow-cache hits carry cached policer handles and meter them before cached + forwarding. +- Rust status, Go protocol, status formatting, and Prometheus expose + per-color/drop counters. +- `deriveUserspaceCapabilities()` admits the current color-blind `then + discard` runtime slice for `firewall three-color-policer` configs. + +Remaining #1375 work is validation and hardening rather than admission: + +- Color-aware inherited-color handling remains fail-closed until packet + metadata carries trusted incoming color end-to-end. This avoids silently + promoting yellow/red traffic to green. +- Replace the per-policer mutex runtime with the approved sharded or packed + atomic state if throughput testing shows contention. +- Preserve counters and token state across snapshot rebuilds if operator + continuity is required for #1373 removal. +- Wire non-drop per-color actions, especially loss-priority propagation, into + the downstream forwarding/CoS path. Until then, non-`discard` three-color + actions remain fail-closed. +- Run integration traffic, failover, and performance evidence for + green/yellow/red classification and red drops. + ## Dependencies - #1381 should land first so userspace capability removal and snapshot delivery @@ -16,7 +47,9 @@ Add userspace support for Junos three-color policers so configs under Extend the userspace policer snapshot and Rust types with srTCM, trTCM, `color_blind`, color-aware input handling, and per-color actions for DSCP -rewrite plus red drop/count behavior. +rewrite plus red drop/count behavior. The current runtime enables only the +subset with enforceable semantics: color-blind metering and red drop/count +for `then discard`. Use `u128` token refill math with `monotonic_nanos`. Reject invalid config at compile/commit time: zero rate, zero burst, `PIR < CIR`, `PBS < CBS`, and @@ -30,17 +63,21 @@ requires both C and P tokens, yellow requires only P tokens, red otherwise. Color-aware mode must respect incoming color and never promote packets above their incoming color. Color-blind mode evaluates each packet without inherited -color. +color. Until inherited color is carried in trusted packet metadata, userspace +must reject color-aware three-color policers rather than defaulting every +packet to green. ## Hot-Path Invariants - Flow-cache hits still execute the policer before forwarding. - No `f64` token math in the dataplane. -- No `FxHashMap` mutable hot-path lookup as the final - production model; use stable rule IDs with sharded or packed atomic state. +- No `FxHashMap` mutable hot-path lookup. The current + runtime uses stable name-sorted IDs with shared handles; sharded or packed + atomic state remains the scaling follow-up. - Per-color DSCP rewrite and red drop decisions happen in the same forwarding decision that accounts tokens. -- Per-color counters are updated without central hot atomics. +- Per-color counters are attached to the stable policer runtime. The current + counters use relaxed atomics per logical policer/color. ## State and HA Behavior @@ -49,8 +86,8 @@ color. adds token sync. - Config snapshots carry stable policer/rule identity so counters can survive snapshot rebuilds where practical. -- Status exposes green/yellow/red packet and byte counters, DSCP rewrites, and - red drops through Rust status, Go protocol, CLI, and Prometheus. +- Status exposes green/yellow/red packet and byte counters plus red drops + through Rust status, Go protocol, CLI, and Prometheus. ## Risks @@ -61,9 +98,9 @@ color. preserving one logical bucket per configured policer identity. - Color semantics: color-aware mode must never promote incoming yellow/red traffic; one wrong branch turns a security control into a bandwidth grant. -- Counter attribution: green/yellow/red/drop counters must survive snapshot - rebuilds by stable identity, or operators cannot audit policer behavior after - commits. +- Counter attribution: green/yellow/red/drop counters are stable inside a + compiled runtime. Carrying them across snapshot rebuilds remains a follow-up + if operators need continuity across commits. ## Exact Tests @@ -74,14 +111,16 @@ color. - Cargo: `policer::color_blind_ignores_incoming_color`. - Cargo: `policer::u128_bucket_math_boundary_inputs`. - Cargo: `policer::three_color_dscp_rewrite`. -- Cargo: `policer::flow_cache_hits_run_policer`. +- Cargo: `filter::tests::three_color_runtime_ids_and_miss_path_counters_are_stable`. +- Cargo: `filter::tests::flow_cache_hits_run_three_color_policer`. - Go: userspace snapshot round-trip for three-color policer fields, per-color actions, and `ColorBlind`. - Go: compiler validation rejects zero rates/bursts, `PIR < CIR`, and `PBS < CBS`. -- Go: `deriveUserspaceCapabilities()` admits three-color policer configs only - after the userspace snapshot and Rust runtime support are wired, and rejects - them before that point. +- Go: `deriveUserspaceCapabilities()` admits three-color policer configs after + the userspace snapshot and Rust runtime support are wired. +- Go: ProcessStatus, status formatting, and Prometheus tests cover + three-color per-color/drop counters. - Integration: controlled-rate traffic against userspace cluster verifies green/yellow/red classification, DSCP rewrite, red drop behavior, and per-color counters. diff --git a/docs/userspace-dataplane-gaps.md b/docs/userspace-dataplane-gaps.md index abfb25b5e..c986c8d9f 100644 --- a/docs/userspace-dataplane-gaps.md +++ b/docs/userspace-dataplane-gaps.md @@ -35,6 +35,7 @@ These capabilities exist in the current Rust userspace dataplane code path: | NPTv6 | Implemented | Stateless prefix translation | | Firewall filters | Implemented | Filter snapshots and evaluation in Rust | | Flow export | Implemented | Userspace flow export snapshot and runtime | +| Three-color policers | Implemented with caveats | srTCM/trTCM runtime, forwarding-path and flow-cache-hit metering, red drops for `then discard`, status/CLI/Prometheus counters. Sharded state, cross-snapshot continuity, non-drop color actions, and integration evidence remain #1375 follow-up work. | | TCP MSS clamping | Implemented | Flow snapshot fields are delivered and used in Rust | | Embedded ICMP NAT reversal | Implemented | Includes reverse-session repair paths | | Configurable session timeouts | Implemented | Snapshot-driven timeouts in `session.rs` | @@ -52,7 +53,6 @@ These are the remaining explicit configuration gates in |----------------------|-------------|--------------------| | Unsupported policy shapes | Gated | Address/application expansion must succeed for userspace | | Screen behavior requiring SYN cookies | Gated; userspace screen runtime has fail-closed cookie challenge/ACK-validation/cache scaffolding, but no HA key publication or SYN-ACK/RST TX yet | #1374 | -| Three-color policers | Gated | #1375 | | Port mirroring | Gated; partial runtime | #1376 still needs full path coverage and integration evidence before the gate is removed | Port mirroring now has snapshot/wire plumbing plus a bounded forwarded-path @@ -87,7 +87,7 @@ The current #1373 audit produced these tracked blockers: | #1378 | Finish the policy-scheduler retirement contract after #1396 userspace propagation: hit-counter survival across scheduler snapshot rebuilds and strict missing-scheduler commit behavior landed in the 2026-05-17 closeout slice; remaining blocker is integration/failover validation evidence | Phase 4 BPF source removal | | #1379 | Emit policy-deny, screen-drop, and filter-log dataplane events from userspace | Phase 4 BPF source removal | | #1374 | Implement userspace SYN-cookie flood protection or an approved equivalent. #1393 and the 2026-05-17 runtime slice cover deterministic cookie codec/layout, snapshot propagation, fail-closed screen challenge selection, session-miss ACK validation, and a bounded validated-client cache. Lower-layer coverage in `userspace-dp/src/screen_tests.rs` pins 4-way validated-client cache replacement; poll-stage tests only pin the operational invalid-ACK drop/bypass semantics. Remaining: validated-client cache expiration semantics, secret-epoch rotation, bounded SYN-ACK TX, ACK RST emission, HA-safe secret publication/cache survivability, counters/status, integration/failover validation, and userspace capability gate removal. | Phase 4 BPF source removal | -| #1375 | Implement userspace RFC 2697/2698 three-color policers | Phase 4 BPF source removal | +| #1375 | Finish userspace RFC 2697/2698 three-color policer hardening: sharded/packed state decision, cross-snapshot counter continuity decision, non-drop color action handling, and integration/failover/performance evidence | Phase 4 BPF source removal | | #1376 | Implement userspace port mirroring or explicitly retire the feature | Phase 4 BPF source removal | | #1380 | Retire the remaining BPF-map-oriented `show system buffers` operator surface. Userspace now renders the bounded helper status that exists; only optional new helper capacity denominators for session-table / flow-cache / neighbor-cache fill remain undecided. | Phase 5 CLI / observability cleanup | @@ -101,11 +101,11 @@ Recommended dependency order: userspace-v1 selector plus mixed-backend rollback boundary, but per-pool `persistent-nat` and allocator exhaustion counters remain #1377 runtime gaps. #1378 is no longer missing basic userspace propagation after #1396, - and the 2026-05-17 closeout slice added counter continuity plus strict - missing-scheduler commit rejection; keep it open for the remaining - integration/HA failover evidence. -3. #1374, #1375, and #1376 before Phase 4, because these are explicit feature - gaps currently protected by the legacy eBPF fallback. + but its remaining counter/validation/evidence contract still blocks BPF + source removal. +3. #1374 and #1376 before Phase 4, because these are explicit feature gaps + currently protected by the legacy eBPF fallback. Keep #1375 on the Phase 4 + list for validation and hardening evidence, not as a capability gate. 4. #1380 in Phase 5, after the dataplane boundary is settled but before the remaining operator-facing BPF map surface disappears. @@ -149,9 +149,12 @@ The highest-value remaining work on current `master` is: 2. fix #1377 and #1379 to remove silent correctness and visibility regressions; keep #1385 plus the userspace-v1 fixtures as evidence of the current AF_XDP SNAT pool selector, not full persistent-NAT parity. Keep - #1378 open only for the remaining policy-scheduler integration/HA failover - evidence. -3. close #1374, #1375, and #1376 before any BPF source removal + #1378 open for the remaining policy-scheduler counter/validation/evidence + contract after #1396. +3. close #1374 and #1376 before any BPF source removal, and finish the #1375 + hardening/evidence checklist. The three-color capability gate is removed + only for the current color-blind `then discard` slice; color-aware and + non-drop treatments stay fail-closed. 4. carry the narrowed #1380 denominator decision into Phase 5; the current userspace command already avoids BPF-map fallback when helper status is available diff --git a/pkg/api/metrics.go b/pkg/api/metrics.go index bcebc104c..10e8f258d 100644 --- a/pkg/api/metrics.go +++ b/pkg/api/metrics.go @@ -49,6 +49,11 @@ type xpfCollector struct { // Filter counters filterHitsTotal *prometheus.Desc + // Userspace three-color policer counters. + threeColorPolicerPacketsTotal *prometheus.Desc + threeColorPolicerBytesTotal *prometheus.Desc + threeColorPolicerDropsTotal *prometheus.Desc + threeColorPolicerDropBytes *prometheus.Desc // Session gauges (from GC) sessionsActive *prometheus.Desc @@ -261,6 +266,26 @@ func newCollector(srv *Server) *xpfCollector { "Total firewall filter term hits.", []string{"filter", "family", "term"}, nil, ), + threeColorPolicerPacketsTotal: prometheus.NewDesc( + "xpf_userspace_three_color_policer_packets_total", + "Userspace three-color policer packets by resulting color.", + []string{"policer", "color"}, nil, + ), + threeColorPolicerBytesTotal: prometheus.NewDesc( + "xpf_userspace_three_color_policer_bytes_total", + "Userspace three-color policer bytes by resulting color.", + []string{"policer", "color"}, nil, + ), + threeColorPolicerDropsTotal: prometheus.NewDesc( + "xpf_userspace_three_color_policer_drops_total", + "Userspace three-color policer packets dropped by policer treatment.", + []string{"policer"}, nil, + ), + threeColorPolicerDropBytes: prometheus.NewDesc( + "xpf_userspace_three_color_policer_drop_bytes_total", + "Userspace three-color policer bytes dropped by policer treatment.", + []string{"policer"}, nil, + ), sessionsActive: prometheus.NewDesc( "xpf_sessions_active", "Current number of active session entries.", @@ -702,6 +727,10 @@ func (c *xpfCollector) Describe(ch chan<- *prometheus.Desc) { ch <- c.zoneBytesTotal ch <- c.policyHitsTotal ch <- c.filterHitsTotal + ch <- c.threeColorPolicerPacketsTotal + ch <- c.threeColorPolicerBytesTotal + ch <- c.threeColorPolicerDropsTotal + ch <- c.threeColorPolicerDropBytes ch <- c.sessionsActive ch <- c.sessionsEstablished ch <- c.sessionsIPv4 @@ -825,10 +854,47 @@ func (c *xpfCollector) collectUserspaceStatus(ch chan<- prometheus.Metric, dp da c.emitBindingActiveFlowCount(ch, status) c.emitBindingTXCompletionTelemetry(ch, status) c.emitCoSActiveFlowCount(ch, status) + c.emitThreeColorPolicerCounters(ch, status) c.emitFairnessRSSGauges(ch, status) c.emitFairnessThroughputGauges(ch, status) } +func (c *xpfCollector) emitThreeColorPolicerCounters(ch chan<- prometheus.Metric, status dpuserspace.ProcessStatus) { + for _, p := range status.ThreeColorPolicerCounters { + emitColor := func(color string, packets, bytes uint64) { + ch <- prometheus.MustNewConstMetric( + c.threeColorPolicerPacketsTotal, + prometheus.CounterValue, + float64(packets), + p.Name, + color, + ) + ch <- prometheus.MustNewConstMetric( + c.threeColorPolicerBytesTotal, + prometheus.CounterValue, + float64(bytes), + p.Name, + color, + ) + } + emitColor("green", p.GreenPackets, p.GreenBytes) + emitColor("yellow", p.YellowPackets, p.YellowBytes) + emitColor("red", p.RedPackets, p.RedBytes) + ch <- prometheus.MustNewConstMetric( + c.threeColorPolicerDropsTotal, + prometheus.CounterValue, + float64(p.DropPackets), + p.Name, + ) + ch <- prometheus.MustNewConstMetric( + c.threeColorPolicerDropBytes, + prometheus.CounterValue, + float64(p.DropBytes), + p.Name, + ) + } +} + // #1219: emit per-binding distinct active flow count for the fairness // harness. Reads BindingStatus.ActiveFlowCount populated by the // helper's ~65ms debug-state tick (see plan §3.2-3.3). diff --git a/pkg/api/metrics_test.go b/pkg/api/metrics_test.go index e007b1aec..9483a31cc 100644 --- a/pkg/api/metrics_test.go +++ b/pkg/api/metrics_test.go @@ -388,6 +388,72 @@ func TestCollectPolicyCountersExposesSparseAndGlobalPolicyIDs(t *testing.T) { }, 31) } +func TestEmitThreeColorPolicerCounters(t *testing.T) { + c := &xpfCollector{ + threeColorPolicerPacketsTotal: prometheus.NewDesc( + "xpf_userspace_three_color_policer_packets_total", + "packets", + []string{"policer", "color"}, + nil, + ), + threeColorPolicerBytesTotal: prometheus.NewDesc( + "xpf_userspace_three_color_policer_bytes_total", + "bytes", + []string{"policer", "color"}, + nil, + ), + threeColorPolicerDropsTotal: prometheus.NewDesc( + "xpf_userspace_three_color_policer_drops_total", + "drops", + []string{"policer"}, + nil, + ), + threeColorPolicerDropBytes: prometheus.NewDesc( + "xpf_userspace_three_color_policer_drop_bytes_total", + "drop bytes", + []string{"policer"}, + nil, + ), + } + status := dpuserspace.ProcessStatus{ + ThreeColorPolicerCounters: []dpuserspace.ThreeColorPolicerStatus{ + { + Name: "wan-egress", + GreenPackets: 10, + GreenBytes: 1000, + YellowPackets: 3, + YellowBytes: 300, + RedPackets: 2, + RedBytes: 200, + DropPackets: 2, + DropBytes: 200, + }, + }, + } + + ch := make(chan prometheus.Metric) + go func() { + c.emitThreeColorPolicerCounters(ch, status) + close(ch) + }() + var got []prometheus.Metric + for m := range ch { + got = append(got, m) + } + if len(got) != 8 { + t.Fatalf("emitThreeColorPolicerCounters: want 8 metrics, got %d", len(got)) + } + + assertCounterClose(t, got, c.threeColorPolicerPacketsTotal, map[string]string{"policer": "wan-egress", "color": "green"}, 10) + assertCounterClose(t, got, c.threeColorPolicerBytesTotal, map[string]string{"policer": "wan-egress", "color": "green"}, 1000) + assertCounterClose(t, got, c.threeColorPolicerPacketsTotal, map[string]string{"policer": "wan-egress", "color": "yellow"}, 3) + assertCounterClose(t, got, c.threeColorPolicerBytesTotal, map[string]string{"policer": "wan-egress", "color": "yellow"}, 300) + assertCounterClose(t, got, c.threeColorPolicerPacketsTotal, map[string]string{"policer": "wan-egress", "color": "red"}, 2) + assertCounterClose(t, got, c.threeColorPolicerBytesTotal, map[string]string{"policer": "wan-egress", "color": "red"}, 200) + assertCounterClose(t, got, c.threeColorPolicerDropsTotal, map[string]string{"policer": "wan-egress"}, 2) + assertCounterClose(t, got, c.threeColorPolicerDropBytes, map[string]string{"policer": "wan-egress"}, 200) +} + func metricValuesByWorker( t *testing.T, metrics []prometheus.Metric, diff --git a/pkg/cmdtree/tree.go b/pkg/cmdtree/tree.go index 2da607885..06746387e 100644 --- a/pkg/cmdtree/tree.go +++ b/pkg/cmdtree/tree.go @@ -824,10 +824,20 @@ var OperationalTree = map[string]*Node{ return out }, Children: map[string]*Node{ "valid": {Desc: "Inject a valid packet using the current snapshot generations", Children: map[string]*Node{ - "destination-ip": {Desc: "Optional destination IP used for forwarding resolution"}, + "destination-ip": {Desc: "Optional destination IP used for forwarding resolution"}, + "emit-on-wire": {Desc: "Emit a resolved synthetic packet on the egress interface"}, + "source-ip": {Desc: "Source IP required when emitting on wire"}, + "source-port": {Desc: "Source tuple port or ICMP identifier"}, + "destination-port": {Desc: "Destination tuple port"}, + "protocol": {Desc: "Tuple protocol for emitted packet"}, }}, "fib-mismatch": {Desc: "Inject a packet with a mismatched FIB generation", Children: map[string]*Node{ - "destination-ip": {Desc: "Optional destination IP used for forwarding resolution"}, + "destination-ip": {Desc: "Optional destination IP used for forwarding resolution"}, + "emit-on-wire": {Desc: "Emit a resolved synthetic packet on the egress interface"}, + "source-ip": {Desc: "Source IP required when emitting on wire"}, + "source-port": {Desc: "Source tuple port or ICMP identifier"}, + "destination-port": {Desc: "Destination tuple port"}, + "protocol": {Desc: "Tuple protocol for emitted packet"}, }}, "metadata-parse-error": {Desc: "Inject a malformed metadata packet", Children: map[string]*Node{ "destination-ip": {Desc: "Optional destination IP used for forwarding resolution"}, diff --git a/pkg/dataplane/userspace/inject.go b/pkg/dataplane/userspace/inject.go index 2d22c5a17..35ef8c989 100644 --- a/pkg/dataplane/userspace/inject.go +++ b/pkg/dataplane/userspace/inject.go @@ -1,13 +1,16 @@ package userspace import ( + "encoding/json" "fmt" + "net/netip" "strconv" "strings" "syscall" ) -const InjectPacketUsage = "request chassis cluster data-plane userspace inject-packet slot [destination-ip ] [emit-on-wire true]" +const InjectPacketUsage = "request chassis cluster data-plane userspace inject-packet slot [destination-ip ] [emit-on-wire true source-ip [source-port ] [destination-port ] [protocol ]]" +const injectPacketTargetExtraPrefix = "xpf-inject-extra:" func ParseInjectPacketCommand(args []string) (slot uint32, mode string, extra map[string]string, err error) { if len(args) < 4 || args[0] != "inject-packet" || args[1] != "slot" { @@ -30,6 +33,32 @@ func ParseInjectPacketCommand(args []string) (slot uint32, mode string, extra ma return slot, mode, extra, nil } +func EncodeInjectPacketTarget(extra map[string]string) string { + if len(extra) == 0 { + return "" + } + raw, err := json.Marshal(extra) + if err != nil { + return extra["destination-ip"] + } + return injectPacketTargetExtraPrefix + string(raw) +} + +func DecodeInjectPacketTarget(target string) (map[string]string, error) { + extra := make(map[string]string) + if target == "" { + return extra, nil + } + if !strings.HasPrefix(target, injectPacketTargetExtraPrefix) { + extra["destination-ip"] = target + return extra, nil + } + if err := json.Unmarshal([]byte(strings.TrimPrefix(target, injectPacketTargetExtraPrefix)), &extra); err != nil { + return nil, fmt.Errorf("invalid userspace inject target extras: %w", err) + } + return extra, nil +} + func BuildInjectPacketRequest(slot uint32, mode string, extra map[string]string, status ProcessStatus) (InjectPacketRequest, error) { req := InjectPacketRequest{ Slot: slot, @@ -56,5 +85,162 @@ func BuildInjectPacketRequest(slot uint32, mode string, extra map[string]string, default: return InjectPacketRequest{}, fmt.Errorf("unknown inject mode %q", mode) } + if req.EmitOnWire { + if !req.MetadataValid { + return InjectPacketRequest{}, fmt.Errorf("emit-on-wire requires valid metadata") + } + if err := populateInjectPacketTuple(&req, extra, status); err != nil { + return InjectPacketRequest{}, err + } + } return req, nil } + +func validateInjectPacketRequestForHelper(req InjectPacketRequest, status ProcessStatus) error { + if !req.EmitOnWire { + return nil + } + if status.InjectPacketTupleProtocolVersion < InjectPacketTupleProtocolVersion { + return fmt.Errorf("emit-on-wire requires helper inject tuple protocol version %d (helper has %d)", + InjectPacketTupleProtocolVersion, status.InjectPacketTupleProtocolVersion) + } + if req.TupleMetadataVersion < InjectPacketTupleProtocolVersion { + return fmt.Errorf("emit-on-wire request requires tuple metadata version %d (got %d)", + InjectPacketTupleProtocolVersion, req.TupleMetadataVersion) + } + if req.SourceIP == "" { + return fmt.Errorf("emit-on-wire requires source-ip") + } + if req.DestinationIP == "" { + return fmt.Errorf("emit-on-wire requires destination-ip") + } + sourceIP, err := netip.ParseAddr(req.SourceIP) + if err != nil { + return fmt.Errorf("invalid source-ip %q: %w", req.SourceIP, err) + } + destinationIP, err := netip.ParseAddr(req.DestinationIP) + if err != nil { + return fmt.Errorf("invalid destination-ip %q: %w", req.DestinationIP, err) + } + if sourceIP.Is4() != destinationIP.Is4() { + return fmt.Errorf("emit-on-wire source-ip and destination-ip must use the same address family") + } + expectedFamily := uint8(syscall.AF_INET6) + expectedProtocol := uint8(58) + if sourceIP.Is4() { + expectedFamily = uint8(syscall.AF_INET) + expectedProtocol = 1 + } + if req.AddrFamily != expectedFamily { + return fmt.Errorf("emit-on-wire tuple addr_family %d does not match packet family %d", req.AddrFamily, expectedFamily) + } + if req.Protocol != expectedProtocol { + return fmt.Errorf("emit-on-wire supports only %s tuples for this address family", injectProtocolName(expectedProtocol)) + } + if req.SourcePort == nil { + return fmt.Errorf("emit-on-wire requires source-port tuple metadata") + } + if req.DestinationPort == nil { + return fmt.Errorf("emit-on-wire requires destination-port tuple metadata") + } + return nil +} + +func populateInjectPacketTuple(req *InjectPacketRequest, extra map[string]string, status ProcessStatus) error { + if status.InjectPacketTupleProtocolVersion < InjectPacketTupleProtocolVersion { + return fmt.Errorf("emit-on-wire requires helper inject tuple protocol version %d (helper has %d)", + InjectPacketTupleProtocolVersion, status.InjectPacketTupleProtocolVersion) + } + if req.DestinationIP == "" { + return fmt.Errorf("emit-on-wire requires destination-ip") + } + sourceText := extra["source-ip"] + if sourceText == "" { + return fmt.Errorf("emit-on-wire requires source-ip") + } + sourceIP, err := netip.ParseAddr(sourceText) + if err != nil { + return fmt.Errorf("invalid source-ip %q: %w", sourceText, err) + } + destinationIP, err := netip.ParseAddr(req.DestinationIP) + if err != nil { + return fmt.Errorf("invalid destination-ip %q: %w", req.DestinationIP, err) + } + if sourceIP.Is4() != destinationIP.Is4() { + return fmt.Errorf("emit-on-wire source-ip and destination-ip must use the same address family") + } + + expectedProtocol := uint8(1) + if sourceIP.Is4() { + req.AddrFamily = uint8(syscall.AF_INET) + } else { + req.AddrFamily = uint8(syscall.AF_INET6) + expectedProtocol = 58 + } + protocol := expectedProtocol + if text := extra["protocol"]; text != "" { + protocol, err = parseInjectProtocol(text) + if err != nil { + return err + } + } + if protocol != expectedProtocol { + return fmt.Errorf("emit-on-wire supports only %s tuples for this address family", injectProtocolName(expectedProtocol)) + } + + sourcePort := uint16(req.Slot) + if text := extra["source-port"]; text != "" { + sourcePort, err = parseInjectPort("source-port", text) + if err != nil { + return err + } + } + destinationPort := uint16(0) + if text := extra["destination-port"]; text != "" { + destinationPort, err = parseInjectPort("destination-port", text) + if err != nil { + return err + } + } + + req.TupleMetadataVersion = InjectPacketTupleProtocolVersion + req.SourceIP = sourceIP.String() + req.DestinationIP = destinationIP.String() + req.Protocol = protocol + req.SourcePort = &sourcePort + req.DestinationPort = &destinationPort + return nil +} + +func parseInjectPort(name, value string) (uint16, error) { + n, err := strconv.ParseUint(value, 10, 16) + if err != nil { + return 0, fmt.Errorf("invalid %s %q: %w", name, value, err) + } + return uint16(n), nil +} + +func parseInjectProtocol(value string) (uint8, error) { + switch strings.ToLower(value) { + case "icmp": + return 1, nil + case "icmpv6": + return 58, nil + } + n, err := strconv.ParseUint(value, 10, 8) + if err != nil { + return 0, fmt.Errorf("invalid protocol %q: %w", value, err) + } + return uint8(n), nil +} + +func injectProtocolName(protocol uint8) string { + switch protocol { + case 1: + return "icmp" + case 58: + return "icmpv6" + default: + return fmt.Sprintf("protocol %d", protocol) + } +} diff --git a/pkg/dataplane/userspace/inject_test.go b/pkg/dataplane/userspace/inject_test.go new file mode 100644 index 000000000..f12b544c3 --- /dev/null +++ b/pkg/dataplane/userspace/inject_test.go @@ -0,0 +1,184 @@ +package userspace + +import ( + "os" + "os/exec" + "strings" + "syscall" + "testing" +) + +func TestBuildInjectPacketRequestEmitOnWireCarriesTuple(t *testing.T) { + req, err := BuildInjectPacketRequest(7, "valid", map[string]string{ + "destination-ip": "172.16.80.200", + "emit-on-wire": "true", + "source-ip": "172.16.80.8", + "source-port": "4660", + "destination-port": "0", + "protocol": "icmp", + }, ProcessStatus{ + LastSnapshotGeneration: 11, + LastFIBGeneration: 12, + InjectPacketTupleProtocolVersion: InjectPacketTupleProtocolVersion, + }) + if err != nil { + t.Fatalf("BuildInjectPacketRequest: %v", err) + } + if req.TupleMetadataVersion != InjectPacketTupleProtocolVersion { + t.Fatalf("TupleMetadataVersion = %d, want %d", req.TupleMetadataVersion, InjectPacketTupleProtocolVersion) + } + if req.AddrFamily != uint8(syscall.AF_INET) { + t.Fatalf("AddrFamily = %d, want AF_INET", req.AddrFamily) + } + if req.Protocol != 1 { + t.Fatalf("Protocol = %d, want ICMP", req.Protocol) + } + if req.SourceIP != "172.16.80.8" || req.DestinationIP != "172.16.80.200" { + t.Fatalf("tuple IPs = %s -> %s", req.SourceIP, req.DestinationIP) + } + if req.SourcePort == nil || *req.SourcePort != 4660 { + t.Fatalf("SourcePort = %v, want 4660", req.SourcePort) + } + if req.DestinationPort == nil || *req.DestinationPort != 0 { + t.Fatalf("DestinationPort = %v, want 0", req.DestinationPort) + } +} + +func TestBuildInjectPacketRequestEmitOnWireFailsClosedWithoutHelperTupleProtocol(t *testing.T) { + _, err := BuildInjectPacketRequest(7, "valid", map[string]string{ + "destination-ip": "172.16.80.200", + "emit-on-wire": "true", + "source-ip": "172.16.80.8", + }, ProcessStatus{}) + if err == nil { + t.Fatal("BuildInjectPacketRequest succeeded without helper tuple protocol") + } + if !strings.Contains(err.Error(), "helper inject tuple protocol version") { + t.Fatalf("error = %v, want tuple protocol failure", err) + } +} + +func TestBuildInjectPacketRequestEmitOnWireRequiresSourceIP(t *testing.T) { + _, err := BuildInjectPacketRequest(7, "valid", map[string]string{ + "destination-ip": "172.16.80.200", + "emit-on-wire": "true", + }, ProcessStatus{InjectPacketTupleProtocolVersion: InjectPacketTupleProtocolVersion}) + if err == nil { + t.Fatal("BuildInjectPacketRequest succeeded without source-ip") + } + if !strings.Contains(err.Error(), "requires source-ip") { + t.Fatalf("error = %v, want source-ip failure", err) + } +} + +func TestBuildInjectPacketRequestEmitOnWireRejectsUnsupportedProtocol(t *testing.T) { + _, err := BuildInjectPacketRequest(7, "valid", map[string]string{ + "destination-ip": "172.16.80.200", + "emit-on-wire": "true", + "source-ip": "172.16.80.8", + "protocol": "tcp", + }, ProcessStatus{InjectPacketTupleProtocolVersion: InjectPacketTupleProtocolVersion}) + if err == nil { + t.Fatal("BuildInjectPacketRequest succeeded with tcp protocol") + } + if !strings.Contains(err.Error(), "invalid protocol") { + t.Fatalf("error = %v, want invalid protocol failure", err) + } +} + +func TestInjectPacketTargetRoundTripCarriesEmitOnWireOptions(t *testing.T) { + target := EncodeInjectPacketTarget(map[string]string{ + "destination-ip": "172.16.80.200", + "emit-on-wire": "true", + "source-ip": "172.16.80.8", + "source-port": "7", + "destination-port": "0", + "protocol": "icmp", + }) + extra, err := DecodeInjectPacketTarget(target) + if err != nil { + t.Fatalf("DecodeInjectPacketTarget: %v", err) + } + req, err := BuildInjectPacketRequest(7, "valid", extra, ProcessStatus{ + InjectPacketTupleProtocolVersion: InjectPacketTupleProtocolVersion, + }) + if err != nil { + t.Fatalf("BuildInjectPacketRequest: %v", err) + } + if !req.EmitOnWire { + t.Fatal("EmitOnWire = false after target round-trip") + } + if req.SourceIP != "172.16.80.8" || req.DestinationIP != "172.16.80.200" { + t.Fatalf("tuple IPs = %s -> %s", req.SourceIP, req.DestinationIP) + } + if req.TupleMetadataVersion != InjectPacketTupleProtocolVersion { + t.Fatalf("TupleMetadataVersion = %d, want %d", req.TupleMetadataVersion, InjectPacketTupleProtocolVersion) + } +} + +func TestInjectPacketEmitOnWireFailsClosedBeforeHelperIPCWithoutTupleProtocol(t *testing.T) { + port := uint16(7) + proc, err := os.FindProcess(os.Getpid()) + if err != nil { + t.Fatalf("FindProcess: %v", err) + } + m := New() + m.inner = nil + m.proc = &exec.Cmd{Process: proc} + m.cfg.ControlSocket = "/tmp/xpf-test-inject-packet-must-not-dial.sock" + m.lastStatus = ProcessStatus{} + + _, err = m.InjectPacket(InjectPacketRequest{ + Slot: 7, + PacketLength: 128, + AddrFamily: uint8(syscall.AF_INET), + Protocol: 1, + MetadataValid: true, + DestinationIP: "172.16.80.200", + EmitOnWire: true, + TupleMetadataVersion: InjectPacketTupleProtocolVersion, + SourceIP: "172.16.80.8", + SourcePort: &port, + DestinationPort: &port, + }) + if err == nil { + t.Fatal("InjectPacket succeeded without helper tuple protocol") + } + if !strings.Contains(err.Error(), "helper inject tuple protocol version") { + t.Fatalf("error = %v, want helper tuple protocol failure", err) + } +} + +func TestInjectPacketEmitOnWireRejectsLegacyRemoteRequestMetadata(t *testing.T) { + port := uint16(7) + proc, err := os.FindProcess(os.Getpid()) + if err != nil { + t.Fatalf("FindProcess: %v", err) + } + m := New() + m.inner = nil + m.proc = &exec.Cmd{Process: proc} + m.cfg.ControlSocket = "/tmp/xpf-test-inject-packet-must-not-dial.sock" + m.lastStatus = ProcessStatus{ + InjectPacketTupleProtocolVersion: InjectPacketTupleProtocolVersion, + } + + _, err = m.InjectPacket(InjectPacketRequest{ + Slot: 7, + PacketLength: 128, + AddrFamily: uint8(syscall.AF_INET), + Protocol: 1, + MetadataValid: true, + DestinationIP: "172.16.80.200", + EmitOnWire: true, + SourceIP: "172.16.80.8", + SourcePort: &port, + DestinationPort: &port, + }) + if err == nil { + t.Fatal("InjectPacket succeeded with legacy remote emit-on-wire metadata") + } + if !strings.Contains(err.Error(), "request requires tuple metadata version") { + t.Fatalf("error = %v, want request tuple metadata failure", err) + } +} diff --git a/pkg/dataplane/userspace/manager.go b/pkg/dataplane/userspace/manager.go index 86e4205e1..46e4635d8 100644 --- a/pkg/dataplane/userspace/manager.go +++ b/pkg/dataplane/userspace/manager.go @@ -1153,11 +1153,14 @@ func deriveUserspaceCapabilities(cfg *config.Config) UserspaceCapabilities { if !userspaceSupportsScreenProfiles(cfg) { addReason("screen features requiring SYN cookies are not implemented in the userspace dataplane") } - // Firewall filters and policers are now supported in the userspace dataplane. - // Three-color policers remain unsupported. - if len(cfg.Firewall.ThreeColorPolicers) > 0 { - addReason("three-color policers are not implemented in the userspace dataplane") - } + if !userspaceSupportsThreeColorPolicers(cfg) { + addReason("userspace three-color policers require color-blind mode and then discard") + } + // Firewall filters and legacy policers are supported in the userspace + // dataplane. Three-color policers are supported for the color-blind + // `then discard` runtime slice above; unsupported color-aware and + // non-drop actions remain fail-closed so the dataplane does not silently + // promote inherited color or ignore configured treatment. // IPsec: kernel XFRM handles ESP encryption/decryption; the userspace // dataplane passes ESP/IKE traffic to the kernel via the slow-path. // GRE transit is now modeled as native userspace tunnel endpoints on the @@ -1170,6 +1173,24 @@ func deriveUserspaceCapabilities(cfg *config.Config) UserspaceCapabilities { return caps } +func userspaceSupportsThreeColorPolicers(cfg *config.Config) bool { + if cfg == nil { + return true + } + for _, pol := range cfg.Firewall.ThreeColorPolicers { + if pol == nil { + continue + } + if !pol.ColorBlind { + return false + } + if pol.ThenAction != "" && pol.ThenAction != "discard" { + return false + } + } + return true +} + func userspaceSupportsSecurityPolicies(cfg *config.Config) bool { if cfg == nil { return true @@ -1587,6 +1608,9 @@ func (m *Manager) InjectPacket(req InjectPacketRequest) (ProcessStatus, error) { if m.proc == nil { return ProcessStatus{}, errors.New("userspace dataplane helper not running") } + if err := validateInjectPacketRequestForHelper(req, m.lastStatus); err != nil { + return ProcessStatus{}, err + } var status ProcessStatus if err := m.requestLocked(ControlRequest{Type: "inject_packet", Packet: &req}, &status); err != nil { return ProcessStatus{}, err diff --git a/pkg/dataplane/userspace/manager_test.go b/pkg/dataplane/userspace/manager_test.go index b6092daa2..d5013e6fc 100644 --- a/pkg/dataplane/userspace/manager_test.go +++ b/pkg/dataplane/userspace/manager_test.go @@ -11,6 +11,7 @@ import ( "os/exec" "path/filepath" "reflect" + "slices" "sort" "strings" "testing" @@ -2212,8 +2213,8 @@ func TestDeriveUserspaceCapabilitiesDetectsFirewallFeatures(t *testing.T) { cfg.Security.Zones = map[string]*config.ZoneConfig{"trust": {Name: "trust"}} cfg.Security.NAT.Source = []*config.NATRuleSet{{Name: "src"}} cfg.Security.Flow.AllowDNSReply = true - // Firewall filters (inet/inet6) and single-rate policers are now supported. - // Only three-color policers remain unsupported. + // Firewall filters (inet/inet6), single-rate policers, and three-color + // policers are now supported. cfg.Firewall.FiltersInet = map[string]*config.FirewallFilter{"f1": {Name: "f1"}} cfg.Services.FlowMonitoring = &config.FlowMonitoringConfig{} @@ -2223,28 +2224,57 @@ func TestDeriveUserspaceCapabilitiesDetectsFirewallFeatures(t *testing.T) { } } -func TestDeriveUserspaceCapabilitiesGatesThreeColorPolicers(t *testing.T) { +func TestDeriveUserspaceCapabilitiesAdmitsThreeColorPolicers(t *testing.T) { cfg := &config.Config{} cfg.Firewall.ThreeColorPolicers = map[string]*config.ThreeColorPolicerConfig{ - "tcp1": {Name: "tcp1", CIR: 1000000, CBS: 50000}, + "tcp1": {Name: "tcp1", ColorBlind: true, CIR: 1000000, CBS: 50000, ThenAction: "discard"}, } caps := deriveUserspaceCapabilities(cfg) - if caps.ForwardingSupported { - t.Fatal("ForwardingSupported = true, want false for three-color policers") + if !caps.ForwardingSupported { + t.Fatalf("ForwardingSupported = false, want true for three-color policers. Reasons: %+v", caps.UnsupportedReasons) } - found := false - for _, r := range caps.UnsupportedReasons { - if r == "three-color policers are not implemented in the userspace dataplane" { - found = true - } +} + +func TestDeriveUserspaceCapabilitiesRejectsUnsupportedThreeColorPolicerActions(t *testing.T) { + tests := []struct { + name string + pol *config.ThreeColorPolicerConfig + }{ + { + name: "color-aware", + pol: &config.ThreeColorPolicerConfig{Name: "aware", CIR: 1000000, CBS: 50000, PBS: 50000, ThenAction: "discard"}, + }, + { + name: "loss-priority", + pol: &config.ThreeColorPolicerConfig{ + Name: "loss", + ColorBlind: true, + CIR: 1000000, + CBS: 50000, + PBS: 50000, + ThenAction: "loss-priority high", + }, + }, } - if !found { - t.Fatalf("expected three-color policer unsupported reason, got: %+v", caps.UnsupportedReasons) + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + cfg := &config.Config{} + cfg.Firewall.ThreeColorPolicers = map[string]*config.ThreeColorPolicerConfig{ + tt.pol.Name: tt.pol, + } + caps := deriveUserspaceCapabilities(cfg) + if caps.ForwardingSupported { + t.Fatal("ForwardingSupported = true, want fail-closed for unsupported three-color policer mode/action") + } + if !slices.Contains(caps.UnsupportedReasons, "userspace three-color policers require color-blind mode and then discard") { + t.Fatalf("UnsupportedReasons = %+v, want three-color reason", caps.UnsupportedReasons) + } + }) } } -func TestBuildSnapshotIncludesThreeColorPolicerSchemaWhileGateClosed(t *testing.T) { +func TestBuildSnapshotIncludesThreeColorPolicerSchema(t *testing.T) { cfg := &config.Config{} cfg.Firewall.ThreeColorPolicers = map[string]*config.ThreeColorPolicerConfig{ "sr": { @@ -2258,17 +2288,18 @@ func TestBuildSnapshotIncludesThreeColorPolicerSchemaWhileGateClosed(t *testing. "tr": { Name: "tr", TwoRate: true, + ColorBlind: true, CIR: 125000, CBS: 50000, PIR: 250000, PBS: 100000, - ThenAction: "loss-priority high", + ThenAction: "discard", }, } snap := buildSnapshot(cfg, config.UserspaceConfig{}, 1, 0) - if snap.Capabilities.ForwardingSupported { - t.Fatal("ForwardingSupported = true, want three-color policers to remain fail-closed") + if !snap.Capabilities.ForwardingSupported { + t.Fatalf("ForwardingSupported = false, want three-color policers admitted. Reasons: %+v", snap.Capabilities.UnsupportedReasons) } want := []ThreeColorPolicerSnapshot{ { @@ -2283,11 +2314,12 @@ func TestBuildSnapshotIncludesThreeColorPolicerSchemaWhileGateClosed(t *testing. { Name: "tr", Mode: "two-rate", + ColorBlind: true, CommittedRateBytes: 125000, CommittedBurstBytes: 50000, PeakOrExcessRateBytes: 250000, PeakOrExcessBurstBytes: 100000, - ThenAction: "loss-priority high", + ThenAction: "discard", }, } if !reflect.DeepEqual(snap.ThreeColorPolicers, want) { diff --git a/pkg/dataplane/userspace/protocol.go b/pkg/dataplane/userspace/protocol.go index 64a4fbeea..67301236d 100644 --- a/pkg/dataplane/userspace/protocol.go +++ b/pkg/dataplane/userspace/protocol.go @@ -7,8 +7,9 @@ import ( ) const ( - ProtocolVersion = 2 - TypeUserspace = "userspace" + ProtocolVersion = 2 + InjectPacketTupleProtocolVersion = 1 + TypeUserspace = "userspace" ) type ControlRequest struct { @@ -441,29 +442,30 @@ type UserspaceCapabilities struct { } type ProcessStatus struct { - PID int `json:"pid"` - ConfigSnapshotProtocolVersion int `json:"config_snapshot_protocol_version,omitempty"` - StartedAt time.Time `json:"started_at"` - ControlSocket string `json:"control_socket"` - StateFile string `json:"state_file"` - Workers int `json:"workers"` - RingEntries int `json:"ring_entries"` - HelperMode string `json:"helper_mode"` - IOUringPlanned bool `json:"io_uring_planned"` - IOUringActive bool `json:"io_uring_active,omitempty"` - IOUringMode string `json:"io_uring_mode,omitempty"` - IOUringLastError string `json:"io_uring_last_error,omitempty"` - Enabled bool `json:"enabled"` - ForwardingArmed bool `json:"forwarding_armed,omitempty"` - Capabilities UserspaceCapabilities `json:"capabilities"` - LastSnapshotGeneration uint64 `json:"last_snapshot_generation"` - LastFIBGeneration uint32 `json:"last_fib_generation,omitempty"` - LastSnapshotAt time.Time `json:"last_snapshot_at,omitempty"` - InterfaceAddresses int `json:"interface_addresses,omitempty"` - NeighborEntries int `json:"neighbor_entries,omitempty"` - NeighborGeneration uint64 `json:"neighbor_generation,omitempty"` - RouteEntries int `json:"route_entries,omitempty"` - WorkerHeartbeats []time.Time `json:"worker_heartbeats,omitempty"` + PID int `json:"pid"` + ConfigSnapshotProtocolVersion int `json:"config_snapshot_protocol_version,omitempty"` + InjectPacketTupleProtocolVersion int `json:"inject_packet_tuple_protocol_version,omitempty"` + StartedAt time.Time `json:"started_at"` + ControlSocket string `json:"control_socket"` + StateFile string `json:"state_file"` + Workers int `json:"workers"` + RingEntries int `json:"ring_entries"` + HelperMode string `json:"helper_mode"` + IOUringPlanned bool `json:"io_uring_planned"` + IOUringActive bool `json:"io_uring_active,omitempty"` + IOUringMode string `json:"io_uring_mode,omitempty"` + IOUringLastError string `json:"io_uring_last_error,omitempty"` + Enabled bool `json:"enabled"` + ForwardingArmed bool `json:"forwarding_armed,omitempty"` + Capabilities UserspaceCapabilities `json:"capabilities"` + LastSnapshotGeneration uint64 `json:"last_snapshot_generation"` + LastFIBGeneration uint32 `json:"last_fib_generation,omitempty"` + LastSnapshotAt time.Time `json:"last_snapshot_at,omitempty"` + InterfaceAddresses int `json:"interface_addresses,omitempty"` + NeighborEntries int `json:"neighbor_entries,omitempty"` + NeighborGeneration uint64 `json:"neighbor_generation,omitempty"` + RouteEntries int `json:"route_entries,omitempty"` + WorkerHeartbeats []time.Time `json:"worker_heartbeats,omitempty"` // #869: per-worker busy/idle runtime telemetry. WorkerRuntime []WorkerRuntimeStatus `json:"worker_runtime,omitempty"` HAGroups []HAGroupStatus `json:"ha_groups,omitempty"` @@ -491,6 +493,7 @@ type ProcessStatus struct { CoSInterfaces []CoSInterfaceStatus `json:"cos_interfaces,omitempty"` PolicyRuleCounters []PolicyRuleCounterStatus `json:"policy_rule_counters,omitempty"` FilterTermCounters []FirewallFilterTermCounterStatus `json:"filter_term_counters,omitempty"` + ThreeColorPolicerCounters []ThreeColorPolicerStatus `json:"three_color_policer_counters,omitempty"` LastResolution *PacketResolution `json:"last_resolution,omitempty"` SlowPath SlowPathStatus `json:"slow_path,omitempty"` LastCacheFlushAt uint64 `json:"last_cache_flush_at,omitempty"` // monotonic secs (#312) @@ -528,6 +531,21 @@ type CoSInterfaceStatus struct { Queues []CoSQueueStatus `json:"queues,omitempty"` } +type ThreeColorPolicerStatus struct { + ID uint32 `json:"id,omitempty"` + Name string `json:"name,omitempty"` + Mode string `json:"mode,omitempty"` + ColorBlind bool `json:"color_blind,omitempty"` + GreenPackets uint64 `json:"green_packets,omitempty"` + GreenBytes uint64 `json:"green_bytes,omitempty"` + YellowPackets uint64 `json:"yellow_packets,omitempty"` + YellowBytes uint64 `json:"yellow_bytes,omitempty"` + RedPackets uint64 `json:"red_packets,omitempty"` + RedBytes uint64 `json:"red_bytes,omitempty"` + DropPackets uint64 `json:"drop_packets,omitempty"` + DropBytes uint64 `json:"drop_bytes,omitempty"` +} + type CoSQueueStatus struct { QueueID int `json:"queue_id,omitempty"` OwnerWorkerID *uint32 `json:"owner_worker_id,omitempty"` @@ -1037,15 +1055,19 @@ type ExceptionStatus struct { } type InjectPacketRequest struct { - Slot uint32 `json:"slot"` - PacketLength uint32 `json:"packet_length,omitempty"` - AddrFamily uint8 `json:"addr_family,omitempty"` - Protocol uint8 `json:"protocol,omitempty"` - ConfigGeneration uint64 `json:"config_generation,omitempty"` - FIBGeneration uint32 `json:"fib_generation,omitempty"` - MetadataValid bool `json:"metadata_valid"` - DestinationIP string `json:"destination_ip,omitempty"` - EmitOnWire bool `json:"emit_on_wire,omitempty"` + Slot uint32 `json:"slot"` + PacketLength uint32 `json:"packet_length,omitempty"` + AddrFamily uint8 `json:"addr_family,omitempty"` + Protocol uint8 `json:"protocol,omitempty"` + ConfigGeneration uint64 `json:"config_generation,omitempty"` + FIBGeneration uint32 `json:"fib_generation,omitempty"` + MetadataValid bool `json:"metadata_valid"` + DestinationIP string `json:"destination_ip,omitempty"` + EmitOnWire bool `json:"emit_on_wire,omitempty"` + TupleMetadataVersion int `json:"tuple_metadata_version,omitempty"` + SourceIP string `json:"source_ip,omitempty"` + SourcePort *uint16 `json:"source_port,omitempty"` + DestinationPort *uint16 `json:"destination_port,omitempty"` } type SessionDeltaDrainRequest struct { diff --git a/pkg/dataplane/userspace/protocol_test.go b/pkg/dataplane/userspace/protocol_test.go index f2b45c0ae..aaa720913 100644 --- a/pkg/dataplane/userspace/protocol_test.go +++ b/pkg/dataplane/userspace/protocol_test.go @@ -316,6 +316,118 @@ func TestConfigSnapshotThreeColorPolicersRoundTrip(t *testing.T) { } } +func TestProcessStatusThreeColorPolicerCountersRoundTrip(t *testing.T) { + in := ProcessStatus{ + ThreeColorPolicerCounters: []ThreeColorPolicerStatus{ + { + ID: 1, + Name: "wan-egress", + Mode: "single-rate", + ColorBlind: true, + GreenPackets: 10, + GreenBytes: 1000, + YellowPackets: 3, + YellowBytes: 300, + RedPackets: 2, + RedBytes: 200, + DropPackets: 2, + DropBytes: 200, + }, + }, + } + + raw, err := json.Marshal(&in) + if err != nil { + t.Fatalf("marshal: %v", err) + } + var obj map[string]json.RawMessage + if err := json.Unmarshal(raw, &obj); err != nil { + t.Fatalf("unmarshal obj: %v", err) + } + if _, ok := obj["three_color_policer_counters"]; !ok { + t.Fatalf("wire key missing from ProcessStatus JSON: %s", string(raw)) + } + + var back ProcessStatus + if err := json.Unmarshal(raw, &back); err != nil { + t.Fatalf("unmarshal ProcessStatus: %v", err) + } + if !reflect.DeepEqual(back.ThreeColorPolicerCounters, in.ThreeColorPolicerCounters) { + t.Fatalf("ThreeColorPolicerCounters = %+v, want %+v", + back.ThreeColorPolicerCounters, in.ThreeColorPolicerCounters) + } +} + +func TestProcessStatusInjectPacketTupleVersionRoundTrip(t *testing.T) { + in := ProcessStatus{ + InjectPacketTupleProtocolVersion: InjectPacketTupleProtocolVersion, + } + raw, err := json.Marshal(&in) + if err != nil { + t.Fatalf("marshal: %v", err) + } + var obj map[string]json.RawMessage + if err := json.Unmarshal(raw, &obj); err != nil { + t.Fatalf("unmarshal obj: %v", err) + } + if _, ok := obj["inject_packet_tuple_protocol_version"]; !ok { + t.Fatalf("wire key missing from ProcessStatus JSON: %s", string(raw)) + } + var back ProcessStatus + if err := json.Unmarshal(raw, &back); err != nil { + t.Fatalf("unmarshal ProcessStatus: %v", err) + } + if back.InjectPacketTupleProtocolVersion != InjectPacketTupleProtocolVersion { + t.Fatalf("InjectPacketTupleProtocolVersion = %d, want %d", + back.InjectPacketTupleProtocolVersion, InjectPacketTupleProtocolVersion) + } +} + +func TestInjectPacketRequestTupleMetadataRoundTrip(t *testing.T) { + sourcePort := uint16(4660) + destinationPort := uint16(0) + in := InjectPacketRequest{ + Slot: 7, + PacketLength: 128, + AddrFamily: 2, + Protocol: 1, + ConfigGeneration: 11, + FIBGeneration: 12, + MetadataValid: true, + DestinationIP: "172.16.80.200", + EmitOnWire: true, + TupleMetadataVersion: InjectPacketTupleProtocolVersion, + SourceIP: "172.16.80.8", + SourcePort: &sourcePort, + DestinationPort: &destinationPort, + } + raw, err := json.Marshal(&in) + if err != nil { + t.Fatalf("marshal: %v", err) + } + var obj map[string]json.RawMessage + if err := json.Unmarshal(raw, &obj); err != nil { + t.Fatalf("unmarshal obj: %v", err) + } + for _, key := range []string{ + "tuple_metadata_version", + "source_ip", + "source_port", + "destination_port", + } { + if _, ok := obj[key]; !ok { + t.Fatalf("wire key %q missing from InjectPacketRequest JSON: %s", key, string(raw)) + } + } + var back InjectPacketRequest + if err := json.Unmarshal(raw, &back); err != nil { + t.Fatalf("unmarshal InjectPacketRequest: %v", err) + } + if !reflect.DeepEqual(back, in) { + t.Fatalf("round-trip mismatch: got %+v, want %+v", back, in) + } +} + func TestCoSQueueStatusDrainPhaseCountersRoundTrip(t *testing.T) { in := CoSQueueStatus{ QueueID: 0, diff --git a/pkg/dataplane/userspace/statusfmt.go b/pkg/dataplane/userspace/statusfmt.go index 14729c92b..b72efb781 100644 --- a/pkg/dataplane/userspace/statusfmt.go +++ b/pkg/dataplane/userspace/statusfmt.go @@ -330,6 +330,34 @@ func FormatStatusSummary(status ProcessStatus) string { fmt.Fprintf(&b, " CoS buffer drops: %d\n", currentRuntimeCoSBufferDrops) fmt.Fprintf(&b, " CoS ECN marked: %d\n", cosAdmissionEcnMarked) } + if len(status.ThreeColorPolicerCounters) > 0 { + rows := append([]ThreeColorPolicerStatus(nil), status.ThreeColorPolicerCounters...) + sort.Slice(rows, func(i, j int) bool { + if rows[i].ID != rows[j].ID { + return rows[i].ID < rows[j].ID + } + return rows[i].Name < rows[j].Name + }) + fmt.Fprintln(&b, "Three-color policers:") + fmt.Fprintf(&b, " %-5s %-16s %-11s %-6s %-9s %-9s %-9s %-9s %-10s %-10s %-10s %-10s\n", + "ID", "Name", "Mode", "Blind", "GreenPkts", "YellowPkts", "RedPkts", "DropPkts", "GreenB", "YellowB", "RedB", "DropB") + for _, row := range rows { + fmt.Fprintf(&b, " %-5d %-16s %-11s %-6t %-9d %-9d %-9d %-9d %-10d %-10d %-10d %-10d\n", + row.ID, + row.Name, + row.Mode, + row.ColorBlind, + row.GreenPackets, + row.YellowPackets, + row.RedPackets, + row.DropPackets, + row.GreenBytes, + row.YellowBytes, + row.RedBytes, + row.DropBytes, + ) + } + } fmt.Fprintf(&b, " TX shared recycle unk: %d\n", txSharedRecycleUnknownSlotDrops) fmt.Fprintf(&b, " TX completions: %d\n", txCompletions) fmt.Fprintf(&b, " Mirrored packets: %d\n", mirroredPackets) diff --git a/pkg/dataplane/userspace/statusfmt_test.go b/pkg/dataplane/userspace/statusfmt_test.go index 21c8859a3..96c39dac9 100644 --- a/pkg/dataplane/userspace/statusfmt_test.go +++ b/pkg/dataplane/userspace/statusfmt_test.go @@ -182,6 +182,44 @@ func TestFormatStatusSummaryWorkerRuntimeRolling60sColumn(t *testing.T) { } } +func TestFormatStatusSummaryIncludesThreeColorPolicerCounters(t *testing.T) { + status := ProcessStatus{ + ThreeColorPolicerCounters: []ThreeColorPolicerStatus{ + { + ID: 2, + Name: "wan-egress", + Mode: "single-rate", + ColorBlind: true, + GreenPackets: 10, + GreenBytes: 1000, + YellowPackets: 3, + YellowBytes: 300, + RedPackets: 2, + RedBytes: 200, + DropPackets: 2, + DropBytes: 200, + }, + }, + } + + out := FormatStatusSummary(status) + for _, want := range []string{ + "Three-color policers:", + "GreenPkts", + "wan-egress", + "single-rate", + "true", + "10", + "3", + "2", + "1000", + } { + if !strings.Contains(out, want) { + t.Fatalf("summary missing three-color policer field %q:\n%s", want, out) + } + } +} + func TestFormatStatusSummaryReportsStandbyArmedRole(t *testing.T) { status := ProcessStatus{ ForwardingArmed: true, diff --git a/pkg/grpcapi/server_diag.go b/pkg/grpcapi/server_diag.go index f2abe7291..f7e232873 100644 --- a/pkg/grpcapi/server_diag.go +++ b/pkg/grpcapi/server_diag.go @@ -845,9 +845,9 @@ func (s *Server) SystemAction(ctx context.Context, req *pb.SystemActionRequest) if err != nil { return nil, status.Errorf(codes.Unavailable, "userspace status: %v", err) } - extra := map[string]string{} - if req.Target != "" { - extra["destination-ip"] = req.Target + extra, err := dpuserspace.DecodeInjectPacketTarget(req.Target) + if err != nil { + return nil, status.Error(codes.InvalidArgument, err.Error()) } injectReq, err := dpuserspace.BuildInjectPacketRequest(uint32(slot), mode, extra, statusNow) if err != nil { diff --git a/pkg/grpcapi/system_action_test.go b/pkg/grpcapi/system_action_test.go index 55c27c1df..45f2e394e 100644 --- a/pkg/grpcapi/system_action_test.go +++ b/pkg/grpcapi/system_action_test.go @@ -2,15 +2,45 @@ package grpcapi import ( "context" + "syscall" "testing" "github.com/psaab/xpf/pkg/cluster" + "github.com/psaab/xpf/pkg/dataplane" + dpuserspace "github.com/psaab/xpf/pkg/dataplane/userspace" pb "github.com/psaab/xpf/pkg/grpcapi/xpfv1" "google.golang.org/grpc/codes" "google.golang.org/grpc/metadata" "google.golang.org/grpc/status" ) +type fakeUserspaceInjectControl struct { + dataplane.DataPlane + status dpuserspace.ProcessStatus + got dpuserspace.InjectPacketRequest +} + +func (f *fakeUserspaceInjectControl) Status() (dpuserspace.ProcessStatus, error) { + return f.status, nil +} + +func (f *fakeUserspaceInjectControl) SetForwardingArmed(bool) (dpuserspace.ProcessStatus, error) { + return dpuserspace.ProcessStatus{}, nil +} + +func (f *fakeUserspaceInjectControl) SetQueueState(uint32, bool, bool) (dpuserspace.ProcessStatus, error) { + return dpuserspace.ProcessStatus{}, nil +} + +func (f *fakeUserspaceInjectControl) SetBindingState(uint32, bool, bool) (dpuserspace.ProcessStatus, error) { + return dpuserspace.ProcessStatus{}, nil +} + +func (f *fakeUserspaceInjectControl) InjectPacket(req dpuserspace.InjectPacketRequest) (dpuserspace.ProcessStatus, error) { + f.got = req + return dpuserspace.ProcessStatus{PID: 1234}, nil +} + func TestSystemActionClusterFailoverProxiesPeerTarget(t *testing.T) { s := NewServer("", Config{Cluster: cluster.NewManager(0, 1)}) @@ -93,3 +123,48 @@ func TestSystemActionClusterFailoverDataRejectsUnsupportedTargetNode(t *testing. t.Fatalf("status code = %s, want %s (err=%v)", status.Code(err), codes.InvalidArgument, err) } } + +func TestSystemActionUserspaceInjectDecodesEmitOnWireTargetExtras(t *testing.T) { + dp := &fakeUserspaceInjectControl{ + status: dpuserspace.ProcessStatus{ + InjectPacketTupleProtocolVersion: dpuserspace.InjectPacketTupleProtocolVersion, + LastSnapshotGeneration: 11, + LastFIBGeneration: 12, + }, + } + s := NewServer("", Config{DP: dp}) + target := dpuserspace.EncodeInjectPacketTarget(map[string]string{ + "destination-ip": "172.16.80.200", + "emit-on-wire": "true", + "source-ip": "172.16.80.8", + "source-port": "7", + "destination-port": "0", + "protocol": "icmp", + }) + + _, err := s.SystemAction(context.Background(), &pb.SystemActionRequest{ + Action: "userspace-inject:7:valid", + Target: target, + }) + if err != nil { + t.Fatalf("SystemAction() error = %v", err) + } + if !dp.got.EmitOnWire { + t.Fatal("remote userspace inject request lost emit-on-wire") + } + if dp.got.TupleMetadataVersion != dpuserspace.InjectPacketTupleProtocolVersion { + t.Fatalf("TupleMetadataVersion = %d, want %d", dp.got.TupleMetadataVersion, dpuserspace.InjectPacketTupleProtocolVersion) + } + if dp.got.AddrFamily != uint8(syscall.AF_INET) || dp.got.Protocol != 1 { + t.Fatalf("family/protocol = %d/%d, want AF_INET/ICMP", dp.got.AddrFamily, dp.got.Protocol) + } + if dp.got.SourceIP != "172.16.80.8" || dp.got.DestinationIP != "172.16.80.200" { + t.Fatalf("tuple IPs = %s -> %s", dp.got.SourceIP, dp.got.DestinationIP) + } + if dp.got.SourcePort == nil || *dp.got.SourcePort != 7 { + t.Fatalf("SourcePort = %v, want 7", dp.got.SourcePort) + } + if dp.got.DestinationPort == nil || *dp.got.DestinationPort != 0 { + t.Fatalf("DestinationPort = %v, want 0", dp.got.DestinationPort) + } +} diff --git a/userspace-dp/src/afxdp/coordinator/inject.rs b/userspace-dp/src/afxdp/coordinator/inject.rs index e2c21efd9..63ca77b70 100644 --- a/userspace-dp/src/afxdp/coordinator/inject.rs +++ b/userspace-dp/src/afxdp/coordinator/inject.rs @@ -1,4 +1,103 @@ use super::*; +use crate::INJECT_PACKET_TUPLE_PROTOCOL_VERSION; + +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +pub(super) struct InjectedPacketTuple { + pub source_ip: IpAddr, + pub destination_ip: IpAddr, + pub source_port: u16, + pub destination_port: u16, + pub addr_family: u8, + pub protocol: u8, +} + +pub(super) fn validate_injected_packet_tuple( + req: &InjectPacketRequest, + dst: IpAddr, +) -> Result { + if req.tuple_metadata_version != INJECT_PACKET_TUPLE_PROTOCOL_VERSION { + return Err(format!( + "emit-on-wire requires tuple metadata version {} (got {})", + INJECT_PACKET_TUPLE_PROTOCOL_VERSION, req.tuple_metadata_version + )); + } + let source_ip = req + .source_ip + .parse::() + .map_err(|e| format!("invalid injected source_ip {}: {e}", req.source_ip))?; + let source_port = req + .source_port + .ok_or_else(|| "emit-on-wire requires source_port tuple metadata".to_string())?; + let destination_port = req + .destination_port + .ok_or_else(|| "emit-on-wire requires destination_port tuple metadata".to_string())?; + let (addr_family, protocol) = match (source_ip, dst) { + (IpAddr::V4(_), IpAddr::V4(_)) => (libc::AF_INET as u8, PROTO_ICMP), + (IpAddr::V6(_), IpAddr::V6(_)) => (libc::AF_INET6 as u8, PROTO_ICMPV6), + _ => { + return Err( + "emit-on-wire source_ip and destination_ip must use the same address family" + .to_string(), + ); + } + }; + if req.addr_family != addr_family { + return Err(format!( + "emit-on-wire tuple addr_family {} does not match packet family {}", + req.addr_family, addr_family + )); + } + if req.protocol != protocol { + return Err(format!( + "emit-on-wire supports only protocol {} for this address family (got {})", + protocol, req.protocol + )); + } + + Ok(InjectedPacketTuple { + source_ip, + destination_ip: dst, + source_port, + destination_port, + addr_family, + protocol, + }) +} + +pub(super) fn stamp_injected_packet_tuple( + meta: &mut UserspaceDpMeta, + frame_len: usize, + tuple: InjectedPacketTuple, + egress: &EgressInterface, +) -> Result<(), String> { + meta.pkt_len = frame_len.min(u16::MAX as usize) as u16; + let l3_offset = if egress.vlan_id > 0 { 18 } else { 14 }; + meta.l3_offset = l3_offset; + meta.flow_src_addr = [0; 16]; + meta.flow_dst_addr = [0; 16]; + meta.flow_src_port = tuple.source_port; + meta.flow_dst_port = tuple.destination_port; + meta.addr_family = tuple.addr_family; + meta.protocol = tuple.protocol; + + match (tuple.source_ip, tuple.destination_ip) { + (IpAddr::V4(src_v4), IpAddr::V4(dst_v4)) => { + meta.l4_offset = l3_offset + 20; + meta.payload_offset = meta.l4_offset + 8; + meta.flow_src_addr[..4].copy_from_slice(&src_v4.octets()); + meta.flow_dst_addr[..4].copy_from_slice(&dst_v4.octets()); + } + (IpAddr::V6(src_v6), IpAddr::V6(dst_v6)) => { + meta.l4_offset = l3_offset + 40; + meta.payload_offset = meta.l4_offset + 8; + meta.flow_src_addr.copy_from_slice(&src_v6.octets()); + meta.flow_dst_addr.copy_from_slice(&dst_v6.octets()); + } + _ => return Err("injected tuple address family mismatch".to_string()), + } + + Ok(()) +} /// `request inject-packet` RPC handler. Builds a synthetic packet /// against the live ForwardingState/HA snapshot, runs it through the @@ -91,7 +190,8 @@ impl super::Coordinator { && candidate.queue_id == ident.queue_id }) .or_else(|| { - self.workers.identities + self.workers + .identities .values() .find(|candidate| candidate.ifindex == egress.bind_ifindex) }) @@ -105,19 +205,36 @@ impl super::Coordinator { let target_live = self.workers.live.get(&target_slot).ok_or_else(|| { format!("binding slot {} has no live state", target_slot) })?; - let frame = build_injected_packet(&req, dst, resolution, egress)?; - let cos = resolve_cos_tx_selection( + let tuple = validate_injected_packet_tuple(&req, dst)?; + let frame = build_injected_packet( + &req, + tuple.source_ip, + tuple.destination_ip, + tuple.source_port, + resolution, + egress, + )?; + let mut tx_meta = meta; + stamp_injected_packet_tuple(&mut tx_meta, frame.len(), tuple, egress)?; + let now_ns = monotonic_nanos(); + let cos_flow = parse_session_flow_from_meta(tx_meta); + let cos = resolve_cos_tx_selection_at( &self.forwarding, resolution.egress_ifindex, - meta, - None, + tx_meta, + cos_flow.as_ref().map(|flow| &flow.forward_key), + now_ns, ); + if cos.drop { + return Ok(()); + } + let flow_key = cos_flow.map(|flow| flow.forward_key); target_live.enqueue_tx(TxRequest { bytes: frame, expected_ports: None, - expected_addr_family: 0, - expected_protocol: 0, - flow_key: None, + expected_addr_family: tx_meta.addr_family, + expected_protocol: tx_meta.protocol, + flow_key, egress_ifindex: resolution.egress_ifindex, cos_queue_id: cos.queue_id, dscp_rewrite: cos.dscp_rewrite, diff --git a/userspace-dp/src/afxdp/coordinator/status.rs b/userspace-dp/src/afxdp/coordinator/status.rs index 17c8a93a4..fccae3457 100644 --- a/userspace-dp/src/afxdp/coordinator/status.rs +++ b/userspace-dp/src/afxdp/coordinator/status.rs @@ -101,6 +101,10 @@ impl super::Coordinator { out } + pub fn three_color_policer_counters(&self) -> Vec { + self.forwarding.filter_state.three_color_policer_statuses() + } + pub fn flow_worker_map(&self) -> (Vec, bool) { const FLOW_WORKER_MAP_MAX_ROWS: usize = 4096; let mut out = Vec::new(); diff --git a/userspace-dp/src/afxdp/coordinator/tests.rs b/userspace-dp/src/afxdp/coordinator/tests.rs index 3df434be9..3ff242289 100644 --- a/userspace-dp/src/afxdp/coordinator/tests.rs +++ b/userspace-dp/src/afxdp/coordinator/tests.rs @@ -4,12 +4,182 @@ // `#[path = "tests.rs"]` from coordinator/mod.rs. use super::*; +use crate::INJECT_PACKET_TUPLE_PROTOCOL_VERSION; use crate::test_zone_ids::*; use crate::{ ClassOfServiceSnapshot, CoSForwardingClassSnapshot, CoSSchedulerMapEntrySnapshot, CoSSchedulerMapSnapshot, }; +#[test] +fn stamp_injected_packet_tuple_builds_ipv4_icmp_flow_key() { + let mut meta = UserspaceDpMeta { + addr_family: libc::AF_INET as u8, + protocol: 0, + ..UserspaceDpMeta::default() + }; + let dst = IpAddr::V4(Ipv4Addr::new(172, 16, 80, 200)); + let req = InjectPacketRequest { + addr_family: libc::AF_INET as u8, + protocol: PROTO_ICMP, + destination_ip: "172.16.80.200".into(), + tuple_metadata_version: INJECT_PACKET_TUPLE_PROTOCOL_VERSION, + source_ip: "172.16.80.8".into(), + source_port: Some(0x1234), + destination_port: Some(0), + ..Default::default() + }; + let tuple = inject::validate_injected_packet_tuple(&req, dst).expect("validate tuple"); + let egress = EgressInterface { + bind_ifindex: 12, + vlan_id: 0, + mtu: 1500, + src_mac: [0; 6], + zone_id: TEST_WAN_ZONE_ID, + redundancy_group: 0, + primary_v4: Some(Ipv4Addr::new(172, 16, 80, 8)), + primary_v6: None, + }; + + inject::stamp_injected_packet_tuple(&mut meta, 98, tuple, &egress).expect("stamp tuple"); + + let flow = parse_session_flow_from_meta(meta).expect("metadata flow"); + assert_eq!(meta.protocol, PROTO_ICMP); + assert_eq!(meta.l3_offset, 14); + assert_eq!(meta.l4_offset, 34); + assert_eq!(meta.payload_offset, 42); + assert_eq!( + flow.forward_key.src_ip, + IpAddr::V4(Ipv4Addr::new(172, 16, 80, 8)) + ); + assert_eq!( + flow.forward_key.dst_ip, + IpAddr::V4(Ipv4Addr::new(172, 16, 80, 200)) + ); + assert_eq!(flow.forward_key.src_port, 0x1234); + assert_eq!(flow.forward_key.dst_port, 0); +} + +#[test] +fn stamp_injected_packet_tuple_builds_ipv6_icmp_flow_key() { + let mut meta = UserspaceDpMeta { + addr_family: libc::AF_INET6 as u8, + protocol: 0, + ..UserspaceDpMeta::default() + }; + let src = "2001:db8:80::8".parse::().unwrap(); + let dst = "2001:db8:80::200".parse::().unwrap(); + let req = InjectPacketRequest { + addr_family: libc::AF_INET6 as u8, + protocol: PROTO_ICMPV6, + destination_ip: "2001:db8:80::200".into(), + tuple_metadata_version: INJECT_PACKET_TUPLE_PROTOCOL_VERSION, + source_ip: "2001:db8:80::8".into(), + source_port: Some(0x4321), + destination_port: Some(0), + ..Default::default() + }; + let tuple = + inject::validate_injected_packet_tuple(&req, IpAddr::V6(dst)).expect("validate tuple"); + let egress = EgressInterface { + bind_ifindex: 12, + vlan_id: 80, + mtu: 1500, + src_mac: [0; 6], + zone_id: TEST_WAN_ZONE_ID, + redundancy_group: 0, + primary_v4: None, + primary_v6: Some(src), + }; + + inject::stamp_injected_packet_tuple(&mut meta, 118, tuple, &egress).expect("stamp tuple"); + + let flow = parse_session_flow_from_meta(meta).expect("metadata flow"); + assert_eq!(meta.protocol, PROTO_ICMPV6); + assert_eq!(meta.l3_offset, 18); + assert_eq!(meta.l4_offset, 58); + assert_eq!(meta.payload_offset, 66); + assert_eq!(flow.forward_key.src_ip, IpAddr::V6(src)); + assert_eq!(flow.forward_key.dst_ip, IpAddr::V6(dst)); + assert_eq!(flow.forward_key.src_port, 0x4321); + assert_eq!(flow.forward_key.dst_port, 0); +} + +#[test] +fn validate_injected_packet_tuple_rejects_legacy_wire_request() { + let req = InjectPacketRequest { + addr_family: libc::AF_INET as u8, + protocol: PROTO_ICMP, + destination_ip: "172.16.80.200".into(), + source_ip: "172.16.80.8".into(), + source_port: Some(0x1234), + destination_port: Some(0), + ..Default::default() + }; + + let err = + inject::validate_injected_packet_tuple(&req, IpAddr::V4(Ipv4Addr::new(172, 16, 80, 200))) + .expect_err("legacy request must fail closed"); + assert!(err.contains("tuple metadata version"), "{err}"); +} + +#[test] +fn validate_injected_packet_tuple_rejects_protocol_mismatch() { + let req = InjectPacketRequest { + addr_family: libc::AF_INET as u8, + protocol: PROTO_TCP, + destination_ip: "172.16.80.200".into(), + tuple_metadata_version: INJECT_PACKET_TUPLE_PROTOCOL_VERSION, + source_ip: "172.16.80.8".into(), + source_port: Some(0x1234), + destination_port: Some(0), + ..Default::default() + }; + + let err = + inject::validate_injected_packet_tuple(&req, IpAddr::V4(Ipv4Addr::new(172, 16, 80, 200))) + .expect_err("non-ICMP tuple must fail closed"); + assert!(err.contains("supports only protocol"), "{err}"); +} + +#[test] +fn build_injected_packet_uses_wire_tuple_source_ipv4() { + let req = InjectPacketRequest { + packet_length: 98, + ..Default::default() + }; + let src = IpAddr::V4(Ipv4Addr::new(172, 16, 80, 8)); + let dst = IpAddr::V4(Ipv4Addr::new(172, 16, 80, 200)); + let egress = EgressInterface { + bind_ifindex: 12, + vlan_id: 0, + mtu: 1500, + src_mac: [0x02, 0xbf, 0x72, 0x00, 0x01, 0x01], + zone_id: TEST_WAN_ZONE_ID, + redundancy_group: 0, + primary_v4: Some(Ipv4Addr::new(192, 0, 2, 1)), + primary_v6: None, + }; + let resolution = ForwardingResolution { + disposition: ForwardingDisposition::ForwardCandidate, + local_ifindex: 0, + egress_ifindex: 80, + tx_ifindex: 80, + tunnel_endpoint_id: 0, + next_hop: Some(dst), + neighbor_mac: Some([0xde, 0xad, 0xbe, 0xef, 0x00, 0x01]), + src_mac: Some(egress.src_mac), + tx_vlan_id: 0, + }; + + let frame = build_injected_packet(&req, src, dst, 0x3456, resolution, &egress) + .expect("build injected packet"); + + assert_eq!(&frame[26..30], &[172, 16, 80, 8]); + assert_eq!(&frame[30..34], &[172, 16, 80, 200]); + assert_eq!(&frame[38..40], &0x3456u16.to_be_bytes()); +} + #[test] fn build_cos_owner_worker_by_queue_prefers_lowest_worker_with_tx_binding() { let mut forwarding = ForwardingState::default(); diff --git a/userspace-dp/src/afxdp/flow_cache.rs b/userspace-dp/src/afxdp/flow_cache.rs index 5791f112d..9ed2b5c5d 100644 --- a/userspace-dp/src/afxdp/flow_cache.rs +++ b/userspace-dp/src/afxdp/flow_cache.rs @@ -28,6 +28,7 @@ pub(super) struct CachedTxSelectionDescriptor { pub(super) queue_id: Option, pub(super) dscp_rewrite: Option, pub(super) filter_counter: Option>, + pub(super) three_color_policers: crate::filter::CachedThreeColorPolicers, } /// Precomputed rewrite descriptor for an established flow. diff --git a/userspace-dp/src/afxdp/forward_request.rs b/userspace-dp/src/afxdp/forward_request.rs index 6c0402cb9..143cd8c10 100644 --- a/userspace-dp/src/afxdp/forward_request.rs +++ b/userspace-dp/src/afxdp/forward_request.rs @@ -13,7 +13,10 @@ use super::*; -pub(super) fn should_install_local_reverse_session(decision: SessionDecision, fabric_ingress: bool) -> bool { +pub(super) fn should_install_local_reverse_session( + decision: SessionDecision, + fabric_ingress: bool, +) -> bool { let fabric_wire_placeholder = shared_ops::is_fabric_wire_placeholder(fabric_ingress, false, decision); decision.resolution.disposition != ForwardingDisposition::FabricRedirect @@ -33,6 +36,7 @@ pub(super) fn build_live_forward_request( flow: Option<&SessionFlow>, fabric_ingress_zone: Option, apply_nat_on_fabric: bool, + now_ns: u64, ) -> Option { let frame = area.slice(desc.addr as usize, desc.len as usize)?; build_live_forward_request_from_frame( @@ -47,6 +51,7 @@ pub(super) fn build_live_forward_request( flow, fabric_ingress_zone, apply_nat_on_fabric, + now_ns, None, None, ) @@ -64,6 +69,7 @@ pub(super) fn build_live_forward_request_from_frame( flow: Option<&SessionFlow>, fabric_ingress_zone: Option, apply_nat_on_fabric: bool, + now_ns: u64, hints: Option, precomputed_tx_selection: Option<&CachedTxSelectionDescriptor>, ) -> Option { @@ -103,19 +109,31 @@ pub(super) fn build_live_forward_request_from_frame( { decision.resolution.src_mac = zone_redirect.src_mac; } + let fallback_flow; + let tx_selection_flow = if flow.is_some() { + flow + } else { + fallback_flow = parse_session_flow_from_meta(meta); + fallback_flow.as_ref() + }; let cos = precomputed_tx_selection .map(|selection| CoSTxSelection { queue_id: selection.queue_id, dscp_rewrite: selection.dscp_rewrite, + drop: false, }) .unwrap_or_else(|| { - resolve_cos_tx_selection( + resolve_cos_tx_selection_at( forwarding, decision.resolution.egress_ifindex, meta, - flow.map(|flow| &flow.forward_key), + tx_selection_flow.map(|flow| &flow.forward_key), + now_ns, ) }); + if cos.drop { + return None; + } Some(PendingForwardRequest { target_ifindex, target_binding_index, @@ -126,9 +144,10 @@ pub(super) fn build_live_forward_request_from_frame( decision, apply_nat_on_fabric, expected_ports, - flow_key: flow.map(|flow| flow.forward_key.clone()), + flow_key: tx_selection_flow.map(|flow| flow.forward_key.clone()), nat64_reverse: None, cos_queue_id: cos.queue_id, dscp_rewrite: cos.dscp_rewrite, + cos_tx_selection_resolved: true, }) } diff --git a/userspace-dp/src/afxdp/forwarding_build.rs b/userspace-dp/src/afxdp/forwarding_build.rs index effcc66f9..053e1542b 100644 --- a/userspace-dp/src/afxdp/forwarding_build.rs +++ b/userspace-dp/src/afxdp/forwarding_build.rs @@ -396,9 +396,10 @@ pub(super) fn build_forwarding_state_with_policy_counters( state.tcp_mss_gre_in = snapshot.flow.tcp_mss_gre_in; state.tcp_mss_gre_out = snapshot.flow.tcp_mss_gre_out; // Build filter state from snapshot - state.filter_state = crate::filter::parse_filter_state( + state.filter_state = crate::filter::parse_filter_state_with_three_color( &snapshot.filters, &snapshot.policers, + &snapshot.three_color_policers, &snapshot.interfaces, &snapshot.flow.lo0_filter_input_v4, &snapshot.flow.lo0_filter_input_v6, @@ -407,10 +408,20 @@ pub(super) fn build_forwarding_state_with_policy_counters( let has_cos_interfaces = !state.cos.interfaces.is_empty(); state.tx_selection_enabled_v4 = has_cos_interfaces || state.filter_state.has_input_tx_selection_v4 - || state.filter_state.has_output_tx_selection_v4; + || state.filter_state.has_output_tx_selection_v4 + || state.filter_state.has_input_three_color_policer_v4 + || !state + .filter_state + .iface_filter_out_v4_needs_tx_eval + .is_empty(); state.tx_selection_enabled_v6 = has_cos_interfaces || state.filter_state.has_input_tx_selection_v6 - || state.filter_state.has_output_tx_selection_v6; + || state.filter_state.has_output_tx_selection_v6 + || state.filter_state.has_input_three_color_policer_v6 + || !state + .filter_state + .iface_filter_out_v6_needs_tx_eval + .is_empty(); // Build flow export config from snapshot state.flow_export_config = snapshot.flow_export.as_ref().and_then(|fe| { let addr = format!("{}:{}", fe.collector_address, fe.collector_port); diff --git a/userspace-dp/src/afxdp/frame/mod.rs b/userspace-dp/src/afxdp/frame/mod.rs index 4b678604c..64c1fcd64 100644 --- a/userspace-dp/src/afxdp/frame/mod.rs +++ b/userspace-dp/src/afxdp/frame/mod.rs @@ -119,16 +119,23 @@ pub(in crate::afxdp) fn apply_dscp_rewrite_to_frame(frame: &mut [u8], dscp: u8) pub(super) fn build_injected_packet( req: &InjectPacketRequest, + src: IpAddr, dst: IpAddr, + src_port: u16, resolution: ForwardingResolution, egress: &EgressInterface, ) -> Result, String> { let dst_mac = resolution .neighbor_mac .ok_or_else(|| "missing neighbor MAC".to_string())?; - match dst { - IpAddr::V4(dst_v4) => build_injected_ipv4(req, dst_mac, dst_v4, egress), - IpAddr::V6(dst_v6) => build_injected_ipv6(req, dst_mac, dst_v6, egress), + match (src, dst) { + (IpAddr::V4(src_v4), IpAddr::V4(dst_v4)) => { + build_injected_ipv4(req, dst_mac, src_v4, dst_v4, src_port, egress) + } + (IpAddr::V6(src_v6), IpAddr::V6(dst_v6)) => { + build_injected_ipv6(req, dst_mac, src_v6, dst_v6, src_port, egress) + } + _ => Err("injected packet source and destination address families differ".to_string()), } } @@ -1441,12 +1448,11 @@ pub(super) fn restore_l4_tuple_from_meta( pub(super) fn build_injected_ipv4( req: &InjectPacketRequest, dst_mac: [u8; 6], + src_ip: Ipv4Addr, dst_ip: Ipv4Addr, + src_port: u16, egress: &EgressInterface, ) -> Result, String> { - let src_ip = egress - .primary_v4 - .ok_or_else(|| "egress interface has no IPv4 source address".to_string())?; let eth_len = if egress.vlan_id > 0 { 18 } else { 14 }; let min_total = eth_len + 20 + 8 + 16; let target_len = req.packet_length.max(min_total as u32) as usize; @@ -1479,7 +1485,7 @@ pub(super) fn build_injected_ipv4( let icmp_start = frame.len(); frame.extend_from_slice(&[8, 0, 0, 0]); - frame.extend_from_slice(&(req.slot as u16).to_be_bytes()); + frame.extend_from_slice(&src_port.to_be_bytes()); frame.extend_from_slice(&1u16.to_be_bytes()); for i in 0..payload_len { frame.push((i & 0xff) as u8); @@ -1493,12 +1499,11 @@ pub(super) fn build_injected_ipv4( pub(super) fn build_injected_ipv6( req: &InjectPacketRequest, dst_mac: [u8; 6], + src_ip: Ipv6Addr, dst_ip: Ipv6Addr, + src_port: u16, egress: &EgressInterface, ) -> Result, String> { - let src_ip = egress - .primary_v6 - .ok_or_else(|| "egress interface has no IPv6 source address".to_string())?; let eth_len = if egress.vlan_id > 0 { 18 } else { 14 }; let min_total = eth_len + 40 + 8 + 16; let target_len = req.packet_length.max(min_total as u32) as usize; @@ -1522,7 +1527,7 @@ pub(super) fn build_injected_ipv6( let icmp_start = frame.len(); frame.extend_from_slice(&[128, 0, 0, 0]); - frame.extend_from_slice(&(req.slot as u16).to_be_bytes()); + frame.extend_from_slice(&src_port.to_be_bytes()); frame.extend_from_slice(&1u16.to_be_bytes()); for i in 0..payload_len { frame.push((i & 0xff) as u8); diff --git a/userspace-dp/src/afxdp/frame/tests.rs b/userspace-dp/src/afxdp/frame/tests.rs index 030c8b9cf..eec8fbbb7 100644 --- a/userspace-dp/src/afxdp/frame/tests.rs +++ b/userspace-dp/src/afxdp/frame/tests.rs @@ -4,6 +4,7 @@ use super::super::test_fixtures::*; use super::*; +use crate::{FirewallFilterSnapshot, FirewallTermSnapshot, ThreeColorPolicerSnapshot}; use crate::test_zone_ids::*; fn active_ha_runtime(now_secs: u64) -> HAGroupRuntime { @@ -1933,6 +1934,7 @@ fn build_live_forward_request_prefers_session_flow_ports_over_frame() { Some(&session_flow), None, false, + 0, ) .expect("request"); // Session flow ports (1025, 5201) take priority over frame ports (38276, 5201) @@ -2036,11 +2038,248 @@ fn build_live_forward_request_uses_live_frame_ports_when_no_session_flow() { None, None, false, + 0, ) .expect("request"); assert_eq!(req.expected_ports, Some((real_src_port, real_dst_port))); } +#[test] +fn build_live_forward_request_meters_non_l4_metadata_flow() { + let src_ip = Ipv4Addr::new(10, 0, 0, 1); + let dst_ip = Ipv4Addr::new(10, 0, 0, 2); + let area = MmapArea::new(4096).expect("mmap"); + let meta = UserspaceDpMeta { + magic: USERSPACE_META_MAGIC, + version: USERSPACE_META_VERSION, + length: std::mem::size_of::() as u16, + l3_offset: 14, + l4_offset: 34, + ingress_ifindex: 10, + addr_family: libc::AF_INET as u8, + protocol: PROTO_GRE, + pkt_len: 128, + flow_src_addr: { + let mut bytes = [0u8; 16]; + bytes[..4].copy_from_slice(&src_ip.octets()); + bytes + }, + flow_dst_addr: { + let mut bytes = [0u8; 16]; + bytes[..4].copy_from_slice(&dst_ip.octets()); + bytes + }, + ..UserspaceDpMeta::default() + }; + let filter_state = crate::filter::parse_filter_state_with_three_color( + &[FirewallFilterSnapshot { + name: "policed".into(), + family: "inet".into(), + terms: vec![FirewallTermSnapshot { + name: "meter-gre".into(), + action: "accept".into(), + protocols: vec!["gre".into()], + policer: "gre-pol".into(), + ..Default::default() + }], + }], + &[], + &[ThreeColorPolicerSnapshot { + name: "gre-pol".into(), + mode: "single-rate".into(), + color_blind: true, + committed_rate_bytes_per_sec: 1, + committed_burst_bytes: 64, + peak_or_excess_burst_bytes: 32, + then_action: "discard".into(), + ..Default::default() + }], + &[crate::InterfaceSnapshot { + name: "ge-0/0/1.0".into(), + ifindex: 10, + filter_input_v4: "policed".into(), + ..Default::default() + }], + "policed", + "", + ); + let mut forwarding = ForwardingState { + filter_state, + tx_selection_enabled_v4: true, + ..ForwardingState::default() + }; + forwarding.cos.interfaces.insert( + 12, + CoSInterfaceConfig { + shaping_rate_bytes: 1_000_000, + burst_bytes: crate::afxdp::cos::COS_MIN_BURST_BYTES, + default_queue: 0, + dscp_classifier: String::new(), + ieee8021_classifier: String::new(), + dscp_queue_by_dscp: [u8::MAX; 64], + ieee8021_queue_by_pcp: [u8::MAX; 8], + queue_by_forwarding_class: FastMap::default(), + queues: Vec::new(), + }, + ); + let decision = SessionDecision { + resolution: ForwardingResolution { + disposition: ForwardingDisposition::ForwardCandidate, + local_ifindex: 0, + egress_ifindex: 12, + tx_ifindex: 12, + tunnel_endpoint_id: 0, + next_hop: Some(IpAddr::V4(dst_ip)), + neighbor_mac: Some([0xba, 0x86, 0xe9, 0xf6, 0x4b, 0xd5]), + src_mac: Some([0x02, 0xbf, 0x72, 0x00, 0x80, 0x08]), + tx_vlan_id: 0, + }, + nat: NatDecision::default(), + }; + let ingress = BindingIdentity { + slot: 0, + queue_id: 0, + worker_id: 0, + interface: Arc::::from("ge-0-0-1"), + ifindex: 10, + }; + + let req = build_live_forward_request( + &area, + &WorkerBindingLookup::default(), + 0, + &ingress, + XdpDesc { + addr: 0, + len: 0, + options: 0, + }, + meta, + &decision, + &forwarding, + None, + None, + false, + 0, + ); + + assert!(req.is_none(), "red-drop policer should reject non-L4 metadata flow"); + let status = forwarding.filter_state.three_color_policer_statuses(); + assert_eq!(status[0].red_packets, 1); + assert_eq!(status[0].drop_packets, 1); +} + +#[test] +fn build_live_forward_request_marks_empty_cos_selection_resolved() { + let src_ip = Ipv4Addr::new(10, 0, 0, 1); + let dst_ip = Ipv4Addr::new(10, 0, 0, 2); + let area = MmapArea::new(4096).expect("mmap"); + let meta = UserspaceDpMeta { + magic: USERSPACE_META_MAGIC, + version: USERSPACE_META_VERSION, + length: std::mem::size_of::() as u16, + ingress_ifindex: 10, + addr_family: libc::AF_INET as u8, + protocol: PROTO_GRE, + pkt_len: 64, + flow_src_addr: { + let mut bytes = [0u8; 16]; + bytes[..4].copy_from_slice(&src_ip.octets()); + bytes + }, + flow_dst_addr: { + let mut bytes = [0u8; 16]; + bytes[..4].copy_from_slice(&dst_ip.octets()); + bytes + }, + ..UserspaceDpMeta::default() + }; + let filter_state = crate::filter::parse_filter_state_with_three_color( + &[FirewallFilterSnapshot { + name: "policed".into(), + family: "inet".into(), + terms: vec![FirewallTermSnapshot { + name: "meter-gre".into(), + action: "accept".into(), + protocols: vec!["gre".into()], + policer: "gre-pol".into(), + ..Default::default() + }], + }], + &[], + &[ThreeColorPolicerSnapshot { + name: "gre-pol".into(), + mode: "single-rate".into(), + color_blind: true, + committed_rate_bytes_per_sec: 1, + committed_burst_bytes: 128, + peak_or_excess_burst_bytes: 64, + then_action: "discard".into(), + ..Default::default() + }], + &[crate::InterfaceSnapshot { + name: "ge-0/0/1.0".into(), + ifindex: 10, + filter_input_v4: "policed".into(), + ..Default::default() + }], + "policed", + "", + ); + let forwarding = ForwardingState { + filter_state, + tx_selection_enabled_v4: true, + ..ForwardingState::default() + }; + let decision = SessionDecision { + resolution: ForwardingResolution { + disposition: ForwardingDisposition::ForwardCandidate, + local_ifindex: 0, + egress_ifindex: 12, + tx_ifindex: 12, + tunnel_endpoint_id: 0, + next_hop: Some(IpAddr::V4(dst_ip)), + neighbor_mac: Some([0xba, 0x86, 0xe9, 0xf6, 0x4b, 0xd5]), + src_mac: Some([0x02, 0xbf, 0x72, 0x00, 0x80, 0x08]), + tx_vlan_id: 0, + }, + nat: NatDecision::default(), + }; + let ingress = BindingIdentity { + slot: 0, + queue_id: 0, + worker_id: 0, + interface: Arc::::from("ge-0-0-1"), + ifindex: 10, + }; + + let req = build_live_forward_request( + &area, + &WorkerBindingLookup::default(), + 0, + &ingress, + XdpDesc { + addr: 0, + len: 0, + options: 0, + }, + meta, + &decision, + &forwarding, + None, + None, + false, + 0, + ) + .expect("green policer should permit request"); + + assert_eq!(req.cos_queue_id, None); + assert_eq!(req.dscp_rewrite, None); + assert!(req.cos_tx_selection_resolved); + let status = forwarding.filter_state.three_color_policer_statuses(); + assert_eq!(status[0].green_packets, 1); +} + #[test] fn build_live_forward_request_uses_flow_or_metadata_ports_when_frame_ports_unavailable() { let src_ip = "2001:559:8585:ef00::102".parse::().unwrap(); @@ -2110,6 +2349,7 @@ fn build_live_forward_request_uses_flow_or_metadata_ports_when_frame_ports_unava Some(&flow), None, false, + 0, ) .expect("request"); assert_eq!(req.expected_ports, Some((54688, 5201))); @@ -2178,6 +2418,7 @@ fn build_live_forward_request_marks_session_fabric_redirect_for_nat_and_zone() { Some(&flow), Some(TEST_WAN_ZONE_ID), true, + 0, ) .expect("request"); @@ -2260,6 +2501,7 @@ fn build_live_forward_request_caches_target_binding_index() { None, None, false, + 0, ) .expect("request"); diff --git a/userspace-dp/src/afxdp/icmp.rs b/userspace-dp/src/afxdp/icmp.rs index 4b8f119ff..43f038996 100644 --- a/userspace-dp/src/afxdp/icmp.rs +++ b/userspace-dp/src/afxdp/icmp.rs @@ -22,7 +22,7 @@ pub(super) fn build_local_time_exceeded_request( desc: XdpDesc, meta: UserspaceDpMeta, ingress_ident: &BindingIdentity, - _flow: &SessionFlow, + flow: &SessionFlow, forwarding: &ForwardingState, _dynamic_neighbors: &Arc, _ha_state: &BTreeMap, @@ -48,7 +48,17 @@ pub(super) fn build_local_time_exceeded_request( _ => return None, }?; - let cos = resolve_cos_tx_selection(forwarding, ingress_ident.ifindex, meta, None); + let now_ns = monotonic_nanos(); + let cos = resolve_cos_tx_selection_at( + forwarding, + ingress_ident.ifindex, + meta, + Some(&flow.forward_key), + now_ns, + ); + if cos.drop { + return None; + } Some(PendingForwardRequest { target_ifindex, target_binding_index: None, @@ -72,10 +82,11 @@ pub(super) fn build_local_time_exceeded_request( }, apply_nat_on_fabric: false, expected_ports: None, - flow_key: None, + flow_key: Some(flow.forward_key.clone()), nat64_reverse: None, cos_queue_id: cos.queue_id, dscp_rewrite: cos.dscp_rewrite, + cos_tx_selection_resolved: true, }) } diff --git a/userspace-dp/src/afxdp/mod.rs b/userspace-dp/src/afxdp/mod.rs index 576a1fdf0..702d7f632 100644 --- a/userspace-dp/src/afxdp/mod.rs +++ b/userspace-dp/src/afxdp/mod.rs @@ -278,10 +278,10 @@ const XDP_OPTIONS_ZEROCOPY: u32 = 1; const PENDING_NEIGH_TIMEOUT_NS: u64 = 2_000_000_000; // 2 seconds // GEMINI-NEXT.md Section 3 cold start: admission cap bumped 64 → 4096 so a // per-binding burst of new connections during the ARP/NDP probe window -// doesn't drop frames. PendingNeighPacket is 224 B on x86_64 (XdpDesc + -// UserspaceDpMeta + SessionDecision + queued_ns + probe_attempts), so -// worst-case per binding when the queue is fully populated is ~896 KiB. -// To avoid paying that ~896 KiB up front per binding × N bindings, the +// doesn't drop frames. PendingNeighPacket is 264 B on x86_64 (XdpDesc + +// UserspaceDpMeta + SessionDecision + flow key + queued_ns + probe_attempts), +// so worst-case per binding when the queue is fully populated is ~1.0 MiB. +// To avoid paying that ~1.0 MiB up front per binding × N bindings, the // underlying VecDeque is now constructed with `VecDeque::new()` (zero // capacity) at worker init — see worker/mod.rs. // The buffer grows on push only when traffic actually queues up, and the diff --git a/userspace-dp/src/afxdp/neighbor_dispatch.rs b/userspace-dp/src/afxdp/neighbor_dispatch.rs index 90c3e0070..6f32ec8fd 100644 --- a/userspace-dp/src/afxdp/neighbor_dispatch.rs +++ b/userspace-dp/src/afxdp/neighbor_dispatch.rs @@ -183,12 +183,17 @@ pub(super) fn retry_pending_neigh( binding.tx_pipeline.pending_fill_frames.push_back(pkt.addr); continue; }; - let cos = resolve_cos_tx_selection( + let cos = resolve_cos_tx_selection_at( forwarding, decision.resolution.egress_ifindex, pkt.meta, - None, + pkt.flow_key.as_ref(), + now_ns, ); + if cos.drop { + binding.tx_pipeline.pending_fill_frames.push_back(pkt.addr); + continue; + } let req = PreparedTxRequest { offset: rewrite_result.offset, len: rewrite_result.len, @@ -200,7 +205,7 @@ pub(super) fn retry_pending_neigh( expected_ports: None, expected_addr_family: pkt.meta.addr_family, expected_protocol: pkt.meta.protocol, - flow_key: None, + flow_key: pkt.flow_key.clone(), egress_ifindex: decision.resolution.egress_ifindex, cos_queue_id: cos.queue_id, dscp_rewrite: cos.dscp_rewrite, diff --git a/userspace-dp/src/afxdp/poll_descriptor.rs b/userspace-dp/src/afxdp/poll_descriptor.rs index f263b094f..d0e084e89 100644 --- a/userspace-dp/src/afxdp/poll_descriptor.rs +++ b/userspace-dp/src/afxdp/poll_descriptor.rs @@ -179,6 +179,20 @@ pub(super) fn poll_binding_process_descriptor( meta.pkt_len as u64, ); } + let policer_action = + crate::filter::apply_cached_three_color_policers( + &cached_descriptor.tx_selection.three_color_policers, + now_ns, + meta.pkt_len as u64, + ); + if policer_action.drop { + binding.scratch.scratch_recycle.push(desc.addr); + continue; + } + let cached_queue_id = cached_descriptor.tx_selection.queue_id; + let cached_dscp_rewrite = policer_action + .dscp_rewrite + .or(cached_descriptor.tx_selection.dscp_rewrite); // Amortize session timestamp touch — every 64 cache hits. binding.flow.flow_cache_session_touch += 1; if binding.flow.flow_cache_session_touch & 63 == 0 { @@ -374,12 +388,8 @@ pub(super) fn poll_binding_process_descriptor( egress_ifindex: cached_decision .resolution .egress_ifindex, - cos_queue_id: cached_descriptor - .tx_selection - .queue_id, - dscp_rewrite: cached_descriptor - .tx_selection - .dscp_rewrite, + cos_queue_id: cached_queue_id, + dscp_rewrite: cached_dscp_rewrite, mirror_clone: false, }, ); @@ -394,6 +404,12 @@ pub(super) fn poll_binding_process_descriptor( } // Fallback: use PendingForwardRequest path for cross-binding or failure. if recycle_now { + let cached_precomputed_tx_selection = + CachedTxSelectionDescriptor { + queue_id: cached_queue_id, + dscp_rewrite: cached_dscp_rewrite, + ..CachedTxSelectionDescriptor::default() + }; if let Some(mut request) = build_live_forward_request_from_frame( worker_ctx.binding_lookup, @@ -407,11 +423,12 @@ pub(super) fn poll_binding_process_descriptor( Some(flow), Some(cached_metadata.ingress_zone), cached_descriptor.apply_nat_on_fabric, + now_ns, Some(PendingForwardHints { expected_ports, target_binding_index: target_bi, }), - Some(&cached_descriptor.tx_selection), + Some(&cached_precomputed_tx_selection), ) { request.frame = owned_packet_frame @@ -590,6 +607,7 @@ pub(super) fn poll_binding_process_descriptor( Some(flow), None, false, + now_ns, None, None, ) { @@ -998,35 +1016,39 @@ pub(super) fn poll_binding_process_descriptor( icmp_decision.resolution.egress_ifindex, ) }; - let cos = resolve_cos_tx_selection( + let cos = resolve_cos_tx_selection_at( worker_ctx.forwarding, icmp_decision.resolution.egress_ifindex, meta, - None, + Some(&flow.forward_key), + now_ns, ); - binding.scratch.scratch_forwards.push(PendingForwardRequest { - target_ifindex, - target_binding_index: worker_ctx.binding_lookup.target_index( - binding_index, - worker_ctx.ident.ifindex, - worker_ctx.ident.queue_id, + if !cos.drop { + binding.scratch.scratch_forwards.push(PendingForwardRequest { target_ifindex, - ), - ingress_queue_id: worker_ctx.ident.queue_id, - desc, - frame: PendingForwardFrame::Prebuilt( - rewritten_frame, - ), - meta: meta.into(), - decision: icmp_decision, - apply_nat_on_fabric: false, - expected_ports: None, - flow_key: None, - nat64_reverse: None, - cos_queue_id: cos.queue_id, - dscp_rewrite: cos.dscp_rewrite, - }); - recycle_now = false; + target_binding_index: worker_ctx.binding_lookup.target_index( + binding_index, + worker_ctx.ident.ifindex, + worker_ctx.ident.queue_id, + target_ifindex, + ), + ingress_queue_id: worker_ctx.ident.queue_id, + desc, + frame: PendingForwardFrame::Prebuilt( + rewritten_frame, + ), + meta: meta.into(), + decision: icmp_decision, + apply_nat_on_fabric: false, + expected_ports: None, + flow_key: Some(flow.forward_key.clone()), + nat64_reverse: None, + cos_queue_id: cos.queue_id, + dscp_rewrite: cos.dscp_rewrite, + cos_tx_selection_resolved: true, + }); + recycle_now = false; + } #[cfg(feature = "debug-log")] if icmpv6_trace { debug_log!( @@ -1822,6 +1844,7 @@ pub(super) fn poll_binding_process_descriptor( flow.as_ref(), session_ingress_zone, apply_nat_on_fabric, + now_ns, None, None, ) { @@ -2137,11 +2160,19 @@ pub(super) fn poll_binding_process_descriptor( // in zero-copy mode (mlx5). The ICMP probe + netlink // monitor + buffer-retry path bypasses this issue. if binding.pending_neigh.len() < MAX_PENDING_NEIGH { + let pending_flow_key = flow + .as_ref() + .map(|flow| flow.forward_key.clone()) + .or_else(|| { + parse_session_flow_from_meta(meta) + .map(|flow| flow.forward_key) + }); binding.pending_neigh.push_back(PendingNeighPacket { addr: desc.addr, desc, meta, decision: pending_decision, + flow_key: pending_flow_key, queued_ns: now_ns, probe_attempts: 0, }); diff --git a/userspace-dp/src/afxdp/tests.rs b/userspace-dp/src/afxdp/tests.rs index 815921823..d985fd5a2 100644 --- a/userspace-dp/src/afxdp/tests.rs +++ b/userspace-dp/src/afxdp/tests.rs @@ -3,8 +3,9 @@ use super::*; use crate::test_zone_ids::*; use crate::xsk_ffi::IfInfo; use crate::{ - DestinationNATRuleSnapshot, InterfaceAddressSnapshot, PolicyRuleSnapshot, - SourceNATRuleSnapshot, StaticNATRuleSnapshot, + DestinationNATRuleSnapshot, FirewallFilterSnapshot, FirewallTermSnapshot, + InterfaceAddressSnapshot, PolicyRuleSnapshot, SourceNATRuleSnapshot, StaticNATRuleSnapshot, + ThreeColorPolicerSnapshot, }; #[test] @@ -243,6 +244,7 @@ fn build_live_forward_request_from_frame_uses_precomputed_hints() { None, None, false, + 0, Some(hints), None, ) @@ -1124,6 +1126,7 @@ fn build_local_time_exceeded_request_returns_prebuilt_forward_for_ttl_expiry() { let meta = UserspaceDpMeta { l3_offset: 14, l4_offset: 34, + ingress_ifindex: 5, addr_family: libc::AF_INET as u8, protocol: PROTO_ICMP, ..UserspaceDpMeta::default() @@ -1183,9 +1186,121 @@ fn build_local_time_exceeded_request_returns_prebuilt_forward_for_ttl_expiry() { assert_eq!(request.target_ifindex, 5); assert_eq!(request.ingress_queue_id, ingress_ident.queue_id); assert_eq!(request.desc.addr, desc.addr); + assert_eq!(request.flow_key.as_ref(), Some(&flow.forward_key)); + assert!(request.cos_tx_selection_resolved); assert!(matches!(request.frame, PendingForwardFrame::Prebuilt(_))); } +#[test] +fn build_local_time_exceeded_request_meters_icmp_flow_key() { + let client_ip = Ipv4Addr::new(10, 0, 61, 102); + let dst_ip = Ipv4Addr::new(1, 1, 1, 1); + let frame = build_icmp_echo_frame_v4(client_ip, dst_ip, 1); + let meta = UserspaceDpMeta { + l3_offset: 14, + l4_offset: 34, + ingress_ifindex: 5, + addr_family: libc::AF_INET as u8, + protocol: PROTO_ICMP, + pkt_len: 128, + ..UserspaceDpMeta::default() + }; + let desc = XdpDesc { + addr: 4096, + len: frame.len() as u32, + options: 0, + }; + let ingress_ident = BindingIdentity { + slot: 0, + queue_id: 7, + worker_id: 0, + interface: Arc::::from("ge-0-0-1"), + ifindex: 5, + }; + let flow = SessionFlow { + src_ip: IpAddr::V4(client_ip), + dst_ip: IpAddr::V4(dst_ip), + forward_key: SessionKey { + addr_family: libc::AF_INET as u8, + protocol: PROTO_ICMP, + src_ip: IpAddr::V4(client_ip), + dst_ip: IpAddr::V4(dst_ip), + src_port: 0x1234, + dst_port: 0, + }, + }; + let filter_state = crate::filter::parse_filter_state_with_three_color( + &[FirewallFilterSnapshot { + name: "policed-icmp".into(), + family: "inet".into(), + terms: vec![FirewallTermSnapshot { + name: "meter-icmp".into(), + action: "accept".into(), + protocols: vec!["icmp".into()], + policer: "icmp-pol".into(), + ..Default::default() + }], + }], + &[], + &[ThreeColorPolicerSnapshot { + name: "icmp-pol".into(), + mode: "single-rate".into(), + color_blind: true, + committed_rate_bytes_per_sec: 1, + committed_burst_bytes: 64, + peak_or_excess_burst_bytes: 32, + then_action: "discard".into(), + ..Default::default() + }], + &[crate::InterfaceSnapshot { + name: "ge-0/0/1.0".into(), + ifindex: 5, + filter_input_v4: "policed-icmp".into(), + ..Default::default() + }], + "policed-icmp", + "", + ); + let mut forwarding = ForwardingState { + filter_state, + tx_selection_enabled_v4: true, + ..ForwardingState::default() + }; + forwarding.egress.insert( + 5, + EgressInterface { + bind_ifindex: 5, + vlan_id: 0, + mtu: 1500, + src_mac: [0x02, 0xbf, 0x72, 0x00, 0x61, 0x01], + zone_id: TEST_LAN_ZONE_ID, + redundancy_group: 1, + primary_v4: Some(Ipv4Addr::new(10, 0, 61, 1)), + primary_v6: None, + }, + ); + + let request = build_local_time_exceeded_request( + &frame, + desc, + meta, + &ingress_ident, + &flow, + &forwarding, + &Arc::new(ShardedNeighborMap::new()), + &BTreeMap::new(), + 0, + ); + + assert!( + request.is_none(), + "red-drop policer should reject the generated ICMP response" + ); + let status = forwarding.filter_state.three_color_policer_statuses(); + assert_eq!(status[0].red_packets, 1); + assert_eq!(status[0].drop_packets, 1); +} + #[test] fn build_local_time_exceeded_request_skips_fabric_ingress_packets() { let client_ip = Ipv4Addr::new(10, 0, 61, 102); diff --git a/userspace-dp/src/afxdp/tunnel.rs b/userspace-dp/src/afxdp/tunnel.rs index 776c83104..5e5175640 100644 --- a/userspace-dp/src/afxdp/tunnel.rs +++ b/userspace-dp/src/afxdp/tunnel.rs @@ -216,15 +216,25 @@ pub(super) fn build_local_origin_tunnel_tx_request( &session_entry, monotonic_nanos() / 1_000_000_000, ); - let cos = resolve_cos_tx_selection(forwarding, decision.resolution.egress_ifindex, meta, None); + let now_ns = monotonic_nanos(); + let cos = resolve_cos_tx_selection_at( + forwarding, + decision.resolution.egress_ifindex, + meta, + Some(&session_entry.key), + now_ns, + ); + if cos.drop { + return Err("local_tunnel_packet_dropped_by_three_color_policer".to_string()); + } Ok(LocalTunnelTxPlan { tx_ifindex: decision.resolution.tx_ifindex, tx_request: TxRequest { bytes, expected_ports: None, - expected_addr_family: 0, - expected_protocol: 0, - flow_key: None, + expected_addr_family: meta.addr_family, + expected_protocol: meta.protocol, + flow_key: Some(session_entry.key.clone()), egress_ifindex: decision.resolution.egress_ifindex, cos_queue_id: cos.queue_id, dscp_rewrite: cos.dscp_rewrite, diff --git a/userspace-dp/src/afxdp/tx/cos_classify.rs b/userspace-dp/src/afxdp/tx/cos_classify.rs index 4be8d52d0..873332683 100644 --- a/userspace-dp/src/afxdp/tx/cos_classify.rs +++ b/userspace-dp/src/afxdp/tx/cos_classify.rs @@ -10,6 +10,7 @@ use crate::afxdp::mirror::MIRROR_TX_FRAME_RESERVE; pub(in crate::afxdp) struct CoSTxSelection { pub(in crate::afxdp) queue_id: Option, pub(in crate::afxdp) dscp_rewrite: Option, + pub(in crate::afxdp) drop: bool, } fn map_cached_forwarding_class_queue( @@ -31,6 +32,7 @@ pub(in crate::afxdp) fn resolve_cached_cos_tx_selection( queue_id: iface.map(|iface| iface.default_queue), dscp_rewrite: None, filter_counter: None, + three_color_policers: crate::filter::CachedThreeColorPolicers::default(), }; }; @@ -42,7 +44,13 @@ pub(in crate::afxdp) fn resolve_cached_cos_tx_selection( ); let has_input_tx_selection = crate::filter::filter_state_has_input_tx_selection(&forwarding.filter_state, is_v6); - if iface.is_none() && !has_output_tx_eval && !has_input_tx_selection { + let has_input_three_color_policer = + crate::filter::filter_state_has_input_three_color_policer(&forwarding.filter_state, is_v6); + if iface.is_none() + && !has_output_tx_eval + && !has_input_tx_selection + && !has_input_three_color_policer + { return CachedTxSelectionDescriptor::default(); } let output_filter = if has_output_tx_eval { @@ -63,7 +71,11 @@ pub(in crate::afxdp) fn resolve_cached_cos_tx_selection( None }; let output_result = output_filter - .filter(|filter| filter.affects_tx_selection || filter.has_counter_terms) + .filter(|filter| { + filter.affects_tx_selection + || filter.has_counter_terms + || filter.has_three_color_policer_terms + }) .map(|filter| { crate::filter::evaluate_filter_ref_tx_selection_cached( filter, @@ -80,8 +92,9 @@ pub(in crate::afxdp) fn resolve_cached_cos_tx_selection( let mut effective_dscp_rewrite = output_result.dscp_rewrite; let mut forwarding_class = output_result.forwarding_class.clone(); let mut filter_counter = output_result.counter.clone(); + let mut three_color_policers = output_result.three_color_policers; - if output_filter.is_none() && has_input_tx_selection { + if (output_filter.is_none() && has_input_tx_selection) || has_input_three_color_policer { let ingress_ifindex = resolve_ingress_logical_ifindex( forwarding, meta.ingress_ifindex as i32, @@ -101,7 +114,10 @@ pub(in crate::afxdp) fn resolve_cached_cos_tx_selection( .get(&ingress_ifindex) .map(Arc::as_ref) }; - if let Some(ingress_filter) = ingress_filter.filter(|filter| filter.affects_tx_selection) { + if let Some(ingress_filter) = ingress_filter.filter(|filter| { + (output_filter.is_none() && filter.affects_tx_selection) + || filter.has_three_color_policer_terms + }) { let ingress_result = crate::filter::evaluate_filter_ref_tx_selection_cached( ingress_filter, flow_key.src_ip, @@ -112,8 +128,11 @@ pub(in crate::afxdp) fn resolve_cached_cos_tx_selection( meta.dscp, ); effective_dscp_rewrite = effective_dscp_rewrite.or(ingress_result.dscp_rewrite); - forwarding_class = ingress_result.forwarding_class; - filter_counter = ingress_result.counter; + if output_filter.is_none() { + forwarding_class = ingress_result.forwarding_class; + filter_counter = ingress_result.counter; + } + three_color_policers.extend(ingress_result.three_color_policers); } } @@ -134,6 +153,7 @@ pub(in crate::afxdp) fn resolve_cached_cos_tx_selection( queue_id, dscp_rewrite: effective_dscp_rewrite, filter_counter, + three_color_policers, } } @@ -151,6 +171,26 @@ pub(in crate::afxdp) fn resolve_cos_tx_selection( egress_ifindex: i32, meta: impl Into, flow_key: Option<&SessionKey>, +) -> CoSTxSelection { + resolve_cos_tx_selection_internal(forwarding, egress_ifindex, meta, flow_key, None) +} + +pub(in crate::afxdp) fn resolve_cos_tx_selection_at( + forwarding: &ForwardingState, + egress_ifindex: i32, + meta: impl Into, + flow_key: Option<&SessionKey>, + now_ns: u64, +) -> CoSTxSelection { + resolve_cos_tx_selection_internal(forwarding, egress_ifindex, meta, flow_key, Some(now_ns)) +} + +fn resolve_cos_tx_selection_internal( + forwarding: &ForwardingState, + egress_ifindex: i32, + meta: impl Into, + flow_key: Option<&SessionKey>, + now_ns: Option, ) -> CoSTxSelection { let meta = meta.into(); let tx_selection_enabled = if meta.addr_family as i32 == libc::AF_INET6 { @@ -166,6 +206,7 @@ pub(in crate::afxdp) fn resolve_cos_tx_selection( return CoSTxSelection { queue_id: iface.map(|iface| iface.default_queue), dscp_rewrite: None, + drop: false, }; }; let is_v6 = meta.addr_family as i32 == libc::AF_INET6; @@ -176,10 +217,17 @@ pub(in crate::afxdp) fn resolve_cos_tx_selection( ); let has_input_tx_selection = crate::filter::filter_state_has_input_tx_selection(&forwarding.filter_state, is_v6); - if iface.is_none() && !has_output_tx_eval && !has_input_tx_selection { + let has_input_three_color_policer = + crate::filter::filter_state_has_input_three_color_policer(&forwarding.filter_state, is_v6); + if iface.is_none() + && !has_output_tx_eval + && !has_input_tx_selection + && !has_input_three_color_policer + { return CoSTxSelection { queue_id: None, dscp_rewrite: None, + drop: false, }; } let output_filter = if has_output_tx_eval { @@ -200,69 +248,108 @@ pub(in crate::afxdp) fn resolve_cos_tx_selection( None }; let has_output_filter = output_filter.is_some(); - let ingress_ifindex = if !has_output_filter && has_input_tx_selection { - resolve_ingress_logical_ifindex( - forwarding, - meta.ingress_ifindex as i32, - meta.ingress_vlan_id, - ) - .unwrap_or(meta.ingress_ifindex as i32) - } else { - 0 - }; - let ingress_filter = if !has_output_filter && has_input_tx_selection { - if is_v6 { - forwarding - .filter_state - .iface_filter_v6_fast - .get(&ingress_ifindex) - .map(Arc::as_ref) + let ingress_ifindex = + if (!has_output_filter && has_input_tx_selection) || has_input_three_color_policer { + resolve_ingress_logical_ifindex( + forwarding, + meta.ingress_ifindex as i32, + meta.ingress_vlan_id, + ) + .unwrap_or(meta.ingress_ifindex as i32) } else { - forwarding - .filter_state - .iface_filter_v4_fast - .get(&ingress_ifindex) - .map(Arc::as_ref) + 0 + }; + let ingress_filter = + if (!has_output_filter && has_input_tx_selection) || has_input_three_color_policer { + if is_v6 { + forwarding + .filter_state + .iface_filter_v6_fast + .get(&ingress_ifindex) + .map(Arc::as_ref) + } else { + forwarding + .filter_state + .iface_filter_v4_fast + .get(&ingress_ifindex) + .map(Arc::as_ref) + } + } else { + None + }; + let output_result = if let Some(output_filter) = output_filter.filter(|filter| { + filter.affects_tx_selection + || filter.has_counter_terms + || filter.has_three_color_policer_terms + }) { + if let Some(now_ns) = now_ns { + crate::filter::evaluate_filter_ref_tx_selection_runtime_counted( + output_filter, + flow_key.src_ip, + flow_key.dst_ip, + flow_key.protocol, + flow_key.src_port, + flow_key.dst_port, + meta.dscp, + meta.pkt_len as u64, + now_ns, + ) + } else { + crate::filter::evaluate_filter_ref_tx_selection_counted( + output_filter, + flow_key.src_ip, + flow_key.dst_ip, + flow_key.protocol, + flow_key.src_port, + flow_key.dst_port, + meta.dscp, + meta.pkt_len as u64, + ) } - } else { - None - }; - let output_result = if let Some(output_filter) = - output_filter.filter(|filter| filter.affects_tx_selection || filter.has_counter_terms) - { - crate::filter::evaluate_filter_ref_tx_selection_counted( - output_filter, - flow_key.src_ip, - flow_key.dst_ip, - flow_key.protocol, - flow_key.src_port, - flow_key.dst_port, - meta.dscp, - meta.pkt_len as u64, - ) } else { crate::filter::TxSelectionFilterResult::default() }; let mut effective_dscp_rewrite = output_result.dscp_rewrite; + let mut policer_drop = output_result.policer_drop; let mut ingress_forwarding_class = None; - if let Some(ingress_filter) = ingress_filter.filter(|filter| filter.affects_tx_selection) { - let ingress_result = crate::filter::evaluate_filter_ref_tx_selection_counted( - ingress_filter, - flow_key.src_ip, - flow_key.dst_ip, - flow_key.protocol, - flow_key.src_port, - flow_key.dst_port, - meta.dscp, - meta.pkt_len as u64, - ); + if let Some(ingress_filter) = ingress_filter.filter(|filter| { + (!has_output_filter && filter.affects_tx_selection) || filter.has_three_color_policer_terms + }) { + let ingress_result = if let Some(now_ns) = now_ns { + crate::filter::evaluate_filter_ref_tx_selection_runtime_counted( + ingress_filter, + flow_key.src_ip, + flow_key.dst_ip, + flow_key.protocol, + flow_key.src_port, + flow_key.dst_port, + meta.dscp, + meta.pkt_len as u64, + now_ns, + ) + } else { + crate::filter::evaluate_filter_ref_tx_selection_counted( + ingress_filter, + flow_key.src_ip, + flow_key.dst_ip, + flow_key.protocol, + flow_key.src_port, + flow_key.dst_port, + meta.dscp, + meta.pkt_len as u64, + ) + }; effective_dscp_rewrite = effective_dscp_rewrite.or(ingress_result.dscp_rewrite); - ingress_forwarding_class = ingress_result.forwarding_class; + policer_drop |= ingress_result.policer_drop; + if !has_output_filter { + ingress_forwarding_class = ingress_result.forwarding_class; + } } let Some(iface) = iface else { return CoSTxSelection { queue_id: None, dscp_rewrite: effective_dscp_rewrite, + drop: policer_drop, }; }; if let Some(forwarding_class) = output_result.forwarding_class { @@ -270,6 +357,7 @@ pub(in crate::afxdp) fn resolve_cos_tx_selection( return CoSTxSelection { queue_id: Some(*queue_id), dscp_rewrite: effective_dscp_rewrite, + drop: policer_drop, }; } } @@ -278,6 +366,7 @@ pub(in crate::afxdp) fn resolve_cos_tx_selection( return CoSTxSelection { queue_id: Some(*queue_id), dscp_rewrite: effective_dscp_rewrite, + drop: policer_drop, }; } } @@ -285,6 +374,7 @@ pub(in crate::afxdp) fn resolve_cos_tx_selection( return CoSTxSelection { queue_id: Some(queue_id), dscp_rewrite: effective_dscp_rewrite, + drop: policer_drop, }; } if let Some(queue_id) = resolve_cos_ieee8021_classifier_queue_id( @@ -295,11 +385,13 @@ pub(in crate::afxdp) fn resolve_cos_tx_selection( return CoSTxSelection { queue_id: Some(queue_id), dscp_rewrite: effective_dscp_rewrite, + drop: policer_drop, }; } CoSTxSelection { queue_id: Some(iface.default_queue), dscp_rewrite: effective_dscp_rewrite, + drop: policer_drop, } } diff --git a/userspace-dp/src/afxdp/tx/dispatch.rs b/userspace-dp/src/afxdp/tx/dispatch.rs index 49b920393..52df10bb4 100644 --- a/userspace-dp/src/afxdp/tx/dispatch.rs +++ b/userspace-dp/src/afxdp/tx/dispatch.rs @@ -103,16 +103,19 @@ pub(in crate::afxdp) fn enqueue_pending_forwards( for request in pending_forwards.iter_mut() { let source_offset = request.desc.addr; let ingress_slot = ingress_binding.slot; - let tx_selection_enabled = if request.meta.addr_family as i32 == libc::AF_INET6 { - tx_selection_enabled_v6 - } else { - tx_selection_enabled_v4 - }; - if tx_selection_enabled && request.cos_queue_id.is_none() && request.dscp_rewrite.is_none() - { - let cos = resolve_pending_forward_cos_tx_selection(forwarding, &request); + if pending_forward_needs_cos_tx_selection( + request, + tx_selection_enabled_v4, + tx_selection_enabled_v6, + ) { + let cos = resolve_pending_forward_cos_tx_selection(forwarding, &request, now_ns); + if cos.drop { + recycle_ingress_frame(ingress_binding, source_offset, now_ns); + continue; + } request.cos_queue_id = cos.queue_id; request.dscp_rewrite = cos.dscp_rewrite; + request.cos_tx_selection_resolved = true; } let target_binding_index = request.target_binding_index.or_else(|| { binding_lookup.target_index( @@ -145,7 +148,7 @@ pub(in crate::afxdp) fn enqueue_pending_forwards( expected_ports: None, expected_addr_family: request.meta.addr_family, expected_protocol: request.meta.protocol, - flow_key: None, + flow_key: request.flow_key.clone(), egress_ifindex: request.decision.resolution.egress_ifindex, cos_queue_id: request.cos_queue_id, dscp_rewrite: request.dscp_rewrite, @@ -1193,15 +1196,30 @@ pub(in crate::afxdp) fn resolve_tx_binding_ifindex( fn resolve_pending_forward_cos_tx_selection( forwarding: &ForwardingState, request: &PendingForwardRequest, + now_ns: u64, ) -> CoSTxSelection { - resolve_cos_tx_selection( + resolve_cos_tx_selection_at( forwarding, request.decision.resolution.egress_ifindex, request.meta, request.flow_key.as_ref(), + now_ns, ) } +fn pending_forward_needs_cos_tx_selection( + request: &PendingForwardRequest, + tx_selection_enabled_v4: bool, + tx_selection_enabled_v6: bool, +) -> bool { + let tx_selection_enabled = if request.meta.addr_family as i32 == libc::AF_INET6 { + tx_selection_enabled_v6 + } else { + tx_selection_enabled_v4 + }; + tx_selection_enabled && !request.cos_tx_selection_resolved +} + pub(in crate::afxdp) fn maybe_reinject_slow_path( binding: &BindingIdentity, live: &BindingLiveState, diff --git a/userspace-dp/src/afxdp/tx/dispatch_tests.rs b/userspace-dp/src/afxdp/tx/dispatch_tests.rs index 7b4d0692c..bcb7bfaf2 100644 --- a/userspace-dp/src/afxdp/tx/dispatch_tests.rs +++ b/userspace-dp/src/afxdp/tx/dispatch_tests.rs @@ -40,6 +40,35 @@ fn test_decision() -> SessionDecision { } } +fn test_pending_forward_request( + addr_family: u8, + cos_tx_selection_resolved: bool, +) -> PendingForwardRequest { + PendingForwardRequest { + target_ifindex: 11, + target_binding_index: None, + ingress_queue_id: 0, + desc: XdpDesc { + addr: 0, + len: 64, + options: 0, + }, + frame: PendingForwardFrame::Live, + meta: ForwardPacketMeta { + addr_family, + ..ForwardPacketMeta::default() + }, + decision: test_decision(), + apply_nat_on_fabric: false, + expected_ports: None, + flow_key: None, + nat64_reverse: None, + cos_queue_id: None, + dscp_rewrite: None, + cos_tx_selection_resolved, + } +} + fn test_cos_fast_interfaces( egress_ifindex: i32, default_queue: u8, @@ -74,6 +103,29 @@ fn test_cos_fast_interfaces( interfaces } +#[test] +fn pending_forward_cos_resolution_uses_resolved_bit_not_empty_outputs() { + let resolved = test_pending_forward_request(libc::AF_INET as u8, true); + assert!( + !pending_forward_needs_cos_tx_selection(&resolved, true, false), + "a resolved None/None selection must not be metered again" + ); + + let unresolved_v4 = test_pending_forward_request(libc::AF_INET as u8, false); + assert!(pending_forward_needs_cos_tx_selection( + &unresolved_v4, + true, + false + )); + + let unresolved_v6 = test_pending_forward_request(libc::AF_INET6 as u8, false); + assert!(pending_forward_needs_cos_tx_selection( + &unresolved_v6, + false, + true + )); +} + #[test] fn forwarded_tcp_may_need_segmentation_skips_mtu_sized_frame() { let forwarding = test_forwarding_with_egress_mtu(1500); diff --git a/userspace-dp/src/afxdp/tx/mod.rs b/userspace-dp/src/afxdp/tx/mod.rs index bf29de4a0..dad380df0 100644 --- a/userspace-dp/src/afxdp/tx/mod.rs +++ b/userspace-dp/src/afxdp/tx/mod.rs @@ -33,7 +33,7 @@ pub(super) mod tcp_segmentation; pub(in crate::afxdp) use cos_classify::cos_queue_dscp_rewrite; pub(super) use cos_classify::{ CoSTxSelection, enqueue_local_into_cos, resolve_cached_cos_tx_selection, resolve_cos_queue_id, - resolve_cos_tx_selection, + resolve_cos_tx_selection, resolve_cos_tx_selection_at, }; // Private use, not a re-export: a `pub(super) use` of a `pub(super)` // item triggers E0364. drain.rs reaches this through `use super::*;`. diff --git a/userspace-dp/src/afxdp/types/mod.rs b/userspace-dp/src/afxdp/types/mod.rs index 07f53e370..f36f594ae 100644 --- a/userspace-dp/src/afxdp/types/mod.rs +++ b/userspace-dp/src/afxdp/types/mod.rs @@ -80,6 +80,7 @@ pub(super) struct PendingNeighPacket { pub(super) desc: XdpDesc, pub(super) meta: UserspaceDpMeta, pub(super) decision: SessionDecision, + pub(super) flow_key: Option, pub(super) queued_ns: u64, /// Cold-start probe schedule attempts (GEMINI-NEXT.md Section 3). /// 0 means no retries fired yet beyond the initial probe; each @@ -88,14 +89,10 @@ pub(super) struct PendingNeighPacket { pub(super) probe_attempts: u8, } -// Compile-time size guard: the `probe_attempts: u8` added in #1082 -// fit within the existing trailing alignment padding, so the struct -// is the same 224 B as before. If a future field bumps this past 224, -// re-evaluate the per-binding worst-case (224 B × MAX_PENDING_NEIGH -// = ~896 KiB; the comment in afxdp.rs above MAX_PENDING_NEIGH must -// be updated to match). +// Compile-time size guard: pending-neighbor retry carries the session key so +// runtime TX-selection policers still meter packets after ARP/NDP resolution. const _: () = assert!( - core::mem::size_of::() == 224, + core::mem::size_of::() == 264, "PendingNeighPacket size changed — update afxdp.rs MAX_PENDING_NEIGH commentary", ); diff --git a/userspace-dp/src/afxdp/types/tx.rs b/userspace-dp/src/afxdp/types/tx.rs index d24438ce8..b4307bd34 100644 --- a/userspace-dp/src/afxdp/types/tx.rs +++ b/userspace-dp/src/afxdp/types/tx.rs @@ -73,6 +73,7 @@ pub(in crate::afxdp) struct PendingForwardRequest { pub(in crate::afxdp) nat64_reverse: Option, pub(in crate::afxdp) cos_queue_id: Option, pub(in crate::afxdp) dscp_rewrite: Option, + pub(in crate::afxdp) cos_tx_selection_resolved: bool, } pub(in crate::afxdp) struct PreparedTxRequest { diff --git a/userspace-dp/src/afxdp/umem/tests.rs b/userspace-dp/src/afxdp/umem/tests.rs index d000dc8a5..5d89b0ec0 100644 --- a/userspace-dp/src/afxdp/umem/tests.rs +++ b/userspace-dp/src/afxdp/umem/tests.rs @@ -1462,6 +1462,7 @@ fn active_flow_debug_test_entry( queue_id: Some(2), dscp_rewrite: Some(46), filter_counter: None, + three_color_policers: crate::filter::CachedThreeColorPolicers::default(), }, nat64: false, nptv6: false, diff --git a/userspace-dp/src/filter/README.md b/userspace-dp/src/filter/README.md index 0a439a24b..208ebd1ac 100644 --- a/userspace-dp/src/filter/README.md +++ b/userspace-dp/src/filter/README.md @@ -7,17 +7,21 @@ Mirrors the BPF firewall-filter pipeline in userspace. - `mod.rs` — public surface: `FilterAction` (`Accept` / `Discard` / `Reject` only), `FilterTerm` (the matched-and-action carrier), - `PortMatcher`, `FilterTermCounter`. Side-effect actions like - counting, logging, policing, forwarding-class assignment, and DSCP - rewrite are **fields on `FilterTerm`** (e.g. `count`, `log`, - `policer_name`, `forwarding_class`, `dscp_rewrite`), not enum + `PortMatcher`, `FilterTermCounter`, and three-color policer runtime + counters. Side-effect actions like counting, logging, policing, + forwarding-class assignment, and DSCP rewrite are **fields on + `FilterTerm`** (e.g. `count`, `log`, `policer_name`, + `three_color_policer`, `forwarding_class`, `dscp_rewrite`), not enum variants — the engine applies them around the action verdict. - `compiler.rs` — parses the typed config's filter terms and lowers them to `FilterTerm`s (prefix vectors, protocol bitmap, port - matcher, DSCP bitmap). + matcher, DSCP bitmap). Three-color policer snapshots are sorted by + name and compiled into stable runtime IDs before terms are linked. - `engine.rs` — per-term evaluation, first-match-wins. It carries the - matched `then policer ...` name in the filter result; forwarding-path - enforcement is a separate wiring step. + matched `then policer ...` name in the filter result. TX-selection + evaluation meters three-color policers on live forwarding paths and + cached evaluation returns runtime handles so flow-cache hits can meter + the same policer before forwarding. - `policer.rs` — token-bucket implementation plus the #1375 RFC 2697/2698 three-color meter core. Token math is integer-only: the legacy token bucket keeps its bits/sec constructor contract, and @@ -38,7 +42,7 @@ Mirrors the BPF firewall-filter pipeline in userspace. - `from-interface` is matched at the binding level (caller sets the ingress interface; the term doesn't re-derive it). -## #1375 Three-Color Policer Foundation +## #1375 Three-Color Policer Runtime Implemented here: @@ -50,29 +54,34 @@ Implemented here: packets. Color-blind classification ignores inherited color. - Per-color treatments can carry DSCP rewrite and drop decisions in the meter decision. +- The Go snapshot schema, Rust wire DTO, and commit-time structural + validation are wired for three-color policers. Commit validation + rejects ambiguous mode declarations (`single-rate` with `two-rate`) + and ambiguous color declarations (`color-blind` with `color-aware`) + before they can reach the helper. +- Filter terms link to stable name-sorted runtime handles. The live + forwarding path meters the handle at packet time, applies red drops, + and records per-color/drop counters. Flow-cache hits carry the same + handle in the cached TX-selection descriptor and meter before cached + forwarding. Packets buffered for missing-neighbor retry carry their + session key and meter at retry dispatch time before prepared TX. +- Rust status, Go status, CLI status formatting, and Prometheus export + expose green/yellow/red packet and byte counters plus drop counters. +- `deriveUserspaceCapabilities()` no longer rejects the color-blind `then + discard` `firewall three-color-policer` runtime slice. -Still gated before removing the userspace capability rejection: +Remaining limitations: -- The Go snapshot schema, Rust wire DTO, and commit-time structural - validation are wired for three-color policers. They are published only - so the control plane and dataplane agree on the future wire shape. - Commit validation rejects ambiguous mode declarations (`single-rate` - with `two-rate`) and ambiguous color declarations (`color-blind` with - `color-aware`) before they can reach the helper. Duplicate - hierarchical `firewall three-color-policer ` blocks are compiled - as one logical policer before that validation, so load - merge/override cannot hide an ambiguity behind last-write-wins map - assignment. Repeated same-mode sub-blocks (`single-rate` or - `two-rate`) are merged before strict validation as well, so split - color-mode declarations in sibling blocks are surfaced and rejected. -- Filter terms still carry a policer name in the evaluation result. - The hot forwarding path must move to stable policer IDs with - ID-indexed or sharded state before three-color policers are enabled. -- Flow-cache hits do not yet execute policer decisions, because the - forwarding path does not consume this meter core. -- Forwarding-path application of per-color counters, red drops, DSCP - rewrites, Rust status, Go status, CLI, and Prometheus export remain - follow-on wiring. -- Until those runtime pieces land, any config containing - `firewall three-color-policer` stays fail-closed for userspace - forwarding through the capability gate. +- Runtime token state is one `Mutex` per logical policer, not a sharded + or packed atomic implementation. This preserves correctness and + stable identity but is not the final high-throughput contention model. +- Counters and token buckets are rebuilt on snapshot replacement; they + are stable within one compiled runtime but not yet carried across + config rebuilds. +- Snapshot `then_action` handling currently wires red drop for + `then discard`. Other actions, such as loss-priority propagation, stay + fail-closed until downstream loss-priority behavior is wired. Color-aware + mode also stays fail-closed until inherited packet color is carried through + trusted metadata. +- Traffic-level integration, failover, and performance evidence still + need to be collected before treating #1375 as fully retired. diff --git a/userspace-dp/src/filter/compiler.rs b/userspace-dp/src/filter/compiler.rs index 705042531..0c77fcef3 100644 --- a/userspace-dp/src/filter/compiler.rs +++ b/userspace-dp/src/filter/compiler.rs @@ -4,7 +4,6 @@ use super::*; - /// Build the complete FilterState from snapshot data. pub(crate) fn parse_filter_state( filters: &[FirewallFilterSnapshot], @@ -12,42 +11,81 @@ pub(crate) fn parse_filter_state( interfaces: &[crate::InterfaceSnapshot], lo0_filter_v4: &str, lo0_filter_v6: &str, +) -> FilterState { + parse_filter_state_with_three_color( + filters, + policers, + &[], + interfaces, + lo0_filter_v4, + lo0_filter_v6, + ) +} + +/// Build the complete FilterState from snapshot data, including stable +/// three-color policer runtimes. +pub(crate) fn parse_filter_state_with_three_color( + filters: &[FirewallFilterSnapshot], + policers: &[PolicerSnapshot], + three_color_policers: &[ThreeColorPolicerSnapshot], + interfaces: &[crate::InterfaceSnapshot], + lo0_filter_v4: &str, + lo0_filter_v6: &str, ) -> FilterState { let mut state = FilterState::default(); + // Parse legacy token-bucket policers. + for snap in policers { + state.policers.insert( + snap.name.clone(), + PolicerState::new( + snap.name.clone(), + snap.bandwidth_bps, + snap.burst_bytes, + snap.discard_excess, + ), + ); + } + + // Parse three-color policers by stable name order. Terms store Arc + // handles, so cache-hit enforcement uses the same logical runtime. + let mut three_color = three_color_policers.iter().collect::>(); + three_color.sort_by(|a, b| a.name.cmp(&b.name)); + for snap in three_color { + let Some(runtime) = parse_three_color_policer(snap, state.three_color_policers.len() + 1) + else { + continue; + }; + state + .three_color_policer_by_name + .insert(runtime.name.to_string(), runtime.clone()); + state.three_color_policers.push(runtime); + } + // Parse filters for snap in filters { let key = qualify_filter_key(&snap.family, &snap.name); + let terms = snap + .terms + .iter() + .map(|t| parse_term(t, &state.three_color_policer_by_name)) + .collect::>(); let filter = Filter { name: snap.name.clone(), family: snap.family.clone(), - terms: snap.terms.iter().map(|t| parse_term(t)).collect(), - affects_tx_selection: snap - .terms + affects_tx_selection: terms .iter() .any(|term| !term.forwarding_class.is_empty() || term.dscp_rewrite.is_some()), - affects_route_lookup: snap - .terms + affects_route_lookup: terms.iter().any(|term| !term.routing_instance.is_empty()), + has_counter_terms: terms.iter().any(|term| term.has_count), + has_three_color_policer_terms: terms .iter() - .any(|term| !term.routing_instance.is_empty()), - has_counter_terms: snap.terms.iter().any(|term| !term.count.is_empty()), + .any(|term| term.three_color_policer.is_some()), + terms, }; state.filters.insert(key, Arc::new(filter)); } - // Parse policers - for snap in policers { - state.policers.insert( - snap.name.clone(), - PolicerState::new( - snap.name.clone(), - snap.bandwidth_bps, - snap.burst_bytes, - snap.discard_excess, - ), - ); - } - // Build per-interface filter assignments for iface in interfaces { if iface.ifindex <= 0 { @@ -62,6 +100,9 @@ pub(crate) fn parse_filter_state( .insert(iface.ifindex); state.has_input_tx_selection_v4 = true; } + if filter.has_three_color_policer_terms { + state.has_input_three_color_policer_v4 = true; + } if filter.affects_route_lookup { state .iface_filter_v4_affects_route_lookup @@ -76,7 +117,10 @@ pub(crate) fn parse_filter_state( if !iface.filter_output_v4.is_empty() { let key = qualify_filter_key("inet", &iface.filter_output_v4); if let Some(filter) = state.filters.get(&key) { - if filter.affects_tx_selection || filter.has_counter_terms { + if filter.affects_tx_selection + || filter.has_counter_terms + || filter.has_three_color_policer_terms + { state .iface_filter_out_v4_needs_tx_eval .insert(iface.ifindex); @@ -99,6 +143,9 @@ pub(crate) fn parse_filter_state( .insert(iface.ifindex); state.has_input_tx_selection_v6 = true; } + if filter.has_three_color_policer_terms { + state.has_input_three_color_policer_v6 = true; + } if filter.affects_route_lookup { state .iface_filter_v6_affects_route_lookup @@ -113,7 +160,10 @@ pub(crate) fn parse_filter_state( if !iface.filter_output_v6.is_empty() { let key = qualify_filter_key("inet6", &iface.filter_output_v6); if let Some(filter) = state.filters.get(&key) { - if filter.affects_tx_selection || filter.has_counter_terms { + if filter.affects_tx_selection + || filter.has_counter_terms + || filter.has_three_color_policer_terms + { state .iface_filter_out_v6_needs_tx_eval .insert(iface.ifindex); @@ -145,11 +195,56 @@ pub(crate) fn parse_filter_state( state } +fn parse_three_color_policer( + snap: &ThreeColorPolicerSnapshot, + id: usize, +) -> Option> { + let treatments = treatments_from_then_action(&snap.then_action); + let state = match snap.mode.as_str() { + "single-rate" => ThreeColorPolicerState::sr_tcm_with_treatments( + snap.committed_rate_bytes_per_sec, + snap.committed_burst_bytes, + snap.peak_or_excess_burst_bytes, + snap.color_blind, + treatments, + ) + .ok()?, + "two-rate" => ThreeColorPolicerState::tr_tcm_with_treatments( + snap.committed_rate_bytes_per_sec, + snap.committed_burst_bytes, + snap.peak_or_excess_rate_bytes_per_sec, + snap.peak_or_excess_burst_bytes, + snap.color_blind, + treatments, + ) + .ok()?, + _ => return None, + }; + Some(Arc::new(ThreeColorPolicerRuntime::new( + id as u32, + snap.name.clone(), + state, + ))) +} + +fn treatments_from_then_action(action: &str) -> ThreeColorTreatments { + if action == "discard" { + return ThreeColorTreatments { + red: ColorTreatment::drop(), + ..ThreeColorTreatments::default() + }; + } + ThreeColorTreatments::default() +} + fn qualify_filter_key(family: &str, filter_name: &str) -> String { format!("{family}:{filter_name}") } -fn parse_term(snap: &FirewallTermSnapshot) -> FilterTerm { +fn parse_term( + snap: &FirewallTermSnapshot, + three_color_policers: &rustc_hash::FxHashMap>, +) -> FilterTerm { let mut source_v4 = Vec::new(); let mut source_v6 = Vec::new(); for addr in &snap.source_addresses { @@ -202,6 +297,7 @@ fn parse_term(snap: &FirewallTermSnapshot) -> FilterTerm { has_count: !snap.count.is_empty(), log: snap.log, policer_name: snap.policer.clone(), + three_color_policer: three_color_policers.get(&snap.policer).cloned(), routing_instance: snap.routing_instance.clone(), forwarding_class: Arc::::from(snap.forwarding_class.as_str()), dscp_rewrite, @@ -316,4 +412,3 @@ fn build_u6_match_bitmap(values: &[u8]) -> u64 { } bitmap } - diff --git a/userspace-dp/src/filter/engine.rs b/userspace-dp/src/filter/engine.rs index b5ad26498..816f0ecaa 100644 --- a/userspace-dp/src/filter/engine.rs +++ b/userspace-dp/src/filter/engine.rs @@ -93,6 +93,55 @@ pub(crate) fn evaluate_filter_ref_tx_selection_counted<'a>( dst_port: u16, dscp: u8, packet_bytes: u64, +) -> TxSelectionFilterResult<'a> { + evaluate_filter_ref_tx_selection_runtime( + filter, + src_ip, + dst_ip, + protocol, + src_port, + dst_port, + dscp, + packet_bytes, + None, + ) +} + +pub(crate) fn evaluate_filter_ref_tx_selection_runtime_counted<'a>( + filter: &'a Filter, + src_ip: IpAddr, + dst_ip: IpAddr, + protocol: u8, + src_port: u16, + dst_port: u16, + dscp: u8, + packet_bytes: u64, + now_ns: u64, +) -> TxSelectionFilterResult<'a> { + evaluate_filter_ref_tx_selection_runtime( + filter, + src_ip, + dst_ip, + protocol, + src_port, + dst_port, + dscp, + packet_bytes, + Some(now_ns), + ) +} + +#[inline] +fn evaluate_filter_ref_tx_selection_runtime<'a>( + filter: &'a Filter, + src_ip: IpAddr, + dst_ip: IpAddr, + protocol: u8, + src_port: u16, + dst_port: u16, + dscp: u8, + packet_bytes: u64, + now_ns: Option, ) -> TxSelectionFilterResult<'a> { match (src_ip, dst_ip) { (IpAddr::V4(src), IpAddr::V4(dst)) => evaluate_filter_ref_tx_selection_counted_v4( @@ -104,6 +153,7 @@ pub(crate) fn evaluate_filter_ref_tx_selection_counted<'a>( dst_port, dscp, packet_bytes, + now_ns, ), (IpAddr::V6(src), IpAddr::V6(dst)) => evaluate_filter_ref_tx_selection_counted_v6( filter, @@ -114,6 +164,7 @@ pub(crate) fn evaluate_filter_ref_tx_selection_counted<'a>( dst_port, dscp, packet_bytes, + now_ns, ), _ => TxSelectionFilterResult::default(), } @@ -209,6 +260,7 @@ fn evaluate_filter_ref_tx_selection_counted_v4<'a>( dst_port: u16, dscp: u8, packet_bytes: u64, + now_ns: Option, ) -> TxSelectionFilterResult<'a> { for term in &filter.terms { if !term_matches_v4(term, src_ip, dst_ip, protocol, src_port, dst_port, dscp) { @@ -217,10 +269,12 @@ fn evaluate_filter_ref_tx_selection_counted_v4<'a>( if term.has_count { record_filter_counter(&term.counter, packet_bytes); } + let policer_action = apply_term_three_color_policer(term, now_ns, packet_bytes); return TxSelectionFilterResult { forwarding_class: (!term.forwarding_class.is_empty()) .then_some(term.forwarding_class.as_ref()), - dscp_rewrite: term.dscp_rewrite, + dscp_rewrite: policer_action.dscp_rewrite.or(term.dscp_rewrite), + policer_drop: policer_action.drop, }; } TxSelectionFilterResult::default() @@ -236,6 +290,7 @@ fn evaluate_filter_ref_tx_selection_counted_v6<'a>( dst_port: u16, dscp: u8, packet_bytes: u64, + now_ns: Option, ) -> TxSelectionFilterResult<'a> { for term in &filter.terms { if !term_matches_v6(term, src_ip, dst_ip, protocol, src_port, dst_port, dscp) { @@ -244,15 +299,50 @@ fn evaluate_filter_ref_tx_selection_counted_v6<'a>( if term.has_count { record_filter_counter(&term.counter, packet_bytes); } + let policer_action = apply_term_three_color_policer(term, now_ns, packet_bytes); return TxSelectionFilterResult { forwarding_class: (!term.forwarding_class.is_empty()) .then_some(term.forwarding_class.as_ref()), - dscp_rewrite: term.dscp_rewrite, + dscp_rewrite: policer_action.dscp_rewrite.or(term.dscp_rewrite), + policer_drop: policer_action.drop, }; } TxSelectionFilterResult::default() } +#[inline] +fn apply_term_three_color_policer( + term: &FilterTerm, + now_ns: Option, + packet_bytes: u64, +) -> ThreeColorPolicerAction { + let Some(runtime) = term.three_color_policer.as_ref() else { + return ThreeColorPolicerAction::default(); + }; + let Some(now_ns) = now_ns else { + return ThreeColorPolicerAction::default(); + }; + let decision = runtime.meter(now_ns, packet_bytes, PacketColor::Green); + ThreeColorPolicerAction { + dscp_rewrite: decision.dscp_rewrite, + drop: decision.drop, + } +} + +pub(crate) fn apply_cached_three_color_policers( + policers: &CachedThreeColorPolicers, + now_ns: u64, + packet_bytes: u64, +) -> ThreeColorPolicerAction { + let mut action = ThreeColorPolicerAction::default(); + policers.for_each(|policer| { + let decision = policer.meter(now_ns, packet_bytes, PacketColor::Green); + action.dscp_rewrite = action.dscp_rewrite.or(decision.dscp_rewrite); + action.drop |= decision.drop; + }); + action +} + fn evaluate_filter_ref_tx_selection_cached_v4( filter: &Filter, src_ip: Ipv4Addr, @@ -271,6 +361,9 @@ fn evaluate_filter_ref_tx_selection_cached_v4( .then(|| term.forwarding_class.clone()), dscp_rewrite: term.dscp_rewrite, counter: term.has_count.then(|| term.counter.clone()), + three_color_policers: CachedThreeColorPolicers::from_option( + term.three_color_policer.clone(), + ), }; } CachedTxSelectionFilterResult::default() @@ -294,6 +387,9 @@ fn evaluate_filter_ref_tx_selection_cached_v6( .then(|| term.forwarding_class.clone()), dscp_rewrite: term.dscp_rewrite, counter: term.has_count.then(|| term.counter.clone()), + three_color_policers: CachedThreeColorPolicers::from_option( + term.three_color_policer.clone(), + ), }; } CachedTxSelectionFilterResult::default() @@ -628,6 +724,14 @@ pub(crate) fn interface_filter_affects_tx_selection( } } +pub(crate) fn filter_state_has_input_three_color_policer(state: &FilterState, is_v6: bool) -> bool { + if is_v6 { + state.has_input_three_color_policer_v6 + } else { + state.has_input_three_color_policer_v4 + } +} + pub(crate) fn interface_filter_affects_route_lookup( state: &FilterState, ifindex: i32, @@ -762,4 +866,3 @@ fn term_matches_v6( } true } - diff --git a/userspace-dp/src/filter/mod.rs b/userspace-dp/src/filter/mod.rs index fe74cbb9b..f68ab3a34 100644 --- a/userspace-dp/src/filter/mod.rs +++ b/userspace-dp/src/filter/mod.rs @@ -12,13 +12,15 @@ use crate::prefix::{PrefixV4, PrefixV6}; // #1049 P2: Snapshot types come from the crate root (protocol.rs) and are // referenced by both compiler.rs and the tests module. Importing here makes // them visible to all submodules via `use super::*;`. -use crate::{FirewallFilterSnapshot, FirewallTermSnapshot, PolicerSnapshot}; +use crate::{ + FirewallFilterSnapshot, FirewallTermSnapshot, PolicerSnapshot, ThreeColorPolicerSnapshot, +}; use ipnet::IpNet; #[cfg(not(test))] use std::cell::RefCell; use std::net::{IpAddr, Ipv4Addr, Ipv6Addr}; -use std::sync::Arc; use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::{Arc, Mutex}; const PROTO_TCP: u8 = 6; const PROTO_UDP: u8 = 17; @@ -59,6 +61,7 @@ pub(crate) struct FilterTerm { pub(crate) has_count: bool, pub(crate) log: bool, pub(crate) policer_name: String, + pub(crate) three_color_policer: Option>, pub(crate) routing_instance: String, pub(crate) forwarding_class: Arc, pub(crate) dscp_rewrite: Option, @@ -104,6 +107,7 @@ pub(crate) struct Filter { pub(crate) affects_tx_selection: bool, pub(crate) affects_route_lookup: bool, pub(crate) has_counter_terms: bool, + pub(crate) has_three_color_policer_terms: bool, } #[derive(Debug, Default)] @@ -112,6 +116,163 @@ pub(crate) struct FilterTermCounter { pub(crate) bytes: AtomicU64, } +#[derive(Debug, Default)] +pub(crate) struct ThreeColorPolicerCounter { + pub(crate) packets: AtomicU64, + pub(crate) bytes: AtomicU64, +} + +impl ThreeColorPolicerCounter { + fn record(&self, packet_bytes: u64) { + self.packets.fetch_add(1, Ordering::Relaxed); + self.bytes.fetch_add(packet_bytes, Ordering::Relaxed); + } +} + +#[derive(Debug, Default)] +pub(crate) struct ThreeColorPolicerCounters { + pub(crate) green: ThreeColorPolicerCounter, + pub(crate) yellow: ThreeColorPolicerCounter, + pub(crate) red: ThreeColorPolicerCounter, + pub(crate) drop: ThreeColorPolicerCounter, +} + +#[derive(Debug)] +pub(crate) struct ThreeColorPolicerRuntime { + pub(crate) id: u32, + pub(crate) name: Arc, + state: Mutex, + counters: ThreeColorPolicerCounters, +} + +impl PartialEq for ThreeColorPolicerRuntime { + fn eq(&self, other: &Self) -> bool { + self.id == other.id && self.name == other.name + } +} + +impl Eq for ThreeColorPolicerRuntime {} + +#[derive(Clone, Debug, Default, PartialEq, Eq)] +pub(crate) struct CachedThreeColorPolicers { + first: Option>, + second: Option>, +} + +impl CachedThreeColorPolicers { + #[inline] + pub(crate) fn from_option(runtime: Option>) -> Self { + Self { + first: runtime, + second: None, + } + } + + #[inline] + pub(crate) fn push(&mut self, runtime: Arc) { + if self + .first + .as_ref() + .is_some_and(|existing| existing.id == runtime.id) + || self + .second + .as_ref() + .is_some_and(|existing| existing.id == runtime.id) + { + return; + } + if self.first.is_none() { + self.first = Some(runtime); + } else if self.second.is_none() { + self.second = Some(runtime); + } + } + + #[inline] + pub(crate) fn extend(&mut self, other: Self) { + if let Some(runtime) = other.first { + self.push(runtime); + } + if let Some(runtime) = other.second { + self.push(runtime); + } + } + + #[inline] + pub(crate) fn len(&self) -> usize { + usize::from(self.first.is_some()) + usize::from(self.second.is_some()) + } + + #[inline] + pub(crate) fn for_each(&self, mut f: impl FnMut(&Arc)) { + if let Some(runtime) = self.first.as_ref() { + f(runtime); + } + if let Some(runtime) = self.second.as_ref() { + f(runtime); + } + } +} + +impl ThreeColorPolicerRuntime { + pub(crate) fn new(id: u32, name: String, state: ThreeColorPolicerState) -> Self { + Self { + id, + name: Arc::::from(name), + state: Mutex::new(state), + counters: ThreeColorPolicerCounters::default(), + } + } + + pub(crate) fn meter( + &self, + now_ns: u64, + packet_bytes: u64, + incoming_color: PacketColor, + ) -> ThreeColorDecision { + let decision = self + .state + .lock() + .map(|mut state| state.meter(now_ns, packet_bytes, incoming_color)) + .unwrap_or_else(|_| ThreeColorDecision { + color: PacketColor::Red, + dscp_rewrite: None, + drop: true, + }); + match decision.color { + PacketColor::Green => self.counters.green.record(packet_bytes), + PacketColor::Yellow => self.counters.yellow.record(packet_bytes), + PacketColor::Red => self.counters.red.record(packet_bytes), + } + if decision.drop { + self.counters.drop.record(packet_bytes); + } + decision + } + + pub(crate) fn status(&self) -> crate::protocol::ThreeColorPolicerStatus { + let (mode, color_blind) = self + .state + .lock() + .map(|state| (state.mode_name().to_string(), state.color_blind())) + .unwrap_or_else(|_| ("unknown".to_string(), false)); + crate::protocol::ThreeColorPolicerStatus { + id: self.id, + name: self.name.to_string(), + mode, + color_blind, + green_packets: self.counters.green.packets.load(Ordering::Relaxed), + green_bytes: self.counters.green.bytes.load(Ordering::Relaxed), + yellow_packets: self.counters.yellow.packets.load(Ordering::Relaxed), + yellow_bytes: self.counters.yellow.bytes.load(Ordering::Relaxed), + red_packets: self.counters.red.packets.load(Ordering::Relaxed), + red_bytes: self.counters.red.bytes.load(Ordering::Relaxed), + drop_packets: self.counters.drop.packets.load(Ordering::Relaxed), + drop_bytes: self.counters.drop.bytes.load(Ordering::Relaxed), + } + } +} + impl FilterTermCounter { pub(crate) fn record(&self, packet_bytes: u64) { self.packets.fetch_add(1, Ordering::Relaxed); @@ -209,6 +370,11 @@ pub(crate) struct FilterState { pub(crate) filters: rustc_hash::FxHashMap>, /// Named policer states keyed by policer name. pub(crate) policers: rustc_hash::FxHashMap, + /// Stable three-color policer runtimes keyed by policer name. + pub(crate) three_color_policer_by_name: + rustc_hash::FxHashMap>, + /// Stable ID-indexed three-color policer runtimes. + pub(crate) three_color_policers: Vec>, /// Per-interface (ifindex) input filter key for inet. pub(crate) iface_filter_v4: rustc_hash::FxHashMap, /// Direct per-interface inet filter reference for packet hot-path evaluation. @@ -217,6 +383,8 @@ pub(crate) struct FilterState { pub(crate) iface_filter_v4_affects_tx_selection: rustc_hash::FxHashSet, /// Whether any inet input filter can affect CoS TX selection. pub(crate) has_input_tx_selection_v4: bool, + /// Whether any inet input filter contains a three-color policer. + pub(crate) has_input_three_color_policer_v4: bool, /// Per-interface inet input filters that can affect route-table selection. pub(crate) iface_filter_v4_affects_route_lookup: rustc_hash::FxHashSet, /// Per-interface (ifindex) input filter key for inet6. @@ -227,6 +395,8 @@ pub(crate) struct FilterState { pub(crate) iface_filter_v6_affects_tx_selection: rustc_hash::FxHashSet, /// Whether any inet6 input filter can affect CoS TX selection. pub(crate) has_input_tx_selection_v6: bool, + /// Whether any inet6 input filter contains a three-color policer. + pub(crate) has_input_three_color_policer_v6: bool, /// Per-interface inet6 input filters that can affect route-table selection. pub(crate) iface_filter_v6_affects_route_lookup: rustc_hash::FxHashSet, /// Per-interface (ifindex) output filter key for inet. @@ -255,6 +425,20 @@ pub(crate) struct FilterState { pub(crate) lo0_filter_v6_fast: Option>, } +impl FilterState { + pub(crate) fn three_color_policer_statuses( + &self, + ) -> Vec { + let mut statuses = self + .three_color_policers + .iter() + .map(|policer| policer.status()) + .collect::>(); + statuses.sort_by_key(|status| status.id); + statuses + } +} + /// Result of filter evaluation. #[derive(Clone, Debug, PartialEq, Eq)] pub(crate) struct FilterResult { @@ -266,10 +450,11 @@ pub(crate) struct FilterResult { pub(crate) log: bool, } -#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)] +#[derive(Clone, Debug, Default)] pub(crate) struct TxSelectionFilterResult<'a> { pub(crate) forwarding_class: Option<&'a str>, pub(crate) dscp_rewrite: Option, + pub(crate) policer_drop: bool, } #[derive(Clone, Debug, Default)] @@ -277,6 +462,13 @@ pub(crate) struct CachedTxSelectionFilterResult { pub(crate) forwarding_class: Option>, pub(crate) dscp_rewrite: Option, pub(crate) counter: Option>, + pub(crate) three_color_policers: CachedThreeColorPolicers, +} + +#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)] +pub(crate) struct ThreeColorPolicerAction { + pub(crate) dscp_rewrite: Option, + pub(crate) drop: bool, } impl Default for FilterResult { diff --git a/userspace-dp/src/filter/policer.rs b/userspace-dp/src/filter/policer.rs index b43c4c4d2..ae4b43ea4 100644 --- a/userspace-dp/src/filter/policer.rs +++ b/userspace-dp/src/filter/policer.rs @@ -297,6 +297,17 @@ impl ThreeColorPolicerState { } } + pub(crate) fn mode_name(&self) -> &'static str { + match self.mode { + ThreeColorMode::SingleRate => "single-rate", + ThreeColorMode::TwoRate => "two-rate", + } + } + + pub(crate) fn color_blind(&self) -> bool { + self.color_blind + } + fn refill(&mut self, now_ns: u64) { if !self.initialized { self.initialized = true; diff --git a/userspace-dp/src/filter/tests.rs b/userspace-dp/src/filter/tests.rs index 7dfd74eb1..d027de39e 100644 --- a/userspace-dp/src/filter/tests.rs +++ b/userspace-dp/src/filter/tests.rs @@ -12,6 +12,13 @@ fn make_filter_state( parse_filter_state(filters, policers, &[], "", "") } +fn make_filter_state_with_three_color( + filters: &[FirewallFilterSnapshot], + three_color_policers: &[ThreeColorPolicerSnapshot], +) -> FilterState { + parse_filter_state_with_three_color(filters, &[], three_color_policers, &[], "", "") +} + #[test] fn basic_accept_discard() { let state = make_filter_state( @@ -293,6 +300,203 @@ fn token_bucket_policer() { assert!(conforming, "packet after refill should conform"); } +#[test] +fn three_color_runtime_ids_and_miss_path_counters_are_stable() { + let state = make_filter_state_with_three_color( + &[FirewallFilterSnapshot { + name: "policed".into(), + family: "inet".into(), + terms: vec![FirewallTermSnapshot { + name: "meter".into(), + action: "accept".into(), + policer: "alpha".into(), + ..Default::default() + }], + }], + &[ + ThreeColorPolicerSnapshot { + name: "zeta".into(), + mode: "single-rate".into(), + color_blind: true, + committed_rate_bytes_per_sec: 1, + committed_burst_bytes: 100, + peak_or_excess_burst_bytes: 50, + then_action: "discard".into(), + ..Default::default() + }, + ThreeColorPolicerSnapshot { + name: "alpha".into(), + mode: "single-rate".into(), + color_blind: true, + committed_rate_bytes_per_sec: 1, + committed_burst_bytes: 100, + peak_or_excess_burst_bytes: 50, + then_action: "discard".into(), + ..Default::default() + }, + ], + ); + + let ids = state + .three_color_policers + .iter() + .map(|runtime| (runtime.id, runtime.name.as_ref().to_string())) + .collect::>(); + assert_eq!(ids, vec![(1, "alpha".into()), (2, "zeta".into())]); + + let filter = state.filters.get("inet:policed").unwrap(); + assert!(filter.has_three_color_policer_terms); + let first = evaluate_filter_ref_tx_selection_runtime_counted( + filter, + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 1)), + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 2)), + PROTO_UDP, + 12345, + 5000, + 0, + 100, + 0, + ); + assert!(!first.policer_drop); + + let second = evaluate_filter_ref_tx_selection_runtime_counted( + filter, + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 1)), + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 2)), + PROTO_UDP, + 12345, + 5000, + 0, + 51, + 0, + ); + assert!(second.policer_drop); + + let status = state.three_color_policer_statuses(); + let alpha = status.iter().find(|item| item.name == "alpha").unwrap(); + assert_eq!(alpha.mode, "single-rate"); + assert!(alpha.color_blind); + assert_eq!(alpha.green_packets, 1); + assert_eq!(alpha.green_bytes, 100); + assert_eq!(alpha.red_packets, 1); + assert_eq!(alpha.red_bytes, 51); + assert_eq!(alpha.drop_packets, 1); + assert_eq!(alpha.drop_bytes, 51); +} + +#[test] +fn flow_cache_hits_run_three_color_policer() { + let state = make_filter_state_with_three_color( + &[FirewallFilterSnapshot { + name: "policed".into(), + family: "inet".into(), + terms: vec![FirewallTermSnapshot { + name: "meter".into(), + action: "accept".into(), + policer: "cache-pol".into(), + ..Default::default() + }], + }], + &[ThreeColorPolicerSnapshot { + name: "cache-pol".into(), + mode: "single-rate".into(), + color_blind: true, + committed_rate_bytes_per_sec: 1, + committed_burst_bytes: 100, + peak_or_excess_burst_bytes: 50, + then_action: "discard".into(), + ..Default::default() + }], + ); + + let filter = state.filters.get("inet:policed").unwrap(); + let cached = evaluate_filter_ref_tx_selection_cached( + filter, + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 1)), + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 2)), + PROTO_UDP, + 12345, + 5000, + 0, + ); + assert_eq!(cached.three_color_policers.len(), 1); + + let first = apply_cached_three_color_policers(&cached.three_color_policers, 0, 100); + assert!(!first.drop); + let second = apply_cached_three_color_policers(&cached.three_color_policers, 0, 51); + assert!(second.drop); + + let status = state.three_color_policer_statuses(); + assert_eq!(status[0].green_packets, 1); + assert_eq!(status[0].red_packets, 1); + assert_eq!(status[0].drop_packets, 1); +} + +#[test] +fn cached_three_color_descriptor_dedupes_without_vec_allocation() { + let state = make_filter_state_with_three_color( + &[ + FirewallFilterSnapshot { + name: "in".into(), + family: "inet".into(), + terms: vec![FirewallTermSnapshot { + name: "meter-in".into(), + action: "accept".into(), + policer: "same-pol".into(), + ..Default::default() + }], + }, + FirewallFilterSnapshot { + name: "out".into(), + family: "inet".into(), + terms: vec![FirewallTermSnapshot { + name: "meter-out".into(), + action: "accept".into(), + policer: "same-pol".into(), + ..Default::default() + }], + }, + ], + &[ThreeColorPolicerSnapshot { + name: "same-pol".into(), + mode: "single-rate".into(), + color_blind: true, + committed_rate_bytes_per_sec: 1, + committed_burst_bytes: 100, + peak_or_excess_burst_bytes: 50, + then_action: "discard".into(), + ..Default::default() + }], + ); + + let mut combined = evaluate_filter_ref_tx_selection_cached( + state.filters.get("inet:out").unwrap(), + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 1)), + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 2)), + PROTO_UDP, + 12345, + 5000, + 0, + ) + .three_color_policers; + combined.extend( + evaluate_filter_ref_tx_selection_cached( + state.filters.get("inet:in").unwrap(), + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 1)), + IpAddr::V4(Ipv4Addr::new(10, 0, 0, 2)), + PROTO_UDP, + 12345, + 5000, + 0, + ) + .three_color_policers, + ); + + assert_eq!(combined.len(), 1); + assert!(!apply_cached_three_color_policers(&combined, 0, 100).drop); + assert!(apply_cached_three_color_policers(&combined, 0, 51).drop); +} + #[test] fn multiple_terms_first_match_wins() { let state = make_filter_state( diff --git a/userspace-dp/src/protocol.rs b/userspace-dp/src/protocol.rs index e52df25e2..dd3d0a055 100644 --- a/userspace-dp/src/protocol.rs +++ b/userspace-dp/src/protocol.rs @@ -8,6 +8,7 @@ use chrono::{DateTime, Utc}; use serde::{Deserialize, Serialize}; pub(crate) const CONFIG_SNAPSHOT_PROTOCOL_VERSION: i32 = 2; +pub(crate) const INJECT_PACKET_TUPLE_PROTOCOL_VERSION: i32 = 1; // --------------------------------------------------------------------------- // Snapshot schema @@ -727,6 +728,8 @@ pub(crate) struct ProcessStatus { pub pid: i32, #[serde(rename = "config_snapshot_protocol_version", default)] pub config_snapshot_protocol_version: i32, + #[serde(rename = "inject_packet_tuple_protocol_version", default)] + pub inject_packet_tuple_protocol_version: i32, #[serde(rename = "started_at")] pub started_at: DateTime, #[serde(rename = "control_socket")] @@ -826,6 +829,8 @@ pub(crate) struct ProcessStatus { pub policy_rule_counters: Vec, #[serde(rename = "filter_term_counters", default)] pub filter_term_counters: Vec, + #[serde(rename = "three_color_policer_counters", default)] + pub three_color_policer_counters: Vec, #[serde(rename = "last_resolution", skip_serializing_if = "Option::is_none")] pub last_resolution: Option, #[serde(rename = "slow_path", default)] @@ -1073,6 +1078,34 @@ pub(crate) struct FirewallFilterTermCounterStatus { pub bytes: u64, } +#[derive(Clone, Debug, Serialize, Deserialize, Default)] +pub(crate) struct ThreeColorPolicerStatus { + #[serde(default)] + pub id: u32, + #[serde(default)] + pub name: String, + #[serde(default)] + pub mode: String, + #[serde(rename = "color_blind", default)] + pub color_blind: bool, + #[serde(rename = "green_packets", default)] + pub green_packets: u64, + #[serde(rename = "green_bytes", default)] + pub green_bytes: u64, + #[serde(rename = "yellow_packets", default)] + pub yellow_packets: u64, + #[serde(rename = "yellow_bytes", default)] + pub yellow_bytes: u64, + #[serde(rename = "red_packets", default)] + pub red_packets: u64, + #[serde(rename = "red_bytes", default)] + pub red_bytes: u64, + #[serde(rename = "drop_packets", default)] + pub drop_packets: u64, + #[serde(rename = "drop_bytes", default)] + pub drop_bytes: u64, +} + #[derive(Clone, Debug, Serialize, Deserialize, Default)] pub(crate) struct SlowPathStatus { #[serde(default)] @@ -2045,6 +2078,14 @@ pub(crate) struct InjectPacketRequest { pub destination_ip: String, #[serde(rename = "emit_on_wire", default)] pub emit_on_wire: bool, + #[serde(rename = "tuple_metadata_version", default)] + pub tuple_metadata_version: i32, + #[serde(rename = "source_ip", default)] + pub source_ip: String, + #[serde(rename = "source_port", default)] + pub source_port: Option, + #[serde(rename = "destination_port", default)] + pub destination_port: Option, } #[derive(Clone, Debug, Serialize, Deserialize, Default)] @@ -2198,6 +2239,88 @@ pub(crate) struct SessionDeltaInfo { mod tests { use super::*; + #[test] + fn process_status_inject_packet_tuple_protocol_version_roundtrip() { + let status = ProcessStatus { + inject_packet_tuple_protocol_version: INJECT_PACKET_TUPLE_PROTOCOL_VERSION, + ..Default::default() + }; + let value: serde_json::Value = + serde_json::to_value(&status).expect("serialize ProcessStatus to Value"); + assert_eq!( + value["inject_packet_tuple_protocol_version"], + INJECT_PACKET_TUPLE_PROTOCOL_VERSION + ); + let back: ProcessStatus = serde_json::from_value(value).expect("deserialize ProcessStatus"); + assert_eq!( + back.inject_packet_tuple_protocol_version, + INJECT_PACKET_TUPLE_PROTOCOL_VERSION + ); + } + + #[test] + fn inject_packet_request_tuple_metadata_wire_roundtrip() { + let req = InjectPacketRequest { + slot: 7, + packet_length: 128, + addr_family: libc::AF_INET as u8, + protocol: 1, + config_generation: 11, + fib_generation: 12, + metadata_valid: true, + destination_ip: "172.16.80.200".into(), + emit_on_wire: true, + tuple_metadata_version: INJECT_PACKET_TUPLE_PROTOCOL_VERSION, + source_ip: "172.16.80.8".into(), + source_port: Some(4660), + destination_port: Some(0), + }; + let value: serde_json::Value = + serde_json::to_value(&req).expect("serialize InjectPacketRequest to Value"); + let obj = value + .as_object() + .expect("InjectPacketRequest serializes as object"); + for key in [ + "tuple_metadata_version", + "source_ip", + "source_port", + "destination_port", + ] { + assert!( + obj.contains_key(key), + "InjectPacketRequest wire key `{key}` missing: {value}" + ); + } + let back: InjectPacketRequest = + serde_json::from_value(value).expect("deserialize InjectPacketRequest"); + assert_eq!( + back.tuple_metadata_version, + INJECT_PACKET_TUPLE_PROTOCOL_VERSION + ); + assert_eq!(back.source_ip, "172.16.80.8"); + assert_eq!(back.source_port, Some(4660)); + assert_eq!(back.destination_port, Some(0)); + } + + #[test] + fn inject_packet_request_legacy_tuple_metadata_defaults_absent() { + let legacy_json = r#"{ + "slot": 7, + "packet_length": 128, + "addr_family": 2, + "protocol": 1, + "metadata_valid": true, + "destination_ip": "172.16.80.200", + "emit_on_wire": true + }"#; + let req: InjectPacketRequest = + serde_json::from_str(legacy_json).expect("legacy InjectPacketRequest decodes"); + assert_eq!(req.tuple_metadata_version, 0); + assert_eq!(req.source_ip, ""); + assert_eq!(req.source_port, None); + assert_eq!(req.destination_port, None); + } + // #825 plan §3.9 test #5: wire-format round-trip for // BindingStatus. Construct with non-zero values on all four // kick-latency fields, serialize, deserialize, assert equality. diff --git a/userspace-dp/src/server/README.md b/userspace-dp/src/server/README.md index fc89025eb..a171303be 100644 --- a/userspace-dp/src/server/README.md +++ b/userspace-dp/src/server/README.md @@ -39,6 +39,12 @@ helper while scheduled policies are configured, it sends the old helper must not keep forwarding a stale snapshot that ignores scheduler inactive bits. +`inject_packet_tuple_protocol_version` is the corresponding status gate for +`inject_packet` requests that set `emit_on_wire=true`. Those requests must carry +the complete tuple metadata (`source_ip`, `destination_ip`, protocol, and ports) +on the control wire; helpers reject legacy emit-on-wire requests instead of +synthesizing tuple identity locally. + ## Reconciliation `replan_queues` derives the binding plan from the current diff --git a/userspace-dp/src/server/helpers.rs b/userspace-dp/src/server/helpers.rs index 21e3f2ee2..075373a2e 100644 --- a/userspace-dp/src/server/helpers.rs +++ b/userspace-dp/src/server/helpers.rs @@ -80,6 +80,7 @@ pub(crate) fn refresh_status(state: &mut ServerState) { state.status.cos_interfaces = state.afxdp.cos_statuses(); state.status.policy_rule_counters = state.afxdp.policy_rule_counters(); state.status.filter_term_counters = state.afxdp.filter_term_counters(); + state.status.three_color_policer_counters = state.afxdp.three_color_policer_counters(); let (flow_worker_map, flow_worker_map_truncated) = state.afxdp.flow_worker_map(); state.status.flow_worker_map = flow_worker_map; state.status.flow_worker_map_truncated = flow_worker_map_truncated; diff --git a/userspace-dp/src/server/lifecycle.rs b/userspace-dp/src/server/lifecycle.rs index b7f23e469..37e6a41d9 100644 --- a/userspace-dp/src/server/lifecycle.rs +++ b/userspace-dp/src/server/lifecycle.rs @@ -74,6 +74,7 @@ pub(crate) fn run() -> Result<(), String> { status: ProcessStatus { pid: std::process::id() as i32, config_snapshot_protocol_version: CONFIG_SNAPSHOT_PROTOCOL_VERSION, + inject_packet_tuple_protocol_version: INJECT_PACKET_TUPLE_PROTOCOL_VERSION, started_at: Utc::now(), control_socket: args.control_socket.clone(), state_file: args.state_file.clone(), @@ -111,6 +112,7 @@ pub(crate) fn run() -> Result<(), String> { cos_interfaces: Vec::new(), policy_rule_counters: Vec::new(), filter_term_counters: Vec::new(), + three_color_policer_counters: Vec::new(), last_resolution: None, slow_path: SlowPathStatus::default(), debug_worker_threads: 0,