-
Notifications
You must be signed in to change notification settings - Fork 50
CFP-45118: Add shared policy lpm trie proposal. #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tsotne95
wants to merge
1
commit into
cilium:main
Choose a base branch
from
tsotne95:shared-policy-lpm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+153
−0
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| # CFP-45118: Shared Policy LPM | ||
|
|
||
| **SIG: SIG-Policy, SIG-Datapath, SIG-Scalability** | ||
|
|
||
| **Begin Design Discussion:** 2026-04-01 | ||
|
|
||
| **Cilium Release:** 1.20+ | ||
|
|
||
| **Authors:** Tsotne Chakhvadze <tsotne@google.com> | ||
|
|
||
| **Status:** Proposal | ||
|
|
||
| # Sharing Policy Maps to Save Memory (Without BPF Arena) | ||
|
|
||
| ## Summary | ||
|
|
||
| This proposal introduces a **Shared LPM Trie architecture** to deduplicate network policy rules across endpoints on a single node. | ||
|
|
||
| Currently, Cilium creates a per-endpoint policy map. If multiple pods run the same application profile (e.g. 1000 replicas of a microservice), Cilium creates replicate rules for each endpoint. This architecture proposes a single, node-wide Shared LPM Trie map where endpoints share a single set of rules, reducing control-plane memory usage significantly. | ||
|
|
||
| This is a practical, immediate alternative to **BPF Arena**, as it works on older kernels and avoids complex pointer management. | ||
|
|
||
| ## Motivation | ||
|
|
||
| As Kubernetes clusters scale to thousands of nodes and tens of thousands of pods, Cilium's per-endpoint policy architecture faces critical scaling bottlenecks: | ||
|
|
||
| 1. **Per-Endpoint Map Exhaustion**: Each endpoint has a fixed-size policy map (typically 16,384 entries). Complex environments with fine-grained segmentations can exceed this limit, causing policy drops or failures to attach. | ||
| 2. **Node Memory Pressure**: Replicating the same rule set across thousands of pods consumes gigabytes of memory per node. This reduces the memory available for user applications and increases infrastructure costs. | ||
| 3. **Bpf Arena Blockers**: While BPF Arena can solve this, it requires very new kernels (Linux 6.9+) and introduces native pointer complexities that slow down adoption. | ||
|
|
||
| Most memory savings (estimated ~99%) come from **Rule Set Sharing** (Level 1 Deduplication). We can achieve this blocking requirement immediately using standard, broadly supported BPF primitives (`BPF_MAP_TYPE_LPM_TRIE`) that work on older kernels. This allows us to scale to massive clusters without hitting map limits or wasting node RAM. | ||
|
|
||
| ## Goals | ||
|
|
||
| * **Memory Efficiency**: Reduce node memory footprint by sharing rule sets across identical endpoints. | ||
| * **Kernel Compatibility**: Support older kernels without requiring Arena. | ||
| * **Safety**: Change rules without dropping packets or accidentally mixing up different apps | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| * Per-rule fine-grained pointer sharing (Level 2 Deduplication), which remains the domain of future BPF Arena enhancements. | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### Overview | ||
|
|
||
| The design replaces the private per-endpoint policy map with a two-stage lookup mechanism using two BPF maps: | ||
|
|
||
| 1. **Overlay Map** (Hash): Maps an `Endpoint ID` to a `Rule Set ID`. | ||
| 2. **Shared LPM Trie Map** (Global): Maps `Rule Set ID` + packet details to a verdict. | ||
|
|
||
| ```mermaid | ||
| graph LR | ||
| subgraph Traffic ["Packet Flow"] | ||
| direction TB | ||
| EP["Endpoint ID: 42"] | ||
| Pkt["[Identity: 100, Port: 80]"] | ||
| end | ||
|
|
||
| subgraph Overlay ["Step 1: Get Rule Set ID"] | ||
| direction TB | ||
| EP_Key["Lookup Ep 42"] | ||
| EP_Val["Returns: Rule Set 7"] | ||
| end | ||
|
|
||
| subgraph SharedLPM ["Step 2: Evaluate Rules"] | ||
| direction TB | ||
| LPM_Key["Match Set 7 + Identity 100 + Port 80"] | ||
| LPM_Val["Verdict: ALLOW"] | ||
| end | ||
|
|
||
| EP --> EP_Key | ||
| EP_Val --> LPM_Key | ||
| Pkt --> LPM_Key | ||
| LPM_Key --> LPM_Val | ||
| ``` | ||
|
|
||
| ### 1. BPF Maps Structure | ||
|
|
||
| #### The Overlay Map (Endpoint → Rule Set ID) | ||
|
|
||
| ```c | ||
| struct { | ||
| __uint(type, BPF_MAP_TYPE_HASH); | ||
| __type(key, __u16); // Endpoint ID | ||
| __type(value, __u32); // Rule Set ID | ||
| __uint(max_entries, 65535); // Example size | ||
| } cilium_policy_overlay SEC(".maps"); | ||
| ``` | ||
|
|
||
| #### The Shared Rules Map (LPM Trie) | ||
|
|
||
| The key combines the `rule_set_id` with protocol selectors. | ||
|
|
||
| ```c | ||
| struct shared_policy_key { | ||
| __u32 prefixlen; // LPM prefix length | ||
| __u32 rule_set_id; // Scopes the rules to a set | ||
| __u32 sec_label; // Destination Security Identity | ||
| __u8 egress; // Direction | ||
| __u8 protocol; // L4 Proto | ||
| __u16 dport; // Dest Port | ||
| } __attribute__((packed)); | ||
|
|
||
| struct policy_entry { | ||
| __be16 proxy_port; | ||
| __u8 deny; | ||
| __u8 wildcard_protocol; | ||
| __u8 wildcard_dport; | ||
| __u16 auth_type; | ||
| __u8 pad; | ||
| __u8 pad2; | ||
| }; | ||
|
|
||
| struct { | ||
| __uint(type, BPF_MAP_TYPE_LPM_TRIE); | ||
| __type(key, struct shared_policy_key); | ||
| __type(value, struct policy_entry); | ||
| __uint(max_entries, 1000000); // Configurable example | ||
| __uint(map_flags, BPF_F_NO_PREALLOC); | ||
| } cilium_policy_shared SEC(".maps"); | ||
| ``` | ||
|
|
||
| ### 2. Userspace Manager (Go) | ||
|
|
||
| The Go agent manages the rules and tells BPF what to do. | ||
|
|
||
| * **Finding if a Rule Set already exists (Zero Collisions)**: | ||
| Go takes a "fingerprint" (hash) of the rules. If another app has the exact same fingerprint, Go double-checks them line-by-line to be 100% sure they are identical. If they match, they share the same rules. If they don't, Go gives them a new ID. This guarantees no two different configs accidentally mix. | ||
| * **Updating rules without dropping packets**: | ||
| If a rule changes for a group of endpoints, Go does not edit the live rules directly (which could cause temporary packet drops or security holes while editing). Instead, Go: | ||
| 1. Creates a **brand new copy** of the new rules in the Shared Map with a new ID. | ||
| 2. Instantly flips the switch in the Overlay Map to point the endpoints to the new ID. Packets flow without problem. | ||
| 3. **Cleans up** (deletes) the old rules once no endpoints are using them anymore to free up memory. | ||
|
|
||
| ## Impacts / Key Questions | ||
|
|
||
| ### Impact: Memory Scale | ||
|
|
||
| Calculated based on 500 rules per application profile across 1000 pods (10 application types total). | ||
|
|
||
| | Feature Type | Total Graph Rules | Est. Node Memory | | ||
| | :--- | :--- | :--- | | ||
| | **Legacy (Individual maps)** | $500,000\text{ items}$ | $\approx 50\text{ MB}$ | | ||
| | **Proposed (Shared LPM)** | $5,000\text{ items}$ | $\approx \text{Less than } 1\text{ MB}$ | | ||
|
|
||
| ### Key Question: Map Limitations | ||
| The `BPF_F_NO_PREALLOC` flag is vital here so memory is only committed on demand, preserving metrics if max_entries is set high (e.g., millions). | ||
|
|
||
| ## Future Milestones | ||
|
|
||
| ### BPF Arena Consolidation | ||
| In the future we can transition to Level 2 Deduplication via BPF Arena without breaking the API surface between Go and BPF. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to eliminate this map. Instead, you could provide file descriptors to the endpoint BPF programs.
It's not a big deal to rebuild an endpoint if it transitions from shared to shared+overlay maps.