Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions tetragon/CFP-4191-per-workload-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,25 @@

## Summary

Today, with Tetragon it is not possible to share a common enforcement logic across many Kubernetes workloads. For each workload users need to create a separate `TracingPolicy` with the same enforcement but with different values.
Today, applying Tetragon policies for different Kubernetes (K8s) workloads requires a different policy for each workload (using the `podSelector` and `containerSelector` fields) and loads a different eBPF program for each policy. In each eBPF program, the eBPF code checks whether the executing process is subject to the corresponding policy. This approach introduces scalability limitations both from the UX perspective but also from the eBPF path perspective. The goal of this CFP is to explore solutions that address these limitations and propose a concrete solution to address them.

This topic was initially discussed in https://github.com/cilium/tetragon/issues/4191.

## Motivation

The current approach of one `TracingPolicy` per workload leads to 2 main problems:
Let's consider the following use-case: creating per-workload policies that specify the subset of binaries each workload is allowed to execute.

The current approach of one `TracingPolicy` per workload and a separate eBPF program per policy leads to two problems, which we discuss separately.

### Scalability limitations of the eBPF implementation (P1)

The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies. First, certain hook types have an attachment limit of `BPF_MAX_TRAMP_LINKS` ([38 on x86](https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138)). This means that we cannot load more than 38 policies on the same hook of this type. Second, for each loaded program we [check](https://github.com/cilium/tetragon/blob/fdd7f014e4172d09f4fcc250f8a5790e764428f8/bpf/process/policy_filter.h#L51-L54) whether the policy applies to the given workload. This wastes a lot of CPU cycles, especially in cases where processes match a small subset of the existing policies.

- P1 (Scaling): The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies (attachment limits, redundant evaluations, memory growth).
(Note(kkourt): is there an argument to be made for reducing memory footprint on the eBPF maps?)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say yes. The memory is the main blocker we have right now.
Even with all the optimization, we opened upstream, on nodes with many CPUs (e.g., 96), there are some per-CPU maps (e.g., process_call_heap, string_maps_heap, data_heap) that bring the memory usage to something like 9 MB per policy. This is not ideal for our use case, where we want to create a Tracing policy for each container inside each Pod

Copy link
Copy Markdown
Author

@kkourt kkourt Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

In my mind, the approach to address the scalability issues of having one program per policy is to have a single program per hook (for all policies) and have them access different state (bpf maps). To reduce memory footprint, we need to take this approach one step further and have the different policies share maps (somehow).

Is that the general idea?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, the approach to address the scalability issues of having one program per policy is to have a single program per hook (for all policies) and have them access different state (bpf maps).

Yep, that's exactly what we ended up doing. We have a little agent that does exactly this, of course, it is easier since in our use case we just hook one single point (security_bprm_creds_for_exec) and we support just 2 operators (Equal, In)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this does not address the memory footprint issue, correct? Both approaches (one hooked program per policy, and one program per hook with different maps) use the same amount of memory in BPF maps. Or am I missing something?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve the memory footprint issue, we used a strategy very similar to what we did in the POC cilium/tetragon#4279.
We have a unique ebpf prog with 2 maps:

  1. hash_map (key: cgroupID, value: policyID)
  2. hash of maps (key: policyID, value: hash_map(key: string, value:0/1)). This is the hashset of values for each policy
    So, from the cgroup, we understand the associated policy, and then we check if the current binary is present in the hashset.

It is probably possible to achieve the same memory footprint with both approaches: one prog per hook/one hooked prog per policy. We chose the one prog per hook approach because it is enough for us and allows us to use just one unique ebpf program for all the policies.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this does not work for cases where policies match multiple workloads (which is common enough use-case that we cannot exclude). How would above work for the generic case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, I can see how we save memory with:

hash_map (key: cgroupID, value: policyID)

But it's not clear to me how we save memory with:

(key: policyID, value: hash_map(key: string, value:0/1)).

Isn't above the same in terms of memory footprint with what we have now (where we hold one map per policy)? Or am I missing something?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this does not work for cases where policies match multiple workloads

That's true, it doesn't cover this case.
In our use case, a cgroup can be associated with one and only one policy. This is something that cannot work with today's tetragon generic TracingPolicy concept, unless we want to introduce a specific policy that enforces this constraint by default.

Isn't above the same in terms of memory footprint with what we have now (where we hold one map per policy)?

You are right, I should probably correct my previous statement:

"To solve the memory footprint issue, we used a strategy very similar to what we did in the POC cilium/tetragon#4279.
We have a unique ebpf prog with 2 maps: ..."

What we actually did to solve the memory issue was to get rid of all the maps we don't need for our use case, ending up using just the 2 maps reported above. I reported the most memory-consuming maps here cilium/tetragon#4191 (comment). So yes, the memory saving doesn't come from that map usage, but from not using all the other maps that are not necessary


- P2 (UX / Composition): Crafting many near-identical `TracingPolicy` resources with only small differences (e.g. filter values) is operationally cumbersome and error prone.
### Ergonomic (UX / Composition) Challenges (P2)

In this document we propose to tackle scaling (P1) and UX (P2) separately.
Crafting many near-identical `TracingPolicy` resources with only small differences (e.g. filter values) is operationally cumbersome, error prone, and makes the policies hard to maintain over time.

## Goals

Expand Down