btf: replace global BTF cache with user provided cache by matthyx · Pull Request #1988 · cilium/ebpf

matthyx · 2026-04-21T09:07:04Z

Problem

globalCache in btf/kernel.go holds strong *Spec pointers to the kernel and module BTF specs. Once loaded, these ~20 MiB of type information are never released, even after all eBPF programs have been loaded and no CollectionSpec retains a reference to them.

FlushKernelSpec() exists as a manual escape hatch but requires callers to know when it is safe to call — which is inherently racy in a library.

Solution

Change globalCache.kernel and globalCache.modules from strong *Spec pointers to weak.Pointer[Spec] (Go 1.24+). The GC will collect the cached specs automatically once no CollectionSpec (or other caller) holds a strong reference to them.

Key behaviours preserved:

Amortisation: The cache still avoids re-parsing /sys/kernel/btf/vmlinux across multiple LoadKernelSpec calls that overlap in time.
Lazy reload: If the GC collects the cached spec, the next LoadKernelSpec call transparently re-parses it.
FlushKernelSpec(): Still works; zeroing a weak.Pointer is equivalent to the previous = nil. Now deprecated since explicit flushing is no longer needed.
NewCache(): Handles the case where the weak pointer has already been collected, falling back to an empty Cache (which then loads on demand via Kernel()).

`Cache` is now self-contained

Previously, Cache.Module re-used the global module cache and then rebased the result against the Cache's own kernel decoder. That created a hidden invariant: the Cache's kernel decoder had to share its raw mmap'd buffer with whatever kernel spec the global cache currently held — otherwise rebaseDecoder would fail with "raw BTF differs".

Under strong references this invariant held permanently. Under weak references it could break: if the global cached kernel spec were GC'd between Cache.Kernel() and Cache.Module(), the next module load would parse a fresh kernel mmap and the rebase would blow up.

Cache.Module now loads module BTF directly against the Cache's own kernel via loadKernelModuleSpec(name, c.kernelTypes). The Cache no longer depends on the global cache's state for correctness, and no kernelSource anchor field is needed. Memory reclaim is unchanged: the mmap'd vmlinux buffer is freed (munmap cleanup on sharedBuf) once the last Cache (and any other strong holder) is dropped.

The trade-off is that a module loaded via Cache.Module is no longer re-cached in the global module cache, so a subsequent LoadKernelModuleSpec for the same module reparses sysfs. Module BTF is small (kilobytes) and this only matters if Cache and the package-level loaders are mixed for the same modules.

Memory impact

In a long-running process that loads a handful of eBPF programs at startup and then runs indefinitely, this frees ~20 MiB of Go heap automatically — without any caller changes.

Measured in a Kubernetes node-agent workload (inspektor-gadget) where kernel BTF was the dominant post-load heap consumer.

Go version requirement

weak.Pointer was stabilised in Go 1.24. The module already requires Go 1.23; this PR bumps the minimum to Go 1.24.

Testing

Existing TestLoadKernelSpec and TestCache pass. The cache correctness is unchanged — the weak pointer is semantically transparent to callers as long as they hold a strong reference (via a returned *Spec) across the critical section.

ti-mo

Thanks! Left some comments. Looks like there's still a failure in the selftests.

matthyx · 2026-04-21T12:12:25Z

@ti-mo looks like the CI passes now

matthyx · 2026-04-22T05:23:33Z

@ti-mo can you check my proposal on the module BTF?

…tives (#786) Temporarily points two dependencies to forks that implement pending upstream PRs, allowing CI to build a testable image before the PRs land. inspektor-gadget → matthyx/inspektor-gadget @ fd383d3 - Release collectionSpec after eBPF load (~14 MiB heap) #5461 - Release oras target after load (~27 MiB RssFile) #5462 - Release program bytes after load (~6 MiB heap) #5466 - Reduce trace_capabilities current_syscall map 1M→10K (~43 MiB kernel) #5468 Upstream: inspektor-gadget/inspektor-gadget cilium/ebpf → matthyx/ebpf @ 8a32d06 - Weak-pointer kernel BTF global cache (~17 MiB heap auto-freed) - Invalidate module cache on kernel spec reload (fixes kmod BTF errors) Upstream PR: cilium/ebpf#1988 These replace directives should be removed once the upstream PRs are merged and released. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

ti-mo · 2026-04-23T12:27:39Z

@matthyx I've given this a lot of thought, but something tells me this isn't the right approach. Turning these into weak pointers completely defeats their purpose as a long-lived cache, since there's no guarantee whatsoever that the global cache will outlive two sequential calls to LoadCollectionSpec, at which point we may as well remove it since btf.Cache fills intra-load caching now. I understand the difficulty in finding a good point to call Flush(); I assume you do this after every collection load to keep consumption down.

Though I have one question: what does that 20MiB of heap contain? We have lazy decoding now, so we should only (mostly) inflate types that get returned to the caller. I don't know your use case, but 20MiB feels like a lot. Are you counting heap or RSS?

It's worth noting that a few things changed since introducing the global cache. Originally, it intended to amortize the cost of parsing BTF across multiple programs within a single LoadCollectionSpec, but the fact that it persisted across multiple collection loads was a nice bonus. Now that decoding is lazy + mmap-backed, a lot of that cost is gone. Conversely, I find the lower-level logic around how the caches interact almost impossible to reason about today.

With btf.Cache in the picture, it may be time to sunset the global cache altogether and allow the caller to pass one in through CollectionOptions. This way, the caller can decide for themselves when and how to manage a cache and we sidestep this issue altogether.

matthyx · 2026-04-23T13:32:03Z

Thanks, this makes sense. You're right that weak pointers turn the global cache from a deterministic long-lived cache into a best-effort memoization layer, which is probably the wrong trade-off here.

Given that we already use a btf.Cache during collection load, surfacing that through CollectionOptions seems like the cleaner direction. That lets callers choose the cache lifetime explicitly instead of the library trying to guess a safe flush
point.

The memory saving is heap, but more 17 than 20, you can see some info here: kubescape/node-agent#786 (comment)

Should I close this?

ti-mo · 2026-04-27T12:24:28Z

Hi, are you generating these replies using an LLM? If you're up for making the CollectionOptions change, feel free to reuse this PR for that.

matthyx · 2026-04-27T12:50:04Z

@ti-mo yes I use one to redact sentences from bullet points, as I'm not native speaker in English. Sorry if that offended you.

matthyx · 2026-04-27T13:53:40Z

@ti-mo something like that?

dylandreimerink

I think this updated PR is nice, having less memory overhead by default, and allowing power users to enable caching between collections.

ti-mo

Thanks! This cleans up nicely. Left a few nits, and we should prob make btf.Cache concurrency-safe.

ti-mo · 2026-04-30T08:34:00Z

+//
+// Deprecated: kernel BTF is no longer cached at the package level.
+// Discard your [Cache] instances to release cached BTF instead.
+func FlushKernelSpec() {}


We should simply remove this as a call to action to anyone that depended on this behaviour. For anyone relying on package-level caching, Flush() becoming a no-op is already sort-of a breaking change. We may as well make it explicit.

ti-mo · 2026-04-30T08:36:30Z

+	return loadKernelModuleSpec(module, base)
 }

 func loadKernelSpec() (*Spec, error) {


Nit: move this above LoadKernelModuleSpec and export instead?

ti-mo · 2026-04-30T08:37:54Z

 	MapReplacements map[string]*Map
+
+	// Cache amortises the cost of decoding kernel BTF across multiple
+	// collection loads. Callers loading more than one collection should


Nit:

Suggested change

// collection loads. Callers loading more than one collection should

// Collection loads. Callers loading more than one Collection should

ti-mo · 2026-04-30T08:38:09Z

 //
-// Opportunistically reuses a global cache if possible.
+// Pass it to [CollectionOptions] to share kernel BTF across multiple
+// collection loads.


Nit:

Suggested change

// collection loads.

// Collection loads.

ti-mo · 2026-04-30T08:47:23Z

+	// allocate a Cache via [btf.NewCache] and share it across loads.
+	// If nil, a fresh cache is allocated per load and discarded.
+	//
+	// The same Cache must not be used concurrently.


This looks straightforward enough to address, it would be a rather inconvenient limitation. Could you pick this up as well?

matthyx · 2026-04-30T10:39:35Z

@ti-mo updated

…ionOptions Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

matthyx requested a review from dylandreimerink as a code owner April 21, 2026 09:07

matthyx force-pushed the matthias/weak-kernel-btf-cache branch from 7f48e14 to dfebdc7 Compare April 21, 2026 09:21

matthyx mentioned this pull request Apr 21, 2026

chore: validate NAUT-1252 memory optimizations (go.mod replace directives) kubescape/node-agent#786

Merged

4 tasks

ti-mo requested changes Apr 21, 2026

View reviewed changes

Comment thread btf/kernel.go Outdated

Comment thread btf/kernel.go Outdated

Comment thread btf/kernel.go Outdated

matthyx force-pushed the matthias/weak-kernel-btf-cache branch 2 times, most recently from d4513a2 to fa31975 Compare April 21, 2026 11:39

ti-mo reviewed Apr 21, 2026

View reviewed changes

Comment thread btf/kernel.go Outdated

matthyx force-pushed the matthias/weak-kernel-btf-cache branch from fa31975 to 21aba8b Compare April 21, 2026 14:24

matthyx force-pushed the matthias/weak-kernel-btf-cache branch from 21aba8b to 131f528 Compare April 27, 2026 13:53

matthyx requested a review from a team as a code owner April 27, 2026 13:53

dylandreimerink changed the title ~~btf: use weak pointers for the kernel BTF global cache~~ btf: replace global BTF cache with user provided cache Apr 30, 2026

dylandreimerink previously approved these changes Apr 30, 2026

View reviewed changes

ti-mo reviewed Apr 30, 2026

View reviewed changes

matthyx dismissed dylandreimerink’s stale review via 3ddbf85 April 30, 2026 10:39

matthyx force-pushed the matthias/weak-kernel-btf-cache branch from 131f528 to 3ddbf85 Compare April 30, 2026 10:39

github-actions Bot added the breaking-change Changes exported API label May 1, 2026

refactor: remove global BTF cache and allow passing one using Collect…

6bbb73e

…ionOptions Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

matthyx force-pushed the matthias/weak-kernel-btf-cache branch from 3ddbf85 to 6bbb73e Compare May 11, 2026 08:12

matthyx requested a review from ti-mo May 11, 2026 09:22

	// collection loads. Callers loading more than one collection should
	// Collection loads. Callers loading more than one Collection should

Conversation

matthyx commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Cache is now self-contained

Memory impact

Go version requirement

Testing

Uh oh!

ti-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthyx commented Apr 21, 2026

Uh oh!

Uh oh!

matthyx commented Apr 22, 2026

Uh oh!

ti-mo commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthyx commented Apr 23, 2026

Uh oh!

ti-mo commented Apr 27, 2026

Uh oh!

matthyx commented Apr 27, 2026

Uh oh!

matthyx commented Apr 27, 2026

Uh oh!

dylandreimerink left a comment

Choose a reason for hiding this comment

Uh oh!

ti-mo left a comment

Choose a reason for hiding this comment

Uh oh!

ti-mo Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ti-mo Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ti-mo Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ti-mo Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ti-mo Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

matthyx commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthyx commented Apr 21, 2026 •

edited

Loading

`Cache` is now self-contained

ti-mo commented Apr 23, 2026 •

edited

Loading