fix: cut reprovide peak memory with key count by lidel · Pull Request #1259 · libp2p/go-libp2p-kad-dht

lidel · 2026-06-07T20:49:06Z

Problem

On nodes with large keystores, reprovide cycles spike memory hard enough to get the process OOM-killed.

There are many sources, this PR fixes one of them, located in batchReprovide, which iiuc

it loaded every multihash in a region with keystore.Get just to learn how many keys there were and make routing decisions
then after exploring the swarm it loaded the (often wider) covered region a second time, all while still holding the first slice.

So a single reprovide could hold the whole region in memory twice and keep it resident across a network round trip. The bigger the keystore, the bigger the spike.

Fix

Unsure if this is the best way, but idea is to make count own op to avoid keeping big region in memory twice:

Add Keystore.KeyCount(prefix), backed by a keys-only datastore query, so the count comes back with bounded memory and without materialising a single multihash.
Reorder batchReprovide to count first and load the region exactly once, after the covered prefix is known, so the large slice is never held across swarm exploration and never loaded twice.
Balance the ongoing-reprovides key gauge on the load-error paths, which previously leaked the started count.

So the intention for this PR is to take peak memory per reprovide from two region loads down to one.

Note

@guillaumemichel This is low priority, so fine to park it until you have spare time to review. Feels like safe and easy win, but lmk if I missed anything.

## Problem batchReprovide loaded every multihash in a region with keystore.Get just to learn the key count and make routing decisions, then loaded the (often wider) covered region a second time after swarm exploration, holding the first slice across that exploration. A region can hold a huge number of keys, so one reprovide could materialise the whole region in memory twice and keep it resident across a network round trip. On nodes with large keystores this is one of the allocation spikes behind reprovide OOM kills. ## Fix - Add Keystore.KeyCount(prefix), backed by a keys-only datastore query, so the count is obtained with bounded memory and no multihash is materialised. - Reorder batchReprovide to count first, then load the region exactly once after the covered prefix is known, so the large slice is never held across swarm exploration and never loaded twice. - Balance the ongoing-reprovides key gauge on the load-error paths, which previously leaked the started count. Peak memory per reprovide drops from two region loads to one. A single region that is still very large after exploration is loaded whole; bounding that by splitting it into chunks builds on KeyCount and is left as a follow-up.

lidel requested a review from guillaumemichel June 7, 2026 20:49

lidel mentioned this pull request Jun 7, 2026

Release 0.43 ipfs/kubo#11298

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cut reprovide peak memory with key count#1259

fix: cut reprovide peak memory with key count#1259
lidel wants to merge 1 commit into
masterfrom
fix/reprovide-keystore-streaming

lidel commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lidel commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lidel commented Jun 7, 2026 •

edited

Loading