fix: cut reprovide peak memory with key count#1259
Open
lidel wants to merge 1 commit into
Open
Conversation
## Problem batchReprovide loaded every multihash in a region with keystore.Get just to learn the key count and make routing decisions, then loaded the (often wider) covered region a second time after swarm exploration, holding the first slice across that exploration. A region can hold a huge number of keys, so one reprovide could materialise the whole region in memory twice and keep it resident across a network round trip. On nodes with large keystores this is one of the allocation spikes behind reprovide OOM kills. ## Fix - Add Keystore.KeyCount(prefix), backed by a keys-only datastore query, so the count is obtained with bounded memory and no multihash is materialised. - Reorder batchReprovide to count first, then load the region exactly once after the covered prefix is known, so the large slice is never held across swarm exploration and never loaded twice. - Balance the ongoing-reprovides key gauge on the load-error paths, which previously leaked the started count. Peak memory per reprovide drops from two region loads to one. A single region that is still very large after exploration is loaded whole; bounding that by splitting it into chunks builds on KeyCount and is left as a follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On nodes with large keystores, reprovide cycles spike memory hard enough to get the process OOM-killed.
There are many sources, this PR fixes one of them, located in
batchReprovide, which iiuckeystore.Getjust to learn how many keys there were and make routing decisionsSo a single reprovide could hold the whole region in memory twice and keep it resident across a network round trip. The bigger the keystore, the bigger the spike.
Fix
Unsure if this is the best way, but idea is to make count own op to avoid keeping big region in memory twice:
Keystore.KeyCount(prefix), backed by a keys-only datastore query, so the count comes back with bounded memory and without materialising a single multihash.batchReprovideto count first and load the region exactly once, after the covered prefix is known, so the large slice is never held across swarm exploration and never loaded twice.So the intention for this PR is to take peak memory per reprovide from two region loads down to one.
Note
@guillaumemichel This is low priority, so fine to park it until you have spare time to review. Feels like safe and easy win, but lmk if I missed anything.