Add small-stack feature to shrink Raw block stack buffer#103
Add small-stack feature to shrink Raw block stack buffer#103zone117x wants to merge 1 commit intoKillingSpark:masterfrom
small-stack feature to shrink Raw block stack buffer#103Conversation
The `Raw` block path in `BlockDecoder::decode_block_content` uses a 128 KiB stack buffer to batch-read raw block content. On deeply embedded targets — for example Xtensa ESP32 tasks, whose executor stacks are typically around 40 KiB — this single array blows the task stack the moment a stream contains a Raw block, regardless of how large the block actually is (LLVM reserves the frame at function entry). Add a `small-stack` cargo feature, off by default, that shrinks `BATCH_SIZE` from `128 * 1024` to `1024`. The surrounding loop already handles arbitrarily-sized blocks, so the only cost is more read/push iterations for Raw-heavy streams on tiny-stack platforms that opt in. Default behavior is unchanged. All existing tests pass with both feature states.
|
Hey thanks for the PR and the detailed description. I think this is a good way to solve this specific issue. I'll think about how I want this feature flag named. Might take me another few days, very busy right now. I think this Kind of issue could arise at a few other places that should be sensibly handled by the same feature. |
|
So I've given this some thought and remembered where I encountered a similar issue: the crc-rs crate. It's not my crate but I contributed a while back. Bigger lookup tables resulted in way better checksumming, but similar to here, weren't suitable for all targets. The solution back then was also to introduce various features to support the different table sizes. That was a pain and looking at it now the project seems to have ditched this again going for const generics with sane defaults. Which is what I think this project should do as well. All the The FrameDecoder could then be a newtype wrapper around this generic decoder with the current defaults. Maybe a second newtype wrapper for a small stack environment could be provided but tbh the tradeoffs are probably very usecase specific so it's probably better to point people to the generic decoder. Which is also another point pro const generic and con feature. Being able to scale the buffer sizes optimally for your usecase to get the best throughput seems like a good idea, because platforms are pretty various out there. Do you have any thoughts on this? I think it's worth it to increase the API surface in this way as long as it is only increased for users with the special usecase. I think it's possible to do this without breaking changes. Or, alternatively, could you elaborate on option 4? I'm not sure how that would work without an intermediate buffer. But maybe I'm just not seeing something. |
|
That all makes sense to me. I'll look into const generics. A possible ideal API would let the caller provide the memory, so different platforms could control when a buffer lived on the stack or heap (that is at least relevant in the platforms I'm working on). I'll do some testing to see what works out. Ideally no breaking changes, but it sounds like if there is a dramatic improvement to the API, then you might be okay with breaking changes? |
|
If there is a big improvement to be had by breaking the api, I might be open to it, but I don't see why it would be necessary. We should be able to provide a FrameDecoder that is still compatible with the current one by choosing the current defaults as the generic parameters. I'm thinking something like this: Using a newtype wrapper would mean duplicating each impl block and forwarding each call to self.0 which would be annoying. I'm not sure if it's necessary for some compatibility reason. |
|
Sorry for the back and forth on this from my part, I just noticed something important.
For some reason I assumed that this does actually directly write to the output This allows for way better solutions without using these intermediate buffers. One can just reserve the required space in the ringbuffer and use that to either read the raw data into it or use the modern If you want to try your hand at messing with the internal buffers I'm willing to give you guidance, but I can also just do it in the upcoming days. Let me know what works for you. The DecodeBuffer has a RingBuffer, the RingBuffer is basically a VecDequeue with the addition of a "extend_from_within" implementation. This has a function called So after calling |
|
Could you try running your project with the new commits from today by pointing the dependency to this github repo? It should now not require excessive stack space |
Add
small-stackfeature to shrink Raw block stack bufferSummary
Rawblock path in BlockDecoder::decode_block_content allocates a 128 KiB buffer on the stack to batch-read raw block content. LLVM reserves that frame at function entry, so any stream containing a Raw block blows the stack the moment decoding hits that branch, regardless of the block's actual size.small-stackcargo feature (off by default) that shrinksBATCH_SIZEfrom128 * 1024to1024. The surrounding loop already handles arbitrarily-sized blocks, so the only cost is extra read/push iterations for Raw-heavy streams on platforms that opt in.Changes
small-stackfeature with a comment explaining its purpose and tradeoff.BATCH_SIZEoncfg(feature = "small-stack").Compatibility
Tests
cargo test -p ruzstd(default features)cargo test -p ruzstd --features small-stackcargo build -p ruzstd --no-default-features --features small-stackto confirm the feature composes withno_stdbuildssmall-stackenabled to confirm correctness of the smaller batching loopThe cargo feature approach seemed least invasive, however, I'm happy to rework this. Here are some alternative approaches i considered;
Heap-allocate the buffer instead of stack. Replace the
[u8; BATCH_SIZE]array with aVec<u8>(orBox<[u8]>) whenalloc/stdis available. This sidesteps the stack-frame issue entirely without a new feature flag, but it costs an allocation per Raw block and isn't viable forno_std+ no-allocusers.Shrink the default unconditionally. 1 KiB (or some middle ground like 4-8 KiB) as the default for everyone. Simpler - no feature flag, no cfg - but it's a throughput regression for the common desktop/server case on Raw-heavy streams, which is why I went with opt-in.
Make
BATCH_SIZEa generic const / associated const on a trait. Lets callers pick their own size without a cargo feature. More flexible but a bigger API surface change, and probably overkill for a single internal buffer.Read directly into the output sink in smaller chunks without any intermediate buffer. Would require reshaping the loop around the
Read+ sink APIs; I didn't want to touch that without a signal from you that it's welcome.Different feature name.
small-stackfelt descriptive, butembedded,tiny-stack, orno-large-stack-bufferswould all work. Let me know if you have a preferred convention.