Skip to content

Add small-stack feature to shrink Raw block stack buffer#103

Open
zone117x wants to merge 1 commit intoKillingSpark:masterfrom
zone117x:master
Open

Add small-stack feature to shrink Raw block stack buffer#103
zone117x wants to merge 1 commit intoKillingSpark:masterfrom
zone117x:master

Conversation

@zone117x
Copy link
Copy Markdown

@zone117x zone117x commented Apr 12, 2026

Add small-stack feature to shrink Raw block stack buffer

Summary

  • The Raw block path in BlockDecoder::decode_block_content allocates a 128 KiB buffer on the stack to batch-read raw block content. LLVM reserves that frame at function entry, so any stream containing a Raw block blows the stack the moment decoding hits that branch, regardless of the block's actual size.
  • On deeply embedded targets this is fatal. Xtensa ESP32 tasks, for example, typically run with ~40 KiB executor stacks, and a 128 KiB frame overflows immediately.
  • Adds a new small-stack cargo feature (off by default) that shrinks BATCH_SIZE from 128 * 1024 to 1024. The surrounding loop already handles arbitrarily-sized blocks, so the only cost is extra read/push iterations for Raw-heavy streams on platforms that opt in.

Changes

Compatibility

  • Default behavior is unchanged - the feature is opt-in and the constant stays at 128 KiB unless explicitly enabled.
  • No public API changes.

Tests

  • cargo test -p ruzstd (default features)
  • cargo test -p ruzstd --features small-stack
  • cargo build -p ruzstd --no-default-features --features small-stack to confirm the feature composes with no_std builds
  • Spot-check a Raw-block-heavy fixture with small-stack enabled to confirm correctness of the smaller batching loop

The cargo feature approach seemed least invasive, however, I'm happy to rework this. Here are some alternative approaches i considered;

  1. Heap-allocate the buffer instead of stack. Replace the [u8; BATCH_SIZE] array with a Vec<u8> (or Box<[u8]>) when alloc/std is available. This sidesteps the stack-frame issue entirely without a new feature flag, but it costs an allocation per Raw block and isn't viable for no_std + no-alloc users.

  2. Shrink the default unconditionally. 1 KiB (or some middle ground like 4-8 KiB) as the default for everyone. Simpler - no feature flag, no cfg - but it's a throughput regression for the common desktop/server case on Raw-heavy streams, which is why I went with opt-in.

  3. Make BATCH_SIZE a generic const / associated const on a trait. Lets callers pick their own size without a cargo feature. More flexible but a bigger API surface change, and probably overkill for a single internal buffer.

  4. Read directly into the output sink in smaller chunks without any intermediate buffer. Would require reshaping the loop around the Read + sink APIs; I didn't want to touch that without a signal from you that it's welcome.

  5. Different feature name. small-stack felt descriptive, but embedded, tiny-stack, or no-large-stack-buffers would all work. Let me know if you have a preferred convention.

The `Raw` block path in `BlockDecoder::decode_block_content` uses a
128 KiB stack buffer to batch-read raw block content. On deeply embedded
targets — for example Xtensa ESP32 tasks, whose executor stacks are
typically around 40 KiB — this single array blows the task stack the
moment a stream contains a Raw block, regardless of how large the block
actually is (LLVM reserves the frame at function entry).

Add a `small-stack` cargo feature, off by default, that shrinks
`BATCH_SIZE` from `128 * 1024` to `1024`. The surrounding loop already
handles arbitrarily-sized blocks, so the only cost is more read/push
iterations for Raw-heavy streams on tiny-stack platforms that opt in.

Default behavior is unchanged. All existing tests pass with both feature
states.
@KillingSpark
Copy link
Copy Markdown
Owner

KillingSpark commented Apr 14, 2026

Hey thanks for the PR and the detailed description.

I think this is a good way to solve this specific issue. I'll think about how I want this feature flag named. Might take me another few days, very busy right now. I think this Kind of issue could arise at a few other places that should be sensibly handled by the same feature.

@KillingSpark
Copy link
Copy Markdown
Owner

KillingSpark commented Apr 28, 2026

So I've given this some thought and remembered where I encountered a similar issue: the crc-rs crate. It's not my crate but I contributed a while back. Bigger lookup tables resulted in way better checksumming, but similar to here, weren't suitable for all targets. The solution back then was also to introduce various features to support the different table sizes. That was a pain and looking at it now the project seems to have ditched this again going for const generics with sane defaults.

Which is what I think this project should do as well. All the consts should be generics in the encoder/decoder respectively.

The FrameDecoder could then be a newtype wrapper around this generic decoder with the current defaults. Maybe a second newtype wrapper for a small stack environment could be provided but tbh the tradeoffs are probably very usecase specific so it's probably better to point people to the generic decoder.

Which is also another point pro const generic and con feature. Being able to scale the buffer sizes optimally for your usecase to get the best throughput seems like a good idea, because platforms are pretty various out there.

Do you have any thoughts on this? I think it's worth it to increase the API surface in this way as long as it is only increased for users with the special usecase. I think it's possible to do this without breaking changes.

Or, alternatively, could you elaborate on option 4? I'm not sure how that would work without an intermediate buffer. But maybe I'm just not seeing something.

@zone117x
Copy link
Copy Markdown
Author

That all makes sense to me. I'll look into const generics.

A possible ideal API would let the caller provide the memory, so different platforms could control when a buffer lived on the stack or heap (that is at least relevant in the platforms I'm working on).

I'll do some testing to see what works out. Ideally no breaking changes, but it sounds like if there is a dramatic improvement to the API, then you might be okay with breaking changes?

@KillingSpark
Copy link
Copy Markdown
Owner

KillingSpark commented Apr 29, 2026

If there is a big improvement to be had by breaking the api, I might be open to it, but I don't see why it would be necessary. We should be able to provide a FrameDecoder that is still compatible with the current one by choosing the current defaults as the generic parameters.

I'm thinking something like this:

// newly introduced configurable FrameDecoder
pub struct GenericFrameDecoder<const STACK_BUFFER_SIZE_LIMIT: usize> {
// Contents of the original FrameDecoder 
}

// New functions on the generic frame decoder for each case where we currently have a `const` declaration.
// The current impl blocks on FrameDecoder should be changed to impl blocks for GenericFrameDecoder too
impl<const STACK_BUFFER_SIZE_LIMIT: usize> GenericFrameDecoder<STACK_BUFFER_SIZE_LIMIT> {
    pub const fn rle_buffer_size() -> usize {
        if 512 >= STACK_BUFFER_SIZE_LIMIT {
            STACK_BUFFER_SIZE_LIMIT
        } else {
            512
        }
    }
    
    pub const fn raw_buffer_size() -> usize {
        STACK_BUFFER_SIZE_LIMIT
    }
}

pub type FrameDecoder = GenericFrameDecoder<{1024 * 128}>;

// We might need to have an actual newtype wrapper like this:
pub struct FrameDecoder(GenericFrameDecoder<{1024 * 128}>);

Using a newtype wrapper would mean duplicating each impl block and forwarding each call to self.0 which would be annoying. I'm not sure if it's necessary for some compatibility reason.

@KillingSpark
Copy link
Copy Markdown
Owner

Sorry for the back and forth on this from my part, I just noticed something important.

  1. Read directly into the output sink in smaller chunks without any intermediate buffer. Would require reshaping the loop around the Read + sink APIs; I didn't want to touch that without a signal from you that it's welcome.

For some reason I assumed that this does actually directly write to the output Write but both the RLE and RAW blocks must also be "decoded" into the internal ringbuffer before being drained into the output.

This allows for way better solutions without using these intermediate buffers. One can just reserve the required space in the ringbuffer and use that to either read the raw data into it or use the modern slice::fill function. This would both sidestep the stack allocated buffers and hopefully be optimized for the respective targets, as well as just cutting out unnecessary memcpys.

If you want to try your hand at messing with the internal buffers I'm willing to give you guidance, but I can also just do it in the upcoming days. Let me know what works for you.

The DecodeBuffer has a RingBuffer, the RingBuffer is basically a VecDequeue with the addition of a "extend_from_within" implementation. This has a function called reserve which should be called and also a function free_slice_parts which gives you the raw parts of the currently unused space in the ringbuffer.

So after calling reserve you should have one or two slices large enough to hold the new contents either raw (the slices will be used as the buffer passed to the read_exact call) or RLE (just use slice::fill to set the values).

@KillingSpark
Copy link
Copy Markdown
Owner

Could you try running your project with the new commits from today by pointing the dependency to this github repo? It should now not require excessive stack space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants