Fix: Split single lines exceeding chunk_size limit #10
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a bug where single lines longer than the configured \chunk_size\ were not being split, resulting in chunks larger than the limit.
Problem
The previous chunking logic in \src/core/indexer.rs\ would simply append a line to the current chunk if the chunk was empty, even if that line itself exceeded the \chunk_size\ limit. This meant that a file containing a single very long line (e.g., minified code or a large data string) would produce a single massive chunk, potentially causing issues with downstream embedding models that have strict token limits.
Solution
The fix involves detecting if a line exceeds the \chunk_size\ before attempting to add it to the current chunk. If it does:
Testing
Verified with a reproduction test case where a string of length 25 (limit 10) was previously resulting in 1 chunk of size 25, and now correctly splits into 3 chunks (size 10, 10, 5).
Related Issue
Fixes PlatformNetwork/bounty-challenge#52