Skip to content

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Jan 19, 2026

Summary

This PR fixes a bug where single lines longer than the configured \chunk_size\ were not being split, resulting in chunks larger than the limit.

Problem

The previous chunking logic in \src/core/indexer.rs\ would simply append a line to the current chunk if the chunk was empty, even if that line itself exceeded the \chunk_size\ limit. This meant that a file containing a single very long line (e.g., minified code or a large data string) would produce a single massive chunk, potentially causing issues with downstream embedding models that have strict token limits.

Solution

The fix involves detecting if a line exceeds the \chunk_size\ before attempting to add it to the current chunk. If it does:

  1. The current pending chunk is flushed.
  2. The long line is hard-split into multiple segments of \chunk_size.
  3. These segments are added as separate chunks.

Testing

Verified with a reproduction test case where a string of length 25 (limit 10) was previously resulting in 1 chunk of size 25, and now correctly splits into 3 chunks (size 10, 10, 5).

Related Issue

Fixes PlatformNetwork/bounty-challenge#52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Single lines exceeding chunk_size are not split, creating oversized chunks

2 participants