Skip to content

perf: tokenize each corpus chunk once during indexing#77

Merged
CGFixIT merged 1 commit into
mainfrom
claude/cyclaw-indexer-dedup-tokenize
Jun 20, 2026
Merged

perf: tokenize each corpus chunk once during indexing#77
CGFixIT merged 1 commit into
mainfrom
claude/cyclaw-indexer-dedup-tokenize

Conversation

@CGFixIT

@CGFixIT CGFixIT commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Problem

retrieval/indexer.py build_index() tokenized every chunk twice:

# pass 1 (per chunk, in the build loop)
stem_tags = tokenize_and_stem(clean_chunk)[:20]
...
# pass 2 (later, over the whole corpus)
tokenized_corpus = [tokenize_and_stem(chunk) for chunk in all_chunks]

Both calls run tokenize_and_stem() on the same sanitized chunk text. The second pass re-runs the regex tokenizer and Porter stemmer across the entire corpus purely to rebuild a token list that the first pass already computed (and then threw away after slicing [:20]).

Change

Tokenize each chunk exactly once in the build loop, store the full token list in a parallel tokenized_corpus, and reuse it for both:

  • the BM25 tokenized_corpus (full tokens), and
  • the stem_tags metadata (tokens[:20]).

Benefit

  • Output is byte-for-byte identical — same chunks, same metadata, same BM25 token lists, same ordering.
  • Index-time tokenization work is halved (one regex + stem pass per chunk instead of two), reducing reindex time for larger corpora.
  • Slightly clearer data flow: tokens are produced once, next to the chunk they belong to.

stem_token() is lru_cached, so the stemmer cost of the old second pass was partly hidden — but the regex findall + cache lookups still ran over every chunk again. This removes that redundant pass entirely.

🤖 Generated with Claude Code


Generated by Claude Code

build_index() called tokenize_and_stem() twice on every chunk: once for
the [:20] stem_tags metadata and again to build the full BM25
tokenized_corpus. Both calls ran on the same sanitized chunk, so the
second pass re-ran the regex tokenizer and stemmer over the entire corpus
for no benefit. Tokenize each chunk once, reuse the result for both the
BM25 corpus and the stem_tags slice. Output is byte-for-byte identical;
index-time tokenization work is halved.
@CGFixIT CGFixIT marked this pull request as ready for review June 20, 2026 07:51
@CGFixIT CGFixIT merged commit 6f86e22 into main Jun 20, 2026
14 checks passed
@CGFixIT CGFixIT deleted the claude/cyclaw-indexer-dedup-tokenize branch June 20, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant