perf: tokenize each corpus chunk once during indexing by CGFixIT · Pull Request #77 · CGFixIT/CyClaw

CGFixIT · 2026-06-20T07:48:08Z

Problem

retrieval/indexer.py build_index() tokenized every chunk twice:

# pass 1 (per chunk, in the build loop)
stem_tags = tokenize_and_stem(clean_chunk)[:20]
...
# pass 2 (later, over the whole corpus)
tokenized_corpus = [tokenize_and_stem(chunk) for chunk in all_chunks]

Both calls run tokenize_and_stem() on the same sanitized chunk text. The second pass re-runs the regex tokenizer and Porter stemmer across the entire corpus purely to rebuild a token list that the first pass already computed (and then threw away after slicing [:20]).

Change

Tokenize each chunk exactly once in the build loop, store the full token list in a parallel tokenized_corpus, and reuse it for both:

the BM25 tokenized_corpus (full tokens), and
the stem_tags metadata (tokens[:20]).

Benefit

Output is byte-for-byte identical — same chunks, same metadata, same BM25 token lists, same ordering.
Index-time tokenization work is halved (one regex + stem pass per chunk instead of two), reducing reindex time for larger corpora.
Slightly clearer data flow: tokens are produced once, next to the chunk they belong to.

stem_token() is lru_cached, so the stemmer cost of the old second pass was partly hidden — but the regex findall + cache lookups still ran over every chunk again. This removes that redundant pass entirely.

🤖 Generated with Claude Code

Generated by Claude Code

build_index() called tokenize_and_stem() twice on every chunk: once for the [:20] stem_tags metadata and again to build the full BM25 tokenized_corpus. Both calls ran on the same sanitized chunk, so the second pass re-ran the regex tokenizer and stemmer over the entire corpus for no benefit. Tokenize each chunk once, reuse the result for both the BM25 corpus and the stem_tags slice. Output is byte-for-byte identical; index-time tokenization work is halved.

CGFixIT marked this pull request as ready for review June 20, 2026 07:51

CGFixIT merged commit 6f86e22 into main Jun 20, 2026
14 checks passed

CGFixIT deleted the claude/cyclaw-indexer-dedup-tokenize branch June 20, 2026 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: tokenize each corpus chunk once during indexing#77

perf: tokenize each corpus chunk once during indexing#77
CGFixIT merged 1 commit into
mainfrom
claude/cyclaw-indexer-dedup-tokenize

CGFixIT commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CGFixIT commented Jun 20, 2026

Problem

Change

Benefit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant