perf: drop redundant per-token regex re-validation in tokenize_and_stem by CGFixIT · Pull Request #78 · CGFixIT/CyClaw

CGFixIT · 2026-06-20T07:54:31Z

Problem

retrieval/stemmer.py tokenized text in two regex passes per token:

_WORD_RE  = re.compile(r'[a-z][a-z0-9_-]+')
_TOKEN_RE = re.compile(r'^[a-z][a-z0-9_-]{1,}$')

def tokenize_and_stem(text):
    tokens = _WORD_RE.findall(text.lower())
    return [stem_token(t) for t in tokens if _TOKEN_RE.match(t)]

_WORD_RE.findall() returns only maximal runs matching [a-z][a-z0-9_-]+ — every returned token is already letter-led and ≥ 2 chars. _TOKEN_RE (^[a-z][a-z0-9_-]{1,}$) describes that exact same shape, so _TOKEN_RE.match(t) is always true for findall output. The filter (and the regex it uses) is dead work: a redundant regex match executed for every token, on both the index-time and query-time hot path.

Change

Drop the _TOKEN_RE filter and the now-unused _TOKEN_RE pattern. Tokenization now does one regex pass:

return [stem_token(t) for t in _WORD_RE.findall(text.lower())]

Benefit

Output is byte-for-byte identical — same tokens, same order, same stems.
One fewer compiled-regex match() per token across every query and every chunk indexed.
Clearer intent: the letter-led / min-length guarantee lives in one place (_WORD_RE) instead of being asserted twice.

Verification

Equivalence fuzz-tested over 200,000 random strings (mixed letters, digits, _ - . @ #, whitespace, empty): the old if _TOKEN_RE.match(t) filter and the new unfiltered list produced identical output in every case (0 mismatches). Existing tests/test_stemmer.py (TestTokenizeAndStem, including test_numeric_ignored) continues to hold because numeric-led tokens are excluded by _WORD_RE itself, not by the removed filter.

🤖 Generated with Claude Code

Generated by Claude Code

_WORD_RE.findall() already returns only maximal [a-z][a-z0-9_-]+ runs (letter-led, length >= 2), so the follow-up 'if _TOKEN_RE.match(t)' filter re-checked the exact same shape and always returned True. It was a redundant regex match per token on the index-time and query-time hot path. Remove the filter (and the now-unused _TOKEN_RE) so tokenization does one regex pass instead of two. Output is byte-for-byte identical; verified by a 200k-string fuzz over the two patterns (0 mismatches). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01ECr44xGUy4SDEJRSmDPNZb

CGFixIT marked this pull request as ready for review June 20, 2026 08:04

CGFixIT merged commit 381bdad into main Jun 20, 2026
14 checks passed

CGFixIT deleted the claude/cyclaw-stemmer-redundant-token-re branch June 20, 2026 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: drop redundant per-token regex re-validation in tokenize_and_stem#78

perf: drop redundant per-token regex re-validation in tokenize_and_stem#78
CGFixIT merged 1 commit into
mainfrom
claude/cyclaw-stemmer-redundant-token-re

CGFixIT commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CGFixIT commented Jun 20, 2026

Problem

Change

Benefit

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants