Skip to content

perf: drop redundant per-token regex re-validation in tokenize_and_stem#78

Merged
CGFixIT merged 1 commit into
mainfrom
claude/cyclaw-stemmer-redundant-token-re
Jun 20, 2026
Merged

perf: drop redundant per-token regex re-validation in tokenize_and_stem#78
CGFixIT merged 1 commit into
mainfrom
claude/cyclaw-stemmer-redundant-token-re

Conversation

@CGFixIT

@CGFixIT CGFixIT commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Problem

retrieval/stemmer.py tokenized text in two regex passes per token:

_WORD_RE  = re.compile(r'[a-z][a-z0-9_-]+')
_TOKEN_RE = re.compile(r'^[a-z][a-z0-9_-]{1,}$')

def tokenize_and_stem(text):
    tokens = _WORD_RE.findall(text.lower())
    return [stem_token(t) for t in tokens if _TOKEN_RE.match(t)]

_WORD_RE.findall() returns only maximal runs matching [a-z][a-z0-9_-]+ — every returned token is already letter-led and ≥ 2 chars. _TOKEN_RE (^[a-z][a-z0-9_-]{1,}$) describes that exact same shape, so _TOKEN_RE.match(t) is always true for findall output. The filter (and the regex it uses) is dead work: a redundant regex match executed for every token, on both the index-time and query-time hot path.

Change

Drop the _TOKEN_RE filter and the now-unused _TOKEN_RE pattern. Tokenization now does one regex pass:

return [stem_token(t) for t in _WORD_RE.findall(text.lower())]

Benefit

  • Output is byte-for-byte identical — same tokens, same order, same stems.
  • One fewer compiled-regex match() per token across every query and every chunk indexed.
  • Clearer intent: the letter-led / min-length guarantee lives in one place (_WORD_RE) instead of being asserted twice.

Verification

Equivalence fuzz-tested over 200,000 random strings (mixed letters, digits, _ - . @ #, whitespace, empty): the old if _TOKEN_RE.match(t) filter and the new unfiltered list produced identical output in every case (0 mismatches). Existing tests/test_stemmer.py (TestTokenizeAndStem, including test_numeric_ignored) continues to hold because numeric-led tokens are excluded by _WORD_RE itself, not by the removed filter.

🤖 Generated with Claude Code


Generated by Claude Code

_WORD_RE.findall() already returns only maximal [a-z][a-z0-9_-]+ runs
(letter-led, length >= 2), so the follow-up 'if _TOKEN_RE.match(t)' filter
re-checked the exact same shape and always returned True. It was a redundant
regex match per token on the index-time and query-time hot path.

Remove the filter (and the now-unused _TOKEN_RE) so tokenization does one
regex pass instead of two. Output is byte-for-byte identical; verified by a
200k-string fuzz over the two patterns (0 mismatches).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01ECr44xGUy4SDEJRSmDPNZb
@CGFixIT CGFixIT marked this pull request as ready for review June 20, 2026 08:04
@CGFixIT CGFixIT merged commit 381bdad into main Jun 20, 2026
14 checks passed
@CGFixIT CGFixIT deleted the claude/cyclaw-stemmer-redundant-token-re branch June 20, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants