You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The English path in scripts/knowledge-graph/tokenizer.ts handles whitespace, kebab / snake / CamelCase splits, and lowercasing — but is otherwise quite naive.
HEX_PATTERN = /^[0-9a-f]{6,}$/i (scripts/knowledge-graph/constants.ts:39) matches plain English words like decade, facade, effect, defaced. They get dropped as "noise" even though they are normal vocabulary, throwing away signal.
The hand-rolled stop word list (constants.ts:7-34, ~100 entries) is significantly smaller than industry-standard sets (NLTK ~180, spaCy ~325). Real session prose contains many high-frequency words that slip through and dilute IDF weights.
No stemming: run / running / runs, test / tested / testing, fix / fixed / fixing end up as separate vocabulary entries, fragmenting IDF and weakening cosine similarity between clearly related sessions.
What
Fix the HEX_PATTERN false positive: a real hex literal must contain at least one digit (otherwise it is a normal English word built from a-f).
Add Porter (or Snowball) stemming after CamelCase split, before stop-word filtering.
Expand STOP_WORDS to roughly NLTK parity (~180 entries).
Add tests that:
assert stemming collapses run/running/runs
assert decade, facade, effect are not filtered as noise
assert the new stop words are dropped
How
Porter stemmer: ~70 LoC of pure logic, ship inline (no deps) or use a single-purpose library that has no transitive deps.
HEX_PATTERN: change to /^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i (require ≥1 digit). Sanity-check against real session log fixtures.
Be careful at the Japanese boundary: stemming must not apply to tokens containing CJK characters.
Notes
Independent from the Japanese tokenizer issue. Can ship in either order. The TF-IDF modernization issue benefits from this landing first.
Why
scripts/knowledge-graph/tokenizer.tshandles whitespace, kebab / snake / CamelCase splits, and lowercasing — but is otherwise quite naive.HEX_PATTERN = /^[0-9a-f]{6,}$/i(scripts/knowledge-graph/constants.ts:39) matches plain English words likedecade,facade,effect,defaced. They get dropped as "noise" even though they are normal vocabulary, throwing away signal.constants.ts:7-34, ~100 entries) is significantly smaller than industry-standard sets (NLTK ~180, spaCy ~325). Real session prose contains many high-frequency words that slip through and dilute IDF weights.run/running/runs,test/tested/testing,fix/fixed/fixingend up as separate vocabulary entries, fragmenting IDF and weakening cosine similarity between clearly related sessions.What
HEX_PATTERNfalse positive: a real hex literal must contain at least one digit (otherwise it is a normal English word built froma-f).STOP_WORDSto roughly NLTK parity (~180 entries).run/running/runsdecade,facade,effectare not filtered as noiseHow
HEX_PATTERN: change to/^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i(require ≥1 digit). Sanity-check against real session log fixtures.Notes