Skip to content

Improve English tokenization: stemming, larger stop word set, fix HEX false positives #30

@chigichan24

Description

@chigichan24

Why

  • The English path in scripts/knowledge-graph/tokenizer.ts handles whitespace, kebab / snake / CamelCase splits, and lowercasing — but is otherwise quite naive.
  • HEX_PATTERN = /^[0-9a-f]{6,}$/i (scripts/knowledge-graph/constants.ts:39) matches plain English words like decade, facade, effect, defaced. They get dropped as "noise" even though they are normal vocabulary, throwing away signal.
  • The hand-rolled stop word list (constants.ts:7-34, ~100 entries) is significantly smaller than industry-standard sets (NLTK ~180, spaCy ~325). Real session prose contains many high-frequency words that slip through and dilute IDF weights.
  • No stemming: run / running / runs, test / tested / testing, fix / fixed / fixing end up as separate vocabulary entries, fragmenting IDF and weakening cosine similarity between clearly related sessions.

What

  • Fix the HEX_PATTERN false positive: a real hex literal must contain at least one digit (otherwise it is a normal English word built from a-f).
  • Add Porter (or Snowball) stemming after CamelCase split, before stop-word filtering.
  • Expand STOP_WORDS to roughly NLTK parity (~180 entries).
  • Add tests that:
    • assert stemming collapses run/running/runs
    • assert decade, facade, effect are not filtered as noise
    • assert the new stop words are dropped

How

  • Porter stemmer: ~70 LoC of pure logic, ship inline (no deps) or use a single-purpose library that has no transitive deps.
  • HEX_PATTERN: change to /^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i (require ≥1 digit). Sanity-check against real session log fixtures.
  • Be careful at the Japanese boundary: stemming must not apply to tokens containing CJK characters.

Notes

  • Independent from the Japanese tokenizer issue. Can ship in either order. The TF-IDF modernization issue benefits from this landing first.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions