Improve English tokenization: stemming, larger stop word set, fix HEX false positives

# Why
- The English path in `scripts/knowledge-graph/tokenizer.ts` handles whitespace, kebab / snake / CamelCase splits, and lowercasing — but is otherwise quite naive.
- `HEX_PATTERN = /^[0-9a-f]{6,}$/i` (`scripts/knowledge-graph/constants.ts:39`) matches plain English words like `decade`, `facade`, `effect`, `defaced`. They get dropped as "noise" even though they are normal vocabulary, throwing away signal.
- The hand-rolled stop word list (`constants.ts:7-34`, ~100 entries) is significantly smaller than industry-standard sets (NLTK ~180, spaCy ~325). Real session prose contains many high-frequency words that slip through and dilute IDF weights.
- No stemming: `run` / `running` / `runs`, `test` / `tested` / `testing`, `fix` / `fixed` / `fixing` end up as separate vocabulary entries, fragmenting IDF and weakening cosine similarity between clearly related sessions.

# What
- Fix the `HEX_PATTERN` false positive: a real hex literal must contain at least one digit (otherwise it is a normal English word built from `a-f`).
- Add Porter (or Snowball) stemming after CamelCase split, before stop-word filtering.
- Expand `STOP_WORDS` to roughly NLTK parity (~180 entries).
- Add tests that:
  - assert stemming collapses `run`/`running`/`runs`
  - assert `decade`, `facade`, `effect` are **not** filtered as noise
  - assert the new stop words are dropped

# How
- Porter stemmer: ~70 LoC of pure logic, ship inline (no deps) or use a single-purpose library that has no transitive deps.
- `HEX_PATTERN`: change to `/^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i` (require ≥1 digit). Sanity-check against real session log fixtures.
- Be careful at the Japanese boundary: stemming must **not** apply to tokens containing CJK characters.

# Notes
- Independent from the Japanese tokenizer issue. Can ship in either order. The TF-IDF modernization issue benefits from this landing first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve English tokenization: stemming, larger stop word set, fix HEX false positives #30

Why

What

How

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve English tokenization: stemming, larger stop word set, fix HEX false positives #30

Description

Why

What

How

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions