Fix Japanese tokenization: a Japanese paragraph collapses into a single token today

# Why
- Current tokenizer (`scripts/knowledge-graph/tokenizer.ts:53-82`) splits on `\s+` after substituting ASCII punctuation for whitespace. Japanese text has no whitespace, so a paragraph like 「セッションの分析を実行する。次にテストを書く。」 enters as a single string.
- The cleaning regex `[^a-z0-9぀-鿿]` happens to **keep** Japanese punctuation (`。`, `、` are in the U+3000 block which falls inside that range), so even the punctuation does not act as a separator.
- The result is one giant token per Japanese run that survives stop-word filtering and lands in the vocabulary unique to that session — useless for TF-IDF, clustering, or similarity.
- The existing regression test (`scripts/__tests__/tokenizer.test.ts:137`) only asserts `not.toThrow()` and never inspects token contents, which is why this has stayed silent.

# What
- Tokenize Japanese into meaningful word-ish units so TF-IDF / clustering / similarity actually carry signal on Japanese-heavy sessions.
- Add tests that assert token **contents** for Japanese input (the current crash-only test is not enough).
- The public surface (`tokenize(text: string): string[]`) should stay the same so call sites do not change.

# How (primary proposal)
- Adopt `Intl.Segmenter('ja', { granularity: 'word' })`. Built into Node 22+, which matches the project's CI target. Zero new dependencies, no WASM cold-start, CLDR-backed quality.
- Mixed-script handling stays simple: detect CJK runs and feed them through `Intl.Segmenter`, fall back to the existing whitespace path for the rest.

# Alternative approaches to weigh
This issue does **not** mandate `Intl.Segmenter`. Other directions worth evaluating before committing:

- **Character n-grams (bi/tri-gram)**: zero deps, robust against mixed-script noise, but inflates vocabulary and adds noise to TF-IDF. Probably best as a *fallback* layer under `Intl.Segmenter`, not as a replacement.
- **kuromoji.js**: classic pure-JS morphological analyzer, IPADIC-based. Highest quality among JS-only options but ships ~30 MB of dictionary on first run and the dictionary is dated.
- **lindera-wasm / sudachi.rs via WASM**: modern morphological analyzers compiled to WASM. Better dictionaries, smaller cold-start than kuromoji once cached, but pulls in a WASM toolchain dependency.
- **Hybrid `Intl.Segmenter` + bigram fallback** when the segmenter returns suspicious output (e.g. one very long token): a cheap defensive layer worth considering whatever the primary choice is.
- **LLM-based segmentation**: explicitly out of scope — this layer must stay offline and cheap.

Whichever path is taken, document the trade-off in `docs/knowledge-graph-algorithm.md`.

# Notes
- Pairs with the English tokenization issue (separate) and the TF-IDF modernization issue (separate). Land this **first** — current behavior is broken, not just suboptimal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29

Why

What

How (primary proposal)

Alternative approaches to weigh

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29

Description

Why

What

How (primary proposal)

Alternative approaches to weigh

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions