Why
- Current tokenizer (
scripts/knowledge-graph/tokenizer.ts:53-82) splits on \s+ after substituting ASCII punctuation for whitespace. Japanese text has no whitespace, so a paragraph like 「セッションの分析を実行する。次にテストを書く。」 enters as a single string.
- The cleaning regex
[^a-z0-9-鿿] happens to keep Japanese punctuation (。, 、 are in the U+3000 block which falls inside that range), so even the punctuation does not act as a separator.
- The result is one giant token per Japanese run that survives stop-word filtering and lands in the vocabulary unique to that session — useless for TF-IDF, clustering, or similarity.
- The existing regression test (
scripts/__tests__/tokenizer.test.ts:137) only asserts not.toThrow() and never inspects token contents, which is why this has stayed silent.
What
- Tokenize Japanese into meaningful word-ish units so TF-IDF / clustering / similarity actually carry signal on Japanese-heavy sessions.
- Add tests that assert token contents for Japanese input (the current crash-only test is not enough).
- The public surface (
tokenize(text: string): string[]) should stay the same so call sites do not change.
How (primary proposal)
- Adopt
Intl.Segmenter('ja', { granularity: 'word' }). Built into Node 22+, which matches the project's CI target. Zero new dependencies, no WASM cold-start, CLDR-backed quality.
- Mixed-script handling stays simple: detect CJK runs and feed them through
Intl.Segmenter, fall back to the existing whitespace path for the rest.
Alternative approaches to weigh
This issue does not mandate Intl.Segmenter. Other directions worth evaluating before committing:
- Character n-grams (bi/tri-gram): zero deps, robust against mixed-script noise, but inflates vocabulary and adds noise to TF-IDF. Probably best as a fallback layer under
Intl.Segmenter, not as a replacement.
- kuromoji.js: classic pure-JS morphological analyzer, IPADIC-based. Highest quality among JS-only options but ships ~30 MB of dictionary on first run and the dictionary is dated.
- lindera-wasm / sudachi.rs via WASM: modern morphological analyzers compiled to WASM. Better dictionaries, smaller cold-start than kuromoji once cached, but pulls in a WASM toolchain dependency.
- Hybrid
Intl.Segmenter + bigram fallback when the segmenter returns suspicious output (e.g. one very long token): a cheap defensive layer worth considering whatever the primary choice is.
- LLM-based segmentation: explicitly out of scope — this layer must stay offline and cheap.
Whichever path is taken, document the trade-off in docs/knowledge-graph-algorithm.md.
Notes
- Pairs with the English tokenization issue (separate) and the TF-IDF modernization issue (separate). Land this first — current behavior is broken, not just suboptimal.
Why
scripts/knowledge-graph/tokenizer.ts:53-82) splits on\s+after substituting ASCII punctuation for whitespace. Japanese text has no whitespace, so a paragraph like 「セッションの分析を実行する。次にテストを書く。」 enters as a single string.[^a-z0-9-鿿]happens to keep Japanese punctuation (。,、are in the U+3000 block which falls inside that range), so even the punctuation does not act as a separator.scripts/__tests__/tokenizer.test.ts:137) only assertsnot.toThrow()and never inspects token contents, which is why this has stayed silent.What
tokenize(text: string): string[]) should stay the same so call sites do not change.How (primary proposal)
Intl.Segmenter('ja', { granularity: 'word' }). Built into Node 22+, which matches the project's CI target. Zero new dependencies, no WASM cold-start, CLDR-backed quality.Intl.Segmenter, fall back to the existing whitespace path for the rest.Alternative approaches to weigh
This issue does not mandate
Intl.Segmenter. Other directions worth evaluating before committing:Intl.Segmenter, not as a replacement.Intl.Segmenter+ bigram fallback when the segmenter returns suspicious output (e.g. one very long token): a cheap defensive layer worth considering whatever the primary choice is.Whichever path is taken, document the trade-off in
docs/knowledge-graph-algorithm.md.Notes