Skip to content

Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29

@chigichan24

Description

@chigichan24

Why

  • Current tokenizer (scripts/knowledge-graph/tokenizer.ts:53-82) splits on \s+ after substituting ASCII punctuation for whitespace. Japanese text has no whitespace, so a paragraph like 「セッションの分析を実行する。次にテストを書く。」 enters as a single string.
  • The cleaning regex [^a-z0-9぀-鿿] happens to keep Japanese punctuation (, are in the U+3000 block which falls inside that range), so even the punctuation does not act as a separator.
  • The result is one giant token per Japanese run that survives stop-word filtering and lands in the vocabulary unique to that session — useless for TF-IDF, clustering, or similarity.
  • The existing regression test (scripts/__tests__/tokenizer.test.ts:137) only asserts not.toThrow() and never inspects token contents, which is why this has stayed silent.

What

  • Tokenize Japanese into meaningful word-ish units so TF-IDF / clustering / similarity actually carry signal on Japanese-heavy sessions.
  • Add tests that assert token contents for Japanese input (the current crash-only test is not enough).
  • The public surface (tokenize(text: string): string[]) should stay the same so call sites do not change.

How (primary proposal)

  • Adopt Intl.Segmenter('ja', { granularity: 'word' }). Built into Node 22+, which matches the project's CI target. Zero new dependencies, no WASM cold-start, CLDR-backed quality.
  • Mixed-script handling stays simple: detect CJK runs and feed them through Intl.Segmenter, fall back to the existing whitespace path for the rest.

Alternative approaches to weigh

This issue does not mandate Intl.Segmenter. Other directions worth evaluating before committing:

  • Character n-grams (bi/tri-gram): zero deps, robust against mixed-script noise, but inflates vocabulary and adds noise to TF-IDF. Probably best as a fallback layer under Intl.Segmenter, not as a replacement.
  • kuromoji.js: classic pure-JS morphological analyzer, IPADIC-based. Highest quality among JS-only options but ships ~30 MB of dictionary on first run and the dictionary is dated.
  • lindera-wasm / sudachi.rs via WASM: modern morphological analyzers compiled to WASM. Better dictionaries, smaller cold-start than kuromoji once cached, but pulls in a WASM toolchain dependency.
  • Hybrid Intl.Segmenter + bigram fallback when the segmenter returns suspicious output (e.g. one very long token): a cheap defensive layer worth considering whatever the primary choice is.
  • LLM-based segmentation: explicitly out of scope — this layer must stay offline and cheap.

Whichever path is taken, document the trade-off in docs/knowledge-graph-algorithm.md.

Notes

  • Pairs with the English tokenization issue (separate) and the TF-IDF modernization issue (separate). Land this first — current behavior is broken, not just suboptimal.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions