Fix Japanese tokenization via Intl.Segmenter (#29) by chigichan24 · Pull Request #47 · chigichan24/crune

chigichan24 · 2026-04-25T17:07:20Z

Summary

Fixes a tokenizer bug where any Japanese paragraph collapsed into a single oversized token (issue #29), producing zero TF-IDF / clustering signal for Japanese-heavy sessions.

Closes #29

Method choice & rationale

Adopt Intl.Segmenter('ja', { granularity: 'word' }) for CJK runs.

Built into Node 22+ — matches the project's CI target.
Zero new dependencies, no WASM cold-start, offline, CLDR-backed.
Detects CJK runs (hiragana U+3040–309F, katakana U+30A0–30FF, CJK Unified Ideographs U+4E00–9FFF) and pipes them through the segmenter; non-CJK portions keep the existing whitespace + kebab/snake/CamelCase path untouched.
Public surface tokenize(text: string): string[] is unchanged.
Pure-CJK tokens are kept at length >= 2 because Japanese kanji compounds (e.g. 分析・実装・修正) are 2-character words. English-side tokens still use length > 2.

STOP_WORDS additions (verb conjugation fragments / connective auxiliaries the segmenter emits as standalone segments and that carry no standalone signal): ている, てい, しない, 次に, に従って. Content-bearing stems like 行う / 書く / 修正 are intentionally not added.

Alternatives considered

kuromoji.js — Pure-JS morphological analyzer, but ships ~30 MB IPADIC dictionary download on first run and the dictionary is dated. Heavy for the marginal quality bump.
lindera-wasm / sudachi.rs via WASM — Better dictionaries, but pulls in a WASM toolchain dependency for limited additional gain over Intl.Segmenter for our TF-IDF feature-extraction use case.
Char n-gram (bi/tri-gram) as primary — Explodes vocabulary and hurts TF-IDF signal. May be revisited as a defensive fallback layer if Intl.Segmenter quality proves insufficient on real corpora — out of scope for this PR.
LLM-based segmentation — Explicitly out of scope; this layer must stay offline and cheap.

Before / after

input	before (single giant token)	after (segmented)
`セッションの分析を実行する。次にテストを書く。`	`["セッションの分析を実行する次にテストを書く"]` (one token, the whole sentence)	`["セッション","分析","実行","テスト","書く"]`
`TypeScriptの型エラーを修正`	`["typescript","の型エラーを修正"]` (English split OK, JA collapsed)	`["type","script","エラー","修正"]`
`バグを修正していく`	`["バグを修正していく"]`	`["バグ","修正"]`

The bug from #29 is reproduced and asserted as a regression test: a multi-sentence Japanese paragraph must yield multiple tokens, no single token longer than 20 chars, and the whole paragraph must not survive as one token.

Test plan

npm test (212 / 212 tests pass; tokenizer suite expanded from 24 → 27)
npm run lint (0 errors; the 2 pre-existing react-hooks/exhaustive-deps warnings in PlaybackSidePanel.tsx are unrelated)
npx tsc --noEmit -p tsconfig.app.json (clean)
npx tsc --noEmit -p tsconfig.node.json (clean)
Manually verified token output on the seven representative inputs from issue Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29
Issue Bug: RangeError: Maximum call stack size exceeded in tokenize() on large session corpora #18 large-input regression tests still pass (no RangeError on 100k path corpora)

🤖 Generated with Claude Code

Before this change, a Japanese paragraph collapsed into a single oversized token: the tokenizer split on `\s+`, but Japanese has no whitespace, and the cleaning regex `[^a-z0-9぀-鿿]` happens to *keep* `。` / `、` (both U+3000 block, inside the kept range), so even Japanese punctuation did not act as a separator. The result was one giant token per Japanese run that survived stop-word filtering and landed in the vocabulary unique to that session, providing zero TF-IDF signal. Approach: detect CJK runs (hiragana U+3040-309F, katakana U+30A0-30FF, CJK Unified Ideographs U+4E00-9FFF) and pipe them through `Intl.Segmenter('ja', { granularity: 'word' })`. Built into Node 22+ (the project's CI target), CLDR-backed, zero new deps, offline. Non-CJK portions keep the existing whitespace + kebab/snake/CamelCase logic untouched. The English-side `length > 2` filter is preserved; pure-CJK tokens are kept at `length >= 2` because Japanese kanji compounds (分析・実装・修正) are 2-character words. STOP_WORDS additions (verb conjugation fragments / connective auxiliaries that the segmenter emits as standalone segments and that carry no standalone signal): ている, てい, しない, 次に, に従って. Content-bearing stems such as 行う / 書く / 修正 are intentionally NOT added. Public surface `tokenize(text: string): string[]` is unchanged. Closes #29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the `not.toThrow()` smoke test with three tests that actually inspect token contents: - Pure JA: セッションの分析を実行する -> tokens include セッション, 分析, 実行 and the whole sentence does NOT survive as a single token. - Mixed JA/EN: TypeScriptの型エラーを修正 -> English side lowercased CamelCase split (type, script) and JA side 2-char compounds preserved (エラー, 修正). - Long-paragraph regression: a multi-sentence Japanese paragraph produces multiple tokens, none longer than 20 chars, and the whole paragraph is not a token. This directly reproduces the bug from #29. The Issue #18 large-input regression tests for `extractPathTokens` and `tokenize` are kept intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`Intl.Segmenter('ja')` does not emit `ている` as a single segment — it splits as て + いる, so the stop-word entry was never matched. Replace it with `いる` (the standalone Hiragana fragment that does leak through as length-2 noise) and add `したら` (verb conditional, also emitted as one segment with no topic signal). Add a regression test that asserts the live verb-fragment stop words are filtered while content stems (実装 / 完了 / 仕様 / テスト) survive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ndex The CJK-run extraction was mutating `CJK_RUN_RE.lastIndex` on a module-scoped /g regex, then iterating with `.exec()`. The reset to 0 on entry made the current single-threaded path correct, but any future re-entry into `tokenize` (or a callback path through `segmentJapanese`) would corrupt the shared cursor. `String.prototype.matchAll` returns a fresh iterator per call, so the cursor is local to this invocation. Same observable behavior, less fragile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes Japanese tokenization so Japanese-heavy text no longer collapses into a single oversized token, restoring TF‑IDF / clustering signal (Issue #29).

Changes:

Add CJK-run detection and Intl.Segmenter('ja', { granularity: 'word' })-based segmentation in the tokenizer while preserving the existing English splitting pipeline.
Extend Japanese stop words to filter low-signal auxiliary fragments emitted by the segmenter.
Expand tokenizer tests with Japanese, mixed-script, and regression coverage for the original “single giant token” failure mode.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`scripts/knowledge-graph/tokenizer.ts`	Adds CJK run detection, `Intl.Segmenter`-based Japanese segmentation, and centralizes token post-processing in `pushClean()`.
`scripts/knowledge-graph/constants.ts`	Updates `STOP_WORDS` with additional Japanese auxiliary/fragment tokens relevant to `Intl.Segmenter` output.
`scripts/__tests__/tokenizer.test.ts`	Replaces the crash-only Japanese test with assertions for meaningful segmentation and regression tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T15:41:49Z

+ *  - pure-CJK tokens: length >= 2 (Japanese compounds such as
+ *    bunseki / jissou / shusei are 2-character kanji words and must survive).
+ */
+function pushClean(token: string, sink: string[]): void {
+  const clean = token.toLowerCase().replace(/[^a-z0-9\u3040-\u9fff]/g, "");
+  if (clean.length === 0) return;
+  const minLen = CJK_CHAR_RE.test(clean) ? 2 : 3;
+  if (


The doc comment says the relaxed length rule applies to "pure-CJK tokens", but the implementation uses CJK_CHAR_RE.test(clean) (any CJK char present) to choose minLen = 2. Either tighten the check to ensure the token is CJK-only, or adjust the comment to match the actual behavior.

Copilot · 2026-04-29T15:41:50Z

+  // Note: `ている` is intentionally NOT in this list because the Segmenter
+  // does not emit it as one segment — it is split as て + いる. The
+  // standalone fragment we actually need to filter is `いる`.
+  // See https://github.com/chigichan24/crune/issues/29
+  "いる", "てい", "しない", "したら", "次に", "に従って",


PR description mentions adding ている to STOP_WORDS, but the code/comment here explicitly says ている is not added (and instead relies on filtering いる). Please align the PR description (or the stop-word list/comment) so the documented behavior matches what actually ships.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chigichan24 and others added 2 commits April 26, 2026 02:06

chigichan24 mentioned this pull request Apr 25, 2026

Improve English tokenization: Porter stemming, NLTK-parity stop words, HEX fix (#30) #48

Open

7 tasks

chigichan24 and others added 2 commits April 26, 2026 02:22

chigichan24 requested a review from Copilot April 29, 2026 15:36

Copilot started reviewing on behalf of chigichan24 April 29, 2026 15:36 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

chigichan24 and others added 3 commits April 30, 2026 00:46

Update scripts/knowledge-graph/tokenizer.ts

1f209ce

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update scripts/knowledge-graph/constants.ts

185eb98

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update scripts/knowledge-graph/tokenizer.ts

8fa2f66

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Japanese tokenization via Intl.Segmenter (#29)#47

Fix Japanese tokenization via Intl.Segmenter (#29)#47
chigichan24 wants to merge 7 commits into
mainfrom
feature/29-japanese-tokenization

chigichan24 commented Apr 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chigichan24 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Method choice & rationale

Alternatives considered

Before / after

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chigichan24 commented Apr 25, 2026 •

edited

Loading