Fix Japanese tokenization via Intl.Segmenter (#29)#47
Conversation
Before this change, a Japanese paragraph collapsed into a single
oversized token: the tokenizer split on `\s+`, but Japanese has no
whitespace, and the cleaning regex `[^a-z0-9-鿿]` happens
to *keep* `。` / `、` (both U+3000 block, inside the kept range), so
even Japanese punctuation did not act as a separator. The result was
one giant token per Japanese run that survived stop-word filtering and
landed in the vocabulary unique to that session, providing zero TF-IDF
signal.
Approach: detect CJK runs (hiragana U+3040-309F, katakana U+30A0-30FF,
CJK Unified Ideographs U+4E00-9FFF) and pipe them through
`Intl.Segmenter('ja', { granularity: 'word' })`. Built into Node 22+
(the project's CI target), CLDR-backed, zero new deps, offline. Non-CJK
portions keep the existing whitespace + kebab/snake/CamelCase logic
untouched. The English-side `length > 2` filter is preserved; pure-CJK
tokens are kept at `length >= 2` because Japanese kanji compounds
(分析・実装・修正) are 2-character words.
STOP_WORDS additions (verb conjugation fragments / connective auxiliaries
that the segmenter emits as standalone segments and that carry no
standalone signal): ている, てい, しない, 次に, に従って. Content-bearing
stems such as 行う / 書く / 修正 are intentionally NOT added.
Public surface `tokenize(text: string): string[]` is unchanged.
Closes #29
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `not.toThrow()` smoke test with three tests that actually inspect token contents: - Pure JA: セッションの分析を実行する -> tokens include セッション, 分析, 実行 and the whole sentence does NOT survive as a single token. - Mixed JA/EN: TypeScriptの型エラーを修正 -> English side lowercased CamelCase split (type, script) and JA side 2-char compounds preserved (エラー, 修正). - Long-paragraph regression: a multi-sentence Japanese paragraph produces multiple tokens, none longer than 20 chars, and the whole paragraph is not a token. This directly reproduces the bug from #29. The Issue #18 large-input regression tests for `extractPathTokens` and `tokenize` are kept intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Intl.Segmenter('ja')` does not emit `ている` as a single segment — it
splits as て + いる, so the stop-word entry was never matched. Replace
it with `いる` (the standalone Hiragana fragment that does leak through
as length-2 noise) and add `したら` (verb conditional, also emitted as
one segment with no topic signal).
Add a regression test that asserts the live verb-fragment stop words
are filtered while content stems (実装 / 完了 / 仕様 / テスト) survive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndex The CJK-run extraction was mutating `CJK_RUN_RE.lastIndex` on a module-scoped /g regex, then iterating with `.exec()`. The reset to 0 on entry made the current single-threaded path correct, but any future re-entry into `tokenize` (or a callback path through `segmentJapanese`) would corrupt the shared cursor. `String.prototype.matchAll` returns a fresh iterator per call, so the cursor is local to this invocation. Same observable behavior, less fragile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes Japanese tokenization so Japanese-heavy text no longer collapses into a single oversized token, restoring TF‑IDF / clustering signal (Issue #29).
Changes:
- Add CJK-run detection and
Intl.Segmenter('ja', { granularity: 'word' })-based segmentation in the tokenizer while preserving the existing English splitting pipeline. - Extend Japanese stop words to filter low-signal auxiliary fragments emitted by the segmenter.
- Expand tokenizer tests with Japanese, mixed-script, and regression coverage for the original “single giant token” failure mode.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
scripts/knowledge-graph/tokenizer.ts |
Adds CJK run detection, Intl.Segmenter-based Japanese segmentation, and centralizes token post-processing in pushClean(). |
scripts/knowledge-graph/constants.ts |
Updates STOP_WORDS with additional Japanese auxiliary/fragment tokens relevant to Intl.Segmenter output. |
scripts/__tests__/tokenizer.test.ts |
Replaces the crash-only Japanese test with assertions for meaningful segmentation and regression tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * - pure-CJK tokens: length >= 2 (Japanese compounds such as | ||
| * bunseki / jissou / shusei are 2-character kanji words and must survive). | ||
| */ | ||
| function pushClean(token: string, sink: string[]): void { | ||
| const clean = token.toLowerCase().replace(/[^a-z0-9\u3040-\u9fff]/g, ""); | ||
| if (clean.length === 0) return; | ||
| const minLen = CJK_CHAR_RE.test(clean) ? 2 : 3; | ||
| if ( |
There was a problem hiding this comment.
The doc comment says the relaxed length rule applies to "pure-CJK tokens", but the implementation uses CJK_CHAR_RE.test(clean) (any CJK char present) to choose minLen = 2. Either tighten the check to ensure the token is CJK-only, or adjust the comment to match the actual behavior.
| // Note: `ている` is intentionally NOT in this list because the Segmenter | ||
| // does not emit it as one segment — it is split as て + いる. The | ||
| // standalone fragment we actually need to filter is `いる`. | ||
| // See https://github.com/chigichan24/crune/issues/29 | ||
| "いる", "てい", "しない", "したら", "次に", "に従って", |
There was a problem hiding this comment.
PR description mentions adding ている to STOP_WORDS, but the code/comment here explicitly says ている is not added (and instead relies on filtering いる). Please align the PR description (or the stop-word list/comment) so the documented behavior matches what actually ships.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
Fixes a tokenizer bug where any Japanese paragraph collapsed into a single oversized token (issue #29), producing zero TF-IDF / clustering signal for Japanese-heavy sessions.
Closes #29
Method choice & rationale
Adopt
Intl.Segmenter('ja', { granularity: 'word' })for CJK runs.tokenize(text: string): string[]is unchanged.length >= 2because Japanese kanji compounds (e.g. 分析・実装・修正) are 2-character words. English-side tokens still uselength > 2.STOP_WORDSadditions (verb conjugation fragments / connective auxiliaries the segmenter emits as standalone segments and that carry no standalone signal):ている,てい,しない,次に,に従って. Content-bearing stems like行う/書く/修正are intentionally not added.Alternatives considered
Intl.Segmenterfor our TF-IDF feature-extraction use case.Intl.Segmenterquality proves insufficient on real corpora — out of scope for this PR.Before / after
セッションの分析を実行する。次にテストを書く。["セッションの分析を実行する次にテストを書く"](one token, the whole sentence)["セッション","分析","実行","テスト","書く"]TypeScriptの型エラーを修正["typescript","の型エラーを修正"](English split OK, JA collapsed)["type","script","エラー","修正"]バグを修正していく["バグを修正していく"]["バグ","修正"]The bug from #29 is reproduced and asserted as a regression test: a multi-sentence Japanese paragraph must yield multiple tokens, no single token longer than 20 chars, and the whole paragraph must not survive as one token.
Test plan
npm test(212 / 212 tests pass; tokenizer suite expanded from 24 → 27)npm run lint(0 errors; the 2 pre-existingreact-hooks/exhaustive-depswarnings inPlaybackSidePanel.tsxare unrelated)npx tsc --noEmit -p tsconfig.app.json(clean)npx tsc --noEmit -p tsconfig.node.json(clean)RangeError: Maximum call stack size exceededintokenize()on large session corpora #18 large-input regression tests still pass (noRangeErroron 100k path corpora)🤖 Generated with Claude Code