Skip to content

Fix Japanese tokenization via Intl.Segmenter (#29)#47

Open
chigichan24 wants to merge 7 commits into
mainfrom
feature/29-japanese-tokenization
Open

Fix Japanese tokenization via Intl.Segmenter (#29)#47
chigichan24 wants to merge 7 commits into
mainfrom
feature/29-japanese-tokenization

Conversation

@chigichan24
Copy link
Copy Markdown
Owner

@chigichan24 chigichan24 commented Apr 25, 2026

Summary

Fixes a tokenizer bug where any Japanese paragraph collapsed into a single oversized token (issue #29), producing zero TF-IDF / clustering signal for Japanese-heavy sessions.

Closes #29

Method choice & rationale

Adopt Intl.Segmenter('ja', { granularity: 'word' }) for CJK runs.

  • Built into Node 22+ — matches the project's CI target.
  • Zero new dependencies, no WASM cold-start, offline, CLDR-backed.
  • Detects CJK runs (hiragana U+3040–309F, katakana U+30A0–30FF, CJK Unified Ideographs U+4E00–9FFF) and pipes them through the segmenter; non-CJK portions keep the existing whitespace + kebab/snake/CamelCase path untouched.
  • Public surface tokenize(text: string): string[] is unchanged.
  • Pure-CJK tokens are kept at length >= 2 because Japanese kanji compounds (e.g. 分析・実装・修正) are 2-character words. English-side tokens still use length > 2.

STOP_WORDS additions (verb conjugation fragments / connective auxiliaries the segmenter emits as standalone segments and that carry no standalone signal): ている, てい, しない, 次に, に従って. Content-bearing stems like 行う / 書く / 修正 are intentionally not added.

Alternatives considered

  • kuromoji.js — Pure-JS morphological analyzer, but ships ~30 MB IPADIC dictionary download on first run and the dictionary is dated. Heavy for the marginal quality bump.
  • lindera-wasm / sudachi.rs via WASM — Better dictionaries, but pulls in a WASM toolchain dependency for limited additional gain over Intl.Segmenter for our TF-IDF feature-extraction use case.
  • Char n-gram (bi/tri-gram) as primary — Explodes vocabulary and hurts TF-IDF signal. May be revisited as a defensive fallback layer if Intl.Segmenter quality proves insufficient on real corpora — out of scope for this PR.
  • LLM-based segmentation — Explicitly out of scope; this layer must stay offline and cheap.

Before / after

input before (single giant token) after (segmented)
セッションの分析を実行する。次にテストを書く。 ["セッションの分析を実行する次にテストを書く"] (one token, the whole sentence) ["セッション","分析","実行","テスト","書く"]
TypeScriptの型エラーを修正 ["typescript","の型エラーを修正"] (English split OK, JA collapsed) ["type","script","エラー","修正"]
バグを修正していく ["バグを修正していく"] ["バグ","修正"]

The bug from #29 is reproduced and asserted as a regression test: a multi-sentence Japanese paragraph must yield multiple tokens, no single token longer than 20 chars, and the whole paragraph must not survive as one token.

Test plan

🤖 Generated with Claude Code

chigichan24 and others added 2 commits April 26, 2026 02:06
Before this change, a Japanese paragraph collapsed into a single
oversized token: the tokenizer split on `\s+`, but Japanese has no
whitespace, and the cleaning regex `[^a-z0-9぀-鿿]` happens
to *keep* `。` / `、` (both U+3000 block, inside the kept range), so
even Japanese punctuation did not act as a separator. The result was
one giant token per Japanese run that survived stop-word filtering and
landed in the vocabulary unique to that session, providing zero TF-IDF
signal.

Approach: detect CJK runs (hiragana U+3040-309F, katakana U+30A0-30FF,
CJK Unified Ideographs U+4E00-9FFF) and pipe them through
`Intl.Segmenter('ja', { granularity: 'word' })`. Built into Node 22+
(the project's CI target), CLDR-backed, zero new deps, offline. Non-CJK
portions keep the existing whitespace + kebab/snake/CamelCase logic
untouched. The English-side `length > 2` filter is preserved; pure-CJK
tokens are kept at `length >= 2` because Japanese kanji compounds
(分析・実装・修正) are 2-character words.

STOP_WORDS additions (verb conjugation fragments / connective auxiliaries
that the segmenter emits as standalone segments and that carry no
standalone signal): ている, てい, しない, 次に, に従って. Content-bearing
stems such as 行う / 書く / 修正 are intentionally NOT added.

Public surface `tokenize(text: string): string[]` is unchanged.

Closes #29

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `not.toThrow()` smoke test with three tests that actually
inspect token contents:
- Pure JA: セッションの分析を実行する -> tokens include セッション,
  分析, 実行 and the whole sentence does NOT survive as a single token.
- Mixed JA/EN: TypeScriptの型エラーを修正 -> English side lowercased
  CamelCase split (type, script) and JA side 2-char compounds preserved
  (エラー, 修正).
- Long-paragraph regression: a multi-sentence Japanese paragraph
  produces multiple tokens, none longer than 20 chars, and the whole
  paragraph is not a token. This directly reproduces the bug from #29.

The Issue #18 large-input regression tests for `extractPathTokens` and
`tokenize` are kept intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chigichan24 and others added 2 commits April 26, 2026 02:22
`Intl.Segmenter('ja')` does not emit `ている` as a single segment — it
splits as て + いる, so the stop-word entry was never matched. Replace
it with `いる` (the standalone Hiragana fragment that does leak through
as length-2 noise) and add `したら` (verb conditional, also emitted as
one segment with no topic signal).

Add a regression test that asserts the live verb-fragment stop words
are filtered while content stems (実装 / 完了 / 仕様 / テスト) survive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndex

The CJK-run extraction was mutating `CJK_RUN_RE.lastIndex` on a
module-scoped /g regex, then iterating with `.exec()`. The reset to 0
on entry made the current single-threaded path correct, but any
future re-entry into `tokenize` (or a callback path through
`segmentJapanese`) would corrupt the shared cursor.

`String.prototype.matchAll` returns a fresh iterator per call, so the
cursor is local to this invocation. Same observable behavior, less
fragile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Japanese tokenization so Japanese-heavy text no longer collapses into a single oversized token, restoring TF‑IDF / clustering signal (Issue #29).

Changes:

  • Add CJK-run detection and Intl.Segmenter('ja', { granularity: 'word' })-based segmentation in the tokenizer while preserving the existing English splitting pipeline.
  • Extend Japanese stop words to filter low-signal auxiliary fragments emitted by the segmenter.
  • Expand tokenizer tests with Japanese, mixed-script, and regression coverage for the original “single giant token” failure mode.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
scripts/knowledge-graph/tokenizer.ts Adds CJK run detection, Intl.Segmenter-based Japanese segmentation, and centralizes token post-processing in pushClean().
scripts/knowledge-graph/constants.ts Updates STOP_WORDS with additional Japanese auxiliary/fragment tokens relevant to Intl.Segmenter output.
scripts/__tests__/tokenizer.test.ts Replaces the crash-only Japanese test with assertions for meaningful segmentation and regression tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +146 to +153
* - pure-CJK tokens: length >= 2 (Japanese compounds such as
* bunseki / jissou / shusei are 2-character kanji words and must survive).
*/
function pushClean(token: string, sink: string[]): void {
const clean = token.toLowerCase().replace(/[^a-z0-9\u3040-\u9fff]/g, "");
if (clean.length === 0) return;
const minLen = CJK_CHAR_RE.test(clean) ? 2 : 3;
if (
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment says the relaxed length rule applies to "pure-CJK tokens", but the implementation uses CJK_CHAR_RE.test(clean) (any CJK char present) to choose minLen = 2. Either tighten the check to ensure the token is CJK-only, or adjust the comment to match the actual behavior.

Copilot uses AI. Check for mistakes.
Comment thread scripts/knowledge-graph/constants.ts Outdated
Comment thread scripts/knowledge-graph/constants.ts Outdated
Comment on lines +39 to +43
// Note: `ている` is intentionally NOT in this list because the Segmenter
// does not emit it as one segment — it is split as て + いる. The
// standalone fragment we actually need to filter is `いる`.
// See https://github.com/chigichan24/crune/issues/29
"いる", "てい", "しない", "したら", "次に", "に従って",
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions adding ている to STOP_WORDS, but the code/comment here explicitly says ている is not added (and instead relies on filtering いる). Please align the PR description (or the stop-word list/comment) so the documented behavior matches what actually ships.

Copilot uses AI. Check for mistakes.
Comment thread scripts/knowledge-graph/tokenizer.ts Outdated
Comment thread scripts/knowledge-graph/tokenizer.ts Outdated
chigichan24 and others added 3 commits April 30, 2026 00:46
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix Japanese tokenization: a Japanese paragraph collapses into a single token today

2 participants