Improve English tokenization: Porter stemming, NLTK-parity stop words, HEX fix (#30) by chigichan24 · Pull Request #48 · chigichan24/crune

chigichan24 · 2026-04-25T17:15:04Z

Summary

Implements the three improvements from issue #30 to the English path of the knowledge-graph tokenizer:

Porter stemmer — new inline module scripts/knowledge-graph/porter-stemmer.ts (full Porter 1980 algorithm, ~200 LoC, no deps). Applied to non-CJK tokens after CamelCase split. Tokens containing CJK characters or digits are skipped so the JA path from Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29 and tokens like abc123 are not corrupted.
HEX_PATTERN false-positive fix — the noise pattern now requires at least one digit (/^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i). Without that lookahead, plain English words built only from a-f (decade, facade, effect, defaced) were being dropped as noise.
STOP_WORDS expansion — bumped to roughly NLTK English parity (~180 entries). The Japanese particles / connective auxiliaries from Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29 are preserved, plus the project-specific filler set (file, code, change, ...).

The stop-word check is performed both pre-stem and post-stem so that surface forms (was, using) are dropped before Porter can mangle them into wa / us, while derived inflections whose stem coincides with a stop word are still caught.

Note on base branch

This PR was branched off feature/29-japanese-tokenization (PR #47) to avoid a merge conflict in tokenizer.ts and constants.ts. Until #29 lands, this PR's diff appears to include #29's changes too. Once #29 merges, this branch will be rebased onto main (or the GitHub merge will resolve cleanly via three-way merge).

Examples

Stemming collapses inflected forms:

tokenize("running runs ran") → ["run", "run", "ran"] (running and runs collapse; ran is irregular and Porter is rule-based)
tokenize("tested testing") → ["test", "test"]
tokenize("fixed fixing") → ["fix", "fix"]

HEX fix lets real English words survive:

isNoiseToken("decade") is false (was true)
isNoiseToken("facade") is false (was true)
isNoiseToken("cafebabe") is false (was true) — pure-alpha hex string, kept as a normal token
isNoiseToken("abc123def") is true (digit present, real hex)
isNoiseToken("deadbeef0") is true (digit present)
isNoiseToken("0xabc123") is false (the x is not in [0-9a-f], so the pattern does not match — this is unchanged from before)

Newly added NLTK stop words are filtered:

tokenize("actually really back even ever say seem tell yeah right thing the test") → ["test"]

Test plan

npm test — all 219 tests pass (was 212 before; +7 new tokenizer assertions)
npm run lint — 0 errors (2 pre-existing warnings in PlaybackSidePanel.tsx)
npx tsc --noEmit -p tsconfig.app.json — clean
npx tsc --noEmit -p tsconfig.node.json — clean
JA tests from Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29 still green (セッションの分析を実行する segments correctly)
Bug: RangeError: Maximum call stack size exceeded in tokenize() on large session corpora #18 regression tests still green (no RangeError on 100k-path corpus)
Existing English assertions for path-token extraction still pass (path tokens bypass stemming by design)

Closes #30.

🤖 Generated with Claude Code

Before this change, a Japanese paragraph collapsed into a single oversized token: the tokenizer split on `\s+`, but Japanese has no whitespace, and the cleaning regex `[^a-z0-9぀-鿿]` happens to *keep* `。` / `、` (both U+3000 block, inside the kept range), so even Japanese punctuation did not act as a separator. The result was one giant token per Japanese run that survived stop-word filtering and landed in the vocabulary unique to that session, providing zero TF-IDF signal. Approach: detect CJK runs (hiragana U+3040-309F, katakana U+30A0-30FF, CJK Unified Ideographs U+4E00-9FFF) and pipe them through `Intl.Segmenter('ja', { granularity: 'word' })`. Built into Node 22+ (the project's CI target), CLDR-backed, zero new deps, offline. Non-CJK portions keep the existing whitespace + kebab/snake/CamelCase logic untouched. The English-side `length > 2` filter is preserved; pure-CJK tokens are kept at `length >= 2` because Japanese kanji compounds (分析・実装・修正) are 2-character words. STOP_WORDS additions (verb conjugation fragments / connective auxiliaries that the segmenter emits as standalone segments and that carry no standalone signal): ている, てい, しない, 次に, に従って. Content-bearing stems such as 行う / 書く / 修正 are intentionally NOT added. Public surface `tokenize(text: string): string[]` is unchanged. Closes #29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the `not.toThrow()` smoke test with three tests that actually inspect token contents: - Pure JA: セッションの分析を実行する -> tokens include セッション, 分析, 実行 and the whole sentence does NOT survive as a single token. - Mixed JA/EN: TypeScriptの型エラーを修正 -> English side lowercased CamelCase split (type, script) and JA side 2-char compounds preserved (エラー, 修正). - Long-paragraph regression: a multi-sentence Japanese paragraph produces multiple tokens, none longer than 20 chars, and the whole paragraph is not a token. This directly reproduces the bug from #29. The Issue #18 large-input regression tests for `extractPathTokens` and `tokenize` are kept intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add inline Porter (1980) stemmer in scripts/knowledge-graph/porter-stemmer.ts and apply it to non-CJK tokens after CamelCase split. CJK tokens from Intl.Segmenter and tokens containing digits are skipped to avoid corruption. - Tighten HEX_PATTERN to require at least one digit (/^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i) so plain English words built from a-f (decade, facade, effect, defaced) survive instead of being filtered as noise. - Expand STOP_WORDS to ~NLTK English parity (~180 entries) while preserving the JA additions from #29 and the project-specific filler set. - Stop-word check now runs both pre-stem (catches surface forms like "was", "using") and post-stem (catches derived inflections). - Add tokenizer tests for stemming collapse, HEX fix, and the expanded stop word set; update one session-summarizer assertion to reflect the stemmed keyword form (`authentication` -> `authent`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d vacuous assertion The original test asserted `tokenize("fixed fixing").every(t => t === "fix")`, but `fix` is a project-specific stop word in constants.ts. The pipeline stems both inputs to `fix` and then drops them via the stop-word filter, so the returned array is empty and `[].every(...)` is true vacuously — the assertion provides no real coverage of stemming behavior. Switch to `walk/walked/walking`, which is not a stop word at any stage, and add a length precondition so future regressions cannot pass vacuously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t why `actually` is not a member of STOP_WORDS. The previous assertion `tokens.not.toContain("actually")` passed only because Porter stems the input to `actual`, replacing the literal surface form. The assertion would therefore hold even if the entire NLTK-parity expansion were reverted, so it provided no real coverage for the test's stated purpose. Remove `actually` from the assertion list and add a comment explaining the intentional omission so future readers don't re-add it as a "missing" case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Improves the knowledge-graph tokenizer’s English pipeline by adding Porter stemming, expanding stop words toward NLTK parity, and fixing hex-noise false positives; it also includes the Japanese Intl.Segmenter path changes from the #29 branch base.

Changes:

Add a dependency-free Porter stemmer and apply it to non-CJK, alphabetic tokens (with pre/post stop-word filtering).
Fix HEX_PATTERN noise filtering to require at least one digit to avoid dropping English words like “decade”.
Expand STOP_WORDS and update/add tests to cover stemming, stop-word filtering, and the HEX regression (plus #29 Japanese segmentation assertions).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
scripts/knowledge-graph/tokenizer.ts	Adds stemming + shared `pushClean()` pipeline and (from #29) CJK run detection + `Intl.Segmenter`-based Japanese segmentation.
scripts/knowledge-graph/porter-stemmer.ts	New inline Porter 1980 stemmer implementation (no external deps).
scripts/knowledge-graph/constants.ts	Expands `STOP_WORDS` and updates `HEX_PATTERN` to avoid a–f word false positives.
scripts/tests/tokenizer.test.ts	Adds/updates tests for HEX fix, stemming behavior, stop-word expansion, and Japanese segmentation regression coverage.
scripts/tests/session-summarizer.test.ts	Updates keyword expectations to match stemmed output (e.g., `authentication` → `authent`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chigichan24 and others added 5 commits April 26, 2026 02:06

chigichan24 requested a review from Copilot April 29, 2026 15:41

Copilot started reviewing on behalf of chigichan24 April 29, 2026 15:41 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread scripts/knowledge-graph/tokenizer.ts Outdated

Update scripts/knowledge-graph/tokenizer.ts

19ef5b7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve English tokenization: Porter stemming, NLTK-parity stop words, HEX fix (#30)#48

Improve English tokenization: Porter stemming, NLTK-parity stop words, HEX fix (#30)#48
chigichan24 wants to merge 6 commits into
mainfrom
feature/30-english-tokenization

chigichan24 commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chigichan24 commented Apr 25, 2026

Summary

Note on base branch

Examples

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants