Improve English tokenization: Porter stemming, NLTK-parity stop words, HEX fix (#30)#48
Open
chigichan24 wants to merge 6 commits into
Open
Improve English tokenization: Porter stemming, NLTK-parity stop words, HEX fix (#30)#48chigichan24 wants to merge 6 commits into
chigichan24 wants to merge 6 commits into
Conversation
Before this change, a Japanese paragraph collapsed into a single
oversized token: the tokenizer split on `\s+`, but Japanese has no
whitespace, and the cleaning regex `[^a-z0-9-鿿]` happens
to *keep* `。` / `、` (both U+3000 block, inside the kept range), so
even Japanese punctuation did not act as a separator. The result was
one giant token per Japanese run that survived stop-word filtering and
landed in the vocabulary unique to that session, providing zero TF-IDF
signal.
Approach: detect CJK runs (hiragana U+3040-309F, katakana U+30A0-30FF,
CJK Unified Ideographs U+4E00-9FFF) and pipe them through
`Intl.Segmenter('ja', { granularity: 'word' })`. Built into Node 22+
(the project's CI target), CLDR-backed, zero new deps, offline. Non-CJK
portions keep the existing whitespace + kebab/snake/CamelCase logic
untouched. The English-side `length > 2` filter is preserved; pure-CJK
tokens are kept at `length >= 2` because Japanese kanji compounds
(分析・実装・修正) are 2-character words.
STOP_WORDS additions (verb conjugation fragments / connective auxiliaries
that the segmenter emits as standalone segments and that carry no
standalone signal): ている, てい, しない, 次に, に従って. Content-bearing
stems such as 行う / 書く / 修正 are intentionally NOT added.
Public surface `tokenize(text: string): string[]` is unchanged.
Closes #29
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `not.toThrow()` smoke test with three tests that actually inspect token contents: - Pure JA: セッションの分析を実行する -> tokens include セッション, 分析, 実行 and the whole sentence does NOT survive as a single token. - Mixed JA/EN: TypeScriptの型エラーを修正 -> English side lowercased CamelCase split (type, script) and JA side 2-char compounds preserved (エラー, 修正). - Long-paragraph regression: a multi-sentence Japanese paragraph produces multiple tokens, none longer than 20 chars, and the whole paragraph is not a token. This directly reproduces the bug from #29. The Issue #18 large-input regression tests for `extractPathTokens` and `tokenize` are kept intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add inline Porter (1980) stemmer in scripts/knowledge-graph/porter-stemmer.ts
and apply it to non-CJK tokens after CamelCase split. CJK tokens from
Intl.Segmenter and tokens containing digits are skipped to avoid corruption.
- Tighten HEX_PATTERN to require at least one digit
(/^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i) so plain English words built from
a-f (decade, facade, effect, defaced) survive instead of being filtered as
noise.
- Expand STOP_WORDS to ~NLTK English parity (~180 entries) while preserving
the JA additions from #29 and the project-specific filler set.
- Stop-word check now runs both pre-stem (catches surface forms like "was",
"using") and post-stem (catches derived inflections).
- Add tokenizer tests for stemming collapse, HEX fix, and the expanded stop
word set; update one session-summarizer assertion to reflect the stemmed
keyword form (`authentication` -> `authent`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d vacuous assertion
The original test asserted `tokenize("fixed fixing").every(t => t === "fix")`,
but `fix` is a project-specific stop word in constants.ts. The pipeline stems
both inputs to `fix` and then drops them via the stop-word filter, so the
returned array is empty and `[].every(...)` is true vacuously — the assertion
provides no real coverage of stemming behavior.
Switch to `walk/walked/walking`, which is not a stop word at any stage, and
add a length precondition so future regressions cannot pass vacuously.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t why
`actually` is not a member of STOP_WORDS. The previous assertion
`tokens.not.toContain("actually")` passed only because Porter stems the
input to `actual`, replacing the literal surface form. The assertion would
therefore hold even if the entire NLTK-parity expansion were reverted, so
it provided no real coverage for the test's stated purpose.
Remove `actually` from the assertion list and add a comment explaining the
intentional omission so future readers don't re-add it as a "missing" case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Improves the knowledge-graph tokenizer’s English pipeline by adding Porter stemming, expanding stop words toward NLTK parity, and fixing hex-noise false positives; it also includes the Japanese Intl.Segmenter path changes from the #29 branch base.
Changes:
- Add a dependency-free Porter stemmer and apply it to non-CJK, alphabetic tokens (with pre/post stop-word filtering).
- Fix
HEX_PATTERNnoise filtering to require at least one digit to avoid dropping English words like “decade”. - Expand
STOP_WORDSand update/add tests to cover stemming, stop-word filtering, and the HEX regression (plus #29 Japanese segmentation assertions).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| scripts/knowledge-graph/tokenizer.ts | Adds stemming + shared pushClean() pipeline and (from #29) CJK run detection + Intl.Segmenter-based Japanese segmentation. |
| scripts/knowledge-graph/porter-stemmer.ts | New inline Porter 1980 stemmer implementation (no external deps). |
| scripts/knowledge-graph/constants.ts | Expands STOP_WORDS and updates HEX_PATTERN to avoid a–f word false positives. |
| scripts/tests/tokenizer.test.ts | Adds/updates tests for HEX fix, stemming behavior, stop-word expansion, and Japanese segmentation regression coverage. |
| scripts/tests/session-summarizer.test.ts | Updates keyword expectations to match stemmed output (e.g., authentication → authent). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the three improvements from issue #30 to the English path of the knowledge-graph tokenizer:
scripts/knowledge-graph/porter-stemmer.ts(full Porter 1980 algorithm, ~200 LoC, no deps). Applied to non-CJK tokens after CamelCase split. Tokens containing CJK characters or digits are skipped so the JA path from Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29 and tokens likeabc123are not corrupted./^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i). Without that lookahead, plain English words built only froma-f(decade,facade,effect,defaced) were being dropped as noise.file,code,change, ...).The stop-word check is performed both pre-stem and post-stem so that surface forms (
was,using) are dropped before Porter can mangle them intowa/us, while derived inflections whose stem coincides with a stop word are still caught.Note on base branch
This PR was branched off
feature/29-japanese-tokenization(PR #47) to avoid a merge conflict intokenizer.tsandconstants.ts. Until #29 lands, this PR's diff appears to include #29's changes too. Once #29 merges, this branch will be rebased ontomain(or the GitHub merge will resolve cleanly via three-way merge).Examples
Stemming collapses inflected forms:
tokenize("running runs ran")→["run", "run", "ran"](runningandrunscollapse;ranis irregular and Porter is rule-based)tokenize("tested testing")→["test", "test"]tokenize("fixed fixing")→["fix", "fix"]HEX fix lets real English words survive:
isNoiseToken("decade")isfalse(wastrue)isNoiseToken("facade")isfalse(wastrue)isNoiseToken("cafebabe")isfalse(wastrue) — pure-alpha hex string, kept as a normal tokenisNoiseToken("abc123def")istrue(digit present, real hex)isNoiseToken("deadbeef0")istrue(digit present)isNoiseToken("0xabc123")isfalse(thexis not in[0-9a-f], so the pattern does not match — this is unchanged from before)Newly added NLTK stop words are filtered:
tokenize("actually really back even ever say seem tell yeah right thing the test")→["test"]Test plan
npm test— all 219 tests pass (was 212 before; +7 new tokenizer assertions)npm run lint— 0 errors (2 pre-existing warnings inPlaybackSidePanel.tsx)npx tsc --noEmit -p tsconfig.app.json— cleannpx tsc --noEmit -p tsconfig.node.json— cleanセッションの分析を実行するsegments correctly)RangeError: Maximum call stack size exceededintokenize()on large session corpora #18 regression tests still green (noRangeErroron 100k-path corpus)Closes #30.
🤖 Generated with Claude Code