Skip to content

Improve English tokenization: Porter stemming, NLTK-parity stop words, HEX fix (#30)#48

Open
chigichan24 wants to merge 6 commits into
mainfrom
feature/30-english-tokenization
Open

Improve English tokenization: Porter stemming, NLTK-parity stop words, HEX fix (#30)#48
chigichan24 wants to merge 6 commits into
mainfrom
feature/30-english-tokenization

Conversation

@chigichan24
Copy link
Copy Markdown
Owner

Summary

Implements the three improvements from issue #30 to the English path of the knowledge-graph tokenizer:

  • Porter stemmer — new inline module scripts/knowledge-graph/porter-stemmer.ts (full Porter 1980 algorithm, ~200 LoC, no deps). Applied to non-CJK tokens after CamelCase split. Tokens containing CJK characters or digits are skipped so the JA path from Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29 and tokens like abc123 are not corrupted.
  • HEX_PATTERN false-positive fix — the noise pattern now requires at least one digit (/^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i). Without that lookahead, plain English words built only from a-f (decade, facade, effect, defaced) were being dropped as noise.
  • STOP_WORDS expansion — bumped to roughly NLTK English parity (~180 entries). The Japanese particles / connective auxiliaries from Fix Japanese tokenization: a Japanese paragraph collapses into a single token today #29 are preserved, plus the project-specific filler set (file, code, change, ...).

The stop-word check is performed both pre-stem and post-stem so that surface forms (was, using) are dropped before Porter can mangle them into wa / us, while derived inflections whose stem coincides with a stop word are still caught.

Note on base branch

This PR was branched off feature/29-japanese-tokenization (PR #47) to avoid a merge conflict in tokenizer.ts and constants.ts. Until #29 lands, this PR's diff appears to include #29's changes too. Once #29 merges, this branch will be rebased onto main (or the GitHub merge will resolve cleanly via three-way merge).

Examples

Stemming collapses inflected forms:

  • tokenize("running runs ran")["run", "run", "ran"] (running and runs collapse; ran is irregular and Porter is rule-based)
  • tokenize("tested testing")["test", "test"]
  • tokenize("fixed fixing")["fix", "fix"]

HEX fix lets real English words survive:

  • isNoiseToken("decade") is false (was true)
  • isNoiseToken("facade") is false (was true)
  • isNoiseToken("cafebabe") is false (was true) — pure-alpha hex string, kept as a normal token
  • isNoiseToken("abc123def") is true (digit present, real hex)
  • isNoiseToken("deadbeef0") is true (digit present)
  • isNoiseToken("0xabc123") is false (the x is not in [0-9a-f], so the pattern does not match — this is unchanged from before)

Newly added NLTK stop words are filtered:

  • tokenize("actually really back even ever say seem tell yeah right thing the test")["test"]

Test plan

Closes #30.

🤖 Generated with Claude Code

chigichan24 and others added 5 commits April 26, 2026 02:06
Before this change, a Japanese paragraph collapsed into a single
oversized token: the tokenizer split on `\s+`, but Japanese has no
whitespace, and the cleaning regex `[^a-z0-9぀-鿿]` happens
to *keep* `。` / `、` (both U+3000 block, inside the kept range), so
even Japanese punctuation did not act as a separator. The result was
one giant token per Japanese run that survived stop-word filtering and
landed in the vocabulary unique to that session, providing zero TF-IDF
signal.

Approach: detect CJK runs (hiragana U+3040-309F, katakana U+30A0-30FF,
CJK Unified Ideographs U+4E00-9FFF) and pipe them through
`Intl.Segmenter('ja', { granularity: 'word' })`. Built into Node 22+
(the project's CI target), CLDR-backed, zero new deps, offline. Non-CJK
portions keep the existing whitespace + kebab/snake/CamelCase logic
untouched. The English-side `length > 2` filter is preserved; pure-CJK
tokens are kept at `length >= 2` because Japanese kanji compounds
(分析・実装・修正) are 2-character words.

STOP_WORDS additions (verb conjugation fragments / connective auxiliaries
that the segmenter emits as standalone segments and that carry no
standalone signal): ている, てい, しない, 次に, に従って. Content-bearing
stems such as 行う / 書く / 修正 are intentionally NOT added.

Public surface `tokenize(text: string): string[]` is unchanged.

Closes #29

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `not.toThrow()` smoke test with three tests that actually
inspect token contents:
- Pure JA: セッションの分析を実行する -> tokens include セッション,
  分析, 実行 and the whole sentence does NOT survive as a single token.
- Mixed JA/EN: TypeScriptの型エラーを修正 -> English side lowercased
  CamelCase split (type, script) and JA side 2-char compounds preserved
  (エラー, 修正).
- Long-paragraph regression: a multi-sentence Japanese paragraph
  produces multiple tokens, none longer than 20 chars, and the whole
  paragraph is not a token. This directly reproduces the bug from #29.

The Issue #18 large-input regression tests for `extractPathTokens` and
`tokenize` are kept intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add inline Porter (1980) stemmer in scripts/knowledge-graph/porter-stemmer.ts
  and apply it to non-CJK tokens after CamelCase split. CJK tokens from
  Intl.Segmenter and tokens containing digits are skipped to avoid corruption.
- Tighten HEX_PATTERN to require at least one digit
  (/^(?=[0-9a-f]*[0-9])[0-9a-f]{6,}$/i) so plain English words built from
  a-f (decade, facade, effect, defaced) survive instead of being filtered as
  noise.
- Expand STOP_WORDS to ~NLTK English parity (~180 entries) while preserving
  the JA additions from #29 and the project-specific filler set.
- Stop-word check now runs both pre-stem (catches surface forms like "was",
  "using") and post-stem (catches derived inflections).
- Add tokenizer tests for stemming collapse, HEX fix, and the expanded stop
  word set; update one session-summarizer assertion to reflect the stemmed
  keyword form (`authentication` -> `authent`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d vacuous assertion

The original test asserted `tokenize("fixed fixing").every(t => t === "fix")`,
but `fix` is a project-specific stop word in constants.ts. The pipeline stems
both inputs to `fix` and then drops them via the stop-word filter, so the
returned array is empty and `[].every(...)` is true vacuously — the assertion
provides no real coverage of stemming behavior.

Switch to `walk/walked/walking`, which is not a stop word at any stage, and
add a length precondition so future regressions cannot pass vacuously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t why

`actually` is not a member of STOP_WORDS. The previous assertion
`tokens.not.toContain("actually")` passed only because Porter stems the
input to `actual`, replacing the literal surface form. The assertion would
therefore hold even if the entire NLTK-parity expansion were reverted, so
it provided no real coverage for the test's stated purpose.

Remove `actually` from the assertion list and add a comment explaining the
intentional omission so future readers don't re-add it as a "missing" case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the knowledge-graph tokenizer’s English pipeline by adding Porter stemming, expanding stop words toward NLTK parity, and fixing hex-noise false positives; it also includes the Japanese Intl.Segmenter path changes from the #29 branch base.

Changes:

  • Add a dependency-free Porter stemmer and apply it to non-CJK, alphabetic tokens (with pre/post stop-word filtering).
  • Fix HEX_PATTERN noise filtering to require at least one digit to avoid dropping English words like “decade”.
  • Expand STOP_WORDS and update/add tests to cover stemming, stop-word filtering, and the HEX regression (plus #29 Japanese segmentation assertions).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
scripts/knowledge-graph/tokenizer.ts Adds stemming + shared pushClean() pipeline and (from #29) CJK run detection + Intl.Segmenter-based Japanese segmentation.
scripts/knowledge-graph/porter-stemmer.ts New inline Porter 1980 stemmer implementation (no external deps).
scripts/knowledge-graph/constants.ts Expands STOP_WORDS and updates HEX_PATTERN to avoid a–f word false positives.
scripts/tests/tokenizer.test.ts Adds/updates tests for HEX fix, stemming behavior, stop-word expansion, and Japanese segmentation regression coverage.
scripts/tests/session-summarizer.test.ts Updates keyword expectations to match stemmed output (e.g., authenticationauthent).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/knowledge-graph/tokenizer.ts Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve English tokenization: stemming, larger stop word set, fix HEX false positives

2 participants