Skip to content

Improve search quality with BM25 path_terms and word tokenizer#3

Merged
soomtong merged 19 commits into
mainfrom
search-quality-improvements
May 27, 2026
Merged

Improve search quality with BM25 path_terms and word tokenizer#3
soomtong merged 19 commits into
mainfrom
search-quality-improvements

Conversation

@soomtong

@soomtong soomtong commented May 27, 2026

Copy link
Copy Markdown
Owner

Summary

  • BM25 스키마에 path_terms 필드 + word_lower 토크나이저 추가, title은 word tokenizer로 전환 (body는 ngram_2_2 유지)
  • text_prep 헬퍼로 식별자/path write-time 전처리 (camelCase split, path separator → 공백)
  • WholeFile 임계값 8KB → 16KB로 상향, Rust top-level trait_item / type_item 심볼 추출 추가
  • INDEX_VERSION 5 → 6으로 범프 (자동 풀 리빌드)

Results

Metric Baseline After Target
MRR 0.330 0.544 ≥0.65 (부분 달성, +65%)
Recall@5 0.286 0.857 ≥0.65 ✅
Recall@10 0.571 1.000 ≥0.85 ✅
NDCG@10 0.384 0.654 ≥0.65 ✅

0-hit 3건(incremental indexing fallback, RRF reciprocal rank fusion, search modal state machine) 모두 해소되어 rank 2~5로 진입. git revwalk topological commit 1건만 7→9로 소폭 회귀.

Test Plan

  • cargo test (222 passed, 0 failed)
  • cargo clippy --all-targets -- -D warnings
  • cargo run --release --bin glc -- index --force (407 docs)
  • cargo run --release --bin glc -- report --out result.md (지표 비교)
  • Reviewer 검토: MRR 미달과 git revwalk 회귀를 후속 라운드(RRF k 튜닝, embed_text path prepend, 모듈 docstring 청크)에서 처리할지 결정

Design / Plan

  • Spec: docs/superpowers/specs/2026-05-26-search-quality-improvements-design.md
  • Plan: docs/superpowers/plans/2026-05-26-search-quality-improvements.md

@soomtong soomtong merged commit b3c02d3 into main May 27, 2026
3 checks passed
@soomtong soomtong deleted the search-quality-improvements branch May 27, 2026 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant