Skip to content

feat: Week 5 — reranker, semantic cache, verifier, typed tools, ablation harness#4

Open
MariaMa-GitHub wants to merge 31 commits intomainfrom
week-5-implementation
Open

feat: Week 5 — reranker, semantic cache, verifier, typed tools, ablation harness#4
MariaMa-GitHub wants to merge 31 commits intomainfrom
week-5-implementation

Conversation

@MariaMa-GitHub
Copy link
Copy Markdown
Owner

@MariaMa-GitHub MariaMa-GitHub commented Apr 24, 2026

Summary

  • Cross-encoder reranker (BAAI/bge-reranker-base) re-scores top RRF candidates before generation; wired behind RERANKER_ENABLED flag
  • Semantic cache scoped to (game_slug, corpus_revision, embedding_identity); applied on non-streaming RAGPipeline.answer() path; auto-invalidates on re-ingest
  • LLM-as-judge verifier with structured JSON output (is_faithful, has_sufficient_evidence, unsupported_claims, rewrite_suggestions); fails open on malformed responses
  • Verifier/refusal wiring through RAGResponse, SSE /chat (buffered — rejected text never streamed), and history persistence so reopened sessions render InsufficientEvidenceCard
  • Entity schema registry + extractorGameAdapter.entity_schema, per-page Flash-Lite extraction, entities table with SELECT-then-upsert
  • Typed tool useentity_lookup / list_entities_by_type tools, GeminiProvider.complete_with_tools, bounded tool loop (≤ 3 iterations) in RAGPipeline.answer()
  • Ablation harness — 10-config matrix (baseline → full-no-tools + hybrid ablations), --configs / --resume flags for incremental runs, Markdown report emitter
  • Spoiler tier removedmax_spoiler_tier hardcoded to 3 (endgame) everywhere; SpoilerSlider component deleted
  • Eval datasets — Hades expanded to 200 questions (4 strata); Hades II authored at 50 questions
  • Frontend InsufficientEvidenceCard — yellow-border refusal card wired through ChatView SSE loop, MessageBubble, and session hydration
  • Alembic migrations 007 (semantic-cache scope columns) + 008 (chat_messages.response_meta)

Test plan

  • cd backend && pytest -v — 296 passing, 2 skipped
  • cd backend && ruff check app tests — clean
  • cd frontend && npm run lint — clean
  • alembic upgrade head applied cleanly on live DB (migrations 007 + 008)
  • Ablation run completed (Ollama llama3.1:8b / llama3.2:3b, n=200, full corpus). Results in docs/EVAL_REPORT.md. Ship-gate thresholds (faithfulness ≥ 0.85, recall@5 ≥ 0.80, citation validity ≥ 0.95) are calibrated for Gemini Flash on the full 200-item dataset — recall@5 gate met; faithfulness and citation gates not met on local 8B model, as expected.
  • Manual smoke: grounded question → normal answer with citations; unanswerable question → InsufficientEvidenceCard; re-open refusal session → card still renders

Checklist

  • Populated docs/EVAL_REPORT.md with full n=200 Ollama results
  • Updated README ablation harness description
  • Wire InsufficientEvidenceCard suggestion items to chat input (onSuggestionPick) — deferred to Week 6
  • Extend tool loop to SSE /chat streaming path — deferred to Week 6

🤖 Generated with Claude Code

MariaMa-GitHub and others added 30 commits April 23, 2026 15:53
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CrossEncoderReranker (lazy-loading BAAI/bge-reranker-base, CPU-bound
predict pushed off the event loop via asyncio.to_thread) and NullReranker
(identity pass-through). RerankedHit dataclass carries rerank_score for
downstream pipeline use. 4 new tests, all passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire CrossEncoderReranker into RAGPipeline._retrieve: RRF now fetches
rerank_candidates, the reranker reorders them, and final_top_k is taken
from the reranked list. NullReranker is used when reranker_enabled=False.
Settings gains reranker_enabled, reranker_model, and rerank_candidates.
Services gains a reranker field populated by build_services().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…alidation

RAGPipeline.answer() now checks SemanticCache before retrieval/generation and
stores responses after. Services.build_services() builds the cache from config;
resolve_corpus_revision_key() provides the invalidation key. Eval runner and
main._get_pipeline both receive the cache and revision fn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…adata

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mums

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add InsufficientEvidenceCard component with yellow-border styling for
refusal responses; wire the refusal SSE event through ChatView and
MessageBubble; preserve refusal payload in session hydration so
reopened sessions still render the card.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add week-5 devlog covering reranker, semantic cache, verifier/refusal,
typed tools, ablation harness, and dataset expansion; update README
feature list and eval question counts (150 → 200 Hades + 50 Hades II).
Live ablation numbers deferred pending a quota-safe run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add GEMINI_STRONG_MODEL, GEMINI_FAST_MODEL, and GEMINI_MIN_CALL_INTERVAL
settings so the ablation harness can be run with a different model pair
(e.g. flash-lite for both) or with a paced inter-call delay to stay within
free-tier RPM limits.

Update docs/EVAL_REPORT.md with concrete run options (Ollama, free-tier
multi-day, paid) after hitting the 20 req/day Flash-Lite daily cap during
the first live eval attempt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add eval_reports/ to .gitignore to prevent accidental commits of run JSON
- Move fmt() closure out of loop in render_markdown_report
- Remove redundant local import of EntityExtractor in ingestion pipeline
- Add comment explaining 6000-char truncation tradeoff in entity extractor
- Mark suggestion buttons as inert (cursor-default, opacity, tooltip) until
  click-to-insert is wired in Week 6
- Fix README semantic cache description: "cosine-similarity" not "LRU"
- Add tests for provider_supports_tools and _reset_cache_scope (289 passing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Eliminate double embedding on cache miss (compute once, pass to helpers)
- Fix silent empty answer when LLM returns neither text nor tool calls
- Use inspect.isawaitable instead of hasattr(__await__)
- Add TYPE_CHECKING guard for SemanticCache to break circular import
- Batch entity SELECT to eliminate N+1 queries; update tests accordingly
- Add buffered-/chat comment and assert-on-unexpected-status guard in main.py
- Move EntityExtractor import to module level
- Replace multi-paragraph docstrings with inline comments (semantic_cache)
- Add proper type annotations for Services fields (reranker, verifier, cache)
- Fix all E501 / E702 / F841 / I001 ruff errors to pass CI lint check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…kType.EXTRACT

- _to_gemini_contents now handles model_tool_call and tool_results message
  roles, emitting function_call and function_response parts respectively so
  Gemini receives the correct multi-turn structure for function calling
- _run_tool_loop builds structured model_tool_call / tool_results history
  entries instead of freeform user-text tool results
- Add TaskType.EXTRACT for entity extraction (distinct from TaskType.TAG)
  and wire ingestion pipeline + /ingest endpoint to use it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add full-no-tools config to default_matrix() as Ollama-compatible
  ship-gate proxy (all components except tool use)
- Fix ablation session lifetime: short-lived setup session for BM25
  build, then a fresh session per eval question so long Ollama latencies
  do not idle-timeout the shared asyncpg connection
- Replace docs/EVAL_REPORT.md placeholder with real Ollama run results
  (qwen2.5:7b / 3b, n=20, 10 configs)
- Update README ablation harness bullet to reflect 10-config matrix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er 3)

Spoiler tier was user-configurable (0–3) but defaulted to 0, meaning most
users and all evals were running against spoiler-free passages only.
Removing the control and hardcoding max_spoiler_tier=3 everywhere so the
full corpus is always used.

- Remove spoiler_tier field from ChatRequest; hardcode max_spoiler_tier=3
  in the /chat pipeline call
- Eval runner and ablation harness now run at tier 3 (full knowledge base)
- Remove SpoilerSlider component and spoilerTier state from ChatView
- Remove spoilerTier from StreamChatOptions and request body in api.ts
- Remove spoiler tier bounds-validation test (field no longer exists)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… update docs

- test_migration.py: bump expected alembic head from 006 → 008; add schema
  assertions for migration 007 (semantic_cache scope columns) and 008
  (chat_messages.response_meta) — fixes CI failure
- test_eval_pipeline_smoke.py: add reranker (NullReranker), semantic_cache,
  and rerank_candidates to fake_services, which were required by run_pipeline_eval
  after Week 5 additions — fixes second CI failure
- EVAL_REPORT.md: correct "stratified sample" to "first 20 examples (not stratified)"
- devlog/2026-week-5.md: update metrics (286→293 tests, 9→10 configs), mark
  ablation run complete, update follow-ups

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-ran ablation matrix after removing spoiler tier cap (max_spoiler_tier=3
for all questions). Recall@5 jumps from 0.55 to 0.89 baseline; citation
validity improves from 0.18 to 0.50. Includes before/after comparison table
and per-config observations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o ablation harness

Allows the Gemini eval to be split across multiple days without re-running
completed configs. Each config writes its own JSON; the report is rendered
from all available JSONs so partial runs accumulate into EVAL_REPORT.md.

- --configs: run a named subset of configs (e.g. baseline +rewriter +rerank)
- --resume: skip configs that already have a completed output JSON
- merge_report_rows: assembles report rows from per-config JSON files in matrix order
- .env.example: document GEMINI_MIN_CALL_INTERVAL for RPM rate-limiting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… retry and progress logging

- EVAL_REPORT.md: replace placeholder with full n=200 results using llama3.1:8b
  (ANSWER) + llama3.2:3b (REWRITE/VERIFY/JUDGE); add dataset metadata, gate
  status note, per-config observations, and re-run instructions
- ablations.py: retry _answer up to 3× on DBAPIError to survive Neon idle
  connection drops during long Ollama generation windows
- runner.py: print per-question progress ([run_name] idx/total id) so
  background runs can be monitored without waiting for a full config to finish

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… gitignore newline

- InsufficientEvidenceCard: render rewrite suggestions as plain list items
  instead of inert cursor-default buttons
- entities/store.py + db/models.py: set entity spoiler_tier default to 3
  (endgame) to match the project-wide max_spoiler_tier=3 default
- db/models.py: document that SemanticCache.max_spoiler_tier is always 3
- .gitignore: add missing trailing newline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to RAGPipeline.__init__

- Add `response = None` before the try block in event_stream to prevent
  unbound variable reference in the except/finally paths
- Replace comment-style type hints with proper annotations on RAGPipeline
  constructor params (Reranker, SemanticCache, Callable, Verifier,
  ToolDispatcher, list[ToolDefinition]) and add the required imports

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update README spoiler control bullet to reflect hardcoded endgame tier
- Shorten SemanticCache.max_spoiler_tier comment to fit 100-char ruff limit
- Guard response.refusal attribute access with `is not None` check in event_stream
- Remove dead `or [""]` fallback from _chunk_text_for_sse
- Eliminate double _load_existing_results call in run_matrix --resume path
- Move _EmptyBM25/_EmptyDense stubs to module level in ablations.py
- Add logger and warn on corrupt ablation result files
- Merge duplicate collections.abc imports in pipeline.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant