feat: Week 5 — reranker, semantic cache, verifier, typed tools, ablation harness by MariaMa-GitHub · Pull Request #4 · MariaMa-GitHub/loresmith

MariaMa-GitHub · 2026-04-24T20:01:55Z

Summary

Cross-encoder reranker (BAAI/bge-reranker-base) re-scores top RRF candidates before generation; wired behind RERANKER_ENABLED flag
Semantic cache scoped to (game_slug, corpus_revision, embedding_identity); applied on non-streaming RAGPipeline.answer() path; auto-invalidates on re-ingest
LLM-as-judge verifier with structured JSON output (is_faithful, has_sufficient_evidence, unsupported_claims, rewrite_suggestions); fails open on malformed responses
Verifier/refusal wiring through RAGResponse, SSE /chat (buffered — rejected text never streamed), and history persistence so reopened sessions render InsufficientEvidenceCard
Entity schema registry + extractor — GameAdapter.entity_schema, per-page Flash-Lite extraction, entities table with SELECT-then-upsert
Typed tool use — entity_lookup / list_entities_by_type tools, GeminiProvider.complete_with_tools, bounded tool loop (≤ 3 iterations) in RAGPipeline.answer()
Ablation harness — 10-config matrix (baseline → full-no-tools + hybrid ablations), --configs / --resume flags for incremental runs, Markdown report emitter
Spoiler tier removed — max_spoiler_tier hardcoded to 3 (endgame) everywhere; SpoilerSlider component deleted
Eval datasets — Hades expanded to 200 questions (4 strata); Hades II authored at 50 questions
Frontend InsufficientEvidenceCard — yellow-border refusal card wired through ChatView SSE loop, MessageBubble, and session hydration
Alembic migrations 007 (semantic-cache scope columns) + 008 (chat_messages.response_meta)

Test plan

cd backend && pytest -v — 296 passing, 2 skipped
cd backend && ruff check app tests — clean
cd frontend && npm run lint — clean
alembic upgrade head applied cleanly on live DB (migrations 007 + 008)
Ablation run completed (Ollama llama3.1:8b / llama3.2:3b, n=200, full corpus). Results in docs/EVAL_REPORT.md. Ship-gate thresholds (faithfulness ≥ 0.85, recall@5 ≥ 0.80, citation validity ≥ 0.95) are calibrated for Gemini Flash on the full 200-item dataset — recall@5 gate met; faithfulness and citation gates not met on local 8B model, as expected.
Manual smoke: grounded question → normal answer with citations; unanswerable question → InsufficientEvidenceCard; re-open refusal session → card still renders

Checklist

Populated docs/EVAL_REPORT.md with full n=200 Ollama results
Updated README ablation harness description
Wire InsufficientEvidenceCard suggestion items to chat input (onSuggestionPick) — deferred to Week 6
Extend tool loop to SSE /chat streaming path — deferred to Week 6

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds CrossEncoderReranker (lazy-loading BAAI/bge-reranker-base, CPU-bound predict pushed off the event loop via asyncio.to_thread) and NullReranker (identity pass-through). RerankedHit dataclass carries rerank_score for downstream pipeline use. 4 new tests, all passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wire CrossEncoderReranker into RAGPipeline._retrieve: RRF now fetches rerank_candidates, the reranker reorders them, and final_top_k is taken from the reranked list. NullReranker is used when reranker_enabled=False. Settings gains reranker_enabled, reranker_model, and rerank_candidates. Services gains a reranker field populated by build_services(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…alidation RAGPipeline.answer() now checks SemanticCache before retrieval/generation and stores responses after. Services.build_services() builds the cache from config; resolve_corpus_revision_key() provides the invalidation key. Eval runner and main._get_pipeline both receive the cache and revision fn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…adata Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ze_slug

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mums Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add InsufficientEvidenceCard component with yellow-border styling for refusal responses; wire the refusal SSE event through ChatView and MessageBubble; preserve refusal payload in session hydration so reopened sessions still render the card. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add week-5 devlog covering reranker, semantic cache, verifier/refusal, typed tools, ablation harness, and dataset expansion; update README feature list and eval question counts (150 → 200 Hades + 50 Hades II). Live ablation numbers deferred pending a quota-safe run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add GEMINI_STRONG_MODEL, GEMINI_FAST_MODEL, and GEMINI_MIN_CALL_INTERVAL settings so the ablation harness can be run with a different model pair (e.g. flash-lite for both) or with a paced inter-call delay to stay within free-tier RPM limits. Update docs/EVAL_REPORT.md with concrete run options (Ollama, free-tier multi-day, paid) after hitting the 20 req/day Flash-Lite daily cap during the first live eval attempt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add eval_reports/ to .gitignore to prevent accidental commits of run JSON - Move fmt() closure out of loop in render_markdown_report - Remove redundant local import of EntityExtractor in ingestion pipeline - Add comment explaining 6000-char truncation tradeoff in entity extractor - Mark suggestion buttons as inert (cursor-default, opacity, tooltip) until click-to-insert is wired in Week 6 - Fix README semantic cache description: "cosine-similarity" not "LRU" - Add tests for provider_supports_tools and _reset_cache_scope (289 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Eliminate double embedding on cache miss (compute once, pass to helpers) - Fix silent empty answer when LLM returns neither text nor tool calls - Use inspect.isawaitable instead of hasattr(__await__) - Add TYPE_CHECKING guard for SemanticCache to break circular import - Batch entity SELECT to eliminate N+1 queries; update tests accordingly - Add buffered-/chat comment and assert-on-unexpected-status guard in main.py - Move EntityExtractor import to module level - Replace multi-paragraph docstrings with inline comments (semantic_cache) - Add proper type annotations for Services fields (reranker, verifier, cache) - Fix all E501 / E702 / F841 / I001 ruff errors to pass CI lint check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…kType.EXTRACT - _to_gemini_contents now handles model_tool_call and tool_results message roles, emitting function_call and function_response parts respectively so Gemini receives the correct multi-turn structure for function calling - _run_tool_loop builds structured model_tool_call / tool_results history entries instead of freeform user-text tool results - Add TaskType.EXTRACT for entity extraction (distinct from TaskType.TAG) and wire ingestion pipeline + /ingest endpoint to use it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add full-no-tools config to default_matrix() as Ollama-compatible ship-gate proxy (all components except tool use) - Fix ablation session lifetime: short-lived setup session for BM25 build, then a fresh session per eval question so long Ollama latencies do not idle-timeout the shared asyncpg connection - Replace docs/EVAL_REPORT.md placeholder with real Ollama run results (qwen2.5:7b / 3b, n=20, 10 configs) - Update README ablation harness bullet to reflect 10-config matrix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…er 3) Spoiler tier was user-configurable (0–3) but defaulted to 0, meaning most users and all evals were running against spoiler-free passages only. Removing the control and hardcoding max_spoiler_tier=3 everywhere so the full corpus is always used. - Remove spoiler_tier field from ChatRequest; hardcode max_spoiler_tier=3 in the /chat pipeline call - Eval runner and ablation harness now run at tier 3 (full knowledge base) - Remove SpoilerSlider component and spoilerTier state from ChatView - Remove spoilerTier from StreamChatOptions and request body in api.ts - Remove spoiler tier bounds-validation test (field no longer exists) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… update docs - test_migration.py: bump expected alembic head from 006 → 008; add schema assertions for migration 007 (semantic_cache scope columns) and 008 (chat_messages.response_meta) — fixes CI failure - test_eval_pipeline_smoke.py: add reranker (NullReranker), semantic_cache, and rerank_candidates to fake_services, which were required by run_pipeline_eval after Week 5 additions — fixes second CI failure - EVAL_REPORT.md: correct "stratified sample" to "first 20 examples (not stratified)" - devlog/2026-week-5.md: update metrics (286→293 tests, 9→10 configs), mark ablation run complete, update follow-ups Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Re-ran ablation matrix after removing spoiler tier cap (max_spoiler_tier=3 for all questions). Recall@5 jumps from 0.55 to 0.89 baseline; citation validity improves from 0.18 to 0.50. Includes before/after comparison table and per-config observations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…o ablation harness Allows the Gemini eval to be split across multiple days without re-running completed configs. Each config writes its own JSON; the report is rendered from all available JSONs so partial runs accumulate into EVAL_REPORT.md. - --configs: run a named subset of configs (e.g. baseline +rewriter +rerank) - --resume: skip configs that already have a completed output JSON - merge_report_rows: assembles report rows from per-config JSON files in matrix order - .env.example: document GEMINI_MIN_CALL_INTERVAL for RPM rate-limiting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… retry and progress logging - EVAL_REPORT.md: replace placeholder with full n=200 results using llama3.1:8b (ANSWER) + llama3.2:3b (REWRITE/VERIFY/JUDGE); add dataset metadata, gate status note, per-config observations, and re-run instructions - ablations.py: retry _answer up to 3× on DBAPIError to survive Neon idle connection drops during long Ollama generation windows - runner.py: print per-question progress ([run_name] idx/total id) so background runs can be monitored without waiting for a full config to finish Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… gitignore newline - InsufficientEvidenceCard: render rewrite suggestions as plain list items instead of inert cursor-default buttons - entities/store.py + db/models.py: set entity spoiler_tier default to 3 (endgame) to match the project-wide max_spoiler_tier=3 default - db/models.py: document that SemanticCache.max_spoiler_tier is always 3 - .gitignore: add missing trailing newline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…to RAGPipeline.__init__ - Add `response = None` before the try block in event_stream to prevent unbound variable reference in the except/finally paths - Replace comment-style type hints with proper annotations on RAGPipeline constructor params (Reranker, SemanticCache, Callable, Verifier, ToolDispatcher, list[ToolDefinition]) and add the required imports Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Update README spoiler control bullet to reflect hardcoded endgame tier - Shorten SemanticCache.max_spoiler_tier comment to fit 100-char ruff limit - Guard response.refusal attribute access with `is not None` check in event_stream - Remove dead `or [""]` fallback from _chunk_text_for_sse - Eliminate double _load_existing_results call in run_matrix --resume path - Move _EmptyBM25/_EmptyDense stubs to module level in ablations.py - Add logger and warn on corrupt ablation result files - Merge duplicate collections.abc imports in pipeline.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MariaMa-GitHub and others added 30 commits April 23, 2026 15:53

feat(db): widen semantic_cache scope for safe invalidation

12f4f07

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(db): restore migration comments and test docstring

ec811d0

feat(rag): scope semantic cache by corpus spoiler tier and embedder

a6a5258

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(rag): llm-as-judge verifier for citation + sufficiency

8d0bc29

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(rag): structured refusal payload and answer-status schema

2525243

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(rag): verify answers before SSE emission and persist refusal met…

96a7b11

…adata Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(entities): per-game entity schema + llm-assisted extractor

552ef0d

chore(entities): add LLMProvider type annotation and simplify normali…

ea413cf

…ze_slug

feat(entities): extract + upsert entities per page during ingestion

6e7862f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(llm): typed entity-lookup tools with bounded tool loop

250ec74

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(eval): grow hades eval set to 200 items and tighten stratum mini…

6709a21

…mums Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(eval): author hades ii eval set (~50 items across four strata)

75ab708

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(eval): ablation harness + first published eval report

c9bc0e9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Week 5 — reranker, semantic cache, verifier, typed tools, ablation harness#4

feat: Week 5 — reranker, semantic cache, verifier, typed tools, ablation harness#4
MariaMa-GitHub wants to merge 31 commits intomainfrom
week-5-implementation

MariaMa-GitHub commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MariaMa-GitHub commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MariaMa-GitHub commented Apr 24, 2026 •

edited

Loading