feat: Week 5 — reranker, semantic cache, verifier, typed tools, ablation harness#4
Open
MariaMa-GitHub wants to merge 31 commits intomainfrom
Open
feat: Week 5 — reranker, semantic cache, verifier, typed tools, ablation harness#4MariaMa-GitHub wants to merge 31 commits intomainfrom
MariaMa-GitHub wants to merge 31 commits intomainfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CrossEncoderReranker (lazy-loading BAAI/bge-reranker-base, CPU-bound predict pushed off the event loop via asyncio.to_thread) and NullReranker (identity pass-through). RerankedHit dataclass carries rerank_score for downstream pipeline use. 4 new tests, all passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire CrossEncoderReranker into RAGPipeline._retrieve: RRF now fetches rerank_candidates, the reranker reorders them, and final_top_k is taken from the reranked list. NullReranker is used when reranker_enabled=False. Settings gains reranker_enabled, reranker_model, and rerank_candidates. Services gains a reranker field populated by build_services(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…alidation RAGPipeline.answer() now checks SemanticCache before retrieval/generation and stores responses after. Services.build_services() builds the cache from config; resolve_corpus_revision_key() provides the invalidation key. Eval runner and main._get_pipeline both receive the cache and revision fn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…adata Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mums Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add InsufficientEvidenceCard component with yellow-border styling for refusal responses; wire the refusal SSE event through ChatView and MessageBubble; preserve refusal payload in session hydration so reopened sessions still render the card. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add week-5 devlog covering reranker, semantic cache, verifier/refusal, typed tools, ablation harness, and dataset expansion; update README feature list and eval question counts (150 → 200 Hades + 50 Hades II). Live ablation numbers deferred pending a quota-safe run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add GEMINI_STRONG_MODEL, GEMINI_FAST_MODEL, and GEMINI_MIN_CALL_INTERVAL settings so the ablation harness can be run with a different model pair (e.g. flash-lite for both) or with a paced inter-call delay to stay within free-tier RPM limits. Update docs/EVAL_REPORT.md with concrete run options (Ollama, free-tier multi-day, paid) after hitting the 20 req/day Flash-Lite daily cap during the first live eval attempt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add eval_reports/ to .gitignore to prevent accidental commits of run JSON - Move fmt() closure out of loop in render_markdown_report - Remove redundant local import of EntityExtractor in ingestion pipeline - Add comment explaining 6000-char truncation tradeoff in entity extractor - Mark suggestion buttons as inert (cursor-default, opacity, tooltip) until click-to-insert is wired in Week 6 - Fix README semantic cache description: "cosine-similarity" not "LRU" - Add tests for provider_supports_tools and _reset_cache_scope (289 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Eliminate double embedding on cache miss (compute once, pass to helpers) - Fix silent empty answer when LLM returns neither text nor tool calls - Use inspect.isawaitable instead of hasattr(__await__) - Add TYPE_CHECKING guard for SemanticCache to break circular import - Batch entity SELECT to eliminate N+1 queries; update tests accordingly - Add buffered-/chat comment and assert-on-unexpected-status guard in main.py - Move EntityExtractor import to module level - Replace multi-paragraph docstrings with inline comments (semantic_cache) - Add proper type annotations for Services fields (reranker, verifier, cache) - Fix all E501 / E702 / F841 / I001 ruff errors to pass CI lint check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…kType.EXTRACT - _to_gemini_contents now handles model_tool_call and tool_results message roles, emitting function_call and function_response parts respectively so Gemini receives the correct multi-turn structure for function calling - _run_tool_loop builds structured model_tool_call / tool_results history entries instead of freeform user-text tool results - Add TaskType.EXTRACT for entity extraction (distinct from TaskType.TAG) and wire ingestion pipeline + /ingest endpoint to use it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add full-no-tools config to default_matrix() as Ollama-compatible ship-gate proxy (all components except tool use) - Fix ablation session lifetime: short-lived setup session for BM25 build, then a fresh session per eval question so long Ollama latencies do not idle-timeout the shared asyncpg connection - Replace docs/EVAL_REPORT.md placeholder with real Ollama run results (qwen2.5:7b / 3b, n=20, 10 configs) - Update README ablation harness bullet to reflect 10-config matrix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er 3) Spoiler tier was user-configurable (0–3) but defaulted to 0, meaning most users and all evals were running against spoiler-free passages only. Removing the control and hardcoding max_spoiler_tier=3 everywhere so the full corpus is always used. - Remove spoiler_tier field from ChatRequest; hardcode max_spoiler_tier=3 in the /chat pipeline call - Eval runner and ablation harness now run at tier 3 (full knowledge base) - Remove SpoilerSlider component and spoilerTier state from ChatView - Remove spoilerTier from StreamChatOptions and request body in api.ts - Remove spoiler tier bounds-validation test (field no longer exists) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… update docs - test_migration.py: bump expected alembic head from 006 → 008; add schema assertions for migration 007 (semantic_cache scope columns) and 008 (chat_messages.response_meta) — fixes CI failure - test_eval_pipeline_smoke.py: add reranker (NullReranker), semantic_cache, and rerank_candidates to fake_services, which were required by run_pipeline_eval after Week 5 additions — fixes second CI failure - EVAL_REPORT.md: correct "stratified sample" to "first 20 examples (not stratified)" - devlog/2026-week-5.md: update metrics (286→293 tests, 9→10 configs), mark ablation run complete, update follow-ups Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-ran ablation matrix after removing spoiler tier cap (max_spoiler_tier=3 for all questions). Recall@5 jumps from 0.55 to 0.89 baseline; citation validity improves from 0.18 to 0.50. Includes before/after comparison table and per-config observations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o ablation harness Allows the Gemini eval to be split across multiple days without re-running completed configs. Each config writes its own JSON; the report is rendered from all available JSONs so partial runs accumulate into EVAL_REPORT.md. - --configs: run a named subset of configs (e.g. baseline +rewriter +rerank) - --resume: skip configs that already have a completed output JSON - merge_report_rows: assembles report rows from per-config JSON files in matrix order - .env.example: document GEMINI_MIN_CALL_INTERVAL for RPM rate-limiting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… retry and progress logging - EVAL_REPORT.md: replace placeholder with full n=200 results using llama3.1:8b (ANSWER) + llama3.2:3b (REWRITE/VERIFY/JUDGE); add dataset metadata, gate status note, per-config observations, and re-run instructions - ablations.py: retry _answer up to 3× on DBAPIError to survive Neon idle connection drops during long Ollama generation windows - runner.py: print per-question progress ([run_name] idx/total id) so background runs can be monitored without waiting for a full config to finish Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… gitignore newline - InsufficientEvidenceCard: render rewrite suggestions as plain list items instead of inert cursor-default buttons - entities/store.py + db/models.py: set entity spoiler_tier default to 3 (endgame) to match the project-wide max_spoiler_tier=3 default - db/models.py: document that SemanticCache.max_spoiler_tier is always 3 - .gitignore: add missing trailing newline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to RAGPipeline.__init__ - Add `response = None` before the try block in event_stream to prevent unbound variable reference in the except/finally paths - Replace comment-style type hints with proper annotations on RAGPipeline constructor params (Reranker, SemanticCache, Callable, Verifier, ToolDispatcher, list[ToolDefinition]) and add the required imports Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Update README spoiler control bullet to reflect hardcoded endgame tier - Shorten SemanticCache.max_spoiler_tier comment to fit 100-char ruff limit - Guard response.refusal attribute access with `is not None` check in event_stream - Remove dead `or [""]` fallback from _chunk_text_for_sse - Eliminate double _load_existing_results call in run_matrix --resume path - Move _EmptyBM25/_EmptyDense stubs to module level in ablations.py - Add logger and warn on corrupt ablation result files - Merge duplicate collections.abc imports in pipeline.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BAAI/bge-reranker-base) re-scores top RRF candidates before generation; wired behindRERANKER_ENABLEDflag(game_slug, corpus_revision, embedding_identity); applied on non-streamingRAGPipeline.answer()path; auto-invalidates on re-ingestis_faithful,has_sufficient_evidence,unsupported_claims,rewrite_suggestions); fails open on malformed responsesRAGResponse, SSE/chat(buffered — rejected text never streamed), and history persistence so reopened sessions renderInsufficientEvidenceCardGameAdapter.entity_schema, per-page Flash-Lite extraction,entitiestable with SELECT-then-upsertentity_lookup/list_entities_by_typetools,GeminiProvider.complete_with_tools, bounded tool loop (≤ 3 iterations) inRAGPipeline.answer()--configs/--resumeflags for incremental runs, Markdown report emittermax_spoiler_tierhardcoded to 3 (endgame) everywhere;SpoilerSlidercomponent deletedInsufficientEvidenceCard— yellow-border refusal card wired through ChatView SSE loop, MessageBubble, and session hydrationTest plan
cd backend && pytest -v— 296 passing, 2 skippedcd backend && ruff check app tests— cleancd frontend && npm run lint— cleanalembic upgrade headapplied cleanly on live DB (migrations 007 + 008)llama3.1:8b/llama3.2:3b, n=200, full corpus). Results indocs/EVAL_REPORT.md. Ship-gate thresholds (faithfulness ≥ 0.85, recall@5 ≥ 0.80, citation validity ≥ 0.95) are calibrated for Gemini Flash on the full 200-item dataset — recall@5 gate met; faithfulness and citation gates not met on local 8B model, as expected.InsufficientEvidenceCard; re-open refusal session → card still rendersChecklist
docs/EVAL_REPORT.mdwith full n=200 Ollama resultsInsufficientEvidenceCardsuggestion items to chat input (onSuggestionPick) — deferred to Week 6/chatstreaming path — deferred to Week 6🤖 Generated with Claude Code