fix(retrieval): build index with the caller's config, not the default config.yaml#143
Merged
CGFixIT merged 2 commits intoJun 21, 2026
Conversation
…fig.yaml build_index(config_path) loads its config from config_path and forwards it to sanitize_chunk(), but called get_embeddings_batch(batch_chunks) without it — so the semantic index was always embedded with the model in the default config.yaml, ignoring a custom config_path. Query-time embeddings already honour config_path (HybridRetriever.semantic_search -> get_embedding(query, self.config_path)). When build_index used a different model/dimension, the corpus vectors and the query vectors disagreed and semantic retrieval silently degraded or broke. Pass config_path through to get_embeddings_batch so index-time and query-time embeddings always come from the same model. Adds a regression test that pins the propagation contract. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LvLWMML8cpBq2q81kL1ByJ
…port-from) Use string-target patch() calls instead of importing the module both as 'import retrieval.indexer as indexer' and 'from retrieval.indexer import ...'. Resolves CodeQL alert #521 on the new config-propagation test; behaviour and assertions are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LvLWMML8cpBq2q81kL1ByJ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
retrieval/indexer.py::build_index(config_path)loads its configuration fromconfig_pathand forwards it everywhere except the embedding call:get_embeddings_batch(texts, config_path="config.yaml")accepts aconfig_pathbutbuild_indexnever passed it, so the semantic index was always embedded with the model named in the defaultconfig.yaml, regardless of the config the caller actually supplied.Why it matters
Query-time embeddings already honour
config_path:So with a non-default config that selects a different embedding model (or dimension), the corpus vectors and the query vectors come from different models — they're no longer comparable, and semantic retrieval silently degrades or breaks (mismatched dimensions, or meaningless cosine scores). It also undermines test isolation, where a
tmp_pathconfig is the whole point of pointing the indexer at a throwaway model/index.The fix
One line — forward the already-available
config_path:Index-time and query-time embeddings now always come from the same model.
Tests
Adds
TestBuildIndexConfigPropagation::test_config_path_reaches_embeddings, which mocksget_embeddings_batch+chromadband assertsbuild_indexforwards itsconfig_path. Verified it fails on the old code and passes with the fix.🤖 Generated with Claude Code
https://claude.ai/code/session_01LvLWMML8cpBq2q81kL1ByJ
Generated by Claude Code