Skip to content

fix(retrieval): build index with the caller's config, not the default config.yaml#143

Merged
CGFixIT merged 2 commits into
mainfrom
claude/cyclaw-optimization-review-u3kgru-indexer-config-path
Jun 21, 2026
Merged

fix(retrieval): build index with the caller's config, not the default config.yaml#143
CGFixIT merged 2 commits into
mainfrom
claude/cyclaw-optimization-review-u3kgru-indexer-config-path

Conversation

@CGFixIT

@CGFixIT CGFixIT commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Summary

retrieval/indexer.py::build_index(config_path) loads its configuration from config_path and forwards it everywhere except the embedding call:

clean_chunk = sanitize_chunk(chunk, config_path)      # honours config_path
...
batch_embeddings = get_embeddings_batch(batch_chunks)  # ← uses default config.yaml

get_embeddings_batch(texts, config_path="config.yaml") accepts a config_path but build_index never passed it, so the semantic index was always embedded with the model named in the default config.yaml, regardless of the config the caller actually supplied.

Why it matters

Query-time embeddings already honour config_path:

# retrieval/hybrid_search.py
def semantic_search(self, query, ...):
    emb = get_embedding(query, self.config_path)

So with a non-default config that selects a different embedding model (or dimension), the corpus vectors and the query vectors come from different models — they're no longer comparable, and semantic retrieval silently degrades or breaks (mismatched dimensions, or meaningless cosine scores). It also undermines test isolation, where a tmp_path config is the whole point of pointing the indexer at a throwaway model/index.

The fix

One line — forward the already-available config_path:

batch_embeddings = get_embeddings_batch(batch_chunks, config_path)

Index-time and query-time embeddings now always come from the same model.

Tests

Adds TestBuildIndexConfigPropagation::test_config_path_reaches_embeddings, which mocks get_embeddings_batch + chromadb and asserts build_index forwards its config_path. Verified it fails on the old code and passes with the fix.

tests/test_indexer.py ........  8 passed

🤖 Generated with Claude Code

https://claude.ai/code/session_01LvLWMML8cpBq2q81kL1ByJ


Generated by Claude Code

…fig.yaml

build_index(config_path) loads its config from config_path and forwards it to
sanitize_chunk(), but called get_embeddings_batch(batch_chunks) without it — so
the semantic index was always embedded with the model in the default
config.yaml, ignoring a custom config_path.

Query-time embeddings already honour config_path
(HybridRetriever.semantic_search -> get_embedding(query, self.config_path)).
When build_index used a different model/dimension, the corpus vectors and the
query vectors disagreed and semantic retrieval silently degraded or broke.

Pass config_path through to get_embeddings_batch so index-time and query-time
embeddings always come from the same model. Adds a regression test that pins
the propagation contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LvLWMML8cpBq2q81kL1ByJ
Comment thread tests/test_indexer.py Fixed
…port-from)

Use string-target patch() calls instead of importing the module both as
'import retrieval.indexer as indexer' and 'from retrieval.indexer import ...'.
Resolves CodeQL alert #521 on the new config-propagation test; behaviour and
assertions are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LvLWMML8cpBq2q81kL1ByJ
@CGFixIT CGFixIT marked this pull request as ready for review June 21, 2026 02:25
@CGFixIT CGFixIT merged commit f1d350a into main Jun 21, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants