Feat/voyage embeddingfeat: 支持 Voyage Embedding 并增强 technical_terms 精确召回#17
Open
anmezing wants to merge 8 commits into
Open
Feat/voyage embeddingfeat: 支持 Voyage Embedding 并增强 technical_terms 精确召回#17anmezing wants to merge 8 commits into
anmezing wants to merge 8 commits into
Conversation
added 8 commits
June 6, 2026 16:43
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
本 PR 增加了 Voyage embedding provider 支持,并增强了代码检索中
technical_terms的精确召回能力。主要内容包括:
input_type。output_dimension和output_dtype。technical_terms精确召回。project_id隔离,避免不同项目互相污染。主要改动
1. Voyage embedding provider
新增 Voyage embedding provider 兼容逻辑。
当配置:
或
EMBEDDINGS_BASE_URL指向 Voyage endpoint 时,使用 Voyage 请求格式。Voyage 请求行为:
input_type=document。input_type=query。output_dtype=float。EMBEDDINGS_OUTPUT_DIMENSION时发送output_dimension。encoding_format。非 Voyage provider,例如 SiliconFlow / OpenAI-compatible:
encoding_format=float行为。同时补充说明:
EMBEDDINGS_DIMENSIONS是本地向量库维度。EMBEDDINGS_OUTPUT_DIMENSION是 Voyage API 可选输出维度。2. technical_terms fixed-string exact retrieval
增强
technical_terms处理逻辑。之前
technical_terms主要拼接进 query。现在会额外执行 fixed-string exact retrieval,用于精确召回包含完整术语的代码片段。支持包含特殊符号的术语,例如:
exact retrieval 特点:
因此:
@ControllerAdvice不会误命中@RestController。useQuery()不会误命中useQueryClient。R.failed不会仅因为出现failed而命中。3. SQLite exact substring index
新增 SQLite-backed exact substring index,用于提升 exact retrieval 的稳定性和性能。
新增或使用的结构包括:
exact_chunksexact_grams特性:
project_id隔离。--force重建索引时会清理当前 project 的 exact index。project_id时会 drop 后重建。content和display_code。4. 索引一致性修复
修复 exact index 写入失败时可能造成的部分成功状态。
现在如果 exact index 写入失败:
vector_index_hash。这样可以避免:
导致 exact retrieval 长期缺失的问题。
5. MCP / CLI exact retrieval 诊断信息
MCP / CLI 输出增加 exact retrieval 诊断信息,便于判断 technical_terms 是否真正精确命中。
可能包含:
含义:
Missing exact technical terms:传入的术语没有 exact 命中。seeds:最终进入上下文的 exact seed 数量。candidates:exact retrieval 候选数量。scanned:扫描或检查过的 chunk 数量。elapsed:exact retrieval 耗时。truncated:结果是否因上限被截断。Skipped exact technical terms:因过短或超过数量限制被跳过的术语。6. 测试覆盖
新增或补充测试覆盖:
@ControllerAdvice不误命中@RestController。useQuery()不误命中useQueryClient。R.failed不误命中failed。project_id隔离。display_code的场景。vector_index_hash。7. 文档更新
更新文档说明:
EMBEDDINGS_OUTPUT_DIMENSION。EMBEDDINGS_OUTPUT_DTYPE。EMBEDDINGS_DIMENSIONS与 API 返回维度的关系。IGNORE_PATTERNS建议。验证
已执行或建议执行:
手动 fixture 验证:
@ControllerAdvice只 exact 命中包含@ControllerAdvice的文件。useQuery()不 exact 命中useQueryClient。R.failed不 exact 命中普通failed。missingExactTechnicalTerms。真实 workspace 验证:
missingExactTechnicalTerms行为符合预期。exactSeedCount、exactCandidateCount、exactHitsTruncated等诊断信息正常输出。注意事项
更换 embedding provider、embedding model、输出维度或输出 dtype 后,需要重新建立向量索引:
仅更换 rerank 配置通常不需要重新索引,因为 rerank 只影响搜索阶段排序。