feat(indexing): OpenAI embeddings + broader language coverage#57
feat(indexing): OpenAI embeddings + broader language coverage#57PatrickSys wants to merge 4 commits intomasterfrom
Conversation
- Index meta stores embedding provider/model; search uses stored embedding config\n- Expand default indexing include globs to 30+ languages + config formats\n- Add LanceDB dimension mismatch guard for incremental updates\n- Curate Kotlin Tree-sitter grammar + fixture coverage\n- npm packaging: ship only docs/cli.md + docs/capabilities.md; exclude local drafts\n- Docs: clarify reindex vs refresh_index; document watcher auto-refresh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 88c18d47c3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/core/indexer.ts
Outdated
| provider: defaultEmbeddingProvider, | ||
| model: DEFAULT_MODEL, |
There was a problem hiding this comment.
Use an OpenAI-compatible default model for env-selected provider
When EMBEDDING_PROVIDER=openai is set and EMBEDDING_MODEL is not, this config now picks provider: openai but keeps model: DEFAULT_MODEL (Xenova/bge-small-en-v1.5), and getEmbeddingProvider only falls back to text-embedding-3-small when model is empty. In that common env-only setup, indexing sends a non-OpenAI model name to the OpenAI embeddings API and fails (400 invalid model), so the documented provider-only opt-in path is broken unless users also manually set a compatible model.
Useful? React with 👍 / 👎.
src/embeddings/index.ts
Outdated
| export function getConfiguredDimensions(config: Partial<EmbeddingConfig> = {}): number { | ||
| const provider = config.provider ?? parseEmbeddingProviderName(process.env.EMBEDDING_PROVIDER) ?? 'transformers'; | ||
| const model = config.model ?? process.env.EMBEDDING_MODEL ?? DEFAULT_MODEL; | ||
| if (provider === 'openai') return 1536; // text-embedding-3-small / ada-002 |
There was a problem hiding this comment.
Compute OpenAI dimensions from the selected model
This hardcodes all OpenAI embeddings to 1536 dimensions, but the provider accepts arbitrary model names (for example text-embedding-3-large returns 3072). With such models, full indexing can write 3072-d vectors, then incremental indexing passes expectedDimensions=1536 and LanceDB initialization throws a dimension-mismatch error every time, effectively breaking incremental updates for valid OpenAI configurations.
Useful? React with 👍 / 👎.
Greptile SummaryAdded OpenAI embeddings provider and embedding mismatch detection to prevent silent vector dimension incompatibilities when switching providers or models. Key Changes
Issues Found
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| src/embeddings/index.ts | added getConfiguredDimensions for dimension validation + OpenAI provider support, but hardcoded 1536 breaks text-embedding-3-large (3072 dims) |
| src/embeddings/openai.ts | new OpenAI embedding provider using native fetch, but hardcoded dimensions (1536) breaks text-embedding-3-large support |
| src/core/indexer.ts | embedding mismatch detection forces full rebuild when provider/model changes, dimension validation on incremental updates, expanded language coverage |
| src/core/index-meta.ts | added embeddingProvider/embeddingModel fields to vectorDb artifact + checkEmbeddingMismatch function for detecting provider/model changes |
| src/core/search.ts | searcher now uses stored embedding config from index meta instead of env vars, ensures query vectors match stored vectors |
| src/storage/lancedb.ts | added expectedDimensions parameter to validate vector column dimensions on initialization, prevents silent dimension mismatches |
| tests/openai-embeddings.test.ts | comprehensive test suite for OpenAI provider: authorization, dimensions, batch processing, error handling — all tests look correct |
| tests/embedding-mismatch.test.ts | tests for embedding mismatch detection + dimension calculation, but missing test case for text-embedding-3-large (3072 dims) |
Sequence Diagram
sequenceDiagram
participant Indexer as CodebaseIndexer
participant Meta as index-meta.ts
participant EmbedAPI as embeddings/index
participant Storage as LanceDB
participant Search as CodebaseSearcher
Note over Indexer,Storage: Embedding Mismatch Detection (Incremental Index)
Indexer->>Meta: readIndexMeta(rootPath)
Meta-->>Indexer: existing metadata with embeddingProvider/Model
Indexer->>Meta: checkEmbeddingMismatch(meta, currentProvider, currentModel)
alt Mismatch detected
Meta-->>Indexer: true (provider or model changed)
Indexer->>Indexer: Force full rebuild (set diff = null)
else No mismatch
Meta-->>Indexer: false (continue incremental)
Indexer->>EmbedAPI: getConfiguredDimensions(config)
EmbedAPI-->>Indexer: expected dimensions (e.g., 384 or 1536)
Indexer->>Storage: initialize(path, {expectedDimensions})
Storage->>Storage: Check stored vector column dimensions
alt Dimension mismatch
Storage-->>Indexer: IndexCorruptedError (rebuild required)
else Dimensions match
Storage-->>Indexer: OK (proceed with incremental)
end
end
Note over Search,EmbedAPI: Query Embedding Consistency
Search->>Meta: Load index metadata
Meta-->>Search: stored embeddingProvider + embeddingModel
Search->>EmbedAPI: getEmbeddingProvider({provider, model})
EmbedAPI-->>Search: Provider matching stored config
Note over Search: Query vectors now match stored vectors
Last reviewed commit: 88c18d4
Additional Comments (1)
|
- text-embedding-3-large returns 3072 dims, not 1536; use a getter on OpenAIEmbeddingProvider so dimensions resolve after modelName is set - getConfiguredDimensions checks model name for 'large' before returning the OpenAI dimension value - mergeConfig now defaults to text-embedding-3-small when EMBEDDING_PROVIDER=openai and EMBEDDING_MODEL is unset, avoiding a 400 from the OpenAI API caused by sending 'Xenova/bge-small-en-v1.5' - add text-embedding-3-large test case; fix stale test description Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PatrickSys
left a comment
There was a problem hiding this comment.
Fixed in 0375021:
- openai.ts: is now a getter so it resolves after is assigned — returns 3072 for
text-embedding-3-large, 1536 for all others. - embeddings/index.ts:
getConfiguredDimensionsapplies the same model-aware logic. - indexer.ts:
mergeConfignow defaults totext-embedding-3-smallwhenEMBEDDING_PROVIDER=openaiandEMBEDDING_MODELis unset, avoiding the 400 caused by sendingXenova/bge-small-en-v1.5to the OpenAI API. - tests: added
text-embedding-3-large → 3072case; fixed stale test description.
…ctor indexer and embeddings code for improved readability - Added multiple entries to memory.json for conventions and architectural decisions based on auto-extracted commit history. - Refactored indexer.ts to improve formatting and readability in exclusion patterns and conditional checks. - Enhanced embeddings/index.ts and types.ts for better code clarity and structure in provider name parsing.
What / Why
Closes out the v1.8.0 scope with three launch-blocking improvements:
OpenAI embeddings — pluggable provider architecture (
localdefault viabge-small-en-v1.5,openaiopt-in viaOPENAI_API_KEY).index-meta.jsonnow storesembeddingProviderandembeddingModel; incremental reindex detects provider/model mismatch and forces a full rebuild with a clear log message instead of silently producing corrupt results.Broader language coverage — default include globs expanded to 30+ languages (Python, Go, Rust, Java, Kotlin, C/C++, C#, Swift, Scala, Ruby, PHP, SQL, shell, config formats). Kotlin Tree-sitter grammar added + fixture coverage.
npm packaging + doc honesty —
.npmignoreships onlydocs/cli.mdanddocs/capabilities.md; local drafts excluded from tarball. Docs clarifyreindex(CLI) vsrefresh_index(MCP tool), document watcher auto-refresh, and correct embedding model defaults.Release notes
feat(indexing): OpenAI embeddings support viaOPENAI_API_KEY+EMBEDDING_PROVIDER=openaifeat(indexing): default language coverage expanded to 30+ languagesfix(indexing): embedding provider/model mismatch now forces full rebuild with clear logfix(storage): LanceDB dimension mismatch guard on incremental updatesrefactor(embeddings): centralizeEMBEDDING_PROVIDERparsingchore: npm packaging — exclude local docs from tarballdocs: clarify watcher auto-refresh, reindex vs refresh_index, embedding model defaults