feat(indexing): OpenAI embeddings + broader language coverage by PatrickSys · Pull Request #57 · PatrickSys/codebase-context

PatrickSys · 2026-03-02T17:35:43Z

What / Why

Closes out the v1.8.0 scope with three launch-blocking improvements:

OpenAI embeddings — pluggable provider architecture (local default via bge-small-en-v1.5, openai opt-in via OPENAI_API_KEY). index-meta.json now stores embeddingProvider and embeddingModel; incremental reindex detects provider/model mismatch and forces a full rebuild with a clear log message instead of silently producing corrupt results.
Broader language coverage — default include globs expanded to 30+ languages (Python, Go, Rust, Java, Kotlin, C/C++, C#, Swift, Scala, Ruby, PHP, SQL, shell, config formats). Kotlin Tree-sitter grammar added + fixture coverage.
npm packaging + doc honesty — .npmignore ships only docs/cli.md and docs/capabilities.md; local drafts excluded from tarball. Docs clarify reindex (CLI) vs refresh_index (MCP tool), document watcher auto-refresh, and correct embedding model defaults.

Release notes

feat(indexing): OpenAI embeddings support via OPENAI_API_KEY + EMBEDDING_PROVIDER=openai
feat(indexing): default language coverage expanded to 30+ languages
fix(indexing): embedding provider/model mismatch now forces full rebuild with clear log
fix(storage): LanceDB dimension mismatch guard on incremental updates
refactor(embeddings): centralize EMBEDDING_PROVIDER parsing
chore: npm packaging — exclude local docs from tarball
docs: clarify watcher auto-refresh, reindex vs refresh_index, embedding model defaults

- Index meta stores embedding provider/model; search uses stored embedding config\n- Expand default indexing include globs to 30+ languages + config formats\n- Add LanceDB dimension mismatch guard for incremental updates\n- Curate Kotlin Tree-sitter grammar + fixture coverage\n- npm packaging: ship only docs/cli.md + docs/capabilities.md; exclude local drafts\n- Docs: clarify reindex vs refresh_index; document watcher auto-refresh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 88c18d47c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T17:42:44Z

src/core/indexer.ts

+        provider: defaultEmbeddingProvider,
        model: DEFAULT_MODEL,


Use an OpenAI-compatible default model for env-selected provider

When EMBEDDING_PROVIDER=openai is set and EMBEDDING_MODEL is not, this config now picks provider: openai but keeps model: DEFAULT_MODEL (Xenova/bge-small-en-v1.5), and getEmbeddingProvider only falls back to text-embedding-3-small when model is empty. In that common env-only setup, indexing sends a non-OpenAI model name to the OpenAI embeddings API and fails (400 invalid model), so the documented provider-only opt-in path is broken unless users also manually set a compatible model.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-02T17:42:44Z

src/embeddings/index.ts

+export function getConfiguredDimensions(config: Partial<EmbeddingConfig> = {}): number {
+  const provider = config.provider ?? parseEmbeddingProviderName(process.env.EMBEDDING_PROVIDER) ?? 'transformers';
+  const model = config.model ?? process.env.EMBEDDING_MODEL ?? DEFAULT_MODEL;
+  if (provider === 'openai') return 1536; // text-embedding-3-small / ada-002


Compute OpenAI dimensions from the selected model

This hardcodes all OpenAI embeddings to 1536 dimensions, but the provider accepts arbitrary model names (for example text-embedding-3-large returns 3072). With such models, full indexing can write 3072-d vectors, then incremental indexing passes expectedDimensions=1536 and LanceDB initialization throws a dimension-mismatch error every time, effectively breaking incremental updates for valid OpenAI configurations.

Useful? React with 👍 / 👎.

greptile-apps · 2026-03-02T17:48:30Z

Greptile Summary

Added OpenAI embeddings provider and embedding mismatch detection to prevent silent vector dimension incompatibilities when switching providers or models.

Key Changes

OpenAI provider: lightweight implementation using native fetch (avoids heavy openai npm package dependency)
Mismatch detection: tracks embeddingProvider and embeddingModel in index metadata, forces full rebuild when changed
Dimension validation: LanceDB checks vector column dimensions on incremental updates to catch mismatches early
Search consistency: searcher uses stored embedding config from index (not env vars) to ensure query vectors match stored vectors
Broader language coverage: expanded default include patterns to support 30+ languages including Kotlin, shell scripts, SQL, and more

Issues Found

OpenAI dimensions hardcoded to 1536 — breaks text-embedding-3-large (3072 dims). Affects both getConfiguredDimensions() and OpenAIEmbeddingProvider.dimensions. Test coverage also missing for this model.

Confidence Score: 3/5

Safe to merge after fixing OpenAI dimension handling for text-embedding-3-large
The embedding mismatch detection and dimension validation logic is well-designed and thoroughly tested. However, the hardcoded 1536 dimensions for all OpenAI models will break text-embedding-3-large (3072 dims), causing validation failures. This is a critical bug that affects a supported model variant. The rest of the implementation is solid with good test coverage.
src/embeddings/index.ts and src/embeddings/openai.ts need OpenAI dimension logic updated to handle text-embedding-3-large

Important Files Changed

Filename	Overview
src/embeddings/index.ts	added `getConfiguredDimensions` for dimension validation + OpenAI provider support, but hardcoded 1536 breaks text-embedding-3-large (3072 dims)
src/embeddings/openai.ts	new OpenAI embedding provider using native fetch, but hardcoded dimensions (1536) breaks text-embedding-3-large support
src/core/indexer.ts	embedding mismatch detection forces full rebuild when provider/model changes, dimension validation on incremental updates, expanded language coverage
src/core/index-meta.ts	added embeddingProvider/embeddingModel fields to vectorDb artifact + `checkEmbeddingMismatch` function for detecting provider/model changes
src/core/search.ts	searcher now uses stored embedding config from index meta instead of env vars, ensures query vectors match stored vectors
src/storage/lancedb.ts	added expectedDimensions parameter to validate vector column dimensions on initialization, prevents silent dimension mismatches
tests/openai-embeddings.test.ts	comprehensive test suite for OpenAI provider: authorization, dimensions, batch processing, error handling — all tests look correct
tests/embedding-mismatch.test.ts	tests for embedding mismatch detection + dimension calculation, but missing test case for text-embedding-3-large (3072 dims)

Sequence Diagram

sequenceDiagram
    participant Indexer as CodebaseIndexer
    participant Meta as index-meta.ts
    participant EmbedAPI as embeddings/index
    participant Storage as LanceDB
    participant Search as CodebaseSearcher

    Note over Indexer,Storage: Embedding Mismatch Detection (Incremental Index)
    Indexer->>Meta: readIndexMeta(rootPath)
    Meta-->>Indexer: existing metadata with embeddingProvider/Model
    Indexer->>Meta: checkEmbeddingMismatch(meta, currentProvider, currentModel)
    alt Mismatch detected
        Meta-->>Indexer: true (provider or model changed)
        Indexer->>Indexer: Force full rebuild (set diff = null)
    else No mismatch
        Meta-->>Indexer: false (continue incremental)
        Indexer->>EmbedAPI: getConfiguredDimensions(config)
        EmbedAPI-->>Indexer: expected dimensions (e.g., 384 or 1536)
        Indexer->>Storage: initialize(path, {expectedDimensions})
        Storage->>Storage: Check stored vector column dimensions
        alt Dimension mismatch
            Storage-->>Indexer: IndexCorruptedError (rebuild required)
        else Dimensions match
            Storage-->>Indexer: OK (proceed with incremental)
        end
    end

    Note over Search,EmbedAPI: Query Embedding Consistency
    Search->>Meta: Load index metadata
    Meta-->>Search: stored embeddingProvider + embeddingModel
    Search->>EmbedAPI: getEmbeddingProvider({provider, model})
    EmbedAPI-->>Search: Provider matching stored config
    Note over Search: Query vectors now match stored vectors

_{Last reviewed commit: 88c18d4}

greptile-apps

_{23 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

src/embeddings/index.ts

tests/embedding-mismatch.test.ts

greptile-apps · 2026-03-02T17:48:40Z

Additional Comments (1)

src/embeddings/openai.ts
hardcoded 1536 doesn't work for text-embedding-3-large (3072 dims)

  readonly dimensions = this.modelName.includes('large') ? 3072 : 1536;

- text-embedding-3-large returns 3072 dims, not 1536; use a getter on OpenAIEmbeddingProvider so dimensions resolve after modelName is set - getConfiguredDimensions checks model name for 'large' before returning the OpenAI dimension value - mergeConfig now defaults to text-embedding-3-small when EMBEDDING_PROVIDER=openai and EMBEDDING_MODEL is unset, avoiding a 400 from the OpenAI API caused by sending 'Xenova/bge-small-en-v1.5' - add text-embedding-3-large test case; fix stale test description Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PatrickSys

Fixed in 0375021:

openai.ts: is now a getter so it resolves after is assigned — returns 3072 for text-embedding-3-large, 1536 for all others.
embeddings/index.ts: getConfiguredDimensions applies the same model-aware logic.
indexer.ts: mergeConfig now defaults to text-embedding-3-small when EMBEDDING_PROVIDER=openai and EMBEDDING_MODEL is unset, avoiding the 400 caused by sending Xenova/bge-small-en-v1.5 to the OpenAI API.
tests: added text-embedding-3-large → 3072 case; fixed stale test description.

…ctor indexer and embeddings code for improved readability - Added multiple entries to memory.json for conventions and architectural decisions based on auto-extracted commit history. - Refactored indexer.ts to improve formatting and readability in exclusion patterns and conditional checks. - Enhanced embeddings/index.ts and types.ts for better code clarity and structure in provider name parsing.

PatrickSys added 2 commits March 2, 2026 17:45

refactor(embeddings): centralize EMBEDDING_PROVIDER parsing

88c18d4

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

greptile-apps bot reviewed Mar 2, 2026

View reviewed changes

src/embeddings/index.ts Outdated Show resolved Hide resolved

tests/embedding-mismatch.test.ts Outdated Show resolved Hide resolved

PatrickSys changed the title ~~Feat/v.1.8.0 sanity check~~ feat(indexing): OpenAI embeddings + broader language coverage Mar 2, 2026

PatrickSys commented Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(indexing): OpenAI embeddings + broader language coverage#57

feat(indexing): OpenAI embeddings + broader language coverage#57
PatrickSys wants to merge 4 commits intomasterfrom
feat/v.1.8.0-sanity-check

PatrickSys commented Mar 2, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Uh oh!

greptile-apps bot commented Mar 2, 2026

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 2, 2026

Uh oh!

PatrickSys left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PatrickSys commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What / Why

Release notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 2, 2026

Greptile Summary

Key Changes

Issues Found

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 2, 2026

Uh oh!

PatrickSys left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PatrickSys commented Mar 2, 2026 •

edited

Loading