Skip to content

feat(indexing): OpenAI embeddings + broader language coverage#57

Open
PatrickSys wants to merge 4 commits intomasterfrom
feat/v.1.8.0-sanity-check
Open

feat(indexing): OpenAI embeddings + broader language coverage#57
PatrickSys wants to merge 4 commits intomasterfrom
feat/v.1.8.0-sanity-check

Conversation

@PatrickSys
Copy link
Owner

@PatrickSys PatrickSys commented Mar 2, 2026

What / Why

Closes out the v1.8.0 scope with three launch-blocking improvements:

  1. OpenAI embeddings — pluggable provider architecture (local default via bge-small-en-v1.5, openai opt-in via OPENAI_API_KEY). index-meta.json now stores embeddingProvider and embeddingModel; incremental reindex detects provider/model mismatch and forces a full rebuild with a clear log message instead of silently producing corrupt results.

  2. Broader language coverage — default include globs expanded to 30+ languages (Python, Go, Rust, Java, Kotlin, C/C++, C#, Swift, Scala, Ruby, PHP, SQL, shell, config formats). Kotlin Tree-sitter grammar added + fixture coverage.

  3. npm packaging + doc honesty.npmignore ships only docs/cli.md and docs/capabilities.md; local drafts excluded from tarball. Docs clarify reindex (CLI) vs refresh_index (MCP tool), document watcher auto-refresh, and correct embedding model defaults.

Release notes

  • feat(indexing): OpenAI embeddings support via OPENAI_API_KEY + EMBEDDING_PROVIDER=openai
  • feat(indexing): default language coverage expanded to 30+ languages
  • fix(indexing): embedding provider/model mismatch now forces full rebuild with clear log
  • fix(storage): LanceDB dimension mismatch guard on incremental updates
  • refactor(embeddings): centralize EMBEDDING_PROVIDER parsing
  • chore: npm packaging — exclude local docs from tarball
  • docs: clarify watcher auto-refresh, reindex vs refresh_index, embedding model defaults

- Index meta stores embedding provider/model; search uses stored embedding config\n- Expand default indexing include globs to 30+ languages + config formats\n- Add LanceDB dimension mismatch guard for incremental updates\n- Curate Kotlin Tree-sitter grammar + fixture coverage\n- npm packaging: ship only docs/cli.md + docs/capabilities.md; exclude local drafts\n- Docs: clarify reindex vs refresh_index; document watcher auto-refresh
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 88c18d47c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 289 to 290
provider: defaultEmbeddingProvider,
model: DEFAULT_MODEL,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use an OpenAI-compatible default model for env-selected provider

When EMBEDDING_PROVIDER=openai is set and EMBEDDING_MODEL is not, this config now picks provider: openai but keeps model: DEFAULT_MODEL (Xenova/bge-small-en-v1.5), and getEmbeddingProvider only falls back to text-embedding-3-small when model is empty. In that common env-only setup, indexing sends a non-OpenAI model name to the OpenAI embeddings API and fails (400 invalid model), so the documented provider-only opt-in path is broken unless users also manually set a compatible model.

Useful? React with 👍 / 👎.

export function getConfiguredDimensions(config: Partial<EmbeddingConfig> = {}): number {
const provider = config.provider ?? parseEmbeddingProviderName(process.env.EMBEDDING_PROVIDER) ?? 'transformers';
const model = config.model ?? process.env.EMBEDDING_MODEL ?? DEFAULT_MODEL;
if (provider === 'openai') return 1536; // text-embedding-3-small / ada-002

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute OpenAI dimensions from the selected model

This hardcodes all OpenAI embeddings to 1536 dimensions, but the provider accepts arbitrary model names (for example text-embedding-3-large returns 3072). With such models, full indexing can write 3072-d vectors, then incremental indexing passes expectedDimensions=1536 and LanceDB initialization throws a dimension-mismatch error every time, effectively breaking incremental updates for valid OpenAI configurations.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link

greptile-apps bot commented Mar 2, 2026

Greptile Summary

Added OpenAI embeddings provider and embedding mismatch detection to prevent silent vector dimension incompatibilities when switching providers or models.

Key Changes

  • OpenAI provider: lightweight implementation using native fetch (avoids heavy openai npm package dependency)
  • Mismatch detection: tracks embeddingProvider and embeddingModel in index metadata, forces full rebuild when changed
  • Dimension validation: LanceDB checks vector column dimensions on incremental updates to catch mismatches early
  • Search consistency: searcher uses stored embedding config from index (not env vars) to ensure query vectors match stored vectors
  • Broader language coverage: expanded default include patterns to support 30+ languages including Kotlin, shell scripts, SQL, and more

Issues Found

  • OpenAI dimensions hardcoded to 1536 — breaks text-embedding-3-large (3072 dims). Affects both getConfiguredDimensions() and OpenAIEmbeddingProvider.dimensions. Test coverage also missing for this model.

Confidence Score: 3/5

  • Safe to merge after fixing OpenAI dimension handling for text-embedding-3-large
  • The embedding mismatch detection and dimension validation logic is well-designed and thoroughly tested. However, the hardcoded 1536 dimensions for all OpenAI models will break text-embedding-3-large (3072 dims), causing validation failures. This is a critical bug that affects a supported model variant. The rest of the implementation is solid with good test coverage.
  • src/embeddings/index.ts and src/embeddings/openai.ts need OpenAI dimension logic updated to handle text-embedding-3-large

Important Files Changed

Filename Overview
src/embeddings/index.ts added getConfiguredDimensions for dimension validation + OpenAI provider support, but hardcoded 1536 breaks text-embedding-3-large (3072 dims)
src/embeddings/openai.ts new OpenAI embedding provider using native fetch, but hardcoded dimensions (1536) breaks text-embedding-3-large support
src/core/indexer.ts embedding mismatch detection forces full rebuild when provider/model changes, dimension validation on incremental updates, expanded language coverage
src/core/index-meta.ts added embeddingProvider/embeddingModel fields to vectorDb artifact + checkEmbeddingMismatch function for detecting provider/model changes
src/core/search.ts searcher now uses stored embedding config from index meta instead of env vars, ensures query vectors match stored vectors
src/storage/lancedb.ts added expectedDimensions parameter to validate vector column dimensions on initialization, prevents silent dimension mismatches
tests/openai-embeddings.test.ts comprehensive test suite for OpenAI provider: authorization, dimensions, batch processing, error handling — all tests look correct
tests/embedding-mismatch.test.ts tests for embedding mismatch detection + dimension calculation, but missing test case for text-embedding-3-large (3072 dims)

Sequence Diagram

sequenceDiagram
    participant Indexer as CodebaseIndexer
    participant Meta as index-meta.ts
    participant EmbedAPI as embeddings/index
    participant Storage as LanceDB
    participant Search as CodebaseSearcher

    Note over Indexer,Storage: Embedding Mismatch Detection (Incremental Index)
    Indexer->>Meta: readIndexMeta(rootPath)
    Meta-->>Indexer: existing metadata with embeddingProvider/Model
    Indexer->>Meta: checkEmbeddingMismatch(meta, currentProvider, currentModel)
    alt Mismatch detected
        Meta-->>Indexer: true (provider or model changed)
        Indexer->>Indexer: Force full rebuild (set diff = null)
    else No mismatch
        Meta-->>Indexer: false (continue incremental)
        Indexer->>EmbedAPI: getConfiguredDimensions(config)
        EmbedAPI-->>Indexer: expected dimensions (e.g., 384 or 1536)
        Indexer->>Storage: initialize(path, {expectedDimensions})
        Storage->>Storage: Check stored vector column dimensions
        alt Dimension mismatch
            Storage-->>Indexer: IndexCorruptedError (rebuild required)
        else Dimensions match
            Storage-->>Indexer: OK (proceed with incremental)
        end
    end

    Note over Search,EmbedAPI: Query Embedding Consistency
    Search->>Meta: Load index metadata
    Meta-->>Search: stored embeddingProvider + embeddingModel
    Search->>EmbedAPI: getEmbeddingProvider({provider, model})
    EmbedAPI-->>Search: Provider matching stored config
    Note over Search: Query vectors now match stored vectors
Loading

Last reviewed commit: 88c18d4

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link

greptile-apps bot commented Mar 2, 2026

Additional Comments (1)

src/embeddings/openai.ts
hardcoded 1536 doesn't work for text-embedding-3-large (3072 dims)

  readonly dimensions = this.modelName.includes('large') ? 3072 : 1536;

@PatrickSys PatrickSys changed the title Feat/v.1.8.0 sanity check feat(indexing): OpenAI embeddings + broader language coverage Mar 2, 2026
- text-embedding-3-large returns 3072 dims, not 1536; use a getter on
  OpenAIEmbeddingProvider so dimensions resolve after modelName is set
- getConfiguredDimensions checks model name for 'large' before returning
  the OpenAI dimension value
- mergeConfig now defaults to text-embedding-3-small when
  EMBEDDING_PROVIDER=openai and EMBEDDING_MODEL is unset, avoiding a
  400 from the OpenAI API caused by sending 'Xenova/bge-small-en-v1.5'
- add text-embedding-3-large test case; fix stale test description

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Owner Author

@PatrickSys PatrickSys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0375021:

  • openai.ts: is now a getter so it resolves after is assigned — returns 3072 for text-embedding-3-large, 1536 for all others.
  • embeddings/index.ts: getConfiguredDimensions applies the same model-aware logic.
  • indexer.ts: mergeConfig now defaults to text-embedding-3-small when EMBEDDING_PROVIDER=openai and EMBEDDING_MODEL is unset, avoiding the 400 caused by sending Xenova/bge-small-en-v1.5 to the OpenAI API.
  • tests: added text-embedding-3-large → 3072 case; fixed stale test description.

…ctor indexer and embeddings code for improved readability

- Added multiple entries to memory.json for conventions and architectural decisions based on auto-extracted commit history.
- Refactored indexer.ts to improve formatting and readability in exclusion patterns and conditional checks.
- Enhanced embeddings/index.ts and types.ts for better code clarity and structure in provider name parsing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant