Skip to content

Stabilize chunk upserts, attach git commit context to memory, and stabilize tests#10

Merged
DxTa merged 3 commits intomainfrom
codex/find-multiple-git-worktrees-support
Feb 7, 2026
Merged

Stabilize chunk upserts, attach git commit context to memory, and stabilize tests#10
DxTa merged 3 commits intomainfrom
codex/find-multiple-git-worktrees-support

Conversation

@DxTa
Copy link
Owner

@DxTa DxTa commented Feb 6, 2026

Motivation

  • Prevent SQLite corruption and inconsistent indexing during re-indexing by making chunk upserts idempotent.
  • Preserve git provenance (commit hash and commit time) for memory items (decisions, timeline events, changelogs) to improve traceability and auditing.
  • Make the test suite stable and offline-friendly by disabling embeddings in CI flows and aligning tests with the current backend API.

Description

  • Replace duplicate-chunk handling in UsearchSqliteBackend.store_chunks_batch with an INSERT OR REPLACE upsert to avoid UNIQUE constraint errors and DB corruption during reindexing (plus schema/column safety checks in _create_tables).
  • Thread git provenance through the stack by adding get_git_commit_context in cli, _get_git_commit_context in indexer.coordinator, and attaching commit_hash/commit_time to chunks and memory writes so GitSyncService and backend storage receive and persist them.
  • Extend models and storage APIs to include commit_hash and commit_time (changes in core.models, storage.base, and usearch_backend), add DB columns for these fields, and propagate them through memory import/export and sync flows.
  • Stabilize tests by updating imports to the current backend, disabling embeddings for CLI/integration tests (set PYTHONPATH for subprocess runs), and converting memvid-specific test code to use the unified backend API (updates in tests/test_basic.py, tests/test_cli_integration.py, tests/test_empty_code_fix.py, tests/test_usearch_backend.py, and tests/unit/test_git_sync.py).

Testing

  • Ran the full test suite with PYTHONPATH=. pytest which completed successfully with 202 passed, 235 skipped and no failures.
  • Re-ran the previously failing integration case tests/integration/test_watch_mode.py::TestWatchModeIndexing::test_chunk_index_tracks_stale_chunks which passed after the upsert fix.
  • Verified targeted backend behaviors (chunk upserts and incremental indexing) with small repro scripts during debugging, all showing expected outcomes.

Codex Task

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 684805a333

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +629 to +633
INSERT OR REPLACE INTO chunks (
uri,
symbol,
chunk_type,
file_path,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Purge old vectors when replacing chunk rows

Using INSERT OR REPLACE for chunks creates a new row id whenever an existing uri is reindexed, which deletes the old row and inserts a fresh one. The vector index is only ever appended to later in store_chunks_batch and there is no corresponding removal path for the old row id, so reindexing the same file leaves orphaned vectors in the HNSW index. When embeddings are enabled, semantic search can then return IDs that no longer exist in SQLite (they get filtered out), reducing recall and steadily bloating the vector index over time. Consider deleting the old vector entry before replace or switching to a deterministic row id/upsert that preserves IDs.

Useful? React with 👍 / 👎.

@DxTa DxTa force-pushed the codex/find-multiple-git-worktrees-support branch from 684805a to 865e368 Compare February 7, 2026 14:18
DxTa added 2 commits February 7, 2026 18:42
Ensure sqlite-vec/usearch store and import commit_hash/commit_time consistently, apply schema migrations on open for writable indexes, and extend tests to cover parity and legacy schema.
Auto-detect linked git worktrees and store the index in the git common dir by default, while keeping per-repo  for normal checkouts. Adds unit tests and updates docs.
@DxTa DxTa merged commit 2d688ee into main Feb 7, 2026
15 checks passed
@DxTa DxTa deleted the codex/find-multiple-git-worktrees-support branch February 7, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant