Skip to content

store: distillery_status blocks behind embedding HTTP calls (single global asyncio.Lock) #558

@norrietaylor

Description

@norrietaylor

Root cause

DuckDBStore._run_sync holds a single global asyncio.Lock across the entire sync function, including the embedding HTTP round-trip:

src/distillery/store/duckdb.py:976:

async with self._get_conn_lock():
    try:
        return await asyncio.to_thread(fn, *args, **kwargs)

…where fn is _sync_store / _sync_store_batch / _sync_update, and those functions call self._embedding_provider.embed(...) inside the lock (lines 1121, 1187, 1392, 1763, 1925, 3088).

The existing TODO acknowledges it:

Throughput on parallel store operations is reduced to one-at-a-time; a future optimisation could narrow the lock to the SQL portion of each _sync_* method (releasing it across the embedding network call) but correctness comes first.

distillery_status calls three locked operations: count_entries, list_feed_sources, probe_readiness (which also goes through _run_sync at line 1034). So when one slow embedding call holds the lock, status probes queue behind it three times over and clients see "server is dead."

Repro

In one session today:

  1. ~10 parallel distillery_store calls fired (embedding-heavy).
  2. Provider (Jina v3) slowed under load.
  3. distillery_status calls timed out repeatedly while writes continued to (slowly) complete.
  4. Server remained unresponsive to status probes for >5 minutes after writes stopped.

Fix

Two-stage:

Stage 1 — split health from worker pool (small change, big win):

  • Make probe_readiness and count_entries use a second, read-only DuckDB connection that is not behind _conn_lock. DuckDB supports multiple concurrent read-only connections to the same file.
  • distillery_status then completes in ms regardless of write contention.

Stage 2 — release lock across embedding call (what the TODO says):

  • In _sync_store / _sync_store_batch / _sync_update, compute embeddings before acquiring the SQL portion of the lock. Pattern:
    async def store(self, entry):
        embedding = await asyncio.to_thread(self._embedding_provider.embed, entry.content)
        return await self._run_sync(self._sync_store_with_embedding, entry, embedding)
  • The SQL critical section then takes the lock for milliseconds, not seconds.
  • Pre-existing thread-safety invariant for DuckDBPyConnection is preserved because the lock still covers all SQL.

Stage 1 alone fixes the "looks dead" symptom. Stage 2 fixes the underlying throughput issue.

Acceptance

  • distillery_status returns within 1s while 10 concurrent embedding-bound writes are in flight.
  • Stage 2: end-to-end p95 latency on parallel distillery_store drops by ≥3x on a slow-provider stub.
  • Integration test: spawn 20 concurrent stores backed by a provider that sleeps 500ms per embed; assert total wall-time < 20 × 500ms (i.e. some parallelism).
  • Existing thread-safety test (issue fix(store): DuckDB connection accessed concurrently from multiple threads, causing heap corruption #416) still passes — connection is never touched from two threads simultaneously.

References

  • src/distillery/store/duckdb.py:943-995 (_run_sync — the single global lock)
  • src/distillery/store/duckdb.py:1017-1039 (probe_readiness — goes through the same lock)
  • src/distillery/store/duckdb.py:1121, 1187, 1392, 1763, 1925, 3088 (embedding calls inside the locked section)
  • src/distillery/mcp/tools/meta.py:120-152 (status handler calling three locked store operations)
  • Related: fix(store): DuckDB connection accessed concurrently from multiple threads, causing heap corruption #416 (the connection-thread-safety incident that motivated the single lock — fix must preserve that guarantee)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions