Skip to content

fix(ai): cap llama-server embedding batches at its 32-input request limit#1281

Open
mmekkaoui wants to merge 1 commit into
garrytan:masterfrom
mmekkaoui:fix/embedding-max-batch-items
Open

fix(ai): cap llama-server embedding batches at its 32-input request limit#1281
mmekkaoui wants to merge 1 commit into
garrytan:masterfrom
mmekkaoui:fix/embedding-max-batch-items

Conversation

@mmekkaoui
Copy link
Copy Markdown

Summary

  • llama.cpp's llama-server rejects /v1/embeddings requests exceeding its launch --batch-size (default 32) with batch size 100 > maximum allowed batch size 32. gbrain sends batches of 100 (BATCH_SIZE), so any page with >32 chunks fails to embed and embed --stale then trips the Postgres statement_timeout retrying the doomed batches.
  • The existing max_batch_tokens protection (+ recursive halving) is token-based and can't bound item count — N tiny chunks fit under any token budget, so it never triggers here. The llama-server recipe also declared no_batch_cap: true, incorrectly asserting llama.cpp has no per-request item cap.
  • Adds an optional max_batch_items count cap to EmbeddingTouchpoint, enforced as a hard re-split after the token split in embed() (capBatchItems), and sets max_batch_items: 32 on the llama-server recipe. A declared item cap also suppresses the missing-max_batch_tokens startup warning. Operators who launch llama-server with a larger -b can raise it.

Reproduced + fixed against a live llama-server (intfloat/multilingual-e5-large): before, every >32-chunk page 422'd; after, embedding completes (drained an ~18k-chunk backlog cleanly).

Test plan

  • bun run typecheck
  • bun test test/ai/adaptive-embed-batch.test.ts test/ai/no-batch-cap-suppression.serial.test.ts (34 pass)
  • Adds capBatchItems pure-function coverage (cap split, exact multiple, order preservation, maxItems<=0 no-op, empty input); updates the recipe assertion (llama-server now declares max_batch_items: 32 instead of no_batch_cap).

Left VERSION/CHANGELOG to the maintainer per the community-PR flow.

…imit

llama.cpp's llama-server rejects /v1/embeddings requests with more inputs
than its launch --batch-size (default 32): "batch size 100 > maximum allowed
batch size 32". gbrain sends batches of 100, so any page with >32 chunks fails
to embed, and embed --stale then trips the Postgres statement_timeout retrying
the doomed batches. The existing token-based protection (max_batch_tokens)
can't bound item count — N tiny chunks fit under any token budget.

Add an optional max_batch_items count cap to EmbeddingTouchpoint, enforced as a
hard re-split after the token split in embed(), and set it to 32 on the
llama-server recipe (replacing no_batch_cap: true, which wrongly assumed
llama.cpp has no per-request item cap). A declared item cap also suppresses the
missing-max_batch_tokens startup warning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant