fix(ai): cap llama-server embedding batches at its 32-input request limit by mmekkaoui · Pull Request #1281 · garrytan/gbrain

mmekkaoui · 2026-05-21T20:33:47Z

Summary

llama.cpp's llama-server rejects /v1/embeddings requests exceeding its launch --batch-size (default 32) with batch size 100 > maximum allowed batch size 32. gbrain sends batches of 100 (BATCH_SIZE), so any page with >32 chunks fails to embed and embed --stale then trips the Postgres statement_timeout retrying the doomed batches.
The existing max_batch_tokens protection (+ recursive halving) is token-based and can't bound item count — N tiny chunks fit under any token budget, so it never triggers here. The llama-server recipe also declared no_batch_cap: true, incorrectly asserting llama.cpp has no per-request item cap.
Adds an optional max_batch_items count cap to EmbeddingTouchpoint, enforced as a hard re-split after the token split in embed() (capBatchItems), and sets max_batch_items: 32 on the llama-server recipe. A declared item cap also suppresses the missing-max_batch_tokens startup warning. Operators who launch llama-server with a larger -b can raise it.

Reproduced + fixed against a live llama-server (intfloat/multilingual-e5-large): before, every >32-chunk page 422'd; after, embedding completes (drained an ~18k-chunk backlog cleanly).

Test plan

bun run typecheck
bun test test/ai/adaptive-embed-batch.test.ts test/ai/no-batch-cap-suppression.serial.test.ts (34 pass)
Adds capBatchItems pure-function coverage (cap split, exact multiple, order preservation, maxItems<=0 no-op, empty input); updates the recipe assertion (llama-server now declares max_batch_items: 32 instead of no_batch_cap).

Left VERSION/CHANGELOG to the maintainer per the community-PR flow.

…imit llama.cpp's llama-server rejects /v1/embeddings requests with more inputs than its launch --batch-size (default 32): "batch size 100 > maximum allowed batch size 32". gbrain sends batches of 100, so any page with >32 chunks fails to embed, and embed --stale then trips the Postgres statement_timeout retrying the doomed batches. The existing token-based protection (max_batch_tokens) can't bound item count — N tiny chunks fit under any token budget. Add an optional max_batch_items count cap to EmbeddingTouchpoint, enforced as a hard re-split after the token split in embed(), and set it to 32 on the llama-server recipe (replacing no_batch_cap: true, which wrongly assumed llama.cpp has no per-request item cap). A declared item cap also suppresses the missing-max_batch_tokens startup warning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ai): cap llama-server embedding batches at its 32-input request limit#1281

fix(ai): cap llama-server embedding batches at its 32-input request limit#1281
mmekkaoui wants to merge 1 commit into
garrytan:masterfrom
mmekkaoui:fix/embedding-max-batch-items

mmekkaoui commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmekkaoui commented May 21, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant