fix(ai): cap llama-server embedding batches at its 32-input request limit#1281
Open
mmekkaoui wants to merge 1 commit into
Open
fix(ai): cap llama-server embedding batches at its 32-input request limit#1281mmekkaoui wants to merge 1 commit into
mmekkaoui wants to merge 1 commit into
Conversation
…imit llama.cpp's llama-server rejects /v1/embeddings requests with more inputs than its launch --batch-size (default 32): "batch size 100 > maximum allowed batch size 32". gbrain sends batches of 100, so any page with >32 chunks fails to embed, and embed --stale then trips the Postgres statement_timeout retrying the doomed batches. The existing token-based protection (max_batch_tokens) can't bound item count — N tiny chunks fit under any token budget. Add an optional max_batch_items count cap to EmbeddingTouchpoint, enforced as a hard re-split after the token split in embed(), and set it to 32 on the llama-server recipe (replacing no_batch_cap: true, which wrongly assumed llama.cpp has no per-request item cap). A declared item cap also suppresses the missing-max_batch_tokens startup warning.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
llama-serverrejects/v1/embeddingsrequests exceeding its launch--batch-size(default 32) withbatch size 100 > maximum allowed batch size 32. gbrain sends batches of 100 (BATCH_SIZE), so any page with >32 chunks fails to embed andembed --stalethen trips the Postgresstatement_timeoutretrying the doomed batches.max_batch_tokensprotection (+ recursive halving) is token-based and can't bound item count — N tiny chunks fit under any token budget, so it never triggers here. Thellama-serverrecipe also declaredno_batch_cap: true, incorrectly asserting llama.cpp has no per-request item cap.max_batch_itemscount cap toEmbeddingTouchpoint, enforced as a hard re-split after the token split inembed()(capBatchItems), and setsmax_batch_items: 32on thellama-serverrecipe. A declared item cap also suppresses the missing-max_batch_tokensstartup warning. Operators who launchllama-serverwith a larger-bcan raise it.Reproduced + fixed against a live
llama-server(intfloat/multilingual-e5-large): before, every >32-chunk page 422'd; after, embedding completes (drained an ~18k-chunk backlog cleanly).Test plan
bun run typecheckbun test test/ai/adaptive-embed-batch.test.ts test/ai/no-batch-cap-suppression.serial.test.ts(34 pass)capBatchItemspure-function coverage (cap split, exact multiple, order preservation,maxItems<=0no-op, empty input); updates the recipe assertion (llama-server now declaresmax_batch_items: 32instead ofno_batch_cap).Left VERSION/CHANGELOG to the maintainer per the community-PR flow.