You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
T5CoreMLEmbedder (and any other Embedder implementation the library provides) has no internal upper bound on the total number of tokens it will pass toward a CoreML prediction. Callers can hand it inputs the underlying CoreML model literally cannot allocate output for, and the failure mode is catastrophic and silent for the pool, not just the call:
One oversized input throws an MLE5OutputPortBinder bindAndReturnError IOSurface allocation failure during output binding.
The ANE pool is left in a degraded state.
Every subsequent inference — including small inputs — then fails with the same allocation error, until the host process restarts.
This was observed during a SafariUnfucker bulk-index run: the first failure had inputLength=593,285 tokens (~600k). At typical T5 dims (768) × 4 bytes per float, the output tensor for a sequence that long is roughly 1.8 GB of contiguous IOSurface — well outside what the runtime can allocate. After that single page, every subsequent input failed (sizes from <1k to 100k+ tokens, 6,157 failures total in the run, all the same error).
The ~600k-token page that triggered the poisoning was a real page from the user's browsing history. Real-world content includes huge docs, paginated GitHub PRs, archived RSS dumps, etc. This is not a synthetic edge case.
Why this is the embedder's job, not the caller's
The embedder is the only component that knows the model's capacity, the dim, the dtype, and the realistic IOSurface ceiling on the platform it's running on. Asking every caller (CLI consumers, SafariUnfucker, MCP servers, batch jobs) to know this and pre-truncate is bug-prone — each caller has to learn this lesson the hard way.
The contract must be: encode(_:) either returns embeddings or throws a typed, recoverable error. It must never put the embedder in a state where the next call fails too.
Even if a caller "should know better," it's a footgun: Switchcraft is the library that exists specifically because consumers don't want to think about CoreML internals.
Summary
Add a configurable overflow policy to T5CoreMLEmbedder (and expose it via a maxInputTokens property on the Embedder protocol or the concrete type) so that no input exceeding a safe maximum ever reaches MLPredictor.predict. On overflow the embedder either silently truncates the token sequence to the safe maximum (default) or throws a typed EmbedderError.inputTooLarge(actual:max:) error, depending on the configured policy. The maximum is documented and queryable by consumers.
Requirements
R1: Overflow guard — After tokenization, if the token count exceeds the embedder's maxInputTokens limit, the configured overflow policy is applied before any call to MLPredictor.predict (or equivalent). No oversized token sequence is ever passed to the CoreML backend.
R2: Two policies — The embedder supports two policies, selectable at init:
.truncate (default) — silently truncate the token sequence to maxInputTokens and embed the prefix, returning successfully.
.reject — throw EmbedderError.inputTooLarge(actual: Int, max: Int) so the caller can decide whether to skip, summarize, or split the input.
R3: Typed error — EmbedderError.inputTooLarge(actual: Int, max: Int) is a new public error type (or new case on an existing public error enum) in SwitchcraftCore or SwitchcraftCoreML. It must be Sendable and Equatable.
R4: Exposed maximum — T5CoreMLEmbedder exposes a nonisolated let maxInputTokens: Int property. Consumers can read this at any time without entering the actor.
R5: No pool poisoning — A stress test of 1,000 sequential encode calls, interleaved with one or more deliberate inputs well above maxInputTokens, must complete without any IOSurface allocation failures or ANE pool degradation. Every normal-sized input must succeed.
R6: Default is truncate — The default overflow policy (when none is specified at init) is .truncate. This matches the most common consumer need (search, classification, retrieval) where the prefix is informative.
R7: Tests — New tests must cover: (a) truncation policy encodes without error and returns non-empty embeddings; (b) reject policy throws the expected typed error; (c) the stress test described in R5; (d) maxInputTokens property returns the expected value.
Scope
In scope:
Adding the overflow policy and maxInputTokens to T5CoreMLEmbedder
Adding the EmbedderError.inputTooLarge error type
Exposing maxInputTokens on T5CoreMLEmbedder (and optionally adding it to the Embedder protocol — see Open Questions)
Chunking/splitting strategies — callers who want to split rather than truncate can do so using the typed error from the .reject policy
Text-length pre-checks before tokenization (a fast but imprecise pre-filter may be added as an optimization, but the authoritative check is on the token count after tokenization)
Changes to SlidingWindow window-level limits (the per-window cap of windowSize = 512 already exists; the missing guard is on total token count)
Open Questions
(None currently blocking. Questions below are design decisions for the Research/Plan stages.)
Q1: Should maxInputTokens be added to the Embedder protocol? Adding it to the protocol makes it queryable on any embedder but requires all conformers to implement it. An alternative is to keep it only on T5CoreMLEmbedder and let callers downcast when needed. Since the safety contract is embedder-internal, keeping it on the concrete type may be sufficient for now.
Q2: What is the concrete safe value for maxInputTokens on the XTR-base-en model? The issue notes it should be "derived from the model's actual capacity." For fixed-window models this is windowSize (512). For models with dynamic input shapes, the limit may be derived from a practical IOSurface budget (e.g., 8,192 or 16,384 total tokens). Research should determine the right value and whether it should be hardcoded, computed at model load time, or passed by the caller.
Q3: What is the mechanism of the 1.8 GB failure — single dynamic-shape prediction or batch pre-allocation? This matters for where exactly the guard must be inserted (pre-tokenization total count vs. per-window count) but does not change the observable contract. Research should clarify.
Prior Art / Context
T5's published max_position_embeddings is 512. The Switchcraft implementation already uses a windowSize: Int = 512 parameter and SlidingWindow to handle inputs longer than 512 tokens by splitting into overlapping windows. The new guard applies to total token count across all windows, not the per-window limit.
Hugging Face Transformers enforces this silently in its AutoTokenizer via truncation=True, max_length=<model.config.max_position_embeddings> — a well-established convention for the truncation-default pattern.
ANE pool poisoning via IOSurface allocation failure is a documented footgun in CoreML usage; the fix is always pre-validation, not recovery (recovery of a poisoned pool requires process restart and is unreliable).
Protocol change risk: If maxInputTokens is added to Embedder, all existing conformers (including test mocks) will require updates. The Research stage should enumerate all conformers.
Truncation and recall: Truncating to the first N tokens is semantically lossy for documents where relevant content appears late. This is a known trade-off and acceptable for the library's primary use case (retrieval). The .reject policy exists for callers that cannot tolerate silent data loss.
Problem
T5CoreMLEmbedder(and any otherEmbedderimplementation the library provides) has no internal upper bound on the total number of tokens it will pass toward a CoreML prediction. Callers can hand it inputs the underlying CoreML model literally cannot allocate output for, and the failure mode is catastrophic and silent for the pool, not just the call:MLE5OutputPortBinder bindAndReturnErrorIOSurface allocation failure during output binding.This was observed during a SafariUnfucker bulk-index run: the first failure had
inputLength=593,285tokens (~600k). At typical T5 dims (768) × 4 bytes per float, the output tensor for a sequence that long is roughly 1.8 GB of contiguous IOSurface — well outside what the runtime can allocate. After that single page, every subsequent input failed (sizes from <1k to 100k+ tokens, 6,157 failures total in the run, all the same error).The ~600k-token page that triggered the poisoning was a real page from the user's browsing history. Real-world content includes huge docs, paginated GitHub PRs, archived RSS dumps, etc. This is not a synthetic edge case.
Why this is the embedder's job, not the caller's
encode(_:)either returns embeddings or throws a typed, recoverable error. It must never put the embedder in a state where the next call fails too.Summary
Add a configurable overflow policy to
T5CoreMLEmbedder(and expose it via amaxInputTokensproperty on theEmbedderprotocol or the concrete type) so that no input exceeding a safe maximum ever reachesMLPredictor.predict. On overflow the embedder either silently truncates the token sequence to the safe maximum (default) or throws a typedEmbedderError.inputTooLarge(actual:max:)error, depending on the configured policy. The maximum is documented and queryable by consumers.Requirements
maxInputTokenslimit, the configured overflow policy is applied before any call toMLPredictor.predict(or equivalent). No oversized token sequence is ever passed to the CoreML backend.init:.truncate(default) — silently truncate the token sequence tomaxInputTokensand embed the prefix, returning successfully..reject— throwEmbedderError.inputTooLarge(actual: Int, max: Int)so the caller can decide whether to skip, summarize, or split the input.EmbedderError.inputTooLarge(actual: Int, max: Int)is a new public error type (or new case on an existing public error enum) inSwitchcraftCoreorSwitchcraftCoreML. It must beSendableandEquatable.T5CoreMLEmbedderexposes anonisolated let maxInputTokens: Intproperty. Consumers can read this at any time without entering the actor.encodecalls, interleaved with one or more deliberate inputs well abovemaxInputTokens, must complete without any IOSurface allocation failures or ANE pool degradation. Every normal-sized input must succeed.init) is.truncate. This matches the most common consumer need (search, classification, retrieval) where the prefix is informative.maxInputTokensproperty returns the expected value.Scope
In scope:
maxInputTokenstoT5CoreMLEmbedderEmbedderError.inputTooLargeerror typemaxInputTokensonT5CoreMLEmbedder(and optionally adding it to theEmbedderprotocol — see Open Questions)Out of scope:
.rejectpolicySlidingWindowwindow-level limits (the per-window cap ofwindowSize = 512already exists; the missing guard is on total token count)Open Questions
(None currently blocking. Questions below are design decisions for the Research/Plan stages.)
maxInputTokensbe added to theEmbedderprotocol? Adding it to the protocol makes it queryable on any embedder but requires all conformers to implement it. An alternative is to keep it only onT5CoreMLEmbedderand let callers downcast when needed. Since the safety contract is embedder-internal, keeping it on the concrete type may be sufficient for now.maxInputTokenson the XTR-base-en model? The issue notes it should be "derived from the model's actual capacity." For fixed-window models this iswindowSize(512). For models with dynamic input shapes, the limit may be derived from a practical IOSurface budget (e.g., 8,192 or 16,384 total tokens). Research should determine the right value and whether it should be hardcoded, computed at model load time, or passed by the caller.Prior Art / Context
max_position_embeddingsis 512. The Switchcraft implementation already uses awindowSize: Int = 512parameter andSlidingWindowto handle inputs longer than 512 tokens by splitting into overlapping windows. The new guard applies to total token count across all windows, not the per-window limit.AutoTokenizerviatruncation=True, max_length=<model.config.max_position_embeddings>— a well-established convention for the truncation-default pattern.Risks / Dependencies
maxInputTokensis added toEmbedder, all existing conformers (including test mocks) will require updates. The Research stage should enumerate all conformers..rejectpolicy exists for callers that cannot tolerate silent data loss.