Skip to content

Embedder must enforce input-size guard internally — oversized inputs poison ANE pool for all subsequent inferences #89

@totalslacker

Description

@totalslacker

Problem

T5CoreMLEmbedder (and any other Embedder implementation the library provides) has no internal upper bound on the total number of tokens it will pass toward a CoreML prediction. Callers can hand it inputs the underlying CoreML model literally cannot allocate output for, and the failure mode is catastrophic and silent for the pool, not just the call:

  • One oversized input throws an MLE5OutputPortBinder bindAndReturnError IOSurface allocation failure during output binding.
  • The ANE pool is left in a degraded state.
  • Every subsequent inference — including small inputs — then fails with the same allocation error, until the host process restarts.

This was observed during a SafariUnfucker bulk-index run: the first failure had inputLength=593,285 tokens (~600k). At typical T5 dims (768) × 4 bytes per float, the output tensor for a sequence that long is roughly 1.8 GB of contiguous IOSurface — well outside what the runtime can allocate. After that single page, every subsequent input failed (sizes from <1k to 100k+ tokens, 6,157 failures total in the run, all the same error).

The ~600k-token page that triggered the poisoning was a real page from the user's browsing history. Real-world content includes huge docs, paginated GitHub PRs, archived RSS dumps, etc. This is not a synthetic edge case.

Why this is the embedder's job, not the caller's

  • The embedder is the only component that knows the model's capacity, the dim, the dtype, and the realistic IOSurface ceiling on the platform it's running on. Asking every caller (CLI consumers, SafariUnfucker, MCP servers, batch jobs) to know this and pre-truncate is bug-prone — each caller has to learn this lesson the hard way.
  • The contract must be: encode(_:) either returns embeddings or throws a typed, recoverable error. It must never put the embedder in a state where the next call fails too.
  • Even if a caller "should know better," it's a footgun: Switchcraft is the library that exists specifically because consumers don't want to think about CoreML internals.

Summary

Add a configurable overflow policy to T5CoreMLEmbedder (and expose it via a maxInputTokens property on the Embedder protocol or the concrete type) so that no input exceeding a safe maximum ever reaches MLPredictor.predict. On overflow the embedder either silently truncates the token sequence to the safe maximum (default) or throws a typed EmbedderError.inputTooLarge(actual:max:) error, depending on the configured policy. The maximum is documented and queryable by consumers.

Requirements

  • R1: Overflow guard — After tokenization, if the token count exceeds the embedder's maxInputTokens limit, the configured overflow policy is applied before any call to MLPredictor.predict (or equivalent). No oversized token sequence is ever passed to the CoreML backend.
  • R2: Two policies — The embedder supports two policies, selectable at init:
    • .truncate (default) — silently truncate the token sequence to maxInputTokens and embed the prefix, returning successfully.
    • .reject — throw EmbedderError.inputTooLarge(actual: Int, max: Int) so the caller can decide whether to skip, summarize, or split the input.
  • R3: Typed errorEmbedderError.inputTooLarge(actual: Int, max: Int) is a new public error type (or new case on an existing public error enum) in SwitchcraftCore or SwitchcraftCoreML. It must be Sendable and Equatable.
  • R4: Exposed maximumT5CoreMLEmbedder exposes a nonisolated let maxInputTokens: Int property. Consumers can read this at any time without entering the actor.
  • R5: No pool poisoning — A stress test of 1,000 sequential encode calls, interleaved with one or more deliberate inputs well above maxInputTokens, must complete without any IOSurface allocation failures or ANE pool degradation. Every normal-sized input must succeed.
  • R6: Default is truncate — The default overflow policy (when none is specified at init) is .truncate. This matches the most common consumer need (search, classification, retrieval) where the prefix is informative.
  • R7: Tests — New tests must cover: (a) truncation policy encodes without error and returns non-empty embeddings; (b) reject policy throws the expected typed error; (c) the stress test described in R5; (d) maxInputTokens property returns the expected value.

Scope

In scope:

  • Adding the overflow policy and maxInputTokens to T5CoreMLEmbedder
  • Adding the EmbedderError.inputTooLarge error type
  • Exposing maxInputTokens on T5CoreMLEmbedder (and optionally adding it to the Embedder protocol — see Open Questions)
  • Tests for all new behavior per R7

Out of scope:

  • Post-hoc ANE pool recovery (that is issue T5CoreMLEmbedder leaks ANE IOSurface buffers — bulk-index fails after ~1k inferences #87's concern; this issue is structural prevention)
  • Chunking/splitting strategies — callers who want to split rather than truncate can do so using the typed error from the .reject policy
  • Text-length pre-checks before tokenization (a fast but imprecise pre-filter may be added as an optimization, but the authoritative check is on the token count after tokenization)
  • Changes to SlidingWindow window-level limits (the per-window cap of windowSize = 512 already exists; the missing guard is on total token count)

Open Questions

(None currently blocking. Questions below are design decisions for the Research/Plan stages.)

  • Q1: Should maxInputTokens be added to the Embedder protocol? Adding it to the protocol makes it queryable on any embedder but requires all conformers to implement it. An alternative is to keep it only on T5CoreMLEmbedder and let callers downcast when needed. Since the safety contract is embedder-internal, keeping it on the concrete type may be sufficient for now.
  • Q2: What is the concrete safe value for maxInputTokens on the XTR-base-en model? The issue notes it should be "derived from the model's actual capacity." For fixed-window models this is windowSize (512). For models with dynamic input shapes, the limit may be derived from a practical IOSurface budget (e.g., 8,192 or 16,384 total tokens). Research should determine the right value and whether it should be hardcoded, computed at model load time, or passed by the caller.
  • Q3: What is the mechanism of the 1.8 GB failure — single dynamic-shape prediction or batch pre-allocation? This matters for where exactly the guard must be inserted (pre-tokenization total count vs. per-window count) but does not change the observable contract. Research should clarify.

Prior Art / Context

  • T5's published max_position_embeddings is 512. The Switchcraft implementation already uses a windowSize: Int = 512 parameter and SlidingWindow to handle inputs longer than 512 tokens by splitting into overlapping windows. The new guard applies to total token count across all windows, not the per-window limit.
  • Hugging Face Transformers enforces this silently in its AutoTokenizer via truncation=True, max_length=<model.config.max_position_embeddings> — a well-established convention for the truncation-default pattern.
  • ANE pool poisoning via IOSurface allocation failure is a documented footgun in CoreML usage; the fix is always pre-validation, not recovery (recovery of a poisoned pool requires process restart and is unreliable).
  • Issue T5CoreMLEmbedder leaks ANE IOSurface buffers — bulk-index fails after ~1k inferences #87 adds post-hoc recovery as a safety net but is explicitly not a substitute for this structural prevention.

Risks / Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions