Embedder must enforce input-size guard internally — oversized inputs poison ANE pool for all subsequent inferences

## Problem

`T5CoreMLEmbedder` (and any other `Embedder` implementation the library provides) has no internal upper bound on the total number of tokens it will pass toward a CoreML prediction. Callers can hand it inputs the underlying CoreML model literally cannot allocate output for, and the failure mode is **catastrophic and silent for the pool, not just the call**:

- One oversized input throws an `MLE5OutputPortBinder bindAndReturnError` IOSurface allocation failure during output binding.
- The ANE pool is left in a degraded state.
- **Every subsequent inference — including small inputs — then fails with the same allocation error**, until the host process restarts.

This was observed during a SafariUnfucker bulk-index run: the first failure had `inputLength=593,285` tokens (~600k). At typical T5 dims (768) × 4 bytes per float, the output tensor for a sequence that long is roughly 1.8 GB of contiguous IOSurface — well outside what the runtime can allocate. After that single page, every subsequent input failed (sizes from <1k to 100k+ tokens, 6,157 failures total in the run, all the same error).

The ~600k-token page that triggered the poisoning was a real page from the user's browsing history. Real-world content includes huge docs, paginated GitHub PRs, archived RSS dumps, etc. This is not a synthetic edge case.

## Why this is the embedder's job, not the caller's

- The embedder is the only component that knows the model's capacity, the dim, the dtype, and the realistic IOSurface ceiling on the platform it's running on. Asking every caller (CLI consumers, SafariUnfucker, MCP servers, batch jobs) to know this and pre-truncate is bug-prone — each caller has to learn this lesson the hard way.
- The contract must be: `encode(_:)` either returns embeddings or throws a typed, recoverable error. It must never put the embedder in a state where the *next* call fails too.
- Even if a caller "should know better," it's a footgun: Switchcraft is the library that exists specifically because consumers don't want to think about CoreML internals.

## Summary

Add a configurable overflow policy to `T5CoreMLEmbedder` (and expose it via a `maxInputTokens` property on the `Embedder` protocol or the concrete type) so that no input exceeding a safe maximum ever reaches `MLPredictor.predict`. On overflow the embedder either silently truncates the token sequence to the safe maximum (default) or throws a typed `EmbedderError.inputTooLarge(actual:max:)` error, depending on the configured policy. The maximum is documented and queryable by consumers.

## Requirements

- **R1: Overflow guard** — After tokenization, if the token count exceeds the embedder's `maxInputTokens` limit, the configured overflow policy is applied before any call to `MLPredictor.predict` (or equivalent). No oversized token sequence is ever passed to the CoreML backend.
- **R2: Two policies** — The embedder supports two policies, selectable at `init`:
  - `.truncate` (default) — silently truncate the token sequence to `maxInputTokens` and embed the prefix, returning successfully.
  - `.reject` — throw `EmbedderError.inputTooLarge(actual: Int, max: Int)` so the caller can decide whether to skip, summarize, or split the input.
- **R3: Typed error** — `EmbedderError.inputTooLarge(actual: Int, max: Int)` is a new public error type (or new case on an existing public error enum) in `SwitchcraftCore` or `SwitchcraftCoreML`. It must be `Sendable` and `Equatable`.
- **R4: Exposed maximum** — `T5CoreMLEmbedder` exposes a `nonisolated let maxInputTokens: Int` property. Consumers can read this at any time without entering the actor.
- **R5: No pool poisoning** — A stress test of 1,000 sequential `encode` calls, interleaved with one or more deliberate inputs well above `maxInputTokens`, must complete without any IOSurface allocation failures or ANE pool degradation. Every normal-sized input must succeed.
- **R6: Default is truncate** — The default overflow policy (when none is specified at `init`) is `.truncate`. This matches the most common consumer need (search, classification, retrieval) where the prefix is informative.
- **R7: Tests** — New tests must cover: (a) truncation policy encodes without error and returns non-empty embeddings; (b) reject policy throws the expected typed error; (c) the stress test described in R5; (d) `maxInputTokens` property returns the expected value.

## Scope

**In scope:**
- Adding the overflow policy and `maxInputTokens` to `T5CoreMLEmbedder`
- Adding the `EmbedderError.inputTooLarge` error type
- Exposing `maxInputTokens` on `T5CoreMLEmbedder` (and optionally adding it to the `Embedder` protocol — see Open Questions)
- Tests for all new behavior per R7

**Out of scope:**
- Post-hoc ANE pool recovery (that is issue #87's concern; this issue is structural prevention)
- Chunking/splitting strategies — callers who want to split rather than truncate can do so using the typed error from the `.reject` policy
- Text-length pre-checks before tokenization (a fast but imprecise pre-filter may be added as an optimization, but the authoritative check is on the token count after tokenization)
- Changes to `SlidingWindow` window-level limits (the per-window cap of `windowSize = 512` already exists; the missing guard is on total token count)

## Open Questions

_(None currently blocking. Questions below are design decisions for the Research/Plan stages.)_

- [ ] **Q1: Should `maxInputTokens` be added to the `Embedder` protocol?** Adding it to the protocol makes it queryable on any embedder but requires all conformers to implement it. An alternative is to keep it only on `T5CoreMLEmbedder` and let callers downcast when needed. Since the safety contract is embedder-internal, keeping it on the concrete type may be sufficient for now.
- [ ] **Q2: What is the concrete safe value for `maxInputTokens` on the XTR-base-en model?** The issue notes it should be "derived from the model's actual capacity." For fixed-window models this is `windowSize` (512). For models with dynamic input shapes, the limit may be derived from a practical IOSurface budget (e.g., 8,192 or 16,384 total tokens). Research should determine the right value and whether it should be hardcoded, computed at model load time, or passed by the caller.
- [ ] **Q3: What is the mechanism of the 1.8 GB failure — single dynamic-shape prediction or batch pre-allocation?** This matters for where exactly the guard must be inserted (pre-tokenization total count vs. per-window count) but does not change the observable contract. Research should clarify.

## Prior Art / Context

- T5's published `max_position_embeddings` is 512. The Switchcraft implementation already uses a `windowSize: Int = 512` parameter and `SlidingWindow` to handle inputs longer than 512 tokens by splitting into overlapping windows. The new guard applies to total token count across all windows, not the per-window limit.
- Hugging Face Transformers enforces this silently in its `AutoTokenizer` via `truncation=True, max_length=<model.config.max_position_embeddings>` — a well-established convention for the truncation-default pattern.
- ANE pool poisoning via IOSurface allocation failure is a documented footgun in CoreML usage; the fix is always pre-validation, not recovery (recovery of a poisoned pool requires process restart and is unreliable).
- Issue #87 adds post-hoc recovery as a safety net but is explicitly not a substitute for this structural prevention.

## Risks / Dependencies

- **Relationship to #87**: This issue is independent of and takes priority over #87. Even if #87 is never merged, this fix prevents the failure mode entirely. If both land, #87's recovery becomes a true last-resort net rather than the primary defense.
- **Protocol change risk**: If `maxInputTokens` is added to `Embedder`, all existing conformers (including test mocks) will require updates. The Research stage should enumerate all conformers.
- **Truncation and recall**: Truncating to the first N tokens is semantically lossy for documents where relevant content appears late. This is a known trade-off and acceptable for the library's primary use case (retrieval). The `.reject` policy exists for callers that cannot tolerate silent data loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedder must enforce input-size guard internally — oversized inputs poison ANE pool for all subsequent inferences #89

Problem

Why this is the embedder's job, not the caller's

Summary

Requirements

Scope

Open Questions

Prior Art / Context

Risks / Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Embedder must enforce input-size guard internally — oversized inputs poison ANE pool for all subsequent inferences #89

Description

Problem

Why this is the embedder's job, not the caller's

Summary

Requirements

Scope

Open Questions

Prior Art / Context

Risks / Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions