Skip to content

Proposal: New orchestration primitive, sub-requests from within the filter pipeline #28

Description

@usize

Introduce a first-class mechanism for filters (and embedded processors) to make async outbound HTTP calls during request processing — with proper timeout, cancellation, memory bounding, and observability.

This is the foundational primitive that enables praxis to act as an orchestrating proxy: a proxy that doesn't just route requests but executes multi-step request workflows. It is a prerequisite for #16 (llm-d Compatibility), #17 (External Processing), #18 (Wasm Runtime), and #24 (AI Agentic).

Motivation

The ext_proc orchestration gap

The Gateway API Inference Extension (GIE) defines the Endpoint Picker Protocol with a hard requirement: "The EPP MUST implement the Envoy external processing service protocol." This protocol is fundamentally a single decision point — the EPP receives request headers/body, returns a destination endpoint via x-gateway-destination-endpoint, and is done. It cannot express multi-step workflows, conditional branching, or mid-flight preemption.

This is sufficient for simple inference routing (one model, one pod). It breaks down for workloads that require orchestration — coordinating multiple backend interactions within a single client request.

P/D disaggregation in llm-d

llm-d splits LLM inference into separate prefill and decode phases running on independent GPU pools. The inference scheduler supports four disaggregation topologies: EPD, P/D, E/PD, and E/P/D — with the most advanced requiring orchestration across three distinct worker types (encode, prefill, decode).

Because ext_proc can only return a single endpoint, llm-d works around this with a sidecar on the decode pod:

  1. The EPP selects both a D pod and a P pod in a single ext_proc pass
  2. The P pod address is injected as an x-prefiller-host-port header (and x-encoder-hosts-ports for multimodal)
  3. Envoy routes the request to the D pod
  4. The sidecar on the D pod intercepts the request, forwards a prefill sub-request to the P pod
  5. The P pod runs prefill and returns KV cache parameters
  6. KV cache blocks are transferred via NIXL over RDMA
  7. The D pod runs decode using the transferred KV cache

This sidecar is explicitly experimental. The llm-d-routing-sidecar repository states: "This repository is deprecated and shall soon be archived. All future development will [be] done under the llm-d-inference-scheduler repository." The code has been consolidated into the inference scheduler as a transitional measure.

Problems with the sidecar approach:

  • No preemption: Once the EPP picks both P and D in a single ext_proc call, the decision is final. If the P pod becomes overloaded between selection and actual prefill, there is no mechanism to re-route.
  • No fine-grained policy: The sidecar forwards to whatever P pod the EPP selected. There is no policy layer between the gateway decision and the actual prefill execution — no flow control, no priority queuing, no fairness enforcement.
  • Orchestration at the edge: The sidecar runs on the D pod, where it has the least visibility into cluster-wide state. Scheduling intelligence is split between the EPP (which has metrics but can only make one decision) and the sidecar (which can orchestrate but has no metrics).
  • Topology scaling: E/P/D requires the sidecar to coordinate three sequential interactions (encode → prefill → decode). Each added stage compounds the complexity of a component that was designed as a simple reverse proxy.

The fundamental issue is architectural: ext_proc is a decision point, not an orchestrator. The proxy layer needs the ability to execute multi-step workflows, not just select a single destination.

Orchestration in AI policy

P/D disaggregation is not the only pattern that requires orchestration from within request processing. The AI Gateway Working Group's Payload Processing proposal identifies several user stories that imply sub-request capability:

  • Guardrails / safety scanning: A processor calls an external detection engine (prompt injection scanner, PII detector, toxicity classifier) and blocks, sanitizes, or reports based on the result. The processor must call the scanner, await its verdict, and then decide whether to continue or reject — this is a sub-request.

  • Semantic caching: A processor checks a cache service for semantically similar prior requests. On a hit, it returns the cached response directly (short-circuiting the backend). On a miss, the request proceeds to the inference backend and the response is cached on the way back. Both the cache lookup and the cache write are sub-requests.

  • RAG augmentation: A processor calls a retrieval service to fetch relevant context, mutates the request body to inject the retrieved context, then forwards to the inference backend. The retrieval call is a sub-request.

  • MCP routing: A processor needs to look up session state from an external store to determine which MCP server should handle a tool call. The session lookup is a sub-request.

  • Provider failover with API translation: On failure from provider A, a processor translates the request format and retries against provider B. The retry against B is a sub-request with a transformed payload.

In each case, the processor can't simply inspect and annotate — it needs to call out to another service and act on the result. Without a sub-request primitive, every one of these patterns requires either a bespoke sidecar, an external orchestration layer, or multiple ext_proc round-trips chained together.

Proposal

Sub-request API

Provide an HTTP client interface accessible from within filter execution (and by extension, from WASM-embedded processors via host calls). Key properties:

  • Async: Sub-requests are awaited within the filter's async on_request / on_request_body execution. The pipeline blocks on that filter until the sub-request completes or times out.
  • Bounded: Per-sub-request timeout, response size limit, and connection timeout. The sub-request timeout MUST be strictly less than the client-facing request timeout.
  • Observable: Sub-requests carry the parent request ID and are visible in access logs / traces.
  • Cancellable: If the client disconnects or the parent request times out, in-flight sub-requests are cancelled.

Memory safety

Nested callouts (a processor making a sub-request, which is itself a callout from the proxy) create memory pressure. Each in-flight request holds: the buffered client body, processor state (WASM linear memory if applicable), the sub-request connection, and the sub-request response. Mitigations:

  • Bounded body retention: Once a processor has extracted routing-relevant fields, the body buffer should be releasable (StreamBuffer Release). Don't hold the full prompt while waiting for a sub-request.
  • Sub-request response size limits: Hard ceiling on bytes received from callouts.
  • Concurrency limits: Maximum number of concurrent sub-requests across all in-flight requests. Exceeding this returns 429 to new requests rather than queuing unboundedly.
  • WASM instance pooling: If processors run as WASM modules (Epic: Wasm Runtime #18), pool instances rather than allocating per-request.

Processor safety contract (sub-task)

If processors can orchestrate (make sub-requests that cause side effects on external systems), the framework must define safety guarantees:

  • At-least-once semantics: A processor may be invoked multiple times for the same request (due to retries, connection failures, or proxy restarts). The framework does NOT guarantee exactly-once execution.
  • Idempotency requirement: Processors that make sub-requests with side effects MUST be idempotent. Given the same request ID, repeated invocations must not create duplicate side effects.
  • Request identity propagation: Every sub-request carries the parent request ID and an attempt counter (X-Processor-Attempt), enabling downstream services to deduplicate.
  • Checkpoint state machine: For multi-step orchestration (P/D, E/P/D), processors should be structured as a state machine with well-defined checkpoints (PREFILL_SELECTED → PREFILL_COMPLETE → TRANSFER_COMPLETE → DECODE_FORWARDED). On retry, re-enter at the last successful checkpoint rather than replaying the entire sequence.

Design

  • Sub-request HTTP client API design (trait, configuration, lifetime)
  • WASM host-call interface for sub-requests (bridges to Epic: Wasm Runtime #18)
  • Memory bounding strategy (body release, response limits, concurrency caps)
  • Cancellation propagation (client disconnect → sub-request cancel)
  • Observability integration (access log fields, tracing spans for sub-requests)
  • Processor safety contract specification (idempotency, attempt tracking, checkpoints)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    Status
    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions