diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml
index 9da7cca..1dab768 100644
--- a/.github/workflows/checks.yml
+++ b/.github/workflows/checks.yml
@@ -1,4 +1,4 @@
-name: Checks
+name: Quality gates
on:
push:
diff --git a/README.md b/README.md
index e7f4427..e545f54 100644
--- a/README.md
+++ b/README.md
@@ -3,6 +3,7 @@
Serverless, event-driven AWS infrastructure for asynchronous key information extraction with LLMs.
+
@@ -117,7 +118,7 @@ The extraction queue sits between the ingestion bucket and the extractor Lambda.
| Lever | Value | What it controls |
|---|---|---|
| Visibility timeout | `6 × lambda_timeout_seconds` (computed) | Hides an in-flight message long enough to cover the worst-case extractor run plus handoff jitter, eliminating the most common SQS+Lambda misconfiguration |
-| `maxReceiveCount` | 3 | Bounds retries on transient failures before the message is shunted to the DLQ |
+| `maxReceiveCount` | 3 (single-pass), 2 (agentic) | Bounds retries on transient failures before the message is shunted to the DLQ. Follows `extractor_flavor`: agentic failures are mostly logic (a non-terminating loop), not transient, so retrying an expensive doomed run buys nothing ([ADR-0016](docs/adr/0016-agentic-flavor-deployment.md)) |
| Long polling | `receive_wait_time_seconds = 20` | Reduces empty receives and smooths Lambda triggering at no extra cost |
| TLS-only policy | Deny on `aws:SecureTransport = false` (main + DLQ) | Mirrors the bucket's transport posture across the pipeline |
| Source-scoped send | `aws:SourceArn` condition on `events.amazonaws.com` | Closes the confused-deputy class of misconfigurations on the EventBridge → SQS hop |
@@ -160,6 +161,8 @@ The extractor is a container-image Lambda that consumes the extraction queue, ru
| Architecture | `arm64` | ~20% cheaper per GB-second on Graviton; native build on `ubuntu-24.04-arm` so no QEMU emulation |
| `batch_size` | 1 | Per-invocation cost is dominated by the LLM call, so batching does not amortize anything and one-message batches keep the failure model simple |
| `maximum_concurrency` | 10 (staging/local), 25 (prod) | Caps parallel LLM fan-out under an ingestion burst, closing the deferral [ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md) made |
+| `extractor_flavor` | `single_pass` (default), `agentic` | Which [`agentic-kie`](https://github.com/gafnts/agentic-kie) strategy the handler builds—one structured call vs. a ReAct loop over the document. Selectable per environment at deploy time; it drives the whole parameter profile (`max_iterations`, `maxReceiveCount`) so flipping a flavor is a one-variable change ([ADR-0016](docs/adr/0016-agentic-flavor-deployment.md)) |
+| `max_iterations` (agentic only) | ~30 | Caps LangGraph supersteps (`recursion_limit`, ≈ 2× the LLM-call count), bounding a non-terminating agent run. `n/a` for single-pass, which has no loop |
| Idempotency | Conditional `PutItem` + status-guarded `UpdateItem` | At-least-once SQS delivery cannot clobber a terminal row; redelivered terminal messages are a no-op |
| Cold-start | No provisioned concurrency | Async polling model hides the 3–10s container-image cold start from the user |
| Networking | No VPC | Talks only to AWS APIs and external HTTPS endpoints; no NAT cost, no ENI cold-start penalty |
@@ -240,11 +243,11 @@ Eight CloudWatch alarms cover the operational hot path. Each is a 1-of-1 5-minut
| Alarm | Source | Fires when | Why it matters |
|---|---|---|---|
-| `${extractor}-errors` | `AWS/Lambda` `Errors` (Sum) on the extractor | `> 0` over 5 min | Any unhandled exception. With `maxReceiveCount = 3` on the queue, a single bad document fires this up to three times before it lands in the DLQ—the early-warning signal that the DLQ alarm is the confirmation of |
+| `${extractor}-errors` | `AWS/Lambda` `Errors` (Sum) on the extractor | `> 0` over 5 min | Any unhandled exception. A single bad document fires this once per delivery attempt (up to the queue's `maxReceiveCount`—3 for single-pass, 2 for agentic) before it lands in the DLQ—the early-warning signal that the DLQ alarm is the confirmation of |
| `${extractor}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the extractor | `> 0` over 5 min | Invocations rejected because the function hit its `maximum_concurrency` cap. Throttles mean ingestion is exceeding the planned LLM fan-out budget; either the cap is wrong or there's a burst worth investigating |
| `${presigner}-errors` | `AWS/Lambda` `Errors` (Sum) on the presigner | `> 0` over 5 min | The presigner does one `generate_presigned_url` call—non-zero errors imply an IAM regression or a malformed request that slipped past API Gateway |
| `${presigner}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the presigner | `> 0` over 5 min | The presigner has no reserved or maximum concurrency ([ADR-0010](docs/adr/0010-uploader-module.md)); throttles imply the account concurrency ceiling is being approached |
-| `${dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the DLQ | `> 0` over 5 min | A message in the DLQ means a document exhausted its three retries. The DLQ is the single source of truth for failed messages ([ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md)); this alarm is the page on it |
+| `${dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the DLQ | `> 0` over 5 min | A message in the DLQ means a document exhausted its `maxReceiveCount` retries (3 single-pass, 2 agentic). The DLQ is the single source of truth for failed messages ([ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md)); this alarm is the page on it |
| `${publisher}-errors` | `AWS/Lambda` `Errors` (Sum) on the publisher | `> 0` over 5 min | An unhandled exception in the Streams consumer. Result objects silently stop reaching S3 while the extractor keeps writing terminal rows to DynamoDB |
| `${publisher}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the publisher | `> 0` over 5 min | The publisher has no reserved or maximum concurrency; throttles stall result publishing and leave `succeeded`/`failed` rows without matching S3 objects |
| `${publisher-dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the publisher DLQ | `> 0` over 5 min | A stream batch exhausted `maximum_retry_attempts`. The single source of truth for failed batches, mirroring the extractor DLQ alarm ([ADR-0014](docs/adr/0014-split-results-module.md)) |
diff --git a/docs/adr/0015-load-testing-strategy.md b/docs/adr/0015-load-testing-strategy.md
index 2b81db8..7e65b5e 100644
--- a/docs/adr/0015-load-testing-strategy.md
+++ b/docs/adr/0015-load-testing-strategy.md
@@ -38,6 +38,12 @@ There is ample headroom; the provider will not throttle these runs. This is wort
> [!NOTE]
> The `maximum_concurrency` cap is implicitly coupled to the provider's RPM budget: a cap that lets the pipeline issue more RPM than the tier allows turns a burst into DLQ'd documents, not buffered ones. At Tier 1 (4,000 RPM) the staging cap (10 → ~60 RPM) and even the prod cap (25 → ~150 RPM) sit far under the ceiling, so the coupling is currently slack. It is not enforced anywhere in code or config—see Finding 1.
+> [!NOTE]
+> **TPM correction (post-run, 2026-06-07).** The deployed model is **Gemini 3 Flash**, whose Tier-1 input-TPM ceiling is **2M, not the 4M** the table above states. The *measured* burst+sustained peak was **0.317M** (~16% of the 2M ceiling), so the conclusion ("not the binding constraint") held with **~6× headroom**—*more* than the table's ~2×, because the pre-run worst-case draw (~1.8M) sat far above the actual 0.317M, more than offsetting the lower-than-assumed ceiling. RPM and RPD held as predicted. The pre-registered estimates are left intact above; this note records the observed values, per the prediction-then-grade methodology.
+
+> [!NOTE]
+> **RPM/RPD ceiling correction (post-run, 2026-06-07).** The table above and the `maximum_concurrency` coupling note that follows it both state the Gemini Tier-1 ceilings as **4,000 RPM** and **~150,000 RPD**; the deployed Gemini 3 Flash key's actual Tier-1 ceilings are **1,000 RPM** and **10,000 RPD**. The *draws* held as predicted (~60 RPM at staging concurrency, ~400/day across both runs), so the conclusion ("not the binding constraint") is unchanged—but against the corrected ceilings the true headroom is **~16× on RPM** (not the tabulated ~65×) and **~25× on RPD** (not ~375×), still ample. As with the TPM note above, the pre-registered estimates are left intact, per the prediction-then-grade methodology; this note records the corrected ceilings.
+
## Decision
### Scope: end-to-end, through the real front door
@@ -67,7 +73,7 @@ Three reasons this beats a single held-constant document:
**Preserving the controlled-experiment property.** Varying the corpus *and* the arrival pattern at once would change two variables—the confound that made a single document tempting. The fix is to **freeze the sample with a fixed seed and use the identical 200 documents, in the same upload order, for both burst and sustained.** The corpus is then held constant *across* scenarios while varying *within* one: arrival pattern stays the only thing that differs between the two runs, and—better—each document can be paired across runs (same doc, burst vs. sustained) to isolate its queue-wait term cleanly.
-**Sourcing.** The corpus is *not* committed—PDFs would bloat the repo and trip the `check for added large files` hook. A prep step fetches the train partition via the pinned `kleister-nda-preparation` package into a git-ignored directory under `tests/`, so runs are reproducible (pinned package + fixed seed) without versioning the documents. The realized token/size distribution is sanity-checked against the extractor's 120s timeout and the 4M Tier-1 TPM ceiling before a run—a corpus of unusually long NDAs is the one input that could approach either.
+**Sourcing.** The corpus is *not* committed—PDFs would bloat the repo and trip the `check for added large files` hook. A prep step fetches the train partition via the pinned `kleister-nda-preparation` package into a git-ignored directory under `tests/`, so runs are reproducible (pinned package + fixed seed) without versioning the documents. The realized token/size distribution is sanity-checked against the extractor's 120s timeout and the Tier-1 TPM ceiling (2M for the deployed Gemini 3 Flash—corrected post-run; see the provider-budget note above) before a run—a corpus of unusually long NDAs is the one input that could approach either.
### What we measure
diff --git a/docs/adr/0016-agentic-flavor-deployment.md b/docs/adr/0016-agentic-flavor-deployment.md
new file mode 100644
index 0000000..eea1463
--- /dev/null
+++ b/docs/adr/0016-agentic-flavor-deployment.md
@@ -0,0 +1,137 @@
+# ADR-0016: Agentic-Flavor Deployment
+
+## Status
+
+Accepted (2026-06-07).
+
+## Context
+
+The project is named `agentic-kie-deploy`, but every environment to date runs the **single-pass** extractor (`SinglePassExtractor`, [handler.py:84](../../src/extractor/handler.py#L84)). That was a deliberate, measured choice: the offline benchmark ([*When does agency earn its cost?*](https://gabriel.com.gt/blog/when-does-agency-earn-its-cost/)) found that on the Kleister NDA corpus single-pass dominates the matrix—~91.5% F1 at ~$0.007/doc and ~9.8s, while the agentic flavor cost more in latency and dollars—Claude-standard ran ~$0.038/~65s (~5× the dollars, ~6× the latency); Gemini Standard agentic is ~$0.011/~14.6s (~1.5×)—for gains "insufficient to justify the overhead," and lite-tier agentic *regressed* more documents than it improved. Agency did not earn its cost, so we shipped the flavor that did.
+
+That verdict is **offline**: a one-shot accuracy/cost eval on 83 dev documents. It says nothing about what agency costs *the deployed system under arrival pressure*—which is a different and harsher cost than per-document dollars. [ADR-0015](0015-load-testing-strategy.md) measured the deployed behavior of the single-pass flavor (both scenarios passed all five SLOs); the symmetric exercise for the agentic flavor has never been run. So three things are simultaneously true:
+
+- The name promises a capability the deployment doesn't currently exercise.
+- The strongest decision in the project—*not* shipping agentic—is only half-justified, because it rests on offline numbers and never confronts the deployed envelope.
+- The offline verdict has a deployed counterpart no benchmark can produce—the agency premium *in the running pipeline* (drain time, queue dwell, the infra cost the eval never saw)—and the exercise gives ADR-0015's dormant findings a live look: Finding 1's provider-RPM coupling gets *measured* (and, at Tier 1, is likely confirmed slack), and Finding 2's errors-alarm-vs-DLQ question becomes testable via a deliberate stressor.
+
+This ADR settles **how** the agentic flavor is deployed and how its parameter envelope is re-derived. On Gemini, agentic costs ~1.5× single-pass in latency and dollars—modest enough that the existing operating envelope already absorbs it at the ADR-0015 bracket. The re-derivation is therefore narrow: exactly two knobs genuinely move, the rest of the envelope holds, and the real payoff is the *deployed* agency premium plus the capability itself.
+
+### The agentic flavor changes the workload model, not just a constant
+
+`AgenticExtractor` builds a LangChain ReAct agent that explores the PDF via tools (`get_page_count`, `read_text`, optionally `load_images`) and stops when it has enough information. Concretely, versus single-pass:
+
+| Property | Single-pass | Agentic | Consequence |
+|---|---|---|---|
+| LLM calls per document | exactly 1 | N, data-dependent (observed 5–9 in offline traces) | request rate decouples from document rate |
+| Service time | ~10s (p99 31s) | ~14.6s (benchmark, ~1.5×), fatter/bimodal tail | steady-state capacity contracts |
+| Input tokens/doc | fixed per document | inflated (re-reads pages across turns) | provider TPM headroom shrinks |
+| Failure modes | one call succeeds/fails | loop non-termination, repeated tool error, partial state | `max_iterations` exhaustion → `ExtractionError` |
+
+The single-pass parameters were *derived* from its workload model (service ~10s → capacity `cap ÷ service` ≈ 60/min at staging; provider draw = throughput because calls = throughput). The honest move is to re-run that derivation and see which constants actually move—not to assume the whole envelope is wrong. At only ~1.5× service the queue-dynamics constants mostly still fit; as it turns out (below), one knob (`max_iterations`) is wrong independent of latency, one (`maxReceiveCount`) is worth tightening, and the rest hold.
+
+### The architectural change: one concurrency knob becomes two
+
+In single-pass, the SQS event-source `maximum_concurrency` ([extractor/main.tf:130](../../infra/modules/extractor/main.tf#L130)) does three jobs at once *because one document equals one LLM call*: it caps document parallelism (throughput), caps concurrent LLM requests (the cost-burst guardrail), and bounds the provider RPM draw (Finding 1's coupling). Those collapse into a single number only at a 1:1 doc-to-call ratio.
+
+Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-flight: the request side now scales with `cap × calls_per_doc`, which is variable and which the SQS cap does not control. The cap still governs throughput, but the cost-guardrail and provider-coupling jobs would call for a **second control surface**: a request-level limiter (token bucket / semaphore in the handler) sized against the Gemini RPM/TPM budget. The SQS event-source cap governs *document* parallelism; the in-handler limiter governs *request* parallelism. That decoupling is the real architectural finding—the deployed-infra echo of the offline thesis: agency doesn't merely cost more per document, it breaks the assumption that one knob controls both throughput and provider exposure. *Conceptually* real is not the same as *quantitatively* binding, though: at Tier 1 (1,000 RPM) with ~1.5× service, the request side draws only a few hundred RPM—~164 at staging's cap 10 (~6× under the ceiling), ~410 even at prod's cap 25 (~2.4× under). So the second control surface is a thing to *measure for*, and to reach for as N or the cap grows—comfortably skippable for the staging characterization run, but a thin enough margin at prod's cap that it moves from hypothetical toward real (Finding B).
+
+## Decision
+
+### Flavor is a deploy-time parameter, single-pass stays the default
+
+Introduce `var.extractor_flavor` (`single_pass` | `agentic`, default `single_pass`). It drives two things:
+
+1. **The handler constructor.** `_extractor()` ([handler.py:84-90](../../src/extractor/handler.py#L84-L90)) reads a new `EXTRACTOR_FLAVOR` env var and builds either `SinglePassExtractor(model, schema)` (today) or `AgenticExtractor(model, schema, modality="text", max_iterations=)`. Both are already exported by `agentic_kie`, share the identical `(model, schema)` interface, and surface failures through the handler's broad `except Exception` ([handler.py:317](../../src/extractor/handler.py#L317)), which already routes them to `batchItemFailures` ([handler.py:356](../../src/extractor/handler.py#L356))—so the agentic failure path (a non-terminating agent's `ExtractionError` included) flows through the existing redrive/DLQ machinery unchanged, caught by the type-agnostic `except` rather than any shared exception class. `Extractor[NDA]` (also exported) becomes the return type so the cache helper covers both.
+2. **The parameter profile** (below), keyed off `extractor_flavor` so the whole envelope moves *with* the flavor rather than being hand-edited—of which, for agentic, only `max_iterations` and `maxReceiveCount` actually differ from single-pass (the timeout and its derived visibility stay put). Switching any environment's flavor is then a one-variable change, which is the point: re-parametrization should be as cheap as flipping the variable.
+
+Every environment—staging and prod alike—can run **either** flavor, selected per environment at deploy time, with single-pass the default everywhere. Because the full profile follows `extractor_flavor` (above), pointing any environment at agentic is a one-variable change, and pointing it back is the same. The characterization run is done on **staging** first: you validate a new flavor's deployed envelope before offering it to prod, and staging's single-pass baseline already lives in the ADR-0015 artifacts, so flipping it loses nothing. Prod thereby *gains the capability* to run agentic while keeping single-pass (and its deletion protection) by choice—nothing about prod is reverted, because the infra change is a permanent capability, not a temporary patch. A dedicated `staging-agentic` environment remains an option for a continuous side-by-side (recorded below).
+
+### The re-derived parameter profile
+
+| Knob | Single-pass (today) | Agentic profile | Why it moves |
+|---|---|---|---|
+| `max_iterations` (agent) | n/a | **~30** (down from the library default 50) | The real cost/latency governor—but it caps LangGraph *supersteps* (`recursion_limit`), ≈ 2× the LLM-call count, *not* LLM calls. Offline traces run 5–9 LLM calls (≈ 9–17 supersteps). ~30 clears that ceiling with margin and still caps a runaway at ~15 calls. See Finding A. |
+| `max_retries` (agent) | n/a | **3** | A *third* retry knob, separate from `maxReceiveCount`: `ModelRetryMiddleware` retries each model call up to 3× with backoff on *transient* errors (429/timeout/overload) inside one invocation. Left at 3, but recorded because it interacts with the 120s timeout (transient retries add wall-clock) and because "fail fast and cheap" applies to *logic* failures, not transient ones. |
+| `modality` | `text` | `text` | Avoids image-token blow-up; keeps the per-doc TPM draw bounded. Measured single-pass TPM peaked at 0.317M against the 2M ceiling (~6× headroom); even agentic's per-doc input inflation (~2–4×) at ~0.68× throughput stays well under, so Finding 1's coupling holds slack. |
+| Lambda timeout ([main.tf:33](../../infra/main.tf#L33)) | 120s | **120s (unchanged)** | Benchmark mean is 14.6s, and `max_iterations` ~30 (≈ ~15 LLM calls at ~2s each) bounds the worst run to ~30–50s—well under the existing 120s, which already absorbed single-pass's 50s tail. `max_iterations`, not the clock, is the governor; the timeout is a backstop that already has margin. No reason to move it. |
+| Visibility timeout | 720s (= 120×6) | **720s (unchanged)** | Derived as `timeout × 6` ([queue/main.tf:2](../../infra/modules/queue/main.tf#L2)), so it tracks the timeout automatically. The timeout stays at 120s, so this stays at 720s—and at ~330s peak dwell (below, scaling the measured single-pass baseline) that is ~2.2× headroom. The coupling is worth keeping; it just doesn't need to fire here. |
+| `maximum_concurrency` | 10 staging / 25 prod | **held at the environment's existing cap** (cost-preserving) | A per-environment lever, independent of flavor—not part of the flavor profile; see the fork below. |
+| **(new) request-level limiter** | implicit in the cap | **measure first, build only if the draw warrants** | In-doc fan-out decouples request rate from the cap (see Context), but at Tier 1 the draw sits ~6× under budget at staging's cap 10 (~2.4× at prod's cap 25). Conditional on the run's measured provider rate (Finding B), not built up front. |
+| `maxReceiveCount` ([queue default](../../infra/modules/queue/variables.tf#L16)) | 3 | **2** | Agentic failures are mostly logic (non-terminating loop, repeated tool error), not transient. Retrying an expensive doomed run 3× triples its cost for nothing. The value is single-sourced (queue redrive → `SQS_MAX_RECEIVE_COUNT`, [main.tf:91](../../infra/main.tf#L91)), so the flip is one variable—but several descriptions hard-code "maxReceiveCount=3" (the [extractor-errors alarm](../../infra/modules/extractor/main.tf#L137), the [DLQ alarm](../../infra/modules/queue/main.tf#L132), the publisher variable, the README alarm table) and must be updated alongside it. |
+| `batch_size` / batching window | 1 / 0 | 1 / 0 (unchanged) | One long ReAct run per invocation is already correct; batching would head-of-line-block. |
+| Memory | 2048 MB | 2048 MB (revisit) | Latency is LLM-wall-clock-bound (network), not CPU-bound; memory buys cold-start and glue speed only. A modest lever, left at baseline pending evidence. |
+
+**The downstream half does not move.** The publisher (DynamoDB Streams → analytics S3, 5s batch window) is flavor-agnostic—it runs *after* extraction and neither knows nor cares which extractor wrote the row. The re-derivation touches only the extractor handler (`max_iterations`) and the queue's redrive policy (`maxReceiveCount`)—the event-source mapping, the provider budget, and the whole downstream half stay as they are.
+
+### The one genuine fork: throughput vs. cost containment
+
+Capacity is `cap ÷ service_time`. To hold single-pass-like drain behavior (a 200-burst absorbed and drained in a few minutes) the cap would rise from 10 to ~15 to offset the ~1.5× longer service time. That fights the cost guardrail. The choice:
+
+- **Throughput-preserving**: raise the cap to ~15, keep drains fast, accept a ~1.5× wider cost-burst exposure on the *expensive* flavor.
+- **Cost-preserving** (chosen): hold the cap at its existing per-environment value (10 on staging) and let SQS hold the backlog longer—which the unchanged 720s visibility timeout already absorbs (~330s dwell, ~2.2× headroom), so nothing has to give for it.
+
+**We hold the existing cap (cost-preserving)—but at ~1.5× this is a low-stakes call, not a principled stand.** Raising it to ~15 would cost ~50% more concurrent spend for a faster drain, and either way the 200-doc bracket completes in minutes with the DLQ empty. We change nothing because the cap is a per-environment lever and there's no measured reason to touch it; if drain time ever matters more than spend, ~15 is the one-variable flip. The original *principle*—lean on the buffer, not the throttle—still holds; it just isn't being tested at this scale.
+
+## Pass/fail criteria (SLOs)
+
+The agentic runs reuse ADR-0015's five SLOs—only SLO 4 changes, made flavor-aware (Finding C)—and add criterion 6, which is the point of the exercise.
+
+1. **Correctness (primary run)**—200/200 reach `succeeded`; both DLQs at 0. (A *deliberate low-`max_iterations` stressor run* is exempt and expected to DLQ—see criterion 5.)
+2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the 720s visibility timeout and the queue drains to empty. At ~14.6s service time (~1.5× single-pass), a 200-burst drains in ~5–5.5 min—scaling the *measured* single-pass baseline (3.85 min at 51.9 docs/min, not the theoretical 60/min)—so oldest-message age peaks ~330s, ~2.2× under the 720s, and no message ages into a redelivery. Nothing in the envelope had to move for this; the queue simply drains cleanly and dwell stays well under the timeout.
+3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the measured LLM request rate stays under the Gemini RPM/TPM budget. This is the live read on Findings 1/B: at staging's cap 10 RPM should sit ~6× under, and TPM likewise ~6× under (single-pass peaked 0.317M of the 2M ceiling; agentic inflates per-doc tokens but stays clear)—which is also the test of whether a request-level limiter is needed at all—if the draw is that slack, it isn't built.
+4. **Latency—reported and compared, not gated (for agentic).** The *deliverable* is the agentic-vs-single-pass delta on the same corpus in the same deployed pipeline (criterion 6), not a pass/fail bar—agentic is slow by design. **This is not what the harness does today:** as built, SLO 4 *gates* processing p90 (both scenarios) and sustained e2e p90, on thresholds derived from single-pass's <10s benchmark, and a failed SLO hard-fails the run. Agentic trips those bars, so making SLO 4 flavor-aware—reporting rather than gating—is required work; see Finding C.
+5. **Alarms honest**—primary run: no alarm fires. **Stressor run: this finally exercises Finding 2.** A small, separate run—~20 documents with `max_iterations` forced very low (≤4 supersteps, e.g. 2) so they reliably exhaust it → `ExtractionError` → retry → (at `maxReceiveCount=2`) DLQ. The prediction (from [handler.py:356](../../src/extractor/handler.py#L356)) is that `Errors` stays flat (failures are reported as `batchItemFailures`, a *successful* invocation) and **only** the `${dlq}-messages-visible` alarm fires, not `${extractor}-errors`. Confirming this on a live run closes Finding 2.
+6. **The deployed agency premium (new)**—cost/doc and e2e-latency, agentic vs. single-pass, measured not benchmarked: the offline "agency doesn't earn its cost" verdict, plus the infra cost the benchmark never saw (slower drain, the retune this ADR documents).
+
+## Expected behavior (hypotheses to confirm or refute)
+
+- **Service time** ~14.6s mean (benchmark, ~1.5× single-pass), tail bounded by `max_iterations` rather than by a timeout crash; **capacity** contracts from ~60/min to ~41/min at the held cap.
+- **Burst**: queue peaks near 200 (as single-pass), but *drains in ~5–5.5 min*—vs the *measured* single-pass ~3.85 min, not the theoretical ~3.5; concurrency pins at the cap; oldest-message age peaks ~330s—comfortably under the unchanged 720s timeout (~2.2×). DLQ 0 on the primary run; no alarm.
+- **Sustained**: holding ADR-0015's 0.22 doc/s arrival schedule (the harness fixes the 900s window, so the rate is flavor-independent), now ~32% of the reduced ~41/min capacity—still below capacity, so queue ≈ 0 and concurrency hovers low (perhaps a touch above single-pass's peak-5, given the fatter tail—ADR-0015 Finding 3—but under the cap); latency ≈ processing (which is now multi-call).
+- **Cost**: ~$0.011/doc on Gemini text-modality agentic (benchmark); ~$4–5 for both scenarios (200 docs each), plus pennies for the ~20-doc stressor.
+- **Finding 2 stressor**: docs that exhaust the low `max_iterations` DLQ cleanly with `Errors` flat and only the DLQ alarm firing.
+
+If reality diverges, the divergence is the finding.
+
+## The harness
+
+No new harness. The ADR-0015 driver under `tests/load/` is **flavor-agnostic**: it presigns + PUTs documents, polls for landing, and reads server-side `created_at` / `processing_ms` / `completed_at` / `token_usage` plus the Layer A CloudWatch series and alarm history. None of that is single-pass-specific. So the existing `make load ENV=staging SCENARIO=burst|sustained` runs against the agentic deployment unchanged; the only difference is which flavor profile staging was applied with. The agentic artifacts land alongside the single-pass baseline in `tests/load/reports/`, and the per-document pairing (same corpus, same upload order) extends to a third axis—single-pass vs agentic on the identical document.
+
+## Consequences
+
+Positive:
+
+- The project earns its name: it deploys `agentic-kie`, both flavors, selectable per environment at deploy time—prod included.
+- The offline "agency doesn't earn its cost" verdict gains its deployed counterpart, including the infra cost the benchmark could not measure.
+- Finding 2 gets a live test (via the deliberate stressor sub-run); Finding 1 is *measured* and—at Tier 1 with these caps—expected to stay slack, which is itself a recorded result. The cap-decoupling is documented as a watch-item for higher N / prod's cap, not prematurely built.
+- The re-parametrization is reusable: the flavor profile is the template for any future heavier workload (multimodal, a larger schema).
+
+Negative:
+
+- Real work: a handler constructor switch, a new `extractor_flavor` parameter + profile plumbing, and a harness change so SLO 4 reports rather than gates agentic latency (Finding C)—plus the request-level limiter *only if* the measured draw warrants it (Finding B). More LLM spend (~$4–5) than the single-pass runs.
+- An environment runs one flavor at a time, so flipping staging to agentic means it isn't serving single-pass during the run window (mitigated: the baseline is already captured and flip-back is one variable; or stand up a second environment for a continuous side-by-side).
+- The agentic flavor does not change the production decision—single-pass remains the default. This is characterization, not a reversal.
+
+Neutral:
+
+- The production *decision* is unchanged—prod keeps single-pass by choice—while the *capability* to run agentic is added for every environment. Adding the option is not exercising it; the change reverts nothing.
+
+## Findings
+
+(Recorded as discovered; pre-implementation findings first.)
+
+- **Finding A—`max_iterations` is a LangGraph `recursion_limit` (supersteps ≈ 2× LLM calls), not an LLM-call count; the right value is ~30—not the 8–12 first drafted, nor the library default 50.** `AgenticExtractor` passes `max_iterations` straight to LangGraph's `recursion_limit`, and `create_agent` builds a two-node loop (model ↔ tools), so K LLM calls cost ≈ 2K−1 supersteps. Offline traces show 5–9 LLM calls (≈ 9–17 supersteps); the higher "count tools and chains → ~45" figure is LangSmith *trace spans*, not supersteps, and doesn't bind this knob. Two corrections follow: (1) the draft's 8–12 would clip *every* legit run into a false `ExtractionError`—even a 5-call run needs ~9 supersteps; (2) the "default 50 crashes on the 120s timeout" mechanism is model-specific—it held for the slow Claude run (~65s) but not for the deployed Gemini Flash (~14.6s for 5–9 calls, ~2s/call), where even 50 supersteps (~25 calls) is ~50s and raises `ExtractionError` *cleanly* rather than crashing. So the reason to lower it is cost/latency containment of a doomed doc (cap a runaway at ~15 calls / ~$0.02 / ~30s) and margin above the legit ceiling, not crash-avoidance. ~30 clears the observed 9-call ceiling with ~1.7× margin; the characterization run validates it—a *legit* doc DLQ'ing via recursion means it's still too tight. The single-pass flavor never surfaced any of this because it has no loop.
+- **Finding B (to confirm)—the SQS event-source cap stops being a provider-rate control under agentic.** Because in-doc fan-out decouples request rate from document rate, holding `maximum_concurrency` no longer bounds RPM/TPM. Whether the new in-handler limiter is necessary, or Tier 1's headroom absorbs `cap × calls_per_doc` anyway, is a quantity to measure on the run, not assume.
+- **Finding C—the harness's latency SLO is hard-gated and would false-fail agentic.** [report.py:24-25](../../tests/load/report.py#L24-L25) hard-codes `PROCESSING_P90_MAX_S = 15` (gated in both scenarios) and `SUSTAINED_E2E_P90_MAX_S = 20` (sustained), and any failed SLO trips `assert not failures` ([test_scenarios.py:86-88](../../tests/load/test_scenarios.py#L86-L88))—so a red SLO 4 fails the whole run, not just the report. Those bars are 1.5× single-pass's <10s benchmark, and single-pass already clears processing p90 by a hair (13.5/13.8s, ADR-0015 Finding 5), so agentic at ~1.5× trips them on the very metric SLO 4 calls informational. Fix: thread `extractor_flavor` into `report.evaluate()` and return `passed=None` for agentic latency—`None` is not `False`, so it doesn't trip the assert, and the harness already uses that exact pattern for the no-data case ([report.py:159](../../tests/load/report.py#L159)). The agentic-vs-single-pass delta (criterion 6) stays the deliverable. Discovered reading the harness while drafting this ADR; lands in the implementation phase.
+
+## Alternatives considered
+
+- **Flip the existing extractor by env var only (no parameter profile).** Simplest, but the agentic flavor still wants `maxReceiveCount` lowered and `max_iterations` set, so an env-var-only flip leaves those to hand-edit per run and can't hold a clean single-pass baseline alongside. Rejected: the flavor and its profile should move together as one variable.
+- **Throughput-preserving cap (~15).** Holds single-pass drain times. Not chosen for v1 (see the fork)—though at ~1.5× the cost delta is small enough that this is nearly a coin-flip. Recorded as a one-variable flip if drain time ever matters more than spend.
+- **Multimodal / image modality.** Closer to what a "read the document like a human" agent implies, and what some benchmark rows used. Rejected for the deploy: image tokens multiply the TPM draw and re-tighten Finding 1's coupling for no measured accuracy win on this text-heavy NDA corpus. `text` keeps the provider budget slack.
+- **Dedicated `staging-agentic` environment.** A true side-by-side: agentic and single-pass live simultaneously, no baseline displacement. Heavier (a full env stand-up, its own alarms, its own teardown) and unnecessary given the baseline is already captured. Recorded as the cleaner path if a *continuous* A/B is ever wanted, per the single-tenant deployment model ([ADR-0013](0013-single-tenant-deployment-model.md)).
+- **Don't deploy agentic; explain the name in prose.** The zero-cost path: a README/blog line saying the name refers to the library, which implements both flavors. Rejected as the anticlimactic answer—it leaves the project's strongest decision resting on offline numbers and forgoes the most interesting load-testing exercise available.
+
+## Post-implementation
+
+(To be completed after the runs, mirroring ADR-0015: the hypotheses above graded against the artifacts, the deployed agency premium reported, and Findings 1/2/A/B/C resolved or carried.)
diff --git a/docs/adr/README.md b/docs/adr/README.md
index 25e3b8e..729bc47 100644
--- a/docs/adr/README.md
+++ b/docs/adr/README.md
@@ -33,3 +33,4 @@ This directory records the significant architectural decisions made in this proj
| [0013](0013-single-tenant-deployment-model.md) | Single-tenant deployment model | Accepted |
| [0014](0014-split-results-module.md) | Split the results module into publisher and analytics | Accepted |
| [0015](0015-load-testing-strategy.md) | Load-testing strategy | Accepted |
+| [0016](0016-agentic-flavor-deployment.md) | Agentic-flavor deployment | Accepted |
diff --git a/infra/envs/local.tfvars b/infra/envs/local.tfvars
index ad4bd7c..244aca5 100644
--- a/infra/envs/local.tfvars
+++ b/infra/envs/local.tfvars
@@ -1 +1,2 @@
-environment = "local"
+environment = "local"
+extractor_flavor = "agentic"
diff --git a/infra/envs/prod.tfvars b/infra/envs/prod.tfvars
index 195908d..0513964 100644
--- a/infra/envs/prod.tfvars
+++ b/infra/envs/prod.tfvars
@@ -1,2 +1,3 @@
-environment = "prod"
-alarm_email = "gafnts@gmail.com"
+environment = "prod"
+extractor_flavor = "single_pass"
+alarm_email = "gafnts@gmail.com"
diff --git a/infra/envs/staging.tfvars b/infra/envs/staging.tfvars
index 9eac881..53f51a5 100644
--- a/infra/envs/staging.tfvars
+++ b/infra/envs/staging.tfvars
@@ -1,2 +1,3 @@
-environment = "staging"
-alarm_email = "gafnts@gmail.com"
+environment = "staging"
+extractor_flavor = "agentic"
+alarm_email = "gafnts@gmail.com"
diff --git a/infra/main.tf b/infra/main.tf
index 19c2046..dd96d10 100644
--- a/infra/main.tf
+++ b/infra/main.tf
@@ -32,6 +32,24 @@ locals {
extractor_timeout_seconds = 120
+ # The parameter profile follows extractor_flavor so the whole envelope moves with
+ # the flavor rather than being hand-edited. Only two knobs differ between flavors:
+ # the agent's max_iterations (n/a for single-pass, which has no loop) and the queue's
+ # maxReceiveCount (agentic failures are mostly logic, not transient, so retrying an
+ # expensive doomed run buys nothing). The timeout, visibility timeout, modality, and
+ # concurrency cap hold across both.
+ flavor_profiles = {
+ single_pass = {
+ max_iterations = null
+ max_receive_count = 3
+ }
+ agentic = {
+ max_iterations = 30
+ max_receive_count = 2
+ }
+ }
+ flavor_profile = local.flavor_profiles[var.extractor_flavor]
+
# Partition root for result objects, single-sourced here and threaded into both
# the publisher (write path) and analytics (Glue/Athena read path) modules.
results_prefix = "extractions"
@@ -67,6 +85,7 @@ module "queue" {
name = "${var.project_name}-${var.environment}-extraction"
source_bucket_name = module.ingestion.bucket_name
lambda_timeout_seconds = local.extractor_timeout_seconds
+ max_receive_count = local.flavor_profile.max_receive_count
alarm_topic_arn = module.alarms.topic_arn
environment = var.environment
}
@@ -82,6 +101,8 @@ module "extractor" {
source = "./modules/extractor"
function_name = "${var.project_name}-${var.environment}-extractor"
image_uri = "${data.aws_ecr_repository.extractor.repository_url}@${var.extractor_image_digest}"
+ extractor_flavor = var.extractor_flavor
+ max_iterations = local.flavor_profile.max_iterations
timeout_seconds = local.extractor_timeout_seconds
memory_mb = 2048
ephemeral_storage_mb = 2048
diff --git a/infra/modules/extractor/main.tf b/infra/modules/extractor/main.tf
index 7a7c8fd..6318e90 100644
--- a/infra/modules/extractor/main.tf
+++ b/infra/modules/extractor/main.tf
@@ -100,14 +100,20 @@ resource "aws_lambda_function" "extractor" {
}
environment {
- variables = {
- LLM_MODEL = var.llm_model
- LLM_PROVIDER_SECRET_ARN = var.llm_provider_secret_arn
- LANGSMITH_SECRET_ARN = var.langsmith_secret_arn
- LANGSMITH_PROJECT = var.langsmith_project
- RESULTS_TABLE_NAME = var.results_table_name
- SQS_MAX_RECEIVE_COUNT = tostring(var.queue_max_receive_count)
- }
+ # Single-pass leaves max_iterations null and omits EXTRACTOR_MAX_ITERATIONS
+ # entirely rather than carry a dead var; only Agentic sets it.
+ variables = merge(
+ {
+ LLM_MODEL = var.llm_model
+ LLM_PROVIDER_SECRET_ARN = var.llm_provider_secret_arn
+ LANGSMITH_SECRET_ARN = var.langsmith_secret_arn
+ LANGSMITH_PROJECT = var.langsmith_project
+ RESULTS_TABLE_NAME = var.results_table_name
+ SQS_MAX_RECEIVE_COUNT = tostring(var.queue_max_receive_count)
+ EXTRACTOR_FLAVOR = var.extractor_flavor
+ },
+ var.max_iterations != null ? { EXTRACTOR_MAX_ITERATIONS = tostring(var.max_iterations) } : {}
+ )
}
tags = {
@@ -134,7 +140,7 @@ resource "aws_lambda_event_source_mapping" "extraction" {
resource "aws_cloudwatch_metric_alarm" "errors" {
alarm_name = "${var.function_name}-errors"
- alarm_description = "Lambda invocations that ended in an unhandled exception. With maxReceiveCount=3 on the queue, a single bad document fires this up to three times before it lands in the DLQ — the alarm is the early-warning signal that the DLQ alarm is the confirmation of."
+ alarm_description = "Lambda invocations that ended in an unhandled exception. A single bad document fires this once per delivery attempt before it lands in the DLQ (the alarm is the early-warning signal that the DLQ alarm is the confirmation of)."
namespace = "AWS/Lambda"
metric_name = "Errors"
statistic = "Sum"
diff --git a/infra/modules/extractor/variables.tf b/infra/modules/extractor/variables.tf
index 3935228..cbfc253 100644
--- a/infra/modules/extractor/variables.tf
+++ b/infra/modules/extractor/variables.tf
@@ -12,6 +12,22 @@ variable "image_uri" {
}
}
+variable "extractor_flavor" {
+ description = "Which agentic-kie strategy the handler builds, passed through as EXTRACTOR_FLAVOR. 'single_pass' or 'agentic'."
+ type = string
+ default = "single_pass"
+ validation {
+ condition = contains(["single_pass", "agentic"], var.extractor_flavor)
+ error_message = "extractor_flavor must be 'single_pass' or 'agentic'."
+ }
+}
+
+variable "max_iterations" {
+ description = "Agentic-only cap on LangGraph supersteps (recursion_limit, ~2x the LLM-call count), passed as EXTRACTOR_MAX_ITERATIONS. Null for single_pass, which has no loop; only emitted to the function env when set."
+ type = number
+ default = null
+}
+
variable "timeout_seconds" {
description = "Function timeout. The queue's visibility timeout is derived as 6x this value."
type = number
diff --git a/infra/modules/publisher/variables.tf b/infra/modules/publisher/variables.tf
index f05c99a..37365d3 100644
--- a/infra/modules/publisher/variables.tf
+++ b/infra/modules/publisher/variables.tf
@@ -64,7 +64,7 @@ variable "stream_batching_window_seconds" {
}
variable "stream_retry_attempts" {
- description = "Retries before a failed batch lands in the DLQ. Mirrors the extractor's maxReceiveCount=3 for retry-budget symmetry across the pipeline."
+ description = "Retries before a failed batch lands in the DLQ. Mirrors the extractor's single-pass maxReceiveCount (3) for retry-budget symmetry across the pipeline. The publisher is flavor-agnostic so this holds at 3 even when the extractor tightens to 2 under the agentic flavor."
type = number
default = 3
}
diff --git a/infra/modules/queue/main.tf b/infra/modules/queue/main.tf
index 205f477..82a50bc 100644
--- a/infra/modules/queue/main.tf
+++ b/infra/modules/queue/main.tf
@@ -129,7 +129,7 @@ resource "aws_sqs_queue_policy" "extraction_dlq" {
resource "aws_cloudwatch_metric_alarm" "dlq_messages_visible" {
alarm_name = "${aws_sqs_queue.extraction_dlq.name}-messages-visible"
- alarm_description = "Any message in the DLQ means a document exhausted maxReceiveCount=3 retries. The DLQ alarm is the single source of truth for failed messages."
+ alarm_description = "Any message in the DLQ means a document exhausted its maxReceiveCount retries (3 for single-pass, 2 for agentic). The DLQ alarm is the single source of truth for failed messages."
namespace = "AWS/SQS"
metric_name = "ApproximateNumberOfMessagesVisible"
statistic = "Maximum"
diff --git a/infra/outputs.tf b/infra/outputs.tf
index fcad498..72661dd 100644
--- a/infra/outputs.tf
+++ b/infra/outputs.tf
@@ -43,6 +43,11 @@ output "results_table_arn" {
value = module.table.table_arn
}
+output "extractor_flavor" {
+ description = "The deployed extraction strategy. Read by the load harness to make SLO 4 (latency) report rather than gate for the agentic flavor, which is slow by design."
+ value = var.extractor_flavor
+}
+
output "extractor_function_name" {
value = module.extractor.function_name
}
diff --git a/infra/variables.tf b/infra/variables.tf
index a7b1d02..8939e87 100644
--- a/infra/variables.tf
+++ b/infra/variables.tf
@@ -35,6 +35,16 @@ variable "llm_model" {
default = "gemini-3-flash-preview"
}
+variable "extractor_flavor" {
+ description = "Which agentic-kie extraction strategy the extractor runs. 'single_pass' issues one structured LLM call; 'agentic' runs a ReAct loop over the document. Selectable per environment at deploy time; drives the whole parameter profile (max_iterations, maxReceiveCount) so re-parametrization is a one-variable flip."
+ type = string
+ default = "single_pass"
+ validation {
+ condition = contains(["single_pass", "agentic"], var.extractor_flavor)
+ error_message = "extractor_flavor must be 'single_pass' or 'agentic'."
+ }
+}
+
variable "alarm_email" {
description = "Email address subscribed to the alarm SNS topic. Leave null to skip the subscription (alarms still fire in CloudWatch, they just don't notify anyone). The recipient must confirm the subscription from their inbox before delivery starts."
type = string
diff --git a/src/extractor/handler.py b/src/extractor/handler.py
index 2bde83e..6c26974 100644
--- a/src/extractor/handler.py
+++ b/src/extractor/handler.py
@@ -19,7 +19,7 @@
from typing import Any, cast
import boto3
-from agentic_kie import PDFLoader, SinglePassExtractor
+from agentic_kie import AgenticExtractor, Extractor, PDFLoader, SinglePassExtractor
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.typing import LambdaContext
from botocore.exceptions import ClientError
@@ -81,12 +81,28 @@ def _bootstrap_secrets() -> None:
@cache
-def _extractor() -> SinglePassExtractor[NDA]:
- """Build the key information extractor."""
+def _extractor() -> Extractor[NDA]:
+ """
+ Build the key information extractor for the deployed flavor (ADR-0016).
+
+ ``EXTRACTOR_FLAVOR`` selects the strategy: ``single_pass`` (default) issues
+ one structured LLM call; ``agentic`` runs a ReAct loop over the document,
+ capped at ``EXTRACTOR_MAX_ITERATIONS`` LangGraph supersteps. Both satisfy the
+ ``Extractor`` protocol and share the identical ``(model, schema)`` interface,
+ so the handler's broad ``except`` routes either one's failure—including the
+ agentic non-termination ``ExtractionError``—through the same redrive path.
+ """
_bootstrap_secrets()
model = ChatGoogleGenerativeAI(
model=os.environ["LLM_MODEL"], google_api_key=_llm_api_key()
)
+ if os.environ.get("EXTRACTOR_FLAVOR", "single_pass") == "agentic":
+ return AgenticExtractor(
+ model=model,
+ schema=NDA,
+ modality="text",
+ max_iterations=int(os.environ.get("EXTRACTOR_MAX_ITERATIONS", "30")),
+ )
return SinglePassExtractor(model=model, schema=NDA)
diff --git a/tests/load/conftest.py b/tests/load/conftest.py
index b6ed5dc..a09f558 100644
--- a/tests/load/conftest.py
+++ b/tests/load/conftest.py
@@ -56,6 +56,15 @@ def extractor_log_group_name() -> str:
return _tf_output("extractor_log_group_name")
+@pytest.fixture(scope="session")
+def extractor_flavor() -> str:
+ """
+ The deployed extraction strategy (ADR-0016), read from the live stack so the
+ report reflects what was actually deployed rather than an operator claim.
+ """
+ return _tf_output("extractor_flavor")
+
+
@pytest.fixture(scope="session")
def load_env(ingestion_bucket: str) -> str:
"""
@@ -82,6 +91,7 @@ def load_targets(
extractor_function_name: str,
publisher_function_name: str,
extractor_log_group_name: str,
+ extractor_flavor: str,
results_table_name: str,
uploader_api_endpoint: str,
analytics_bucket: str,
@@ -104,4 +114,5 @@ def load_targets(
api_id=urlparse(uploader_api_endpoint).hostname.split(".")[0], # type: ignore[union-attr]
analytics_bucket=analytics_bucket,
ingestion_bucket=ingestion_bucket,
+ flavor=extractor_flavor,
)
diff --git a/tests/load/measure.py b/tests/load/measure.py
index 90338a6..0456a31 100644
--- a/tests/load/measure.py
+++ b/tests/load/measure.py
@@ -44,6 +44,7 @@ class Targets:
api_id: str
analytics_bucket: str
ingestion_bucket: str
+ flavor: str = "single_pass" # deployed extraction strategy (ADR-0016)
@property
def alarm_prefix(self) -> str:
diff --git a/tests/load/report.py b/tests/load/report.py
index 1129792..59f3a2d 100644
--- a/tests/load/report.py
+++ b/tests/load/report.py
@@ -153,10 +153,29 @@ def evaluate(
)
)
- # 4. Latency
+ # 4. Latency. Gated for single_pass on the <10s-benchmark-derived bars;
+ # reported, not gated for agentic (ADR-0016 Finding C), which is slow by
+ # design—its deliverable is the agentic-vs-single-pass delta (criterion 6),
+ # not a pass/fail bar. passed=None is not False, so a slow agentic run never
+ # trips the harness's `assert not failures`.
+ agentic = targets.flavor == "agentic"
proc = [r.processing_s for r in ok if r.processing_s is not None]
if not proc:
slos.append(SLO(4, "Latency", None, "no processing data"))
+ elif agentic:
+ proc_p90 = _percentile(proc, 0.9)
+ e2e = [r.total_e2e for r in ok if r.total_e2e is not None]
+ e2e_p90 = _percentile(e2e, 0.9) if e2e else None
+ e2e_str = f"; e2e p90 {e2e_p90:.1f}s" if e2e_p90 is not None else ""
+ slos.append(
+ SLO(
+ 4,
+ "Latency",
+ None,
+ f"processing p90 {proc_p90:.1f}s{e2e_str} "
+ "(agentic: reported, not gated)",
+ )
+ )
else:
proc_p90 = _percentile(proc, 0.9)
if scenario == "sustained":
@@ -223,6 +242,7 @@ def build(
return {
"scenario": scenario,
"env": targets.env,
+ "flavor": targets.flavor,
"n": len(results),
"timestamp": datetime.now(UTC).isoformat(),
"window": layer_a["window"],
@@ -253,7 +273,8 @@ def write_artifact(report: dict[str, Any]) -> Path:
def format_report(report: dict[str, Any]) -> str:
lat = report["latency"]
lines = [
- f"\n=== load report: {report['scenario']} / {report['env']} / n={report['n']} ===",
+ f"\n=== load report: {report['scenario']} / {report['env']} / "
+ f"{report.get('flavor', 'single_pass')} / n={report['n']} ===",
f" {'segment':<12}{'p50':>8}{'p90':>8}{'p99':>8}{'max':>8}",
]
for label, key in [
diff --git a/tests/test_extractor.py b/tests/test_extractor.py
index b708184..b24cefd 100644
--- a/tests/test_extractor.py
+++ b/tests/test_extractor.py
@@ -464,6 +464,8 @@ def test_bootstrap_secrets_hydrates_env_vars(self, monkeypatch):
def test_extractor_bootstraps_then_builds_single_pass(self, monkeypatch):
monkeypatch.setenv("LLM_MODEL", "gemini-fake")
+ # Default flavor is single_pass; ensure no stray env flips it.
+ monkeypatch.delenv("EXTRACTOR_FLAVOR", raising=False)
bootstrap = MagicMock()
monkeypatch.setattr(handler, "_bootstrap_secrets", bootstrap)
monkeypatch.setattr(
@@ -474,8 +476,10 @@ def test_extractor_bootstraps_then_builds_single_pass(self, monkeypatch):
fake_extractor_obj = MagicMock()
model_ctor = MagicMock(return_value=fake_model)
ext_ctor = MagicMock(return_value=fake_extractor_obj)
+ agentic_ctor = MagicMock()
monkeypatch.setattr(handler, "ChatGoogleGenerativeAI", model_ctor)
monkeypatch.setattr(handler, "SinglePassExtractor", ext_ctor)
+ monkeypatch.setattr(handler, "AgenticExtractor", agentic_ctor)
assert handler._extractor() is fake_extractor_obj
bootstrap.assert_called_once()
@@ -483,6 +487,52 @@ def test_extractor_bootstraps_then_builds_single_pass(self, monkeypatch):
model="gemini-fake", google_api_key="fake-api-key"
)
ext_ctor.assert_called_once_with(model=fake_model, schema=NDA)
+ agentic_ctor.assert_not_called()
+
+ def test_extractor_builds_agentic_when_flavor_set(self, monkeypatch):
+ monkeypatch.setenv("LLM_MODEL", "gemini-fake")
+ monkeypatch.setenv("EXTRACTOR_FLAVOR", "agentic")
+ monkeypatch.setenv("EXTRACTOR_MAX_ITERATIONS", "30")
+ bootstrap = MagicMock()
+ monkeypatch.setattr(handler, "_bootstrap_secrets", bootstrap)
+ monkeypatch.setattr(
+ handler, "_llm_api_key", MagicMock(return_value="fake-api-key")
+ )
+
+ fake_model = MagicMock()
+ fake_agentic_obj = MagicMock()
+ model_ctor = MagicMock(return_value=fake_model)
+ single_ctor = MagicMock()
+ agentic_ctor = MagicMock(return_value=fake_agentic_obj)
+ monkeypatch.setattr(handler, "ChatGoogleGenerativeAI", model_ctor)
+ monkeypatch.setattr(handler, "SinglePassExtractor", single_ctor)
+ monkeypatch.setattr(handler, "AgenticExtractor", agentic_ctor)
+
+ assert handler._extractor() is fake_agentic_obj
+ bootstrap.assert_called_once()
+ agentic_ctor.assert_called_once_with(
+ model=fake_model, schema=NDA, modality="text", max_iterations=30
+ )
+ single_ctor.assert_not_called()
+
+ def test_extractor_agentic_defaults_max_iterations_when_env_absent(
+ self, monkeypatch
+ ):
+ monkeypatch.setenv("LLM_MODEL", "gemini-fake")
+ monkeypatch.setenv("EXTRACTOR_FLAVOR", "agentic")
+ monkeypatch.delenv("EXTRACTOR_MAX_ITERATIONS", raising=False)
+ monkeypatch.setattr(handler, "_bootstrap_secrets", MagicMock())
+ monkeypatch.setattr(
+ handler, "_llm_api_key", MagicMock(return_value="fake-api-key")
+ )
+ monkeypatch.setattr(
+ handler, "ChatGoogleGenerativeAI", MagicMock(return_value=MagicMock())
+ )
+ agentic_ctor = MagicMock()
+ monkeypatch.setattr(handler, "AgenticExtractor", agentic_ctor)
+
+ handler._extractor()
+ assert agentic_ctor.call_args.kwargs["max_iterations"] == 30
def test_ls_client_bootstraps_then_returns_cached_singleton(self, monkeypatch):
bootstrap = MagicMock()