From 811f0e2cae572261500ad4dd5101b7abe9bf97ec Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sat, 6 Jun 2026 20:46:20 -0600
Subject: [PATCH 01/14] Add ADR-0016 for agentic-flavor deployment and
re-parametrization
---
docs/adr/0016-agentic-flavor-deployment.md | 135 +++++++++++++++++++++
docs/adr/README.md | 1 +
2 files changed, 136 insertions(+)
create mode 100644 docs/adr/0016-agentic-flavor-deployment.md
diff --git a/docs/adr/0016-agentic-flavor-deployment.md b/docs/adr/0016-agentic-flavor-deployment.md
new file mode 100644
index 0000000..be794d7
--- /dev/null
+++ b/docs/adr/0016-agentic-flavor-deployment.md
@@ -0,0 +1,135 @@
+# ADR-0016: Agentic-Flavor Deployment and Re-Parametrization
+
+## Status
+
+Proposed (2026-06-06).
+
+## Context
+
+The project is named `agentic-kie-deploy`, but every environment to date runs the **single-pass** extractor (`SinglePassExtractor`, [handler.py:84](../../src/extractor/handler.py#L84)). That was a deliberate, measured choice: the offline benchmark ([*When does agency earn its cost?*](https://gabriel.com.gt/blog/when-does-agency-earn-its-cost/)) found that on the Kleister NDA corpus single-pass dominates the matrix—~91.5% F1 at ~$0.007/doc and ~9.8s, while the agentic flavor cost 2–4× the latency and dollars (Claude-standard ran ~$0.038/~65s) for gains "insufficient to justify the overhead," and lite-tier agentic *regressed* more documents than it improved. Agency did not earn its cost, so we shipped the flavor that did.
+
+That verdict is **offline**: a one-shot accuracy/cost eval on 83 dev documents. It says nothing about what agency costs *the deployed system under arrival pressure*—which is a different and harsher cost than per-document dollars. [ADR-0015](0015-load-testing-strategy.md) measured the deployed behavior of the single-pass flavor (both scenarios passed all five SLOs); the symmetric exercise for the agentic flavor has never been run. So three things are simultaneously true:
+
+- The name promises a capability the deployment doesn't currently exercise.
+- The strongest decision in the project—*not* shipping agentic—is only half-justified, because it rests on offline numbers and never confronts the deployed envelope.
+- Deploying and load-testing the agentic flavor is where every dormant finding in ADR-0015 stops being hypothetical (the provider-RPM coupling of Finding 1; the errors-alarm-vs-DLQ-alarm question of Finding 2).
+
+This ADR settles **how** the agentic flavor is deployed and, more importantly, how the architecture is *re-derived* for it—because the single-pass parameters are correct only for a ~10s, one-LLM-call-per-document workload, and the agentic flavor invalidates every input to that model.
+
+### The agentic flavor changes the workload model, not just a constant
+
+`AgenticExtractor` builds a LangChain ReAct agent that explores the PDF via tools (`get_page_count`, `read_text`, optionally `load_images`) and stops when it has enough information. Concretely, versus single-pass:
+
+| Property | Single-pass | Agentic | Consequence |
+|---|---|---|---|
+| LLM calls per document | exactly 1 | N, data-dependent (1 → `max_iterations`) | request rate decouples from document rate |
+| Service time | ~10s (p99 31s) | ~25–40s expected (2–4×), fatter/bimodal tail | steady-state capacity collapses |
+| Input tokens/doc | fixed per document | inflated (re-reads pages across turns) | provider TPM headroom shrinks |
+| Failure modes | one call succeeds/fails | loop non-termination, repeated tool error, partial state | `max_iterations` exhaustion → `ExtractionError` |
+
+The single-pass parameters were *derived* from its workload model (service ~10s → capacity `cap ÷ service` ≈ 60/min at staging; provider draw = throughput because calls = throughput). Re-using those constants for agentic isn't conservative—it's mis-tuned. The honest move is to re-run the derivation, the same way ADR-0015 wrote a model and graded it.
+
+### The architectural change: one concurrency knob becomes two
+
+In single-pass, the SQS event-source `maximum_concurrency` ([extractor/main.tf:130](../../infra/modules/extractor/main.tf#L130)) does three jobs at once *because one document equals one LLM call*: it caps document parallelism (throughput), caps concurrent LLM requests (the cost-burst guardrail), and bounds the provider RPM draw (Finding 1's coupling). Those collapse into a single number only at a 1:1 doc-to-call ratio.
+
+Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-flight: the request side now scales with `cap × calls_per_doc`, which is variable and which the SQS cap does not control. The cap still governs throughput, but the cost-guardrail and provider-coupling jobs need a **second control surface**: a request-level limiter (token bucket / semaphore in the handler) sized against the Gemini RPM/TPM budget. The SQS event-source cap governs *document* parallelism; the in-handler limiter governs *request* parallelism. That decoupling is the real architectural finding—the deployed-infra echo of the offline thesis: agency doesn't merely cost more per document, it breaks the assumption that one knob controls both throughput and provider exposure.
+
+## Decision
+
+### Flavor is a deploy-time parameter, single-pass stays the default
+
+Introduce `var.extractor_flavor` (`single_pass` | `agentic`, default `single_pass`). It drives two things:
+
+1. **The handler constructor.** `_extractor()` ([handler.py:84-90](../../src/extractor/handler.py#L84-L90)) reads a new `EXTRACTOR_FLAVOR` env var and builds either `SinglePassExtractor(model, schema)` (today) or `AgenticExtractor(model, schema, modality="text", max_iterations=)`. Both are already exported by `agentic_kie`, share the identical `(model, schema)` interface, and raise the same `ExtractionError` the handler already catches into `batchItemFailures` ([handler.py:356](../../src/extractor/handler.py#L356))—so the agentic failure path flows through the existing redrive/DLQ machinery unchanged. `Extractor[NDA]` (also exported) becomes the return type so the cache helper covers both.
+2. **The parameter profile** (below), so the infra constants move *with* the flavor rather than being hand-edited per run.
+
+Prod is untouched—it remains single-pass with deletion protection. The agentic profile is applied to **staging** for the characterization run (staging's single-pass baseline already lives in the ADR-0015 artifacts, so re-applying it loses nothing), then reverted. A dedicated `staging-agentic` environment is the cleaner-but-heavier alternative (recorded below).
+
+### The re-derived parameter profile
+
+| Knob | Single-pass (today) | Agentic profile | Why it moves |
+|---|---|---|---|
+| `max_iterations` (agent) | n/a | **8–12** (down from the library default 50) | The real cost/latency governor. A doc that can't terminate should fail fast into a *bounded* cost, not burn 50 LLM calls. This is the agentic analog of single-pass's deterministic single call. |
+| `modality` | `text` | `text` | Avoids image-token blow-up; keeps the per-doc TPM draw bounded and Finding 1's coupling slack. |
+| Lambda timeout ([main.tf:33](../../infra/main.tf#L33)) | 120s | **300s** (backstop) | Above the worst legitimate `max_iterations`-bounded run (~10 calls), not the governor. A timeout is a crash → retry → wasted spend; `max_iterations` should bite first. |
+| Visibility timeout | 720s (= 120×6) | **1800s** (= 300×6, automatic) | Already *derived* as `timeout × 6` ([queue/main.tf:2](../../infra/modules/queue/main.tf#L2)). Raising the Lambda timeout moves it in lockstep—one knob, not two—and is exactly what keeps SLO 2 from breaching under the longer queue dwell. |
+| `maximum_concurrency` | 10 staging / 25 prod | **10** (held — cost-preserving) | See the fork below. |
+| **(new) request-level limiter** | implicit in the cap | explicit token bucket vs RPM/TPM | in-doc fan-out decoupled it from the cap (see Context). |
+| `maxReceiveCount` ([queue default](../../infra/modules/queue/variables.tf#L16)) | 3 | **2** | Agentic failures are mostly logic (non-terminating loop, repeated tool error), not transient. Retrying an expensive doomed run 3× triples its cost for nothing. |
+| `batch_size` / batching window | 1 / 0 | 1 / 0 (unchanged) | One long ReAct run per invocation is already correct; batching would head-of-line-block. |
+| Memory | 2048 MB | 2048 MB (revisit) | Latency is LLM-wall-clock-bound (network), not CPU-bound; memory buys cold-start and glue speed only. A modest lever, left at baseline pending evidence. |
+
+**The downstream half does not move.** The publisher (DynamoDB Streams → analytics S3, 5s batch window) is flavor-agnostic—it runs *after* extraction and neither knows nor cares which extractor wrote the row. The re-derivation is entirely the extractor, its event source, and the provider budget.
+
+### The one genuine fork: throughput vs. cost containment
+
+Capacity is `cap ÷ service_time`. To hold single-pass-like drain behavior (a 200-burst absorbed and drained in a few minutes) the cap would rise from 10 to ~30 to offset the ~3× longer service time. That fights the cost guardrail. The choice:
+
+- **Throughput-preserving**: raise the cap to ~30, keep drains fast, accept a ~3× wider cost-burst exposure on the *expensive* flavor.
+- **Cost-preserving** (chosen): hold the cap at 10, let SQS hold the backlog longer, and pay for the longer dwell with the higher (auto-derived) visibility timeout.
+
+**We choose cost-preserving.** Agentic is the flavor that already doesn't earn its cost; letting it *also* fan out 30-wide and spike spend is the wrong instinct. Lean harder on the buffer the architecture already has, not on the throttle. That stance is itself the finding: *the right response to a slower, costlier workload is to widen the buffer's job, not the throttle's.* (Flip this one knob and the rest of the profile is unchanged—the decision is isolated by design.)
+
+## Pass/fail criteria (SLOs)
+
+The agentic runs reuse ADR-0015's five SLOs, adjusted for the re-derived envelope; criterion 6 is new and is the point of the exercise.
+
+1. **Correctness (primary run)**—200/200 reach `succeeded`; both DLQs at 0. (A *deliberate low-`max_iterations` stressor run* is exempt and expected to DLQ—see criterion 5.)
+2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the **new 1800s** visibility timeout and the queue drains to empty. This is the SLO the re-parametrization exists to protect: under the *old* 720s timeout, a 200-burst at ~30s service would push the last messages to ~570s dwell and brush redelivery. Confirming it holds under the new profile—and would not under the old—is the headline.
+3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the in-handler limiter keeps the LLM request rate under the Gemini RPM/TPM budget (the new control surface working).
+4. **Latency—reported, not gated, and compared.** Agentic is slow by design; the e2e/processing percentiles are reported, not failed. The *deliverable* is the agentic-vs-single-pass delta on the same corpus in the same deployed pipeline (criterion 6).
+5. **Alarms honest**—primary run: no alarm fires. **Stressor run: this finally exercises Finding 2.** When `max_iterations` is capped low enough that genuinely hard docs exhaust it → `ExtractionError` → retry → DLQ, the prediction (from [handler.py:356](../../src/extractor/handler.py#L356)) is that `Errors` stays flat (failures are reported as `batchItemFailures`, a *successful* invocation) and **only** the `${dlq}-messages-visible` alarm fires, not `${extractor}-errors`. Confirming this on a live run closes Finding 2.
+6. **The deployed agency premium (new)**—cost/doc and e2e-latency, agentic vs. single-pass, measured not benchmarked: the offline "agency doesn't earn its cost" verdict, plus the infra cost the benchmark never saw (slower drain, the retune this ADR documents).
+
+## Expected behavior (hypotheses to confirm or refute)
+
+- **Service time** ~25–40s mean (2–4× single-pass), tail bounded by `max_iterations` rather than by a 120s crash; **capacity** collapses from ~60/min to ~15–25/min at the held cap.
+- **Burst**: queue peaks near 200 (as single-pass), but *drains in ~8–13 min* not ~4; concurrency pins at the cap; oldest-message age peaks ~400–600s—comfortably under 1800s, **breaching the old 720s**. DLQ 0 on the primary run; no alarm.
+- **Sustained**: at a rate set to ~22% of the *new* capacity, queue ≈ 0, concurrency hovers low; latency ≈ processing (which is now multi-call and several-fold higher).
+- **Cost**: ~$0.015–0.025/doc on Gemini text-modality agentic (more calls, but no image tokens, cheaper model than the blog's Claude-standard); ~$6–10 for both scenarios.
+- **Finding 2 stressor**: docs that exhaust the low `max_iterations` DLQ cleanly with `Errors` flat and only the DLQ alarm firing.
+
+If reality diverges, the divergence is the finding.
+
+## The harness
+
+No new harness. The ADR-0015 driver under `tests/load/` is **flavor-agnostic**: it presigns + PUTs documents, polls for landing, and reads server-side `created_at` / `processing_ms` / `completed_at` / `token_usage` plus the Layer A CloudWatch series and alarm history. None of that is single-pass-specific. So the existing `make load ENV=staging SCENARIO=burst|sustained` runs against the agentic deployment unchanged; the only difference is which flavor profile staging was applied with. The agentic artifacts land alongside the single-pass baseline in `tests/load/reports/`, and the per-document pairing (same corpus, same upload order) extends to a third axis—single-pass vs agentic on the identical document.
+
+## Consequences
+
+Positive:
+
+- The project earns its name: it deploys `agentic-kie`, both flavors, selected at deploy time.
+- The offline "agency doesn't earn its cost" verdict gains its deployed counterpart, including the infra cost the benchmark could not measure.
+- Findings 1 and 2 move from hypotheses to live results; the request-level limiter and the cap-decoupling are exercised, not just reasoned about.
+- The re-parametrization is reusable: the flavor profile is the template for any future heavier workload (multimodal, a larger schema).
+
+Negative:
+
+- Real work: a handler constructor switch, a new `extractor_flavor` parameter + profile plumbing, and the request-level limiter (genuinely new code, not a config change). More LLM spend (~$6–10) than the single-pass runs.
+- Re-applying staging to the agentic profile displaces its single-pass deployment for the duration (mitigated: the baseline is already captured; or stand up `staging-agentic`).
+- The agentic flavor does not change the production decision—single-pass remains the default. This is characterization, not a reversal.
+
+Neutral:
+
+- Prod is untouched. The agentic profile is staging-only and reverted after the run.
+
+## Findings
+
+(Recorded as discovered; pre-implementation findings first.)
+
+- **Finding A—`max_iterations` defaults to 50, which is a latency/cost bomb in a Lambda.** The library default lets a single document drive up to 50 LLM calls before raising. Under a 120s function timeout that document would crash (timeout) long before iteration 50, turning a logic problem into an infra fault and a retry. The profile caps it at 8–12 so the *agent* governs cost, and raises the timeout so the cap—not the clock—is what bites. The single-pass flavor never surfaced this because it has no loop.
+- **Finding B (to confirm)—the SQS event-source cap stops being a provider-rate control under agentic.** Because in-doc fan-out decouples request rate from document rate, holding `maximum_concurrency` no longer bounds RPM/TPM. Whether the new in-handler limiter is necessary, or Tier 1's headroom absorbs `cap × calls_per_doc` anyway, is a quantity to measure on the run, not assume.
+
+## Alternatives considered
+
+- **Flip the existing staging extractor by env var only (no parameter profile).** Simplest, but re-parametrizing (timeout → visibility, `maxReceiveCount`, the limiter) means editing shared infra by hand per run, and you cannot hold a clean single-pass baseline alongside. Rejected: the flavor and its derived envelope should move together as one parameter.
+- **Throughput-preserving cap (~30).** Holds single-pass drain times. Rejected for v1 (see the fork): it widens cost exposure on the flavor we deploy *because* it's expensive. Recorded as a one-line flip if drain time ever matters more than spend.
+- **Multimodal / image modality.** Closer to what a "read the document like a human" agent implies, and what some benchmark rows used. Rejected for the deploy: image tokens multiply the TPM draw and re-tighten Finding 1's coupling for no measured accuracy win on this text-heavy NDA corpus. `text` keeps the provider budget slack.
+- **Dedicated `staging-agentic` environment.** A true side-by-side: agentic and single-pass live simultaneously, no baseline displacement. Heavier (a full env stand-up, its own alarms, its own teardown) and unnecessary given the baseline is already captured. Recorded as the cleaner path if a *continuous* A/B is ever wanted, per the single-tenant deployment model ([ADR-0013](0013-single-tenant-deployment-model.md)).
+- **Don't deploy agentic; explain the name in prose.** The zero-cost path: a README/blog line saying the name refers to the library, which implements both flavors. Rejected as the anticlimactic answer—it leaves the project's strongest decision resting on offline numbers and forgoes the most interesting load-testing exercise available.
+
+## Post-implementation
+
+(To be completed after the runs, mirroring ADR-0015: the hypotheses above graded against the artifacts, the deployed agency premium reported, and Findings 1/2/A/B resolved or carried.)
diff --git a/docs/adr/README.md b/docs/adr/README.md
index 25e3b8e..bffaeb7 100644
--- a/docs/adr/README.md
+++ b/docs/adr/README.md
@@ -33,3 +33,4 @@ This directory records the significant architectural decisions made in this proj
| [0013](0013-single-tenant-deployment-model.md) | Single-tenant deployment model | Accepted |
| [0014](0014-split-results-module.md) | Split the results module into publisher and analytics | Accepted |
| [0015](0015-load-testing-strategy.md) | Load-testing strategy | Accepted |
+| [0016](0016-agentic-flavor-deployment.md) | Agentic-flavor deployment and re-parametrization | Proposed |
From 74991d3e611740f6ed8d69e7d912bd27d85881ab Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sat, 6 Jun 2026 21:21:38 -0600
Subject: [PATCH 02/14] Revise ADR-0016 with Gemini agentic benchmark numbers
---
docs/adr/0016-agentic-flavor-deployment.md | 58 +++++++++++-----------
1 file changed, 29 insertions(+), 29 deletions(-)
diff --git a/docs/adr/0016-agentic-flavor-deployment.md b/docs/adr/0016-agentic-flavor-deployment.md
index be794d7..75f2eda 100644
--- a/docs/adr/0016-agentic-flavor-deployment.md
+++ b/docs/adr/0016-agentic-flavor-deployment.md
@@ -1,4 +1,4 @@
-# ADR-0016: Agentic-Flavor Deployment and Re-Parametrization
+# ADR-0016: Agentic-Flavor Deployment
## Status
@@ -6,15 +6,15 @@ Proposed (2026-06-06).
## Context
-The project is named `agentic-kie-deploy`, but every environment to date runs the **single-pass** extractor (`SinglePassExtractor`, [handler.py:84](../../src/extractor/handler.py#L84)). That was a deliberate, measured choice: the offline benchmark ([*When does agency earn its cost?*](https://gabriel.com.gt/blog/when-does-agency-earn-its-cost/)) found that on the Kleister NDA corpus single-pass dominates the matrix—~91.5% F1 at ~$0.007/doc and ~9.8s, while the agentic flavor cost 2–4× the latency and dollars (Claude-standard ran ~$0.038/~65s) for gains "insufficient to justify the overhead," and lite-tier agentic *regressed* more documents than it improved. Agency did not earn its cost, so we shipped the flavor that did.
+The project is named `agentic-kie-deploy`, but every environment to date runs the **single-pass** extractor (`SinglePassExtractor`, [handler.py:84](../../src/extractor/handler.py#L84)). That was a deliberate, measured choice: the offline benchmark ([*When does agency earn its cost?*](https://gabriel.com.gt/blog/when-does-agency-earn-its-cost/)) found that on the Kleister NDA corpus single-pass dominates the matrix—~91.5% F1 at ~$0.007/doc and ~9.8s, while the agentic flavor cost more in latency and dollars—Claude-standard ran ~$0.038/~65s (2–4× single-pass); Gemini Standard agentic is ~$0.011/~14.6s (~1.5×)—for gains "insufficient to justify the overhead," and lite-tier agentic *regressed* more documents than it improved. Agency did not earn its cost, so we shipped the flavor that did.
That verdict is **offline**: a one-shot accuracy/cost eval on 83 dev documents. It says nothing about what agency costs *the deployed system under arrival pressure*—which is a different and harsher cost than per-document dollars. [ADR-0015](0015-load-testing-strategy.md) measured the deployed behavior of the single-pass flavor (both scenarios passed all five SLOs); the symmetric exercise for the agentic flavor has never been run. So three things are simultaneously true:
- The name promises a capability the deployment doesn't currently exercise.
- The strongest decision in the project—*not* shipping agentic—is only half-justified, because it rests on offline numbers and never confronts the deployed envelope.
-- Deploying and load-testing the agentic flavor is where every dormant finding in ADR-0015 stops being hypothetical (the provider-RPM coupling of Finding 1; the errors-alarm-vs-DLQ-alarm question of Finding 2).
+- The offline verdict has a deployed counterpart no benchmark can produce—the agency premium *in the running pipeline* (drain time, queue dwell, the infra cost the eval never saw)—and the exercise gives ADR-0015's dormant findings a live look: Finding 1's provider-RPM coupling gets *measured* (and, at Tier 1, is likely confirmed slack), and Finding 2's errors-alarm-vs-DLQ question becomes testable via a deliberate stressor.
-This ADR settles **how** the agentic flavor is deployed and, more importantly, how the architecture is *re-derived* for it—because the single-pass parameters are correct only for a ~10s, one-LLM-call-per-document workload, and the agentic flavor invalidates every input to that model.
+This ADR settles **how** the agentic flavor is deployed and how its parameter envelope is re-derived. The headline, once the real numbers are in—Gemini agentic is ~1.5× single-pass, not the 2–4× a Claude-standard outlier suggested—is narrower than "re-tune everything": the existing envelope already absorbs agentic at the ADR-0015 bracket, exactly two knobs genuinely move, and the payoff is the *deployed* agency premium plus the capability itself, not a system pushed to breaking.
### The agentic flavor changes the workload model, not just a constant
@@ -23,17 +23,17 @@ This ADR settles **how** the agentic flavor is deployed and, more importantly, h
| Property | Single-pass | Agentic | Consequence |
|---|---|---|---|
| LLM calls per document | exactly 1 | N, data-dependent (1 → `max_iterations`) | request rate decouples from document rate |
-| Service time | ~10s (p99 31s) | ~25–40s expected (2–4×), fatter/bimodal tail | steady-state capacity collapses |
+| Service time | ~10s (p99 31s) | ~14.6s (benchmark, ~1.5×), fatter/bimodal tail | steady-state capacity contracts |
| Input tokens/doc | fixed per document | inflated (re-reads pages across turns) | provider TPM headroom shrinks |
| Failure modes | one call succeeds/fails | loop non-termination, repeated tool error, partial state | `max_iterations` exhaustion → `ExtractionError` |
-The single-pass parameters were *derived* from its workload model (service ~10s → capacity `cap ÷ service` ≈ 60/min at staging; provider draw = throughput because calls = throughput). Re-using those constants for agentic isn't conservative—it's mis-tuned. The honest move is to re-run the derivation, the same way ADR-0015 wrote a model and graded it.
+The single-pass parameters were *derived* from its workload model (service ~10s → capacity `cap ÷ service` ≈ 60/min at staging; provider draw = throughput because calls = throughput). The honest move is to re-run that derivation and see which constants actually move—not to assume the whole envelope is wrong. At only ~1.5× service the queue-dynamics constants mostly still fit; as it turns out (below), one knob (`max_iterations`) is wrong independent of latency, one (`maxReceiveCount`) is worth tightening, and the rest hold.
### The architectural change: one concurrency knob becomes two
In single-pass, the SQS event-source `maximum_concurrency` ([extractor/main.tf:130](../../infra/modules/extractor/main.tf#L130)) does three jobs at once *because one document equals one LLM call*: it caps document parallelism (throughput), caps concurrent LLM requests (the cost-burst guardrail), and bounds the provider RPM draw (Finding 1's coupling). Those collapse into a single number only at a 1:1 doc-to-call ratio.
-Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-flight: the request side now scales with `cap × calls_per_doc`, which is variable and which the SQS cap does not control. The cap still governs throughput, but the cost-guardrail and provider-coupling jobs need a **second control surface**: a request-level limiter (token bucket / semaphore in the handler) sized against the Gemini RPM/TPM budget. The SQS event-source cap governs *document* parallelism; the in-handler limiter governs *request* parallelism. That decoupling is the real architectural finding—the deployed-infra echo of the offline thesis: agency doesn't merely cost more per document, it breaks the assumption that one knob controls both throughput and provider exposure.
+Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-flight: the request side now scales with `cap × calls_per_doc`, which is variable and which the SQS cap does not control. The cap still governs throughput, but the cost-guardrail and provider-coupling jobs need a **second control surface**: a request-level limiter (token bucket / semaphore in the handler) sized against the Gemini RPM/TPM budget. The SQS event-source cap governs *document* parallelism; the in-handler limiter governs *request* parallelism. That decoupling is the real architectural finding—the deployed-infra echo of the offline thesis: agency doesn't merely cost more per document, it breaks the assumption that one knob controls both throughput and provider exposure. *Conceptually* real is not the same as *quantitatively* binding, though: at Tier 1 (4,000 RPM) with cap ≤ 25 and ~1.5× service, the request side draws only a few hundred RPM (~410 even at prod's cap 25), ~10× under the ceiling. So the second control surface is a thing to *measure for*, and to reach for as N or the cap grows—not something this deployment needs built today (Finding B).
## Decision
@@ -42,9 +42,9 @@ Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-f
Introduce `var.extractor_flavor` (`single_pass` | `agentic`, default `single_pass`). It drives two things:
1. **The handler constructor.** `_extractor()` ([handler.py:84-90](../../src/extractor/handler.py#L84-L90)) reads a new `EXTRACTOR_FLAVOR` env var and builds either `SinglePassExtractor(model, schema)` (today) or `AgenticExtractor(model, schema, modality="text", max_iterations=)`. Both are already exported by `agentic_kie`, share the identical `(model, schema)` interface, and raise the same `ExtractionError` the handler already catches into `batchItemFailures` ([handler.py:356](../../src/extractor/handler.py#L356))—so the agentic failure path flows through the existing redrive/DLQ machinery unchanged. `Extractor[NDA]` (also exported) becomes the return type so the cache helper covers both.
-2. **The parameter profile** (below), so the infra constants move *with* the flavor rather than being hand-edited per run.
+2. **The parameter profile** (below), keyed off `extractor_flavor` so the whole envelope—timeout, derived visibility, `maxReceiveCount`, `max_iterations`, the limiter—moves *with* the flavor rather than being hand-edited. Switching any environment's flavor is then a one-variable change, which is the point: re-parametrization should be as cheap as flipping the variable.
-Prod is untouched—it remains single-pass with deletion protection. The agentic profile is applied to **staging** for the characterization run (staging's single-pass baseline already lives in the ADR-0015 artifacts, so re-applying it loses nothing), then reverted. A dedicated `staging-agentic` environment is the cleaner-but-heavier alternative (recorded below).
+Every environment—staging and prod alike—can run **either** flavor, selected per environment at deploy time, with single-pass the default everywhere. Because the full profile follows `extractor_flavor` (above), pointing any environment at agentic is a one-variable change, and pointing it back is the same. The characterization run is done on **staging** first: you validate a new flavor's deployed envelope before offering it to prod, and staging's single-pass baseline already lives in the ADR-0015 artifacts, so flipping it loses nothing. Prod thereby *gains the capability* to run agentic while keeping single-pass (and its deletion protection) by choice—nothing about prod is reverted, because the infra change is a permanent capability, not a temporary patch. A dedicated `staging-agentic` environment remains an option for a continuous side-by-side (recorded below).
### The re-derived parameter profile
@@ -52,10 +52,10 @@ Prod is untouched—it remains single-pass with deletion protection. The agentic
|---|---|---|---|
| `max_iterations` (agent) | n/a | **8–12** (down from the library default 50) | The real cost/latency governor. A doc that can't terminate should fail fast into a *bounded* cost, not burn 50 LLM calls. This is the agentic analog of single-pass's deterministic single call. |
| `modality` | `text` | `text` | Avoids image-token blow-up; keeps the per-doc TPM draw bounded and Finding 1's coupling slack. |
-| Lambda timeout ([main.tf:33](../../infra/main.tf#L33)) | 120s | **300s** (backstop) | Above the worst legitimate `max_iterations`-bounded run (~10 calls), not the governor. A timeout is a crash → retry → wasted spend; `max_iterations` should bite first. |
-| Visibility timeout | 720s (= 120×6) | **1800s** (= 300×6, automatic) | Already *derived* as `timeout × 6` ([queue/main.tf:2](../../infra/modules/queue/main.tf#L2)). Raising the Lambda timeout moves it in lockstep—one knob, not two—and is exactly what keeps SLO 2 from breaching under the longer queue dwell. |
-| `maximum_concurrency` | 10 staging / 25 prod | **10** (held — cost-preserving) | See the fork below. |
-| **(new) request-level limiter** | implicit in the cap | explicit token bucket vs RPM/TPM | in-doc fan-out decoupled it from the cap (see Context). |
+| Lambda timeout ([main.tf:33](../../infra/main.tf#L33)) | 120s | **120s (unchanged)** | Benchmark mean is 14.6s, and `max_iterations` 8–12 bounds the worst legit run to ~40–70s—well under the existing 120s, which already absorbed single-pass's 50s tail. `max_iterations`, not the clock, is the governor; the timeout is a backstop that already has margin. No reason to move it. |
+| Visibility timeout | 720s (= 120×6) | **720s (unchanged)** | Derived as `timeout × 6` ([queue/main.tf:2](../../infra/modules/queue/main.tf#L2)), so it tracks the timeout automatically. The timeout stays at 120s, so this stays at 720s—and at ~290s peak dwell (below) that is ~2.5× headroom. The coupling is worth keeping; it just doesn't need to fire here. |
+| `maximum_concurrency` | 10 staging / 25 prod | **held at the environment's existing cap** (cost-preserving) | A per-environment lever, independent of flavor—not part of the flavor profile; see the fork below. |
+| **(new) request-level limiter** | implicit in the cap | **measure first, build only if the draw warrants** | In-doc fan-out decouples request rate from the cap (see Context), but at Tier 1 + cap ≤ 25 the draw sits ~10× under budget. Conditional on the run's measured provider rate (Finding B), not built up front. |
| `maxReceiveCount` ([queue default](../../infra/modules/queue/variables.tf#L16)) | 3 | **2** | Agentic failures are mostly logic (non-terminating loop, repeated tool error), not transient. Retrying an expensive doomed run 3× triples its cost for nothing. |
| `batch_size` / batching window | 1 / 0 | 1 / 0 (unchanged) | One long ReAct run per invocation is already correct; batching would head-of-line-block. |
| Memory | 2048 MB | 2048 MB (revisit) | Latency is LLM-wall-clock-bound (network), not CPU-bound; memory buys cold-start and glue speed only. A modest lever, left at baseline pending evidence. |
@@ -64,19 +64,19 @@ Prod is untouched—it remains single-pass with deletion protection. The agentic
### The one genuine fork: throughput vs. cost containment
-Capacity is `cap ÷ service_time`. To hold single-pass-like drain behavior (a 200-burst absorbed and drained in a few minutes) the cap would rise from 10 to ~30 to offset the ~3× longer service time. That fights the cost guardrail. The choice:
+Capacity is `cap ÷ service_time`. To hold single-pass-like drain behavior (a 200-burst absorbed and drained in a few minutes) the cap would rise from 10 to ~15 to offset the ~1.5× longer service time. That fights the cost guardrail. The choice:
-- **Throughput-preserving**: raise the cap to ~30, keep drains fast, accept a ~3× wider cost-burst exposure on the *expensive* flavor.
-- **Cost-preserving** (chosen): hold the cap at 10, let SQS hold the backlog longer, and pay for the longer dwell with the higher (auto-derived) visibility timeout.
+- **Throughput-preserving**: raise the cap to ~15, keep drains fast, accept a ~1.5× wider cost-burst exposure on the *expensive* flavor.
+- **Cost-preserving** (chosen): hold the cap at its existing per-environment value (10 on staging), let SQS hold the backlog longer, and pay for the longer dwell with the higher (auto-derived) visibility timeout.
-**We choose cost-preserving.** Agentic is the flavor that already doesn't earn its cost; letting it *also* fan out 30-wide and spike spend is the wrong instinct. Lean harder on the buffer the architecture already has, not on the throttle. That stance is itself the finding: *the right response to a slower, costlier workload is to widen the buffer's job, not the throttle's.* (Flip this one knob and the rest of the profile is unchanged—the decision is isolated by design.)
+**We hold the existing cap (cost-preserving)—but at ~1.5× this is a low-stakes call, not a principled stand.** Raising it to ~15 would cost ~50% more concurrent spend for a faster drain, and either way the 200-doc bracket completes in minutes with the DLQ empty. We change nothing because the cap is a per-environment lever and there's no measured reason to touch it; if drain time ever matters more than spend, ~15 is the one-variable flip. The original *principle*—lean on the buffer, not the throttle—still holds; it just isn't being tested at this scale.
## Pass/fail criteria (SLOs)
The agentic runs reuse ADR-0015's five SLOs, adjusted for the re-derived envelope; criterion 6 is new and is the point of the exercise.
1. **Correctness (primary run)**—200/200 reach `succeeded`; both DLQs at 0. (A *deliberate low-`max_iterations` stressor run* is exempt and expected to DLQ—see criterion 5.)
-2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the **new 1800s** visibility timeout and the queue drains to empty. This is the SLO the re-parametrization exists to protect: under the *old* 720s timeout, a 200-burst at ~30s service would push the last messages to ~570s dwell and brush redelivery. Confirming it holds under the new profile—and would not under the old—is the headline.
+2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the (unchanged) 720s visibility timeout and the queue drains to empty. At 14.6s actual service time, a 200-burst drains in ~290s—~2.5× under the 720s, so the original threat (the inflated ~30s estimates that pushed dwell toward ~570s) never materializes. Nothing in the envelope needed to move for this; the headline is simply that the queue drains cleanly and dwell stays well under the timeout.
3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the in-handler limiter keeps the LLM request rate under the Gemini RPM/TPM budget (the new control surface working).
4. **Latency—reported, not gated, and compared.** Agentic is slow by design; the e2e/processing percentiles are reported, not failed. The *deliverable* is the agentic-vs-single-pass delta on the same corpus in the same deployed pipeline (criterion 6).
5. **Alarms honest**—primary run: no alarm fires. **Stressor run: this finally exercises Finding 2.** When `max_iterations` is capped low enough that genuinely hard docs exhaust it → `ExtractionError` → retry → DLQ, the prediction (from [handler.py:356](../../src/extractor/handler.py#L356)) is that `Errors` stays flat (failures are reported as `batchItemFailures`, a *successful* invocation) and **only** the `${dlq}-messages-visible` alarm fires, not `${extractor}-errors`. Confirming this on a live run closes Finding 2.
@@ -84,10 +84,10 @@ The agentic runs reuse ADR-0015's five SLOs, adjusted for the re-derived envelop
## Expected behavior (hypotheses to confirm or refute)
-- **Service time** ~25–40s mean (2–4× single-pass), tail bounded by `max_iterations` rather than by a 120s crash; **capacity** collapses from ~60/min to ~15–25/min at the held cap.
-- **Burst**: queue peaks near 200 (as single-pass), but *drains in ~8–13 min* not ~4; concurrency pins at the cap; oldest-message age peaks ~400–600s—comfortably under 1800s, **breaching the old 720s**. DLQ 0 on the primary run; no alarm.
-- **Sustained**: at a rate set to ~22% of the *new* capacity, queue ≈ 0, concurrency hovers low; latency ≈ processing (which is now multi-call and several-fold higher).
-- **Cost**: ~$0.015–0.025/doc on Gemini text-modality agentic (more calls, but no image tokens, cheaper model than the blog's Claude-standard); ~$6–10 for both scenarios.
+- **Service time** ~14.6s mean (benchmark, ~1.5× single-pass), tail bounded by `max_iterations` rather than by a timeout crash; **capacity** contracts from ~60/min to ~41/min at the held cap.
+- **Burst**: queue peaks near 200 (as single-pass), but *drains in ~5 min* not ~3.5; concurrency pins at the cap; oldest-message age peaks ~290s—comfortably under the unchanged 720s timeout. DLQ 0 on the primary run; no alarm.
+- **Sustained**: at a rate set to ~22% of the *new* capacity (~9/min), queue ≈ 0, concurrency hovers low; latency ≈ processing (which is now multi-call).
+- **Cost**: ~$0.011/doc on Gemini text-modality agentic (benchmark); ~$4–5 for both scenarios (200 docs each).
- **Finding 2 stressor**: docs that exhaust the low `max_iterations` DLQ cleanly with `Errors` flat and only the DLQ alarm firing.
If reality diverges, the divergence is the finding.
@@ -100,32 +100,32 @@ No new harness. The ADR-0015 driver under `tests/load/` is **flavor-agnostic**:
Positive:
-- The project earns its name: it deploys `agentic-kie`, both flavors, selected at deploy time.
+- The project earns its name: it deploys `agentic-kie`, both flavors, selectable per environment at deploy time—prod included.
- The offline "agency doesn't earn its cost" verdict gains its deployed counterpart, including the infra cost the benchmark could not measure.
-- Findings 1 and 2 move from hypotheses to live results; the request-level limiter and the cap-decoupling are exercised, not just reasoned about.
+- Finding 2 gets a live test (via the deliberate stressor sub-run); Finding 1 is *measured* and—at Tier 1 with these caps—expected to stay slack, which is itself a recorded result. The cap-decoupling is documented as a watch-item for higher N / prod's cap, not prematurely built.
- The re-parametrization is reusable: the flavor profile is the template for any future heavier workload (multimodal, a larger schema).
Negative:
-- Real work: a handler constructor switch, a new `extractor_flavor` parameter + profile plumbing, and the request-level limiter (genuinely new code, not a config change). More LLM spend (~$6–10) than the single-pass runs.
-- Re-applying staging to the agentic profile displaces its single-pass deployment for the duration (mitigated: the baseline is already captured; or stand up `staging-agentic`).
+- Real work: a handler constructor switch and a new `extractor_flavor` parameter + profile plumbing (plus the request-level limiter *only if* the measured draw warrants it—see Finding B). More LLM spend (~$4–5) than the single-pass runs.
+- An environment runs one flavor at a time, so flipping staging to agentic means it isn't serving single-pass during the run window (mitigated: the baseline is already captured and flip-back is one variable; or stand up a second environment for a continuous side-by-side).
- The agentic flavor does not change the production decision—single-pass remains the default. This is characterization, not a reversal.
Neutral:
-- Prod is untouched. The agentic profile is staging-only and reverted after the run.
+- The production *decision* is unchanged—prod keeps single-pass by choice—while the *capability* to run agentic is added for every environment. Adding the option is not exercising it; the change reverts nothing.
## Findings
(Recorded as discovered; pre-implementation findings first.)
-- **Finding A—`max_iterations` defaults to 50, which is a latency/cost bomb in a Lambda.** The library default lets a single document drive up to 50 LLM calls before raising. Under a 120s function timeout that document would crash (timeout) long before iteration 50, turning a logic problem into an infra fault and a retry. The profile caps it at 8–12 so the *agent* governs cost, and raises the timeout so the cap—not the clock—is what bites. The single-pass flavor never surfaced this because it has no loop.
+- **Finding A—`max_iterations` defaults to 50, which is a latency/cost bomb in a Lambda.** The library default lets a single document drive up to 50 LLM calls before raising. Under a 120s function timeout that document would crash (timeout) long before iteration 50, turning a logic problem into an infra fault and a retry. The profile caps it at 8–12 so the *agent* governs cost and a doomed doc fails fast and cheap—well inside the existing 120s, so the timeout stays put as the backstop. The single-pass flavor never surfaced this because it has no loop.
- **Finding B (to confirm)—the SQS event-source cap stops being a provider-rate control under agentic.** Because in-doc fan-out decouples request rate from document rate, holding `maximum_concurrency` no longer bounds RPM/TPM. Whether the new in-handler limiter is necessary, or Tier 1's headroom absorbs `cap × calls_per_doc` anyway, is a quantity to measure on the run, not assume.
## Alternatives considered
- **Flip the existing staging extractor by env var only (no parameter profile).** Simplest, but re-parametrizing (timeout → visibility, `maxReceiveCount`, the limiter) means editing shared infra by hand per run, and you cannot hold a clean single-pass baseline alongside. Rejected: the flavor and its derived envelope should move together as one parameter.
-- **Throughput-preserving cap (~30).** Holds single-pass drain times. Rejected for v1 (see the fork): it widens cost exposure on the flavor we deploy *because* it's expensive. Recorded as a one-line flip if drain time ever matters more than spend.
+- **Throughput-preserving cap (~15).** Holds single-pass drain times. Not chosen for v1 (see the fork)—though at ~1.5× the cost delta is small enough that this is nearly a coin-flip. Recorded as a one-variable flip if drain time ever matters more than spend.
- **Multimodal / image modality.** Closer to what a "read the document like a human" agent implies, and what some benchmark rows used. Rejected for the deploy: image tokens multiply the TPM draw and re-tighten Finding 1's coupling for no measured accuracy win on this text-heavy NDA corpus. `text` keeps the provider budget slack.
- **Dedicated `staging-agentic` environment.** A true side-by-side: agentic and single-pass live simultaneously, no baseline displacement. Heavier (a full env stand-up, its own alarms, its own teardown) and unnecessary given the baseline is already captured. Recorded as the cleaner path if a *continuous* A/B is ever wanted, per the single-tenant deployment model ([ADR-0013](0013-single-tenant-deployment-model.md)).
- **Don't deploy agentic; explain the name in prose.** The zero-cost path: a README/blog line saying the name refers to the library, which implements both flavors. Rejected as the anticlimactic answer—it leaves the project's strongest decision resting on offline numbers and forgoes the most interesting load-testing exercise available.
From d1402dc51e4c6d9601480e573cefd12e93dfb6b3 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sat, 6 Jun 2026 21:53:44 -0600
Subject: [PATCH 03/14] Tighten ADR-0016 prose and sharpen parameter envelope
findings
---
docs/adr/0016-agentic-flavor-deployment.md | 29 +++++++++++-----------
docs/adr/README.md | 2 +-
2 files changed, 16 insertions(+), 15 deletions(-)
diff --git a/docs/adr/0016-agentic-flavor-deployment.md b/docs/adr/0016-agentic-flavor-deployment.md
index 75f2eda..6f69c3e 100644
--- a/docs/adr/0016-agentic-flavor-deployment.md
+++ b/docs/adr/0016-agentic-flavor-deployment.md
@@ -14,7 +14,7 @@ That verdict is **offline**: a one-shot accuracy/cost eval on 83 dev documents.
- The strongest decision in the project—*not* shipping agentic—is only half-justified, because it rests on offline numbers and never confronts the deployed envelope.
- The offline verdict has a deployed counterpart no benchmark can produce—the agency premium *in the running pipeline* (drain time, queue dwell, the infra cost the eval never saw)—and the exercise gives ADR-0015's dormant findings a live look: Finding 1's provider-RPM coupling gets *measured* (and, at Tier 1, is likely confirmed slack), and Finding 2's errors-alarm-vs-DLQ question becomes testable via a deliberate stressor.
-This ADR settles **how** the agentic flavor is deployed and how its parameter envelope is re-derived. The headline, once the real numbers are in—Gemini agentic is ~1.5× single-pass, not the 2–4× a Claude-standard outlier suggested—is narrower than "re-tune everything": the existing envelope already absorbs agentic at the ADR-0015 bracket, exactly two knobs genuinely move, and the payoff is the *deployed* agency premium plus the capability itself, not a system pushed to breaking.
+This ADR settles **how** the agentic flavor is deployed and how its parameter envelope is re-derived. On Gemini, agentic costs ~1.5× single-pass in latency and dollars—modest enough that the existing operating envelope already absorbs it at the ADR-0015 bracket. The re-derivation is therefore narrow: exactly two knobs genuinely move, the rest of the envelope holds, and the real payoff is the *deployed* agency premium plus the capability itself.
### The agentic flavor changes the workload model, not just a constant
@@ -33,7 +33,7 @@ The single-pass parameters were *derived* from its workload model (service ~10s
In single-pass, the SQS event-source `maximum_concurrency` ([extractor/main.tf:130](../../infra/modules/extractor/main.tf#L130)) does three jobs at once *because one document equals one LLM call*: it caps document parallelism (throughput), caps concurrent LLM requests (the cost-burst guardrail), and bounds the provider RPM draw (Finding 1's coupling). Those collapse into a single number only at a 1:1 doc-to-call ratio.
-Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-flight: the request side now scales with `cap × calls_per_doc`, which is variable and which the SQS cap does not control. The cap still governs throughput, but the cost-guardrail and provider-coupling jobs need a **second control surface**: a request-level limiter (token bucket / semaphore in the handler) sized against the Gemini RPM/TPM budget. The SQS event-source cap governs *document* parallelism; the in-handler limiter governs *request* parallelism. That decoupling is the real architectural finding—the deployed-infra echo of the offline thesis: agency doesn't merely cost more per document, it breaks the assumption that one knob controls both throughput and provider exposure. *Conceptually* real is not the same as *quantitatively* binding, though: at Tier 1 (4,000 RPM) with cap ≤ 25 and ~1.5× service, the request side draws only a few hundred RPM (~410 even at prod's cap 25), ~10× under the ceiling. So the second control surface is a thing to *measure for*, and to reach for as N or the cap grows—not something this deployment needs built today (Finding B).
+Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-flight: the request side now scales with `cap × calls_per_doc`, which is variable and which the SQS cap does not control. The cap still governs throughput, but the cost-guardrail and provider-coupling jobs would call for a **second control surface**: a request-level limiter (token bucket / semaphore in the handler) sized against the Gemini RPM/TPM budget. The SQS event-source cap governs *document* parallelism; the in-handler limiter governs *request* parallelism. That decoupling is the real architectural finding—the deployed-infra echo of the offline thesis: agency doesn't merely cost more per document, it breaks the assumption that one knob controls both throughput and provider exposure. *Conceptually* real is not the same as *quantitatively* binding, though: at Tier 1 (4,000 RPM) with cap ≤ 25 and ~1.5× service, the request side draws only a few hundred RPM (~410 even at prod's cap 25), ~10× under the ceiling. So the second control surface is a thing to *measure for*, and to reach for as N or the cap grows—not something this deployment needs built today (Finding B).
## Decision
@@ -42,7 +42,7 @@ Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-f
Introduce `var.extractor_flavor` (`single_pass` | `agentic`, default `single_pass`). It drives two things:
1. **The handler constructor.** `_extractor()` ([handler.py:84-90](../../src/extractor/handler.py#L84-L90)) reads a new `EXTRACTOR_FLAVOR` env var and builds either `SinglePassExtractor(model, schema)` (today) or `AgenticExtractor(model, schema, modality="text", max_iterations=)`. Both are already exported by `agentic_kie`, share the identical `(model, schema)` interface, and raise the same `ExtractionError` the handler already catches into `batchItemFailures` ([handler.py:356](../../src/extractor/handler.py#L356))—so the agentic failure path flows through the existing redrive/DLQ machinery unchanged. `Extractor[NDA]` (also exported) becomes the return type so the cache helper covers both.
-2. **The parameter profile** (below), keyed off `extractor_flavor` so the whole envelope—timeout, derived visibility, `maxReceiveCount`, `max_iterations`, the limiter—moves *with* the flavor rather than being hand-edited. Switching any environment's flavor is then a one-variable change, which is the point: re-parametrization should be as cheap as flipping the variable.
+2. **The parameter profile** (below), keyed off `extractor_flavor` so the whole envelope moves *with* the flavor rather than being hand-edited—of which, for agentic, only `max_iterations` and `maxReceiveCount` actually differ from single-pass (the timeout and its derived visibility stay put). Switching any environment's flavor is then a one-variable change, which is the point: re-parametrization should be as cheap as flipping the variable.
Every environment—staging and prod alike—can run **either** flavor, selected per environment at deploy time, with single-pass the default everywhere. Because the full profile follows `extractor_flavor` (above), pointing any environment at agentic is a one-variable change, and pointing it back is the same. The characterization run is done on **staging** first: you validate a new flavor's deployed envelope before offering it to prod, and staging's single-pass baseline already lives in the ADR-0015 artifacts, so flipping it loses nothing. Prod thereby *gains the capability* to run agentic while keeping single-pass (and its deletion protection) by choice—nothing about prod is reverted, because the infra change is a permanent capability, not a temporary patch. A dedicated `staging-agentic` environment remains an option for a continuous side-by-side (recorded below).
@@ -60,26 +60,26 @@ Every environment—staging and prod alike—can run **either** flavor, selected
| `batch_size` / batching window | 1 / 0 | 1 / 0 (unchanged) | One long ReAct run per invocation is already correct; batching would head-of-line-block. |
| Memory | 2048 MB | 2048 MB (revisit) | Latency is LLM-wall-clock-bound (network), not CPU-bound; memory buys cold-start and glue speed only. A modest lever, left at baseline pending evidence. |
-**The downstream half does not move.** The publisher (DynamoDB Streams → analytics S3, 5s batch window) is flavor-agnostic—it runs *after* extraction and neither knows nor cares which extractor wrote the row. The re-derivation is entirely the extractor, its event source, and the provider budget.
+**The downstream half does not move.** The publisher (DynamoDB Streams → analytics S3, 5s batch window) is flavor-agnostic—it runs *after* extraction and neither knows nor cares which extractor wrote the row. The re-derivation touches only the extractor handler (`max_iterations`) and the queue's redrive policy (`maxReceiveCount`)—the event-source mapping, the provider budget, and the whole downstream half stay as they are.
### The one genuine fork: throughput vs. cost containment
Capacity is `cap ÷ service_time`. To hold single-pass-like drain behavior (a 200-burst absorbed and drained in a few minutes) the cap would rise from 10 to ~15 to offset the ~1.5× longer service time. That fights the cost guardrail. The choice:
- **Throughput-preserving**: raise the cap to ~15, keep drains fast, accept a ~1.5× wider cost-burst exposure on the *expensive* flavor.
-- **Cost-preserving** (chosen): hold the cap at its existing per-environment value (10 on staging), let SQS hold the backlog longer, and pay for the longer dwell with the higher (auto-derived) visibility timeout.
+- **Cost-preserving** (chosen): hold the cap at its existing per-environment value (10 on staging) and let SQS hold the backlog longer—which the unchanged 720s visibility timeout already absorbs (~290s dwell, ~2.5× headroom), so nothing has to give for it.
**We hold the existing cap (cost-preserving)—but at ~1.5× this is a low-stakes call, not a principled stand.** Raising it to ~15 would cost ~50% more concurrent spend for a faster drain, and either way the 200-doc bracket completes in minutes with the DLQ empty. We change nothing because the cap is a per-environment lever and there's no measured reason to touch it; if drain time ever matters more than spend, ~15 is the one-variable flip. The original *principle*—lean on the buffer, not the throttle—still holds; it just isn't being tested at this scale.
## Pass/fail criteria (SLOs)
-The agentic runs reuse ADR-0015's five SLOs, adjusted for the re-derived envelope; criterion 6 is new and is the point of the exercise.
+The agentic runs reuse ADR-0015's five SLOs—only SLO 4 changes, made flavor-aware (Finding C)—and add criterion 6, which is the point of the exercise.
1. **Correctness (primary run)**—200/200 reach `succeeded`; both DLQs at 0. (A *deliberate low-`max_iterations` stressor run* is exempt and expected to DLQ—see criterion 5.)
-2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the (unchanged) 720s visibility timeout and the queue drains to empty. At 14.6s actual service time, a 200-burst drains in ~290s—~2.5× under the 720s, so the original threat (the inflated ~30s estimates that pushed dwell toward ~570s) never materializes. Nothing in the envelope needed to move for this; the headline is simply that the queue drains cleanly and dwell stays well under the timeout.
-3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the in-handler limiter keeps the LLM request rate under the Gemini RPM/TPM budget (the new control surface working).
-4. **Latency—reported, not gated, and compared.** Agentic is slow by design; the e2e/processing percentiles are reported, not failed. The *deliverable* is the agentic-vs-single-pass delta on the same corpus in the same deployed pipeline (criterion 6).
-5. **Alarms honest**—primary run: no alarm fires. **Stressor run: this finally exercises Finding 2.** When `max_iterations` is capped low enough that genuinely hard docs exhaust it → `ExtractionError` → retry → DLQ, the prediction (from [handler.py:356](../../src/extractor/handler.py#L356)) is that `Errors` stays flat (failures are reported as `batchItemFailures`, a *successful* invocation) and **only** the `${dlq}-messages-visible` alarm fires, not `${extractor}-errors`. Confirming this on a live run closes Finding 2.
+2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the 720s visibility timeout and the queue drains to empty. At ~14.6s service time, a 200-burst drains in ~290s, ~2.5× under the 720s, so no message ages into a redelivery. Nothing in the envelope had to move for this; the queue simply drains cleanly and dwell stays well under the timeout.
+3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the measured LLM request rate stays under the Gemini RPM/TPM budget. This is the live read on Findings 1/B: at these caps it should sit ~10× under, which is also the test of whether a request-level limiter is needed at all—if the draw is that slack, it isn't built.
+4. **Latency—reported and compared, not gated (for agentic).** The *deliverable* is the agentic-vs-single-pass delta on the same corpus in the same deployed pipeline (criterion 6), not a pass/fail bar—agentic is slow by design. **This is not what the harness does today:** as built, SLO 4 *gates* processing p90 (both scenarios) and sustained e2e p90, on thresholds derived from single-pass's <10s benchmark, and a failed SLO hard-fails the run. Agentic trips those bars, so making SLO 4 flavor-aware—reporting rather than gating—is required work; see Finding C.
+5. **Alarms honest**—primary run: no alarm fires. **Stressor run: this finally exercises Finding 2.** A small, separate run—~20 documents with `max_iterations` forced very low (1–2) so they reliably exhaust it → `ExtractionError` → retry → (at `maxReceiveCount=2`) DLQ. The prediction (from [handler.py:356](../../src/extractor/handler.py#L356)) is that `Errors` stays flat (failures are reported as `batchItemFailures`, a *successful* invocation) and **only** the `${dlq}-messages-visible` alarm fires, not `${extractor}-errors`. Confirming this on a live run closes Finding 2.
6. **The deployed agency premium (new)**—cost/doc and e2e-latency, agentic vs. single-pass, measured not benchmarked: the offline "agency doesn't earn its cost" verdict, plus the infra cost the benchmark never saw (slower drain, the retune this ADR documents).
## Expected behavior (hypotheses to confirm or refute)
@@ -87,7 +87,7 @@ The agentic runs reuse ADR-0015's five SLOs, adjusted for the re-derived envelop
- **Service time** ~14.6s mean (benchmark, ~1.5× single-pass), tail bounded by `max_iterations` rather than by a timeout crash; **capacity** contracts from ~60/min to ~41/min at the held cap.
- **Burst**: queue peaks near 200 (as single-pass), but *drains in ~5 min* not ~3.5; concurrency pins at the cap; oldest-message age peaks ~290s—comfortably under the unchanged 720s timeout. DLQ 0 on the primary run; no alarm.
- **Sustained**: at a rate set to ~22% of the *new* capacity (~9/min), queue ≈ 0, concurrency hovers low; latency ≈ processing (which is now multi-call).
-- **Cost**: ~$0.011/doc on Gemini text-modality agentic (benchmark); ~$4–5 for both scenarios (200 docs each).
+- **Cost**: ~$0.011/doc on Gemini text-modality agentic (benchmark); ~$4–5 for both scenarios (200 docs each), plus pennies for the ~20-doc stressor.
- **Finding 2 stressor**: docs that exhaust the low `max_iterations` DLQ cleanly with `Errors` flat and only the DLQ alarm firing.
If reality diverges, the divergence is the finding.
@@ -107,7 +107,7 @@ Positive:
Negative:
-- Real work: a handler constructor switch and a new `extractor_flavor` parameter + profile plumbing (plus the request-level limiter *only if* the measured draw warrants it—see Finding B). More LLM spend (~$4–5) than the single-pass runs.
+- Real work: a handler constructor switch, a new `extractor_flavor` parameter + profile plumbing, and a harness change so SLO 4 reports rather than gates agentic latency (Finding C)—plus the request-level limiter *only if* the measured draw warrants it (Finding B). More LLM spend (~$4–5) than the single-pass runs.
- An environment runs one flavor at a time, so flipping staging to agentic means it isn't serving single-pass during the run window (mitigated: the baseline is already captured and flip-back is one variable; or stand up a second environment for a continuous side-by-side).
- The agentic flavor does not change the production decision—single-pass remains the default. This is characterization, not a reversal.
@@ -121,10 +121,11 @@ Neutral:
- **Finding A—`max_iterations` defaults to 50, which is a latency/cost bomb in a Lambda.** The library default lets a single document drive up to 50 LLM calls before raising. Under a 120s function timeout that document would crash (timeout) long before iteration 50, turning a logic problem into an infra fault and a retry. The profile caps it at 8–12 so the *agent* governs cost and a doomed doc fails fast and cheap—well inside the existing 120s, so the timeout stays put as the backstop. The single-pass flavor never surfaced this because it has no loop.
- **Finding B (to confirm)—the SQS event-source cap stops being a provider-rate control under agentic.** Because in-doc fan-out decouples request rate from document rate, holding `maximum_concurrency` no longer bounds RPM/TPM. Whether the new in-handler limiter is necessary, or Tier 1's headroom absorbs `cap × calls_per_doc` anyway, is a quantity to measure on the run, not assume.
+- **Finding C—the harness's latency SLO is hard-gated and would false-fail agentic.** [report.py:24-25](../../tests/load/report.py#L24-L25) hard-codes `PROCESSING_P90_MAX_S = 15` (gated in both scenarios) and `SUSTAINED_E2E_P90_MAX_S = 20` (sustained), and any failed SLO trips `assert not failures` ([test_scenarios.py:86-88](../../tests/load/test_scenarios.py#L86-L88))—so a red SLO 4 fails the whole run, not just the report. Those bars are 1.5× single-pass's <10s benchmark, and single-pass already clears processing p90 by a hair (13.5/13.8s, ADR-0015 Finding 5), so agentic at ~1.5× trips them on the very metric SLO 4 calls informational. Fix: thread `extractor_flavor` into `report.evaluate()` and return `passed=None` for agentic latency—`None` is not `False`, so it doesn't trip the assert, and the harness already uses that exact pattern for the no-data case ([report.py:159](../../tests/load/report.py#L159)). The agentic-vs-single-pass delta (criterion 6) stays the deliverable. Discovered reading the harness while drafting this ADR; lands in the implementation phase.
## Alternatives considered
-- **Flip the existing staging extractor by env var only (no parameter profile).** Simplest, but re-parametrizing (timeout → visibility, `maxReceiveCount`, the limiter) means editing shared infra by hand per run, and you cannot hold a clean single-pass baseline alongside. Rejected: the flavor and its derived envelope should move together as one parameter.
+- **Flip the existing extractor by env var only (no parameter profile).** Simplest, but the agentic flavor still wants `maxReceiveCount` lowered and `max_iterations` set, so an env-var-only flip leaves those to hand-edit per run and can't hold a clean single-pass baseline alongside. Rejected: the flavor and its profile should move together as one variable.
- **Throughput-preserving cap (~15).** Holds single-pass drain times. Not chosen for v1 (see the fork)—though at ~1.5× the cost delta is small enough that this is nearly a coin-flip. Recorded as a one-variable flip if drain time ever matters more than spend.
- **Multimodal / image modality.** Closer to what a "read the document like a human" agent implies, and what some benchmark rows used. Rejected for the deploy: image tokens multiply the TPM draw and re-tighten Finding 1's coupling for no measured accuracy win on this text-heavy NDA corpus. `text` keeps the provider budget slack.
- **Dedicated `staging-agentic` environment.** A true side-by-side: agentic and single-pass live simultaneously, no baseline displacement. Heavier (a full env stand-up, its own alarms, its own teardown) and unnecessary given the baseline is already captured. Recorded as the cleaner path if a *continuous* A/B is ever wanted, per the single-tenant deployment model ([ADR-0013](0013-single-tenant-deployment-model.md)).
@@ -132,4 +133,4 @@ Neutral:
## Post-implementation
-(To be completed after the runs, mirroring ADR-0015: the hypotheses above graded against the artifacts, the deployed agency premium reported, and Findings 1/2/A/B resolved or carried.)
+(To be completed after the runs, mirroring ADR-0015: the hypotheses above graded against the artifacts, the deployed agency premium reported, and Findings 1/2/A/B/C resolved or carried.)
diff --git a/docs/adr/README.md b/docs/adr/README.md
index bffaeb7..c989d05 100644
--- a/docs/adr/README.md
+++ b/docs/adr/README.md
@@ -33,4 +33,4 @@ This directory records the significant architectural decisions made in this proj
| [0013](0013-single-tenant-deployment-model.md) | Single-tenant deployment model | Accepted |
| [0014](0014-split-results-module.md) | Split the results module into publisher and analytics | Accepted |
| [0015](0015-load-testing-strategy.md) | Load-testing strategy | Accepted |
-| [0016](0016-agentic-flavor-deployment.md) | Agentic-flavor deployment and re-parametrization | Proposed |
+| [0016](0016-agentic-flavor-deployment.md) | Agentic-flavor deployment | Proposed |
From 63dccb65139d91e004ac7a9968cba9074250d399 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sat, 6 Jun 2026 23:46:37 -0600
Subject: [PATCH 04/14] Record post-run TPM correction for Gemini 3 Flash in
ADR-0015
---
docs/adr/0015-load-testing-strategy.md | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/docs/adr/0015-load-testing-strategy.md b/docs/adr/0015-load-testing-strategy.md
index 2b81db8..a231eee 100644
--- a/docs/adr/0015-load-testing-strategy.md
+++ b/docs/adr/0015-load-testing-strategy.md
@@ -38,6 +38,9 @@ There is ample headroom; the provider will not throttle these runs. This is wort
> [!NOTE]
> The `maximum_concurrency` cap is implicitly coupled to the provider's RPM budget: a cap that lets the pipeline issue more RPM than the tier allows turns a burst into DLQ'd documents, not buffered ones. At Tier 1 (4,000 RPM) the staging cap (10 → ~60 RPM) and even the prod cap (25 → ~150 RPM) sit far under the ceiling, so the coupling is currently slack. It is not enforced anywhere in code or config—see Finding 1.
+> [!NOTE]
+> **TPM correction (post-run, 2026-06-07).** The deployed model is **Gemini 3 Flash**, whose Tier-1 input-TPM ceiling is **2M, not the 4M** the table above states. The *measured* burst+sustained peak was **0.317M** (~16% of the 2M ceiling), so the conclusion ("not the binding constraint") held with **~6× headroom**—*more* than the table's ~2×, because the pre-run worst-case draw (~1.8M) sat far above the actual 0.317M, more than offsetting the lower-than-assumed ceiling. RPM and RPD held as predicted. The pre-registered estimates are left intact above; this note records the observed values, per the prediction-then-grade methodology.
+
## Decision
### Scope: end-to-end, through the real front door
@@ -67,7 +70,7 @@ Three reasons this beats a single held-constant document:
**Preserving the controlled-experiment property.** Varying the corpus *and* the arrival pattern at once would change two variables—the confound that made a single document tempting. The fix is to **freeze the sample with a fixed seed and use the identical 200 documents, in the same upload order, for both burst and sustained.** The corpus is then held constant *across* scenarios while varying *within* one: arrival pattern stays the only thing that differs between the two runs, and—better—each document can be paired across runs (same doc, burst vs. sustained) to isolate its queue-wait term cleanly.
-**Sourcing.** The corpus is *not* committed—PDFs would bloat the repo and trip the `check for added large files` hook. A prep step fetches the train partition via the pinned `kleister-nda-preparation` package into a git-ignored directory under `tests/`, so runs are reproducible (pinned package + fixed seed) without versioning the documents. The realized token/size distribution is sanity-checked against the extractor's 120s timeout and the 4M Tier-1 TPM ceiling before a run—a corpus of unusually long NDAs is the one input that could approach either.
+**Sourcing.** The corpus is *not* committed—PDFs would bloat the repo and trip the `check for added large files` hook. A prep step fetches the train partition via the pinned `kleister-nda-preparation` package into a git-ignored directory under `tests/`, so runs are reproducible (pinned package + fixed seed) without versioning the documents. The realized token/size distribution is sanity-checked against the extractor's 120s timeout and the Tier-1 TPM ceiling (2M for the deployed Gemini 3 Flash—corrected post-run; see the provider-budget note above) before a run—a corpus of unusually long NDAs is the one input that could approach either.
### What we measure
From 629cdcf82c43d9c4b0dedd948be875e6d69f87ec Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sat, 6 Jun 2026 23:47:00 -0600
Subject: [PATCH 05/14] Revise ADR-0016 with Finding A superstep correction and
post-run data
---
docs/adr/0016-agentic-flavor-deployment.md | 31 +++++++++++-----------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/docs/adr/0016-agentic-flavor-deployment.md b/docs/adr/0016-agentic-flavor-deployment.md
index 6f69c3e..9d621de 100644
--- a/docs/adr/0016-agentic-flavor-deployment.md
+++ b/docs/adr/0016-agentic-flavor-deployment.md
@@ -6,7 +6,7 @@ Proposed (2026-06-06).
## Context
-The project is named `agentic-kie-deploy`, but every environment to date runs the **single-pass** extractor (`SinglePassExtractor`, [handler.py:84](../../src/extractor/handler.py#L84)). That was a deliberate, measured choice: the offline benchmark ([*When does agency earn its cost?*](https://gabriel.com.gt/blog/when-does-agency-earn-its-cost/)) found that on the Kleister NDA corpus single-pass dominates the matrix—~91.5% F1 at ~$0.007/doc and ~9.8s, while the agentic flavor cost more in latency and dollars—Claude-standard ran ~$0.038/~65s (2–4× single-pass); Gemini Standard agentic is ~$0.011/~14.6s (~1.5×)—for gains "insufficient to justify the overhead," and lite-tier agentic *regressed* more documents than it improved. Agency did not earn its cost, so we shipped the flavor that did.
+The project is named `agentic-kie-deploy`, but every environment to date runs the **single-pass** extractor (`SinglePassExtractor`, [handler.py:84](../../src/extractor/handler.py#L84)). That was a deliberate, measured choice: the offline benchmark ([*When does agency earn its cost?*](https://gabriel.com.gt/blog/when-does-agency-earn-its-cost/)) found that on the Kleister NDA corpus single-pass dominates the matrix—~91.5% F1 at ~$0.007/doc and ~9.8s, while the agentic flavor cost more in latency and dollars—Claude-standard ran ~$0.038/~65s (~5× the dollars, ~6× the latency); Gemini Standard agentic is ~$0.011/~14.6s (~1.5×)—for gains "insufficient to justify the overhead," and lite-tier agentic *regressed* more documents than it improved. Agency did not earn its cost, so we shipped the flavor that did.
That verdict is **offline**: a one-shot accuracy/cost eval on 83 dev documents. It says nothing about what agency costs *the deployed system under arrival pressure*—which is a different and harsher cost than per-document dollars. [ADR-0015](0015-load-testing-strategy.md) measured the deployed behavior of the single-pass flavor (both scenarios passed all five SLOs); the symmetric exercise for the agentic flavor has never been run. So three things are simultaneously true:
@@ -22,7 +22,7 @@ This ADR settles **how** the agentic flavor is deployed and how its parameter en
| Property | Single-pass | Agentic | Consequence |
|---|---|---|---|
-| LLM calls per document | exactly 1 | N, data-dependent (1 → `max_iterations`) | request rate decouples from document rate |
+| LLM calls per document | exactly 1 | N, data-dependent (observed 5–9 in offline traces) | request rate decouples from document rate |
| Service time | ~10s (p99 31s) | ~14.6s (benchmark, ~1.5×), fatter/bimodal tail | steady-state capacity contracts |
| Input tokens/doc | fixed per document | inflated (re-reads pages across turns) | provider TPM headroom shrinks |
| Failure modes | one call succeeds/fails | loop non-termination, repeated tool error, partial state | `max_iterations` exhaustion → `ExtractionError` |
@@ -41,7 +41,7 @@ Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-f
Introduce `var.extractor_flavor` (`single_pass` | `agentic`, default `single_pass`). It drives two things:
-1. **The handler constructor.** `_extractor()` ([handler.py:84-90](../../src/extractor/handler.py#L84-L90)) reads a new `EXTRACTOR_FLAVOR` env var and builds either `SinglePassExtractor(model, schema)` (today) or `AgenticExtractor(model, schema, modality="text", max_iterations=)`. Both are already exported by `agentic_kie`, share the identical `(model, schema)` interface, and raise the same `ExtractionError` the handler already catches into `batchItemFailures` ([handler.py:356](../../src/extractor/handler.py#L356))—so the agentic failure path flows through the existing redrive/DLQ machinery unchanged. `Extractor[NDA]` (also exported) becomes the return type so the cache helper covers both.
+1. **The handler constructor.** `_extractor()` ([handler.py:84-90](../../src/extractor/handler.py#L84-L90)) reads a new `EXTRACTOR_FLAVOR` env var and builds either `SinglePassExtractor(model, schema)` (today) or `AgenticExtractor(model, schema, modality="text", max_iterations=)`. Both are already exported by `agentic_kie`, share the identical `(model, schema)` interface, and surface failures through the handler's broad `except Exception` ([handler.py:317](../../src/extractor/handler.py#L317)), which already routes them to `batchItemFailures` ([handler.py:356](../../src/extractor/handler.py#L356))—so the agentic failure path (a non-terminating agent's `ExtractionError` included) flows through the existing redrive/DLQ machinery unchanged, caught by the type-agnostic `except` rather than any shared exception class. `Extractor[NDA]` (also exported) becomes the return type so the cache helper covers both.
2. **The parameter profile** (below), keyed off `extractor_flavor` so the whole envelope moves *with* the flavor rather than being hand-edited—of which, for agentic, only `max_iterations` and `maxReceiveCount` actually differ from single-pass (the timeout and its derived visibility stay put). Switching any environment's flavor is then a one-variable change, which is the point: re-parametrization should be as cheap as flipping the variable.
Every environment—staging and prod alike—can run **either** flavor, selected per environment at deploy time, with single-pass the default everywhere. Because the full profile follows `extractor_flavor` (above), pointing any environment at agentic is a one-variable change, and pointing it back is the same. The characterization run is done on **staging** first: you validate a new flavor's deployed envelope before offering it to prod, and staging's single-pass baseline already lives in the ADR-0015 artifacts, so flipping it loses nothing. Prod thereby *gains the capability* to run agentic while keeping single-pass (and its deletion protection) by choice—nothing about prod is reverted, because the infra change is a permanent capability, not a temporary patch. A dedicated `staging-agentic` environment remains an option for a continuous side-by-side (recorded below).
@@ -50,13 +50,14 @@ Every environment—staging and prod alike—can run **either** flavor, selected
| Knob | Single-pass (today) | Agentic profile | Why it moves |
|---|---|---|---|
-| `max_iterations` (agent) | n/a | **8–12** (down from the library default 50) | The real cost/latency governor. A doc that can't terminate should fail fast into a *bounded* cost, not burn 50 LLM calls. This is the agentic analog of single-pass's deterministic single call. |
-| `modality` | `text` | `text` | Avoids image-token blow-up; keeps the per-doc TPM draw bounded and Finding 1's coupling slack. |
-| Lambda timeout ([main.tf:33](../../infra/main.tf#L33)) | 120s | **120s (unchanged)** | Benchmark mean is 14.6s, and `max_iterations` 8–12 bounds the worst legit run to ~40–70s—well under the existing 120s, which already absorbed single-pass's 50s tail. `max_iterations`, not the clock, is the governor; the timeout is a backstop that already has margin. No reason to move it. |
-| Visibility timeout | 720s (= 120×6) | **720s (unchanged)** | Derived as `timeout × 6` ([queue/main.tf:2](../../infra/modules/queue/main.tf#L2)), so it tracks the timeout automatically. The timeout stays at 120s, so this stays at 720s—and at ~290s peak dwell (below) that is ~2.5× headroom. The coupling is worth keeping; it just doesn't need to fire here. |
+| `max_iterations` (agent) | n/a | **~30** (down from the library default 50) | The real cost/latency governor—but it caps LangGraph *supersteps* (`recursion_limit`), ≈ 2× the LLM-call count, *not* LLM calls. Offline traces run 5–9 LLM calls (≈ 9–17 supersteps); the 8–12 first drafted would have clipped every legit run into a false `ExtractionError`. ~30 clears that ceiling with margin and still caps a runaway at ~15 calls. See Finding A. |
+| `max_retries` (agent) | n/a | **3 (unchanged)** | A *third* retry knob, separate from `maxReceiveCount`: `ModelRetryMiddleware` retries each model call up to 3× with backoff on *transient* errors (429/timeout/overload) inside one invocation. Left at 3, but recorded because it interacts with the 120s timeout (transient retries add wall-clock) and because "fail fast and cheap" applies to *logic* failures, not transient ones. |
+| `modality` | `text` | `text` | Avoids image-token blow-up; keeps the per-doc TPM draw bounded. Measured single-pass TPM peaked at 0.317M against the 2M ceiling (~6× headroom); even agentic's per-doc input inflation (~2–4×) at ~0.68× throughput stays well under, so Finding 1's coupling holds slack. |
+| Lambda timeout ([main.tf:33](../../infra/main.tf#L33)) | 120s | **120s (unchanged)** | Benchmark mean is 14.6s, and `max_iterations` ~30 (≈ ~15 LLM calls at ~2s each) bounds the worst run to ~30–50s—well under the existing 120s, which already absorbed single-pass's 50s tail. `max_iterations`, not the clock, is the governor; the timeout is a backstop that already has margin. No reason to move it. |
+| Visibility timeout | 720s (= 120×6) | **720s (unchanged)** | Derived as `timeout × 6` ([queue/main.tf:2](../../infra/modules/queue/main.tf#L2)), so it tracks the timeout automatically. The timeout stays at 120s, so this stays at 720s—and at ~330s peak dwell (below, scaling the measured single-pass baseline) that is ~2.2× headroom. The coupling is worth keeping; it just doesn't need to fire here. |
| `maximum_concurrency` | 10 staging / 25 prod | **held at the environment's existing cap** (cost-preserving) | A per-environment lever, independent of flavor—not part of the flavor profile; see the fork below. |
| **(new) request-level limiter** | implicit in the cap | **measure first, build only if the draw warrants** | In-doc fan-out decouples request rate from the cap (see Context), but at Tier 1 + cap ≤ 25 the draw sits ~10× under budget. Conditional on the run's measured provider rate (Finding B), not built up front. |
-| `maxReceiveCount` ([queue default](../../infra/modules/queue/variables.tf#L16)) | 3 | **2** | Agentic failures are mostly logic (non-terminating loop, repeated tool error), not transient. Retrying an expensive doomed run 3× triples its cost for nothing. |
+| `maxReceiveCount` ([queue default](../../infra/modules/queue/variables.tf#L16)) | 3 | **2** | Agentic failures are mostly logic (non-terminating loop, repeated tool error), not transient. Retrying an expensive doomed run 3× triples its cost for nothing. The value is single-sourced (queue redrive → `SQS_MAX_RECEIVE_COUNT`, [main.tf:91](../../infra/main.tf#L91)), so the flip is one variable—but several descriptions hard-code "maxReceiveCount=3" (the [extractor-errors alarm](../../infra/modules/extractor/main.tf#L137), the [DLQ alarm](../../infra/modules/queue/main.tf#L132), the publisher variable, the README alarm table) and must be updated alongside it. |
| `batch_size` / batching window | 1 / 0 | 1 / 0 (unchanged) | One long ReAct run per invocation is already correct; batching would head-of-line-block. |
| Memory | 2048 MB | 2048 MB (revisit) | Latency is LLM-wall-clock-bound (network), not CPU-bound; memory buys cold-start and glue speed only. A modest lever, left at baseline pending evidence. |
@@ -67,7 +68,7 @@ Every environment—staging and prod alike—can run **either** flavor, selected
Capacity is `cap ÷ service_time`. To hold single-pass-like drain behavior (a 200-burst absorbed and drained in a few minutes) the cap would rise from 10 to ~15 to offset the ~1.5× longer service time. That fights the cost guardrail. The choice:
- **Throughput-preserving**: raise the cap to ~15, keep drains fast, accept a ~1.5× wider cost-burst exposure on the *expensive* flavor.
-- **Cost-preserving** (chosen): hold the cap at its existing per-environment value (10 on staging) and let SQS hold the backlog longer—which the unchanged 720s visibility timeout already absorbs (~290s dwell, ~2.5× headroom), so nothing has to give for it.
+- **Cost-preserving** (chosen): hold the cap at its existing per-environment value (10 on staging) and let SQS hold the backlog longer—which the unchanged 720s visibility timeout already absorbs (~330s dwell, ~2.2× headroom), so nothing has to give for it.
**We hold the existing cap (cost-preserving)—but at ~1.5× this is a low-stakes call, not a principled stand.** Raising it to ~15 would cost ~50% more concurrent spend for a faster drain, and either way the 200-doc bracket completes in minutes with the DLQ empty. We change nothing because the cap is a per-environment lever and there's no measured reason to touch it; if drain time ever matters more than spend, ~15 is the one-variable flip. The original *principle*—lean on the buffer, not the throttle—still holds; it just isn't being tested at this scale.
@@ -76,17 +77,17 @@ Capacity is `cap ÷ service_time`. To hold single-pass-like drain behavior (a 20
The agentic runs reuse ADR-0015's five SLOs—only SLO 4 changes, made flavor-aware (Finding C)—and add criterion 6, which is the point of the exercise.
1. **Correctness (primary run)**—200/200 reach `succeeded`; both DLQs at 0. (A *deliberate low-`max_iterations` stressor run* is exempt and expected to DLQ—see criterion 5.)
-2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the 720s visibility timeout and the queue drains to empty. At ~14.6s service time, a 200-burst drains in ~290s, ~2.5× under the 720s, so no message ages into a redelivery. Nothing in the envelope had to move for this; the queue simply drains cleanly and dwell stays well under the timeout.
-3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the measured LLM request rate stays under the Gemini RPM/TPM budget. This is the live read on Findings 1/B: at these caps it should sit ~10× under, which is also the test of whether a request-level limiter is needed at all—if the draw is that slack, it isn't built.
+2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the 720s visibility timeout and the queue drains to empty. At ~14.6s service time (~1.5× single-pass), a 200-burst drains in ~5–5.5 min—scaling the *measured* single-pass baseline (3.85 min at 51.9 docs/min, not the theoretical 60/min)—so oldest-message age peaks ~330s, ~2.2× under the 720s, and no message ages into a redelivery. Nothing in the envelope had to move for this; the queue simply drains cleanly and dwell stays well under the timeout.
+3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the measured LLM request rate stays under the Gemini RPM/TPM budget. This is the live read on Findings 1/B: RPM should sit ~10× under, and TPM ~6× under (single-pass peaked 0.317M of the 2M ceiling; agentic inflates per-doc tokens but stays clear)—which is also the test of whether a request-level limiter is needed at all—if the draw is that slack, it isn't built.
4. **Latency—reported and compared, not gated (for agentic).** The *deliverable* is the agentic-vs-single-pass delta on the same corpus in the same deployed pipeline (criterion 6), not a pass/fail bar—agentic is slow by design. **This is not what the harness does today:** as built, SLO 4 *gates* processing p90 (both scenarios) and sustained e2e p90, on thresholds derived from single-pass's <10s benchmark, and a failed SLO hard-fails the run. Agentic trips those bars, so making SLO 4 flavor-aware—reporting rather than gating—is required work; see Finding C.
-5. **Alarms honest**—primary run: no alarm fires. **Stressor run: this finally exercises Finding 2.** A small, separate run—~20 documents with `max_iterations` forced very low (1–2) so they reliably exhaust it → `ExtractionError` → retry → (at `maxReceiveCount=2`) DLQ. The prediction (from [handler.py:356](../../src/extractor/handler.py#L356)) is that `Errors` stays flat (failures are reported as `batchItemFailures`, a *successful* invocation) and **only** the `${dlq}-messages-visible` alarm fires, not `${extractor}-errors`. Confirming this on a live run closes Finding 2.
+5. **Alarms honest**—primary run: no alarm fires. **Stressor run: this finally exercises Finding 2.** A small, separate run—~20 documents with `max_iterations` forced very low (≤4 supersteps, e.g. 2) so they reliably exhaust it → `ExtractionError` → retry → (at `maxReceiveCount=2`) DLQ. The prediction (from [handler.py:356](../../src/extractor/handler.py#L356)) is that `Errors` stays flat (failures are reported as `batchItemFailures`, a *successful* invocation) and **only** the `${dlq}-messages-visible` alarm fires, not `${extractor}-errors`. Confirming this on a live run closes Finding 2.
6. **The deployed agency premium (new)**—cost/doc and e2e-latency, agentic vs. single-pass, measured not benchmarked: the offline "agency doesn't earn its cost" verdict, plus the infra cost the benchmark never saw (slower drain, the retune this ADR documents).
## Expected behavior (hypotheses to confirm or refute)
- **Service time** ~14.6s mean (benchmark, ~1.5× single-pass), tail bounded by `max_iterations` rather than by a timeout crash; **capacity** contracts from ~60/min to ~41/min at the held cap.
-- **Burst**: queue peaks near 200 (as single-pass), but *drains in ~5 min* not ~3.5; concurrency pins at the cap; oldest-message age peaks ~290s—comfortably under the unchanged 720s timeout. DLQ 0 on the primary run; no alarm.
-- **Sustained**: at a rate set to ~22% of the *new* capacity (~9/min), queue ≈ 0, concurrency hovers low; latency ≈ processing (which is now multi-call).
+- **Burst**: queue peaks near 200 (as single-pass), but *drains in ~5–5.5 min*—vs the *measured* single-pass ~3.85 min, not the theoretical ~3.5; concurrency pins at the cap; oldest-message age peaks ~330s—comfortably under the unchanged 720s timeout (~2.2×). DLQ 0 on the primary run; no alarm.
+- **Sustained**: holding ADR-0015's 0.22 doc/s arrival schedule (the harness fixes the 900s window, so the rate is flavor-independent), now ~32% of the reduced ~41/min capacity—still below capacity, so queue ≈ 0 and concurrency hovers low (perhaps a touch above single-pass's peak-5, given the fatter tail—ADR-0015 Finding 3—but under the cap); latency ≈ processing (which is now multi-call).
- **Cost**: ~$0.011/doc on Gemini text-modality agentic (benchmark); ~$4–5 for both scenarios (200 docs each), plus pennies for the ~20-doc stressor.
- **Finding 2 stressor**: docs that exhaust the low `max_iterations` DLQ cleanly with `Errors` flat and only the DLQ alarm firing.
@@ -119,7 +120,7 @@ Neutral:
(Recorded as discovered; pre-implementation findings first.)
-- **Finding A—`max_iterations` defaults to 50, which is a latency/cost bomb in a Lambda.** The library default lets a single document drive up to 50 LLM calls before raising. Under a 120s function timeout that document would crash (timeout) long before iteration 50, turning a logic problem into an infra fault and a retry. The profile caps it at 8–12 so the *agent* governs cost and a doomed doc fails fast and cheap—well inside the existing 120s, so the timeout stays put as the backstop. The single-pass flavor never surfaced this because it has no loop.
+- **Finding A—`max_iterations` is a LangGraph `recursion_limit` (supersteps ≈ 2× LLM calls), not an LLM-call count; the right value is ~30—not the 8–12 first drafted, nor the library default 50.** `AgenticExtractor` passes `max_iterations` straight to LangGraph's `recursion_limit`, and `create_agent` builds a two-node loop (model ↔ tools), so K LLM calls cost ≈ 2K−1 supersteps. Offline traces show 5–9 LLM calls (≈ 9–17 supersteps); the higher "count tools and chains → ~45" figure is LangSmith *trace spans*, not supersteps, and doesn't bind this knob. Two corrections follow: (1) the draft's 8–12 would clip *every* legit run into a false `ExtractionError`—even a 5-call run needs ~9 supersteps; (2) the "default 50 crashes on the 120s timeout" mechanism is model-specific—it held for the slow Claude run (~65s) but not for the deployed Gemini Flash (~14.6s for 5–9 calls, ~2s/call), where even 50 supersteps (~25 calls) is ~50s and raises `ExtractionError` *cleanly* rather than crashing. So the reason to lower it is cost/latency containment of a doomed doc (cap a runaway at ~15 calls / ~$0.02 / ~30s) and margin above the legit ceiling, not crash-avoidance. ~30 clears the observed 9-call ceiling with ~1.7× margin; the characterization run validates it—a *legit* doc DLQ'ing via recursion means it's still too tight. The single-pass flavor never surfaced any of this because it has no loop.
- **Finding B (to confirm)—the SQS event-source cap stops being a provider-rate control under agentic.** Because in-doc fan-out decouples request rate from document rate, holding `maximum_concurrency` no longer bounds RPM/TPM. Whether the new in-handler limiter is necessary, or Tier 1's headroom absorbs `cap × calls_per_doc` anyway, is a quantity to measure on the run, not assume.
- **Finding C—the harness's latency SLO is hard-gated and would false-fail agentic.** [report.py:24-25](../../tests/load/report.py#L24-L25) hard-codes `PROCESSING_P90_MAX_S = 15` (gated in both scenarios) and `SUSTAINED_E2E_P90_MAX_S = 20` (sustained), and any failed SLO trips `assert not failures` ([test_scenarios.py:86-88](../../tests/load/test_scenarios.py#L86-L88))—so a red SLO 4 fails the whole run, not just the report. Those bars are 1.5× single-pass's <10s benchmark, and single-pass already clears processing p90 by a hair (13.5/13.8s, ADR-0015 Finding 5), so agentic at ~1.5× trips them on the very metric SLO 4 calls informational. Fix: thread `extractor_flavor` into `report.evaluate()` and return `passed=None` for agentic latency—`None` is not `False`, so it doesn't trip the assert, and the harness already uses that exact pattern for the no-data case ([report.py:159](../../tests/load/report.py#L159)). The agentic-vs-single-pass delta (criterion 6) stays the deliverable. Discovered reading the harness while drafting this ADR; lands in the implementation phase.
From 30acd99e5911a586096ef3f3efb858765ba0b343 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 00:01:48 -0600
Subject: [PATCH 06/14] Correct Gemini Tier-1 RPM/RPD ceilings in ADR-0015 and
ADR-0016
---
docs/adr/0015-load-testing-strategy.md | 3 +++
docs/adr/0016-agentic-flavor-deployment.md | 8 ++++----
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/docs/adr/0015-load-testing-strategy.md b/docs/adr/0015-load-testing-strategy.md
index a231eee..7e65b5e 100644
--- a/docs/adr/0015-load-testing-strategy.md
+++ b/docs/adr/0015-load-testing-strategy.md
@@ -41,6 +41,9 @@ There is ample headroom; the provider will not throttle these runs. This is wort
> [!NOTE]
> **TPM correction (post-run, 2026-06-07).** The deployed model is **Gemini 3 Flash**, whose Tier-1 input-TPM ceiling is **2M, not the 4M** the table above states. The *measured* burst+sustained peak was **0.317M** (~16% of the 2M ceiling), so the conclusion ("not the binding constraint") held with **~6× headroom**—*more* than the table's ~2×, because the pre-run worst-case draw (~1.8M) sat far above the actual 0.317M, more than offsetting the lower-than-assumed ceiling. RPM and RPD held as predicted. The pre-registered estimates are left intact above; this note records the observed values, per the prediction-then-grade methodology.
+> [!NOTE]
+> **RPM/RPD ceiling correction (post-run, 2026-06-07).** The table above and the `maximum_concurrency` coupling note that follows it both state the Gemini Tier-1 ceilings as **4,000 RPM** and **~150,000 RPD**; the deployed Gemini 3 Flash key's actual Tier-1 ceilings are **1,000 RPM** and **10,000 RPD**. The *draws* held as predicted (~60 RPM at staging concurrency, ~400/day across both runs), so the conclusion ("not the binding constraint") is unchanged—but against the corrected ceilings the true headroom is **~16× on RPM** (not the tabulated ~65×) and **~25× on RPD** (not ~375×), still ample. As with the TPM note above, the pre-registered estimates are left intact, per the prediction-then-grade methodology; this note records the corrected ceilings.
+
## Decision
### Scope: end-to-end, through the real front door
diff --git a/docs/adr/0016-agentic-flavor-deployment.md b/docs/adr/0016-agentic-flavor-deployment.md
index 9d621de..b7427f1 100644
--- a/docs/adr/0016-agentic-flavor-deployment.md
+++ b/docs/adr/0016-agentic-flavor-deployment.md
@@ -33,7 +33,7 @@ The single-pass parameters were *derived* from its workload model (service ~10s
In single-pass, the SQS event-source `maximum_concurrency` ([extractor/main.tf:130](../../infra/modules/extractor/main.tf#L130)) does three jobs at once *because one document equals one LLM call*: it caps document parallelism (throughput), caps concurrent LLM requests (the cost-burst guardrail), and bounds the provider RPM draw (Finding 1's coupling). Those collapse into a single number only at a 1:1 doc-to-call ratio.
-Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-flight: the request side now scales with `cap × calls_per_doc`, which is variable and which the SQS cap does not control. The cap still governs throughput, but the cost-guardrail and provider-coupling jobs would call for a **second control surface**: a request-level limiter (token bucket / semaphore in the handler) sized against the Gemini RPM/TPM budget. The SQS event-source cap governs *document* parallelism; the in-handler limiter governs *request* parallelism. That decoupling is the real architectural finding—the deployed-infra echo of the offline thesis: agency doesn't merely cost more per document, it breaks the assumption that one knob controls both throughput and provider exposure. *Conceptually* real is not the same as *quantitatively* binding, though: at Tier 1 (4,000 RPM) with cap ≤ 25 and ~1.5× service, the request side draws only a few hundred RPM (~410 even at prod's cap 25), ~10× under the ceiling. So the second control surface is a thing to *measure for*, and to reach for as N or the cap grows—not something this deployment needs built today (Finding B).
+Agentic fans out **inside** a document, so documents-in-flight ≠ requests-in-flight: the request side now scales with `cap × calls_per_doc`, which is variable and which the SQS cap does not control. The cap still governs throughput, but the cost-guardrail and provider-coupling jobs would call for a **second control surface**: a request-level limiter (token bucket / semaphore in the handler) sized against the Gemini RPM/TPM budget. The SQS event-source cap governs *document* parallelism; the in-handler limiter governs *request* parallelism. That decoupling is the real architectural finding—the deployed-infra echo of the offline thesis: agency doesn't merely cost more per document, it breaks the assumption that one knob controls both throughput and provider exposure. *Conceptually* real is not the same as *quantitatively* binding, though: at Tier 1 (1,000 RPM) with ~1.5× service, the request side draws only a few hundred RPM—~164 at staging's cap 10 (~6× under the ceiling), ~410 even at prod's cap 25 (~2.4× under). So the second control surface is a thing to *measure for*, and to reach for as N or the cap grows—comfortably skippable for the staging characterization run, but a thin enough margin at prod's cap that it moves from hypothetical toward real (Finding B).
## Decision
@@ -50,13 +50,13 @@ Every environment—staging and prod alike—can run **either** flavor, selected
| Knob | Single-pass (today) | Agentic profile | Why it moves |
|---|---|---|---|
-| `max_iterations` (agent) | n/a | **~30** (down from the library default 50) | The real cost/latency governor—but it caps LangGraph *supersteps* (`recursion_limit`), ≈ 2× the LLM-call count, *not* LLM calls. Offline traces run 5–9 LLM calls (≈ 9–17 supersteps); the 8–12 first drafted would have clipped every legit run into a false `ExtractionError`. ~30 clears that ceiling with margin and still caps a runaway at ~15 calls. See Finding A. |
+| `max_iterations` (agent) | n/a | **~30** (down from the library default 50) | The real cost/latency governor—but it caps LangGraph *supersteps* (`recursion_limit`), ≈ 2× the LLM-call count, *not* LLM calls. Offline traces run 5–9 LLM calls (≈ 9–17 supersteps). ~30 clears that ceiling with margin and still caps a runaway at ~15 calls. See Finding A. |
| `max_retries` (agent) | n/a | **3 (unchanged)** | A *third* retry knob, separate from `maxReceiveCount`: `ModelRetryMiddleware` retries each model call up to 3× with backoff on *transient* errors (429/timeout/overload) inside one invocation. Left at 3, but recorded because it interacts with the 120s timeout (transient retries add wall-clock) and because "fail fast and cheap" applies to *logic* failures, not transient ones. |
| `modality` | `text` | `text` | Avoids image-token blow-up; keeps the per-doc TPM draw bounded. Measured single-pass TPM peaked at 0.317M against the 2M ceiling (~6× headroom); even agentic's per-doc input inflation (~2–4×) at ~0.68× throughput stays well under, so Finding 1's coupling holds slack. |
| Lambda timeout ([main.tf:33](../../infra/main.tf#L33)) | 120s | **120s (unchanged)** | Benchmark mean is 14.6s, and `max_iterations` ~30 (≈ ~15 LLM calls at ~2s each) bounds the worst run to ~30–50s—well under the existing 120s, which already absorbed single-pass's 50s tail. `max_iterations`, not the clock, is the governor; the timeout is a backstop that already has margin. No reason to move it. |
| Visibility timeout | 720s (= 120×6) | **720s (unchanged)** | Derived as `timeout × 6` ([queue/main.tf:2](../../infra/modules/queue/main.tf#L2)), so it tracks the timeout automatically. The timeout stays at 120s, so this stays at 720s—and at ~330s peak dwell (below, scaling the measured single-pass baseline) that is ~2.2× headroom. The coupling is worth keeping; it just doesn't need to fire here. |
| `maximum_concurrency` | 10 staging / 25 prod | **held at the environment's existing cap** (cost-preserving) | A per-environment lever, independent of flavor—not part of the flavor profile; see the fork below. |
-| **(new) request-level limiter** | implicit in the cap | **measure first, build only if the draw warrants** | In-doc fan-out decouples request rate from the cap (see Context), but at Tier 1 + cap ≤ 25 the draw sits ~10× under budget. Conditional on the run's measured provider rate (Finding B), not built up front. |
+| **(new) request-level limiter** | implicit in the cap | **measure first, build only if the draw warrants** | In-doc fan-out decouples request rate from the cap (see Context), but at Tier 1 the draw sits ~6× under budget at staging's cap 10 (~2.4× at prod's cap 25). Conditional on the run's measured provider rate (Finding B), not built up front. |
| `maxReceiveCount` ([queue default](../../infra/modules/queue/variables.tf#L16)) | 3 | **2** | Agentic failures are mostly logic (non-terminating loop, repeated tool error), not transient. Retrying an expensive doomed run 3× triples its cost for nothing. The value is single-sourced (queue redrive → `SQS_MAX_RECEIVE_COUNT`, [main.tf:91](../../infra/main.tf#L91)), so the flip is one variable—but several descriptions hard-code "maxReceiveCount=3" (the [extractor-errors alarm](../../infra/modules/extractor/main.tf#L137), the [DLQ alarm](../../infra/modules/queue/main.tf#L132), the publisher variable, the README alarm table) and must be updated alongside it. |
| `batch_size` / batching window | 1 / 0 | 1 / 0 (unchanged) | One long ReAct run per invocation is already correct; batching would head-of-line-block. |
| Memory | 2048 MB | 2048 MB (revisit) | Latency is LLM-wall-clock-bound (network), not CPU-bound; memory buys cold-start and glue speed only. A modest lever, left at baseline pending evidence. |
@@ -78,7 +78,7 @@ The agentic runs reuse ADR-0015's five SLOs—only SLO 4 changes, made flavor-aw
1. **Correctness (primary run)**—200/200 reach `succeeded`; both DLQs at 0. (A *deliberate low-`max_iterations` stressor run* is exempt and expected to DLQ—see criterion 5.)
2. **No premature redelivery**—`ApproximateAgeOfOldestMessage` stays well under the 720s visibility timeout and the queue drains to empty. At ~14.6s service time (~1.5× single-pass), a 200-burst drains in ~5–5.5 min—scaling the *measured* single-pass baseline (3.85 min at 51.9 docs/min, not the theoretical 60/min)—so oldest-message age peaks ~330s, ~2.2× under the 720s, and no message ages into a redelivery. Nothing in the envelope had to move for this; the queue simply drains cleanly and dwell stays well under the timeout.
-3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the measured LLM request rate stays under the Gemini RPM/TPM budget. This is the live read on Findings 1/B: RPM should sit ~10× under, and TPM ~6× under (single-pass peaked 0.317M of the 2M ceiling; agentic inflates per-doc tokens but stays clear)—which is also the test of whether a request-level limiter is needed at all—if the draw is that slack, it isn't built.
+3. **Concurrency & provider rate hold**—peak `ConcurrentExecutions` ≤ cap; zero `Throttles`; **and** the measured LLM request rate stays under the Gemini RPM/TPM budget. This is the live read on Findings 1/B: at staging's cap 10 RPM should sit ~6× under, and TPM likewise ~6× under (single-pass peaked 0.317M of the 2M ceiling; agentic inflates per-doc tokens but stays clear)—which is also the test of whether a request-level limiter is needed at all—if the draw is that slack, it isn't built.
4. **Latency—reported and compared, not gated (for agentic).** The *deliverable* is the agentic-vs-single-pass delta on the same corpus in the same deployed pipeline (criterion 6), not a pass/fail bar—agentic is slow by design. **This is not what the harness does today:** as built, SLO 4 *gates* processing p90 (both scenarios) and sustained e2e p90, on thresholds derived from single-pass's <10s benchmark, and a failed SLO hard-fails the run. Agentic trips those bars, so making SLO 4 flavor-aware—reporting rather than gating—is required work; see Finding C.
5. **Alarms honest**—primary run: no alarm fires. **Stressor run: this finally exercises Finding 2.** A small, separate run—~20 documents with `max_iterations` forced very low (≤4 supersteps, e.g. 2) so they reliably exhaust it → `ExtractionError` → retry → (at `maxReceiveCount=2`) DLQ. The prediction (from [handler.py:356](../../src/extractor/handler.py#L356)) is that `Errors` stays flat (failures are reported as `batchItemFailures`, a *successful* invocation) and **only** the `${dlq}-messages-visible` alarm fires, not `${extractor}-errors`. Confirming this on a live run closes Finding 2.
6. **The deployed agency premium (new)**—cost/doc and e2e-latency, agentic vs. single-pass, measured not benchmarked: the offline "agency doesn't earn its cost" verdict, plus the infra cost the benchmark never saw (slower drain, the retune this ADR documents).
From 6aef92062756bb31f3cdd7c1f15172ceac9b94c6 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 00:11:43 -0600
Subject: [PATCH 07/14] Accept ADR-0016 agentic-flavor deployment
---
docs/adr/0016-agentic-flavor-deployment.md | 4 ++--
docs/adr/README.md | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/docs/adr/0016-agentic-flavor-deployment.md b/docs/adr/0016-agentic-flavor-deployment.md
index b7427f1..eea1463 100644
--- a/docs/adr/0016-agentic-flavor-deployment.md
+++ b/docs/adr/0016-agentic-flavor-deployment.md
@@ -2,7 +2,7 @@
## Status
-Proposed (2026-06-06).
+Accepted (2026-06-07).
## Context
@@ -51,7 +51,7 @@ Every environment—staging and prod alike—can run **either** flavor, selected
| Knob | Single-pass (today) | Agentic profile | Why it moves |
|---|---|---|---|
| `max_iterations` (agent) | n/a | **~30** (down from the library default 50) | The real cost/latency governor—but it caps LangGraph *supersteps* (`recursion_limit`), ≈ 2× the LLM-call count, *not* LLM calls. Offline traces run 5–9 LLM calls (≈ 9–17 supersteps). ~30 clears that ceiling with margin and still caps a runaway at ~15 calls. See Finding A. |
-| `max_retries` (agent) | n/a | **3 (unchanged)** | A *third* retry knob, separate from `maxReceiveCount`: `ModelRetryMiddleware` retries each model call up to 3× with backoff on *transient* errors (429/timeout/overload) inside one invocation. Left at 3, but recorded because it interacts with the 120s timeout (transient retries add wall-clock) and because "fail fast and cheap" applies to *logic* failures, not transient ones. |
+| `max_retries` (agent) | n/a | **3** | A *third* retry knob, separate from `maxReceiveCount`: `ModelRetryMiddleware` retries each model call up to 3× with backoff on *transient* errors (429/timeout/overload) inside one invocation. Left at 3, but recorded because it interacts with the 120s timeout (transient retries add wall-clock) and because "fail fast and cheap" applies to *logic* failures, not transient ones. |
| `modality` | `text` | `text` | Avoids image-token blow-up; keeps the per-doc TPM draw bounded. Measured single-pass TPM peaked at 0.317M against the 2M ceiling (~6× headroom); even agentic's per-doc input inflation (~2–4×) at ~0.68× throughput stays well under, so Finding 1's coupling holds slack. |
| Lambda timeout ([main.tf:33](../../infra/main.tf#L33)) | 120s | **120s (unchanged)** | Benchmark mean is 14.6s, and `max_iterations` ~30 (≈ ~15 LLM calls at ~2s each) bounds the worst run to ~30–50s—well under the existing 120s, which already absorbed single-pass's 50s tail. `max_iterations`, not the clock, is the governor; the timeout is a backstop that already has margin. No reason to move it. |
| Visibility timeout | 720s (= 120×6) | **720s (unchanged)** | Derived as `timeout × 6` ([queue/main.tf:2](../../infra/modules/queue/main.tf#L2)), so it tracks the timeout automatically. The timeout stays at 120s, so this stays at 720s—and at ~330s peak dwell (below, scaling the measured single-pass baseline) that is ~2.2× headroom. The coupling is worth keeping; it just doesn't need to fire here. |
diff --git a/docs/adr/README.md b/docs/adr/README.md
index c989d05..729bc47 100644
--- a/docs/adr/README.md
+++ b/docs/adr/README.md
@@ -33,4 +33,4 @@ This directory records the significant architectural decisions made in this proj
| [0013](0013-single-tenant-deployment-model.md) | Single-tenant deployment model | Accepted |
| [0014](0014-split-results-module.md) | Split the results module into publisher and analytics | Accepted |
| [0015](0015-load-testing-strategy.md) | Load-testing strategy | Accepted |
-| [0016](0016-agentic-flavor-deployment.md) | Agentic-flavor deployment | Proposed |
+| [0016](0016-agentic-flavor-deployment.md) | Agentic-flavor deployment | Accepted |
From fa7feb61887176adaea71603e03724dada982767 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 10:58:26 -0600
Subject: [PATCH 08/14] Set extractor_flavor per environment in tfvars
---
infra/envs/local.tfvars | 3 ++-
infra/envs/prod.tfvars | 5 +++--
infra/envs/staging.tfvars | 5 +++--
3 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/infra/envs/local.tfvars b/infra/envs/local.tfvars
index ad4bd7c..244aca5 100644
--- a/infra/envs/local.tfvars
+++ b/infra/envs/local.tfvars
@@ -1 +1,2 @@
-environment = "local"
+environment = "local"
+extractor_flavor = "agentic"
diff --git a/infra/envs/prod.tfvars b/infra/envs/prod.tfvars
index 195908d..0513964 100644
--- a/infra/envs/prod.tfvars
+++ b/infra/envs/prod.tfvars
@@ -1,2 +1,3 @@
-environment = "prod"
-alarm_email = "gafnts@gmail.com"
+environment = "prod"
+extractor_flavor = "single_pass"
+alarm_email = "gafnts@gmail.com"
diff --git a/infra/envs/staging.tfvars b/infra/envs/staging.tfvars
index 9eac881..53f51a5 100644
--- a/infra/envs/staging.tfvars
+++ b/infra/envs/staging.tfvars
@@ -1,2 +1,3 @@
-environment = "staging"
-alarm_email = "gafnts@gmail.com"
+environment = "staging"
+extractor_flavor = "agentic"
+alarm_email = "gafnts@gmail.com"
From 494e88cd3db8133bbc231e1607a274f05ef43587 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 11:05:45 -0600
Subject: [PATCH 09/14] Inject EXTRACTOR_FLAVOR and conditional
EXTRACTOR_MAX_ITERATIONS into Lambda env
---
infra/modules/extractor/main.tf | 24 +++++++++++++++---------
infra/modules/extractor/variables.tf | 16 ++++++++++++++++
2 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/infra/modules/extractor/main.tf b/infra/modules/extractor/main.tf
index 7a7c8fd..6318e90 100644
--- a/infra/modules/extractor/main.tf
+++ b/infra/modules/extractor/main.tf
@@ -100,14 +100,20 @@ resource "aws_lambda_function" "extractor" {
}
environment {
- variables = {
- LLM_MODEL = var.llm_model
- LLM_PROVIDER_SECRET_ARN = var.llm_provider_secret_arn
- LANGSMITH_SECRET_ARN = var.langsmith_secret_arn
- LANGSMITH_PROJECT = var.langsmith_project
- RESULTS_TABLE_NAME = var.results_table_name
- SQS_MAX_RECEIVE_COUNT = tostring(var.queue_max_receive_count)
- }
+ # Single-pass leaves max_iterations null and omits EXTRACTOR_MAX_ITERATIONS
+ # entirely rather than carry a dead var; only Agentic sets it.
+ variables = merge(
+ {
+ LLM_MODEL = var.llm_model
+ LLM_PROVIDER_SECRET_ARN = var.llm_provider_secret_arn
+ LANGSMITH_SECRET_ARN = var.langsmith_secret_arn
+ LANGSMITH_PROJECT = var.langsmith_project
+ RESULTS_TABLE_NAME = var.results_table_name
+ SQS_MAX_RECEIVE_COUNT = tostring(var.queue_max_receive_count)
+ EXTRACTOR_FLAVOR = var.extractor_flavor
+ },
+ var.max_iterations != null ? { EXTRACTOR_MAX_ITERATIONS = tostring(var.max_iterations) } : {}
+ )
}
tags = {
@@ -134,7 +140,7 @@ resource "aws_lambda_event_source_mapping" "extraction" {
resource "aws_cloudwatch_metric_alarm" "errors" {
alarm_name = "${var.function_name}-errors"
- alarm_description = "Lambda invocations that ended in an unhandled exception. With maxReceiveCount=3 on the queue, a single bad document fires this up to three times before it lands in the DLQ — the alarm is the early-warning signal that the DLQ alarm is the confirmation of."
+ alarm_description = "Lambda invocations that ended in an unhandled exception. A single bad document fires this once per delivery attempt before it lands in the DLQ (the alarm is the early-warning signal that the DLQ alarm is the confirmation of)."
namespace = "AWS/Lambda"
metric_name = "Errors"
statistic = "Sum"
diff --git a/infra/modules/extractor/variables.tf b/infra/modules/extractor/variables.tf
index 3935228..cbfc253 100644
--- a/infra/modules/extractor/variables.tf
+++ b/infra/modules/extractor/variables.tf
@@ -12,6 +12,22 @@ variable "image_uri" {
}
}
+variable "extractor_flavor" {
+ description = "Which agentic-kie strategy the handler builds, passed through as EXTRACTOR_FLAVOR. 'single_pass' or 'agentic'."
+ type = string
+ default = "single_pass"
+ validation {
+ condition = contains(["single_pass", "agentic"], var.extractor_flavor)
+ error_message = "extractor_flavor must be 'single_pass' or 'agentic'."
+ }
+}
+
+variable "max_iterations" {
+ description = "Agentic-only cap on LangGraph supersteps (recursion_limit, ~2x the LLM-call count), passed as EXTRACTOR_MAX_ITERATIONS. Null for single_pass, which has no loop; only emitted to the function env when set."
+ type = number
+ default = null
+}
+
variable "timeout_seconds" {
description = "Function timeout. The queue's visibility timeout is derived as 6x this value."
type = number
From 354f6908a3525b20b7a4e2daee6827998590c8ce Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 11:10:55 -0600
Subject: [PATCH 10/14] Update descriptions to reflect flavor-specific
maxReceiveCount values
---
infra/modules/publisher/variables.tf | 2 +-
infra/modules/queue/main.tf | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/infra/modules/publisher/variables.tf b/infra/modules/publisher/variables.tf
index f05c99a..37365d3 100644
--- a/infra/modules/publisher/variables.tf
+++ b/infra/modules/publisher/variables.tf
@@ -64,7 +64,7 @@ variable "stream_batching_window_seconds" {
}
variable "stream_retry_attempts" {
- description = "Retries before a failed batch lands in the DLQ. Mirrors the extractor's maxReceiveCount=3 for retry-budget symmetry across the pipeline."
+ description = "Retries before a failed batch lands in the DLQ. Mirrors the extractor's single-pass maxReceiveCount (3) for retry-budget symmetry across the pipeline. The publisher is flavor-agnostic so this holds at 3 even when the extractor tightens to 2 under the agentic flavor."
type = number
default = 3
}
diff --git a/infra/modules/queue/main.tf b/infra/modules/queue/main.tf
index 205f477..82a50bc 100644
--- a/infra/modules/queue/main.tf
+++ b/infra/modules/queue/main.tf
@@ -129,7 +129,7 @@ resource "aws_sqs_queue_policy" "extraction_dlq" {
resource "aws_cloudwatch_metric_alarm" "dlq_messages_visible" {
alarm_name = "${aws_sqs_queue.extraction_dlq.name}-messages-visible"
- alarm_description = "Any message in the DLQ means a document exhausted maxReceiveCount=3 retries. The DLQ alarm is the single source of truth for failed messages."
+ alarm_description = "Any message in the DLQ means a document exhausted its maxReceiveCount retries (3 for single-pass, 2 for agentic). The DLQ alarm is the single source of truth for failed messages."
namespace = "AWS/SQS"
metric_name = "ApproximateNumberOfMessagesVisible"
statistic = "Maximum"
From e7f04337a89121a4962e6f59f4d662a9061f1cc1 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 11:16:31 -0600
Subject: [PATCH 11/14] Wire extractor_flavor profile through root module and
outputs
---
infra/main.tf | 21 +++++++++++++++++++++
infra/outputs.tf | 5 +++++
infra/variables.tf | 10 ++++++++++
3 files changed, 36 insertions(+)
diff --git a/infra/main.tf b/infra/main.tf
index 19c2046..dd96d10 100644
--- a/infra/main.tf
+++ b/infra/main.tf
@@ -32,6 +32,24 @@ locals {
extractor_timeout_seconds = 120
+ # The parameter profile follows extractor_flavor so the whole envelope moves with
+ # the flavor rather than being hand-edited. Only two knobs differ between flavors:
+ # the agent's max_iterations (n/a for single-pass, which has no loop) and the queue's
+ # maxReceiveCount (agentic failures are mostly logic, not transient, so retrying an
+ # expensive doomed run buys nothing). The timeout, visibility timeout, modality, and
+ # concurrency cap hold across both.
+ flavor_profiles = {
+ single_pass = {
+ max_iterations = null
+ max_receive_count = 3
+ }
+ agentic = {
+ max_iterations = 30
+ max_receive_count = 2
+ }
+ }
+ flavor_profile = local.flavor_profiles[var.extractor_flavor]
+
# Partition root for result objects, single-sourced here and threaded into both
# the publisher (write path) and analytics (Glue/Athena read path) modules.
results_prefix = "extractions"
@@ -67,6 +85,7 @@ module "queue" {
name = "${var.project_name}-${var.environment}-extraction"
source_bucket_name = module.ingestion.bucket_name
lambda_timeout_seconds = local.extractor_timeout_seconds
+ max_receive_count = local.flavor_profile.max_receive_count
alarm_topic_arn = module.alarms.topic_arn
environment = var.environment
}
@@ -82,6 +101,8 @@ module "extractor" {
source = "./modules/extractor"
function_name = "${var.project_name}-${var.environment}-extractor"
image_uri = "${data.aws_ecr_repository.extractor.repository_url}@${var.extractor_image_digest}"
+ extractor_flavor = var.extractor_flavor
+ max_iterations = local.flavor_profile.max_iterations
timeout_seconds = local.extractor_timeout_seconds
memory_mb = 2048
ephemeral_storage_mb = 2048
diff --git a/infra/outputs.tf b/infra/outputs.tf
index fcad498..72661dd 100644
--- a/infra/outputs.tf
+++ b/infra/outputs.tf
@@ -43,6 +43,11 @@ output "results_table_arn" {
value = module.table.table_arn
}
+output "extractor_flavor" {
+ description = "The deployed extraction strategy. Read by the load harness to make SLO 4 (latency) report rather than gate for the agentic flavor, which is slow by design."
+ value = var.extractor_flavor
+}
+
output "extractor_function_name" {
value = module.extractor.function_name
}
diff --git a/infra/variables.tf b/infra/variables.tf
index a7b1d02..8939e87 100644
--- a/infra/variables.tf
+++ b/infra/variables.tf
@@ -35,6 +35,16 @@ variable "llm_model" {
default = "gemini-3-flash-preview"
}
+variable "extractor_flavor" {
+ description = "Which agentic-kie extraction strategy the extractor runs. 'single_pass' issues one structured LLM call; 'agentic' runs a ReAct loop over the document. Selectable per environment at deploy time; drives the whole parameter profile (max_iterations, maxReceiveCount) so re-parametrization is a one-variable flip."
+ type = string
+ default = "single_pass"
+ validation {
+ condition = contains(["single_pass", "agentic"], var.extractor_flavor)
+ error_message = "extractor_flavor must be 'single_pass' or 'agentic'."
+ }
+}
+
variable "alarm_email" {
description = "Email address subscribed to the alarm SNS topic. Leave null to skip the subscription (alarms still fire in CloudWatch, they just don't notify anyone). The recipient must confirm the subscription from their inbox before delivery starts."
type = string
From 219122eefc000eab5097ef32790c35a973e233b1 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 11:19:42 -0600
Subject: [PATCH 12/14] Select extractor strategy from EXTRACTOR_FLAVOR env var
---
src/extractor/handler.py | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/src/extractor/handler.py b/src/extractor/handler.py
index 2bde83e..6c26974 100644
--- a/src/extractor/handler.py
+++ b/src/extractor/handler.py
@@ -19,7 +19,7 @@
from typing import Any, cast
import boto3
-from agentic_kie import PDFLoader, SinglePassExtractor
+from agentic_kie import AgenticExtractor, Extractor, PDFLoader, SinglePassExtractor
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.typing import LambdaContext
from botocore.exceptions import ClientError
@@ -81,12 +81,28 @@ def _bootstrap_secrets() -> None:
@cache
-def _extractor() -> SinglePassExtractor[NDA]:
- """Build the key information extractor."""
+def _extractor() -> Extractor[NDA]:
+ """
+ Build the key information extractor for the deployed flavor (ADR-0016).
+
+ ``EXTRACTOR_FLAVOR`` selects the strategy: ``single_pass`` (default) issues
+ one structured LLM call; ``agentic`` runs a ReAct loop over the document,
+ capped at ``EXTRACTOR_MAX_ITERATIONS`` LangGraph supersteps. Both satisfy the
+ ``Extractor`` protocol and share the identical ``(model, schema)`` interface,
+ so the handler's broad ``except`` routes either one's failure—including the
+ agentic non-termination ``ExtractionError``—through the same redrive path.
+ """
_bootstrap_secrets()
model = ChatGoogleGenerativeAI(
model=os.environ["LLM_MODEL"], google_api_key=_llm_api_key()
)
+ if os.environ.get("EXTRACTOR_FLAVOR", "single_pass") == "agentic":
+ return AgenticExtractor(
+ model=model,
+ schema=NDA,
+ modality="text",
+ max_iterations=int(os.environ.get("EXTRACTOR_MAX_ITERATIONS", "30")),
+ )
return SinglePassExtractor(model=model, schema=NDA)
From 775aeea35544f081ac73ba85028c74d3eaa8d304 Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 11:28:53 -0600
Subject: [PATCH 13/14] Add flavor-aware load harness and extractor unit tests
---
tests/load/conftest.py | 11 +++++++++
tests/load/measure.py | 1 +
tests/load/report.py | 25 +++++++++++++++++++--
tests/test_extractor.py | 50 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 85 insertions(+), 2 deletions(-)
diff --git a/tests/load/conftest.py b/tests/load/conftest.py
index b6ed5dc..a09f558 100644
--- a/tests/load/conftest.py
+++ b/tests/load/conftest.py
@@ -56,6 +56,15 @@ def extractor_log_group_name() -> str:
return _tf_output("extractor_log_group_name")
+@pytest.fixture(scope="session")
+def extractor_flavor() -> str:
+ """
+ The deployed extraction strategy (ADR-0016), read from the live stack so the
+ report reflects what was actually deployed rather than an operator claim.
+ """
+ return _tf_output("extractor_flavor")
+
+
@pytest.fixture(scope="session")
def load_env(ingestion_bucket: str) -> str:
"""
@@ -82,6 +91,7 @@ def load_targets(
extractor_function_name: str,
publisher_function_name: str,
extractor_log_group_name: str,
+ extractor_flavor: str,
results_table_name: str,
uploader_api_endpoint: str,
analytics_bucket: str,
@@ -104,4 +114,5 @@ def load_targets(
api_id=urlparse(uploader_api_endpoint).hostname.split(".")[0], # type: ignore[union-attr]
analytics_bucket=analytics_bucket,
ingestion_bucket=ingestion_bucket,
+ flavor=extractor_flavor,
)
diff --git a/tests/load/measure.py b/tests/load/measure.py
index 90338a6..0456a31 100644
--- a/tests/load/measure.py
+++ b/tests/load/measure.py
@@ -44,6 +44,7 @@ class Targets:
api_id: str
analytics_bucket: str
ingestion_bucket: str
+ flavor: str = "single_pass" # deployed extraction strategy (ADR-0016)
@property
def alarm_prefix(self) -> str:
diff --git a/tests/load/report.py b/tests/load/report.py
index 1129792..59f3a2d 100644
--- a/tests/load/report.py
+++ b/tests/load/report.py
@@ -153,10 +153,29 @@ def evaluate(
)
)
- # 4. Latency
+ # 4. Latency. Gated for single_pass on the <10s-benchmark-derived bars;
+ # reported, not gated for agentic (ADR-0016 Finding C), which is slow by
+ # design—its deliverable is the agentic-vs-single-pass delta (criterion 6),
+ # not a pass/fail bar. passed=None is not False, so a slow agentic run never
+ # trips the harness's `assert not failures`.
+ agentic = targets.flavor == "agentic"
proc = [r.processing_s for r in ok if r.processing_s is not None]
if not proc:
slos.append(SLO(4, "Latency", None, "no processing data"))
+ elif agentic:
+ proc_p90 = _percentile(proc, 0.9)
+ e2e = [r.total_e2e for r in ok if r.total_e2e is not None]
+ e2e_p90 = _percentile(e2e, 0.9) if e2e else None
+ e2e_str = f"; e2e p90 {e2e_p90:.1f}s" if e2e_p90 is not None else ""
+ slos.append(
+ SLO(
+ 4,
+ "Latency",
+ None,
+ f"processing p90 {proc_p90:.1f}s{e2e_str} "
+ "(agentic: reported, not gated)",
+ )
+ )
else:
proc_p90 = _percentile(proc, 0.9)
if scenario == "sustained":
@@ -223,6 +242,7 @@ def build(
return {
"scenario": scenario,
"env": targets.env,
+ "flavor": targets.flavor,
"n": len(results),
"timestamp": datetime.now(UTC).isoformat(),
"window": layer_a["window"],
@@ -253,7 +273,8 @@ def write_artifact(report: dict[str, Any]) -> Path:
def format_report(report: dict[str, Any]) -> str:
lat = report["latency"]
lines = [
- f"\n=== load report: {report['scenario']} / {report['env']} / n={report['n']} ===",
+ f"\n=== load report: {report['scenario']} / {report['env']} / "
+ f"{report.get('flavor', 'single_pass')} / n={report['n']} ===",
f" {'segment':<12}{'p50':>8}{'p90':>8}{'p99':>8}{'max':>8}",
]
for label, key in [
diff --git a/tests/test_extractor.py b/tests/test_extractor.py
index b708184..b24cefd 100644
--- a/tests/test_extractor.py
+++ b/tests/test_extractor.py
@@ -464,6 +464,8 @@ def test_bootstrap_secrets_hydrates_env_vars(self, monkeypatch):
def test_extractor_bootstraps_then_builds_single_pass(self, monkeypatch):
monkeypatch.setenv("LLM_MODEL", "gemini-fake")
+ # Default flavor is single_pass; ensure no stray env flips it.
+ monkeypatch.delenv("EXTRACTOR_FLAVOR", raising=False)
bootstrap = MagicMock()
monkeypatch.setattr(handler, "_bootstrap_secrets", bootstrap)
monkeypatch.setattr(
@@ -474,8 +476,10 @@ def test_extractor_bootstraps_then_builds_single_pass(self, monkeypatch):
fake_extractor_obj = MagicMock()
model_ctor = MagicMock(return_value=fake_model)
ext_ctor = MagicMock(return_value=fake_extractor_obj)
+ agentic_ctor = MagicMock()
monkeypatch.setattr(handler, "ChatGoogleGenerativeAI", model_ctor)
monkeypatch.setattr(handler, "SinglePassExtractor", ext_ctor)
+ monkeypatch.setattr(handler, "AgenticExtractor", agentic_ctor)
assert handler._extractor() is fake_extractor_obj
bootstrap.assert_called_once()
@@ -483,6 +487,52 @@ def test_extractor_bootstraps_then_builds_single_pass(self, monkeypatch):
model="gemini-fake", google_api_key="fake-api-key"
)
ext_ctor.assert_called_once_with(model=fake_model, schema=NDA)
+ agentic_ctor.assert_not_called()
+
+ def test_extractor_builds_agentic_when_flavor_set(self, monkeypatch):
+ monkeypatch.setenv("LLM_MODEL", "gemini-fake")
+ monkeypatch.setenv("EXTRACTOR_FLAVOR", "agentic")
+ monkeypatch.setenv("EXTRACTOR_MAX_ITERATIONS", "30")
+ bootstrap = MagicMock()
+ monkeypatch.setattr(handler, "_bootstrap_secrets", bootstrap)
+ monkeypatch.setattr(
+ handler, "_llm_api_key", MagicMock(return_value="fake-api-key")
+ )
+
+ fake_model = MagicMock()
+ fake_agentic_obj = MagicMock()
+ model_ctor = MagicMock(return_value=fake_model)
+ single_ctor = MagicMock()
+ agentic_ctor = MagicMock(return_value=fake_agentic_obj)
+ monkeypatch.setattr(handler, "ChatGoogleGenerativeAI", model_ctor)
+ monkeypatch.setattr(handler, "SinglePassExtractor", single_ctor)
+ monkeypatch.setattr(handler, "AgenticExtractor", agentic_ctor)
+
+ assert handler._extractor() is fake_agentic_obj
+ bootstrap.assert_called_once()
+ agentic_ctor.assert_called_once_with(
+ model=fake_model, schema=NDA, modality="text", max_iterations=30
+ )
+ single_ctor.assert_not_called()
+
+ def test_extractor_agentic_defaults_max_iterations_when_env_absent(
+ self, monkeypatch
+ ):
+ monkeypatch.setenv("LLM_MODEL", "gemini-fake")
+ monkeypatch.setenv("EXTRACTOR_FLAVOR", "agentic")
+ monkeypatch.delenv("EXTRACTOR_MAX_ITERATIONS", raising=False)
+ monkeypatch.setattr(handler, "_bootstrap_secrets", MagicMock())
+ monkeypatch.setattr(
+ handler, "_llm_api_key", MagicMock(return_value="fake-api-key")
+ )
+ monkeypatch.setattr(
+ handler, "ChatGoogleGenerativeAI", MagicMock(return_value=MagicMock())
+ )
+ agentic_ctor = MagicMock()
+ monkeypatch.setattr(handler, "AgenticExtractor", agentic_ctor)
+
+ handler._extractor()
+ assert agentic_ctor.call_args.kwargs["max_iterations"] == 30
def test_ls_client_bootstraps_then_returns_cached_singleton(self, monkeypatch):
bootstrap = MagicMock()
From 7d1948572552f72cb2badb84c52b4342481e176f Mon Sep 17 00:00:00 2001
From: Gabriel Fuentes
Date: Sun, 7 Jun 2026 11:30:58 -0600
Subject: [PATCH 14/14] Document extractor_flavor in README and rename CI
workflow
---
.github/workflows/checks.yml | 2 +-
README.md | 9 ++++++---
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml
index 9da7cca..1dab768 100644
--- a/.github/workflows/checks.yml
+++ b/.github/workflows/checks.yml
@@ -1,4 +1,4 @@
-name: Checks
+name: Quality gates
on:
push:
diff --git a/README.md b/README.md
index e7f4427..e545f54 100644
--- a/README.md
+++ b/README.md
@@ -3,6 +3,7 @@
Serverless, event-driven AWS infrastructure for asynchronous key information extraction with LLMs.
+
@@ -117,7 +118,7 @@ The extraction queue sits between the ingestion bucket and the extractor Lambda.
| Lever | Value | What it controls |
|---|---|---|
| Visibility timeout | `6 × lambda_timeout_seconds` (computed) | Hides an in-flight message long enough to cover the worst-case extractor run plus handoff jitter, eliminating the most common SQS+Lambda misconfiguration |
-| `maxReceiveCount` | 3 | Bounds retries on transient failures before the message is shunted to the DLQ |
+| `maxReceiveCount` | 3 (single-pass), 2 (agentic) | Bounds retries on transient failures before the message is shunted to the DLQ. Follows `extractor_flavor`: agentic failures are mostly logic (a non-terminating loop), not transient, so retrying an expensive doomed run buys nothing ([ADR-0016](docs/adr/0016-agentic-flavor-deployment.md)) |
| Long polling | `receive_wait_time_seconds = 20` | Reduces empty receives and smooths Lambda triggering at no extra cost |
| TLS-only policy | Deny on `aws:SecureTransport = false` (main + DLQ) | Mirrors the bucket's transport posture across the pipeline |
| Source-scoped send | `aws:SourceArn` condition on `events.amazonaws.com` | Closes the confused-deputy class of misconfigurations on the EventBridge → SQS hop |
@@ -160,6 +161,8 @@ The extractor is a container-image Lambda that consumes the extraction queue, ru
| Architecture | `arm64` | ~20% cheaper per GB-second on Graviton; native build on `ubuntu-24.04-arm` so no QEMU emulation |
| `batch_size` | 1 | Per-invocation cost is dominated by the LLM call, so batching does not amortize anything and one-message batches keep the failure model simple |
| `maximum_concurrency` | 10 (staging/local), 25 (prod) | Caps parallel LLM fan-out under an ingestion burst, closing the deferral [ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md) made |
+| `extractor_flavor` | `single_pass` (default), `agentic` | Which [`agentic-kie`](https://github.com/gafnts/agentic-kie) strategy the handler builds—one structured call vs. a ReAct loop over the document. Selectable per environment at deploy time; it drives the whole parameter profile (`max_iterations`, `maxReceiveCount`) so flipping a flavor is a one-variable change ([ADR-0016](docs/adr/0016-agentic-flavor-deployment.md)) |
+| `max_iterations` (agentic only) | ~30 | Caps LangGraph supersteps (`recursion_limit`, ≈ 2× the LLM-call count), bounding a non-terminating agent run. `n/a` for single-pass, which has no loop |
| Idempotency | Conditional `PutItem` + status-guarded `UpdateItem` | At-least-once SQS delivery cannot clobber a terminal row; redelivered terminal messages are a no-op |
| Cold-start | No provisioned concurrency | Async polling model hides the 3–10s container-image cold start from the user |
| Networking | No VPC | Talks only to AWS APIs and external HTTPS endpoints; no NAT cost, no ENI cold-start penalty |
@@ -240,11 +243,11 @@ Eight CloudWatch alarms cover the operational hot path. Each is a 1-of-1 5-minut
| Alarm | Source | Fires when | Why it matters |
|---|---|---|---|
-| `${extractor}-errors` | `AWS/Lambda` `Errors` (Sum) on the extractor | `> 0` over 5 min | Any unhandled exception. With `maxReceiveCount = 3` on the queue, a single bad document fires this up to three times before it lands in the DLQ—the early-warning signal that the DLQ alarm is the confirmation of |
+| `${extractor}-errors` | `AWS/Lambda` `Errors` (Sum) on the extractor | `> 0` over 5 min | Any unhandled exception. A single bad document fires this once per delivery attempt (up to the queue's `maxReceiveCount`—3 for single-pass, 2 for agentic) before it lands in the DLQ—the early-warning signal that the DLQ alarm is the confirmation of |
| `${extractor}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the extractor | `> 0` over 5 min | Invocations rejected because the function hit its `maximum_concurrency` cap. Throttles mean ingestion is exceeding the planned LLM fan-out budget; either the cap is wrong or there's a burst worth investigating |
| `${presigner}-errors` | `AWS/Lambda` `Errors` (Sum) on the presigner | `> 0` over 5 min | The presigner does one `generate_presigned_url` call—non-zero errors imply an IAM regression or a malformed request that slipped past API Gateway |
| `${presigner}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the presigner | `> 0` over 5 min | The presigner has no reserved or maximum concurrency ([ADR-0010](docs/adr/0010-uploader-module.md)); throttles imply the account concurrency ceiling is being approached |
-| `${dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the DLQ | `> 0` over 5 min | A message in the DLQ means a document exhausted its three retries. The DLQ is the single source of truth for failed messages ([ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md)); this alarm is the page on it |
+| `${dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the DLQ | `> 0` over 5 min | A message in the DLQ means a document exhausted its `maxReceiveCount` retries (3 single-pass, 2 agentic). The DLQ is the single source of truth for failed messages ([ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md)); this alarm is the page on it |
| `${publisher}-errors` | `AWS/Lambda` `Errors` (Sum) on the publisher | `> 0` over 5 min | An unhandled exception in the Streams consumer. Result objects silently stop reaching S3 while the extractor keeps writing terminal rows to DynamoDB |
| `${publisher}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the publisher | `> 0` over 5 min | The publisher has no reserved or maximum concurrency; throttles stall result publishing and leave `succeeded`/`failed` rows without matching S3 objects |
| `${publisher-dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the publisher DLQ | `> 0` over 5 min | A stream batch exhausted `maximum_retry_attempts`. The single source of truth for failed batches, mirroring the extractor DLQ alarm ([ADR-0014](docs/adr/0014-split-results-module.md)) |