gafnts · gafnts · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026
diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml
@@ -1,4 +1,4 @@
-name: Checks
+name: Quality gates
 
 on:
   push:

diff --git a/README.md b/README.md
@@ -3,6 +3,7 @@
   <strong>Serverless, event-driven AWS infrastructure for asynchronous key information extraction with LLMs.</strong>
 </p>
 <p align="center">
+<a href="https://github.com/gafnts/agentic-kie-deploy/actions/workflows/checks.yml"><img src="https://github.com/gafnts/agentic-kie-deploy/actions/workflows/checks.yml/badge.svg" alt="Quality gates"></a>
 <a href="https://github.com/gafnts/agentic-kie-deploy/actions/workflows/deploy-staging.yml"><img src="https://github.com/gafnts/agentic-kie-deploy/actions/workflows/deploy-staging.yml/badge.svg" alt="Deploy staging"></a>
 <a href="https://github.com/gafnts/agentic-kie-deploy/actions/workflows/deploy-prod.yml"><img src="https://github.com/gafnts/agentic-kie-deploy/actions/workflows/deploy-prod.yml/badge.svg" alt="Deploy prod"></a>
 <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License"></a>
@@ -117,7 +118,7 @@ The extraction queue sits between the ingestion bucket and the extractor Lambda.
 | Lever | Value | What it controls |
 |---|---|---|
 | Visibility timeout | `6 × lambda_timeout_seconds` (computed) | Hides an in-flight message long enough to cover the worst-case extractor run plus handoff jitter, eliminating the most common SQS+Lambda misconfiguration |
-| `maxReceiveCount` | 3 | Bounds retries on transient failures before the message is shunted to the DLQ |
+| `maxReceiveCount` | 3 (single-pass), 2 (agentic) | Bounds retries on transient failures before the message is shunted to the DLQ. Follows `extractor_flavor`: agentic failures are mostly logic (a non-terminating loop), not transient, so retrying an expensive doomed run buys nothing ([ADR-0016](docs/adr/0016-agentic-flavor-deployment.md)) |
 | Long polling | `receive_wait_time_seconds = 20` | Reduces empty receives and smooths Lambda triggering at no extra cost |
 | TLS-only policy | Deny on `aws:SecureTransport = false` (main + DLQ) | Mirrors the bucket's transport posture across the pipeline |
 | Source-scoped send | `aws:SourceArn` condition on `events.amazonaws.com` | Closes the confused-deputy class of misconfigurations on the EventBridge → SQS hop |
@@ -160,6 +161,8 @@ The extractor is a container-image Lambda that consumes the extraction queue, ru
 | Architecture | `arm64` | ~20% cheaper per GB-second on Graviton; native build on `ubuntu-24.04-arm` so no QEMU emulation |
 | `batch_size` | 1 | Per-invocation cost is dominated by the LLM call, so batching does not amortize anything and one-message batches keep the failure model simple |
 | `maximum_concurrency` | 10 (staging/local), 25 (prod) | Caps parallel LLM fan-out under an ingestion burst, closing the deferral [ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md) made |
+| `extractor_flavor` | `single_pass` (default), `agentic` | Which [`agentic-kie`](https://github.com/gafnts/agentic-kie) strategy the handler builds—one structured call vs. a ReAct loop over the document. Selectable per environment at deploy time; it drives the whole parameter profile (`max_iterations`, `maxReceiveCount`) so flipping a flavor is a one-variable change ([ADR-0016](docs/adr/0016-agentic-flavor-deployment.md)) |
+| `max_iterations` (agentic only) | ~30 | Caps LangGraph supersteps (`recursion_limit`, ≈ 2× the LLM-call count), bounding a non-terminating agent run. `n/a` for single-pass, which has no loop |
 | Idempotency | Conditional `PutItem` + status-guarded `UpdateItem` | At-least-once SQS delivery cannot clobber a terminal row; redelivered terminal messages are a no-op |
 | Cold-start | No provisioned concurrency | Async polling model hides the 3–10s container-image cold start from the user |
 | Networking | No VPC | Talks only to AWS APIs and external HTTPS endpoints; no NAT cost, no ENI cold-start penalty |
@@ -240,11 +243,11 @@ Eight CloudWatch alarms cover the operational hot path. Each is a 1-of-1 5-minut
 
 | Alarm | Source | Fires when | Why it matters |
 |---|---|---|---|
-| `${extractor}-errors` | `AWS/Lambda` `Errors` (Sum) on the extractor | `> 0` over 5 min | Any unhandled exception. With `maxReceiveCount = 3` on the queue, a single bad document fires this up to three times before it lands in the DLQ—the early-warning signal that the DLQ alarm is the confirmation of |
+| `${extractor}-errors` | `AWS/Lambda` `Errors` (Sum) on the extractor | `> 0` over 5 min | Any unhandled exception. A single bad document fires this once per delivery attempt (up to the queue's `maxReceiveCount`—3 for single-pass, 2 for agentic) before it lands in the DLQ—the early-warning signal that the DLQ alarm is the confirmation of |
 | `${extractor}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the extractor | `> 0` over 5 min | Invocations rejected because the function hit its `maximum_concurrency` cap. Throttles mean ingestion is exceeding the planned LLM fan-out budget; either the cap is wrong or there's a burst worth investigating |
 | `${presigner}-errors` | `AWS/Lambda` `Errors` (Sum) on the presigner | `> 0` over 5 min | The presigner does one `generate_presigned_url` call—non-zero errors imply an IAM regression or a malformed request that slipped past API Gateway |
 | `${presigner}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the presigner | `> 0` over 5 min | The presigner has no reserved or maximum concurrency ([ADR-0010](docs/adr/0010-uploader-module.md)); throttles imply the account concurrency ceiling is being approached |
-| `${dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the DLQ | `> 0` over 5 min | A message in the DLQ means a document exhausted its three retries. The DLQ is the single source of truth for failed messages ([ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md)); this alarm is the page on it |
+| `${dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the DLQ | `> 0` over 5 min | A message in the DLQ means a document exhausted its `maxReceiveCount` retries (3 single-pass, 2 agentic). The DLQ is the single source of truth for failed messages ([ADR-0005](docs/adr/0005-sqs-dlq-retry-topology.md)); this alarm is the page on it |
 | `${publisher}-errors` | `AWS/Lambda` `Errors` (Sum) on the publisher | `> 0` over 5 min | An unhandled exception in the Streams consumer. Result objects silently stop reaching S3 while the extractor keeps writing terminal rows to DynamoDB |
 | `${publisher}-throttles` | `AWS/Lambda` `Throttles` (Sum) on the publisher | `> 0` over 5 min | The publisher has no reserved or maximum concurrency; throttles stall result publishing and leave `succeeded`/`failed` rows without matching S3 objects |
 | `${publisher-dlq}-messages-visible` | `AWS/SQS` `ApproximateNumberOfMessagesVisible` (Max) on the publisher DLQ | `> 0` over 5 min | A stream batch exhausted `maximum_retry_attempts`. The single source of truth for failed batches, mirroring the extractor DLQ alarm ([ADR-0014](docs/adr/0014-split-results-module.md)) |

diff --git a/docs/adr/0015-load-testing-strategy.md b/docs/adr/0015-load-testing-strategy.md
@@ -38,6 +38,12 @@ There is ample headroom; the provider will not throttle these runs. This is wort
 > [!NOTE]
 > The `maximum_concurrency` cap is implicitly coupled to the provider's RPM budget: a cap that lets the pipeline issue more RPM than the tier allows turns a burst into DLQ'd documents, not buffered ones. At Tier 1 (4,000 RPM) the staging cap (10 → ~60 RPM) and even the prod cap (25 → ~150 RPM) sit far under the ceiling, so the coupling is currently slack. It is not enforced anywhere in code or config—see Finding 1.
 
+> [!NOTE]
+> **TPM correction (post-run, 2026-06-07).** The deployed model is **Gemini 3 Flash**, whose Tier-1 input-TPM ceiling is **2M, not the 4M** the table above states. The *measured* burst+sustained peak was **0.317M** (~16% of the 2M ceiling), so the conclusion ("not the binding constraint") held with **~6× headroom**—*more* than the table's ~2×, because the pre-run worst-case draw (~1.8M) sat far above the actual 0.317M, more than offsetting the lower-than-assumed ceiling. RPM and RPD held as predicted. The pre-registered estimates are left intact above; this note records the observed values, per the prediction-then-grade methodology.
+
+> [!NOTE]
+> **RPM/RPD ceiling correction (post-run, 2026-06-07).** The table above and the `maximum_concurrency` coupling note that follows it both state the Gemini Tier-1 ceilings as **4,000 RPM** and **~150,000 RPD**; the deployed Gemini 3 Flash key's actual Tier-1 ceilings are **1,000 RPM** and **10,000 RPD**. The *draws* held as predicted (~60 RPM at staging concurrency, ~400/day across both runs), so the conclusion ("not the binding constraint") is unchanged—but against the corrected ceilings the true headroom is **~16× on RPM** (not the tabulated ~65×) and **~25× on RPD** (not ~375×), still ample. As with the TPM note above, the pre-registered estimates are left intact, per the prediction-then-grade methodology; this note records the corrected ceilings.
+
 ## Decision
 
 ### Scope: end-to-end, through the real front door
@@ -67,7 +73,7 @@ Three reasons this beats a single held-constant document:
 
 **Preserving the controlled-experiment property.** Varying the corpus *and* the arrival pattern at once would change two variables—the confound that made a single document tempting. The fix is to **freeze the sample with a fixed seed and use the identical 200 documents, in the same upload order, for both burst and sustained.** The corpus is then held constant *across* scenarios while varying *within* one: arrival pattern stays the only thing that differs between the two runs, and—better—each document can be paired across runs (same doc, burst vs. sustained) to isolate its queue-wait term cleanly.
 
-**Sourcing.** The corpus is *not* committed—PDFs would bloat the repo and trip the `check for added large files` hook. A prep step fetches the train partition via the pinned `kleister-nda-preparation` package into a git-ignored directory under `tests/`, so runs are reproducible (pinned package + fixed seed) without versioning the documents. The realized token/size distribution is sanity-checked against the extractor's 120s timeout and the 4M Tier-1 TPM ceiling before a run—a corpus of unusually long NDAs is the one input that could approach either.
+**Sourcing.** The corpus is *not* committed—PDFs would bloat the repo and trip the `check for added large files` hook. A prep step fetches the train partition via the pinned `kleister-nda-preparation` package into a git-ignored directory under `tests/`, so runs are reproducible (pinned package + fixed seed) without versioning the documents. The realized token/size distribution is sanity-checked against the extractor's 120s timeout and the Tier-1 TPM ceiling (2M for the deployed Gemini 3 Flash—corrected post-run; see the provider-budget note above) before a run—a corpus of unusually long NDAs is the one input that could approach either.
 
 ### What we measure