Enhance agentic extractor deployment and SQS monitoring by gafnts · Pull Request #39 · gafnts/agentic-kie-deploy

gafnts · 2026-06-07T21:16:58Z

This pull request addresses several issues identified during load testing of the agentic flavor deployment, focusing on improving harness reliability, cleanup safety, and the accuracy of post-run reporting. The main changes increase the default harness timeout, refine cleanup logic to avoid phantom DLQ entries, clarify documentation, and enhance the concurrency cap reporting for better diagnostics.

Harness reliability and cleanup safety:

Increased the default await_completion timeout from 600s to 900s in harness.py to ensure that documents failing on their first attempt have enough time for a retry before being marked as "timeout", reducing false SLO 1 failures. Added documentation clarifying the relationship between this timeout and the SQS visibility timeout.
Updated the cleanup function in harness.py to skip deletion of objects and rows for documents still marked as "timeout". This prevents premature deletion that could cause in-flight retries to fail with AccessDenied, which previously resulted in phantom DLQ entries. Added a log message to inform users about skipped documents and the need for manual cleanup after backlog drains.

Reporting and documentation improvements:

Enhanced the concurrency cap SLO reporting in report.py to clarify that the authoritative concurrency signals are Lambda's CloudWatch metrics, and the SQS in-flight metric is a noisy proxy. The proxy value is now reported but not used for gating SLOs, improving the accuracy and interpretability of test results. [1] [2]
Updated the docstring in extractor/handler.py to reflect that the extraction now uses the deployed flavor's NDA extraction, not just single-pass extraction.
Added detailed post-implementation analysis to the ADR (0016-agentic-flavor-deployment.md), documenting test artifacts, SLO verdicts, findings, and rationale for the above changes, including the root causes and fixes for harness timeouts and DLQ measurement artifacts.

Trigger staging extractor rebuild to deploy the agentic image

…roxy

…bility window

Enhance SQS monitoring and timeout settings with documentation updates

github-actions · 2026-06-07T21:17:34Z

Terraform Plan · `prod` ✅

Show plan

terraform -chdir=infra plan -var-file=envs/prod.tfvars
module.publisher.data.archive_file.publisher: Reading...
module.publisher.data.archive_file.publisher: Read complete after 0s [id=a8a217aa670dbf877cef8e62ea09f239480a3857]
module.uploader.data.archive_file.presigner: Reading...
module.uploader.data.archive_file.presigner: Read complete after 0s [id=c6003612e7d0a8992056dd501c589d957f338842]
module.publisher.data.aws_iam_policy_document.assume_role: Reading...
data.aws_secretsmanager_secret.llm_provider: Reading...
module.uploader.aws_cloudwatch_log_group.presigner: Refreshing state... [id=/aws/lambda/agentic-kie-deploy-prod-uploader]
module.uploader.aws_apigatewayv2_api.uploader: Refreshing state... [id=pgt5fz6hfb]
module.queue.aws_sqs_queue.extraction_dlq: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-extraction-dlq]
module.analytics.aws_glue_catalog_database.results: Refreshing state... [id=009160074575:agentic-kie-deploy_prod_analytics]
module.publisher.aws_cloudwatch_log_group.publisher: Refreshing state... [id=/aws/lambda/agentic-kie-deploy-prod-publisher]
data.aws_secretsmanager_secret.langsmith: Reading...
module.alarms.aws_sns_topic.alarms: Refreshing state... [id=arn:aws:sns:us-east-1:009160074575:agentic-kie-deploy-prod-alarms]
module.table.aws_dynamodb_table.results: Refreshing state... [id=agentic-kie-deploy-prod-results]
module.publisher.data.aws_iam_policy_document.assume_role: Read complete after 0s [id=666922913]
module.extractor.data.aws_iam_policy_document.assume_role: Reading...
module.extractor.data.aws_iam_policy_document.assume_role: Read complete after 0s [id=666922913]
module.publisher.aws_sqs_queue.publisher_dlq: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-publisher-dlq]
data.aws_secretsmanager_secret.llm_provider: Read complete after 1s [id=arn:aws:secretsmanager:us-east-1:009160074575:secret:agentic-kie-deploy/prod/llm-provider-GRUKBj]
module.uploader.aws_cloudwatch_log_group.api_access: Refreshing state... [id=/aws/apigateway/agentic-kie-deploy-prod-uploader]
module.extractor.aws_cloudwatch_log_group.extractor: Refreshing state... [id=/aws/lambda/agentic-kie-deploy-prod-extractor]
module.uploader.data.aws_iam_policy_document.assume_role: Reading...
module.uploader.data.aws_iam_policy_document.assume_role: Read complete after 0s [id=666922913]
data.aws_ecr_repository.extractor: Reading...
data.aws_caller_identity.current: Reading...
module.publisher.aws_iam_role.publisher: Refreshing state... [id=agentic-kie-deploy-prod-publisher-exec]
module.extractor.aws_iam_role.extractor: Refreshing state... [id=agentic-kie-deploy-prod-extractor-exec]
module.alarms.aws_sns_topic_subscription.email[0]: Refreshing state... [id=arn:aws:sns:us-east-1:009160074575:agentic-kie-deploy-prod-alarms:f9db7014-afdf-495e-b3f7-b445b4855d38]
data.aws_caller_identity.current: Read complete after 0s [id=009160074575]
module.publisher.data.aws_iam_policy_document.publisher_dlq: Reading...
module.publisher.data.aws_iam_policy_document.publisher_dlq: Read complete after 0s [id=379898922]
data.aws_secretsmanager_secret.langsmith: Read complete after 1s [id=arn:aws:secretsmanager:us-east-1:009160074575:secret:agentic-kie-deploy/prod/langsmith-yEQDzt]
module.uploader.aws_iam_role.presigner: Refreshing state... [id=agentic-kie-deploy-prod-uploader-exec]
module.queue.aws_sqs_queue.extraction: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-extraction]
module.queue.data.aws_iam_policy_document.extraction_dlq: Reading...
module.queue.data.aws_iam_policy_document.extraction_dlq: Read complete after 0s [id=3392646447]
module.publisher.aws_sqs_queue_policy.publisher_dlq: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-publisher-dlq]
module.queue.aws_sqs_queue_policy.extraction_dlq: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-extraction-dlq]
module.queue.aws_cloudwatch_metric_alarm.dlq_messages_visible: Refreshing state... [id=agentic-kie-deploy-prod-extraction-dlq-messages-visible]
module.publisher.aws_cloudwatch_metric_alarm.dlq_messages_visible: Refreshing state... [id=agentic-kie-deploy-prod-publisher-dlq-messages-visible]
module.uploader.aws_apigatewayv2_stage.default: Refreshing state... [id=$default]
module.analytics.aws_s3_bucket.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.analytics.aws_s3_bucket.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
data.aws_ecr_repository.extractor: Read complete after 0s [id=agentic-kie-deploy-prod-extractor]
module.ingestion.aws_s3_bucket.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.analytics.aws_s3_bucket.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.aws_s3_bucket_ownership_controls.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.analytics.aws_s3_bucket_server_side_encryption_configuration.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.analytics.aws_s3_bucket_public_access_block.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.analytics.aws_s3_bucket_lifecycle_configuration.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.ingestion.aws_s3_bucket_notification.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.data.aws_iam_policy_document.ingestion_tls_only: Reading...
module.ingestion.aws_s3_bucket_versioning.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket_lifecycle_configuration.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket_ownership_controls.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket_server_side_encryption_configuration.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.data.aws_iam_policy_document.ingestion_tls_only: Read complete after 0s [id=245963188]
module.ingestion.aws_s3_bucket_public_access_block.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.uploader.data.aws_iam_policy_document.presigner: Reading...
module.uploader.data.aws_iam_policy_document.presigner: Read complete after 0s [id=3714044795]
module.queue.aws_cloudwatch_event_rule.object_created: Refreshing state... [id=agentic-kie-deploy-prod-extraction-object-created]
module.extractor.data.aws_iam_policy_document.extractor: Reading...
module.extractor.data.aws_iam_policy_document.extractor: Read complete after 0s [id=1692636987]
module.analytics.aws_glue_catalog_table.extractions: Refreshing state... [id=009160074575:agentic-kie-deploy_prod_analytics:extractions]
module.analytics.aws_s3_bucket_server_side_encryption_configuration.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.data.aws_iam_policy_document.results_tls_only: Reading...
module.analytics.data.aws_iam_policy_document.results_tls_only: Read complete after 0s [id=3894737737]
module.analytics.aws_s3_bucket_lifecycle_configuration.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.aws_s3_bucket_notification.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.aws_s3_bucket_logging.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.aws_s3_bucket_versioning.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.aws_s3_bucket_public_access_block.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.ingestion.aws_s3_bucket_policy.ingestion_tls_only: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.analytics.aws_s3_bucket_ownership_controls.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.uploader.aws_iam_role_policy.presigner: Refreshing state... [id=agentic-kie-deploy-prod-uploader-exec:presigner]
module.ingestion.aws_s3_bucket_logging.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket_lifecycle_configuration.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.ingestion.aws_s3_bucket_ownership_controls.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.ingestion.aws_s3_bucket_public_access_block.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.ingestion.aws_s3_bucket_server_side_encryption_configuration.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.extractor.aws_iam_role_policy.extractor: Refreshing state... [id=agentic-kie-deploy-prod-extractor-exec:extractor]
module.analytics.aws_s3_bucket_public_access_block.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.aws_s3_bucket_ownership_controls.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.aws_athena_workgroup.results: Refreshing state... [id=agentic-kie-deploy-prod-analytics]
module.analytics.aws_s3_bucket_lifecycle_configuration.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.data.aws_iam_policy_document.athena_results_tls_only: Reading...
module.analytics.data.aws_iam_policy_document.athena_results_tls_only: Read complete after 0s [id=2676727016]
module.analytics.aws_s3_bucket_server_side_encryption_configuration.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.aws_s3_bucket_policy.results_tls_only: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.queue.aws_cloudwatch_event_target.extraction_queue: Refreshing state... [id=agentic-kie-deploy-prod-extraction-object-created-terraform-20260531234921286200000001]
module.queue.data.aws_iam_policy_document.extraction_queue: Reading...
module.queue.data.aws_iam_policy_document.extraction_queue: Read complete after 0s [id=4087084627]
module.uploader.aws_lambda_function.presigner: Refreshing state... [id=agentic-kie-deploy-prod-uploader]
module.extractor.aws_lambda_function.extractor: Refreshing state... [id=agentic-kie-deploy-prod-extractor]
module.publisher.data.aws_iam_policy_document.publisher: Reading...
module.publisher.data.aws_iam_policy_document.publisher: Read complete after 0s [id=2357955661]
module.analytics.aws_s3_bucket_policy.athena_results_tls_only: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.queue.aws_sqs_queue_policy.extraction: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-extraction]
module.publisher.aws_iam_role_policy.publisher: Refreshing state... [id=agentic-kie-deploy-prod-publisher-exec:publisher]
module.publisher.aws_lambda_function.publisher: Refreshing state... [id=agentic-kie-deploy-prod-publisher]
module.extractor.aws_cloudwatch_metric_alarm.throttles: Refreshing state... [id=agentic-kie-deploy-prod-extractor-throttles]
module.extractor.aws_cloudwatch_metric_alarm.errors: Refreshing state... [id=agentic-kie-deploy-prod-extractor-errors]
module.extractor.aws_lambda_event_source_mapping.extraction: Refreshing state... [id=a478837a-6fba-4f1f-8fa1-346762d56960]
module.uploader.aws_lambda_permission.apigw_invoke: Refreshing state... [id=AllowAPIGatewayInvoke]
module.uploader.aws_cloudwatch_metric_alarm.throttles: Refreshing state... [id=agentic-kie-deploy-prod-uploader-throttles]
module.uploader.aws_cloudwatch_metric_alarm.errors: Refreshing state... [id=agentic-kie-deploy-prod-uploader-errors]
module.uploader.aws_apigatewayv2_integration.presigner: Refreshing state... [id=4kwa7tq]
module.publisher.aws_cloudwatch_metric_alarm.throttles: Refreshing state... [id=agentic-kie-deploy-prod-publisher-throttles]
module.publisher.aws_cloudwatch_metric_alarm.errors: Refreshing state... [id=agentic-kie-deploy-prod-publisher-errors]
module.publisher.aws_lambda_event_source_mapping.publisher: Refreshing state... [id=175ef08f-4119-44b9-84f9-34059ff7db0b]
module.uploader.aws_apigatewayv2_route.uploads: Refreshing state... [id=v3kyan2]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # module.alarms.aws_sns_topic_subscription.email[0] will be created
  + resource "aws_sns_topic_subscription" "email" {
      + arn                             = (known after apply)
      + confirmation_timeout_in_minutes = 1
      + confirmation_was_authenticated  = (known after apply)
      + endpoint                        = "gafnts@gmail.com"
      + endpoint_auto_confirms          = false
      + filter_policy_scope             = (known after apply)
      + id                              = (known after apply)
      + owner_id                        = (known after apply)
      + pending_confirmation            = (known after apply)
      + protocol                        = "email"
      + raw_message_delivery            = false
      + region                          = "us-east-1"
      + topic_arn                       = "arn:aws:sns:us-east-1:009160074575:agentic-kie-deploy-prod-alarms"
    }

  # module.extractor.aws_cloudwatch_metric_alarm.errors will be updated in-place
  ~ resource "aws_cloudwatch_metric_alarm" "errors" {
      ~ alarm_description                     = "Lambda invocations that ended in an unhandled exception. With maxReceiveCount=3 on the queue, a single bad document fires this up to three times before it lands in the DLQ — the alarm is the early-warning signal that the DLQ alarm is the confirmation of." -> "Lambda invocations that ended in an unhandled exception. A single bad document fires this once per delivery attempt before it lands in the DLQ (the alarm is the early-warning signal that the DLQ alarm is the confirmation of)."
        id                                    = "agentic-kie-deploy-prod-extractor-errors"
        tags                                  = {
            "Environment" = "prod"
        }
        # (23 unchanged attributes hidden)
    }

  # module.extractor.aws_lambda_function.extractor will be updated in-place
  ~ resource "aws_lambda_function" "extractor" {
        id                             = "agentic-kie-deploy-prod-extractor"
      ~ image_uri                      = "009160074575.dkr.ecr.us-east-1.amazonaws.com/agentic-kie-deploy-prod-extractor@sha256:53a1848f1438d6d21c6e27662544676909d967bd71e25261aca22d1b5f880ddc" -> "009160074575.dkr.ecr.us-east-1.amazonaws.com/agentic-kie-deploy-prod-extractor@sha256:d8ca88c3d31cb8bc7315e65d4e77ce7b65d63de4c2f74e2d538213ab072b5c2c"
      ~ last_modified                  = "2026-05-31T23:49:21.495+0000" -> (known after apply)
        tags                           = {
            "Environment" = "prod"
        }
        # (29 unchanged attributes hidden)

      ~ environment {
          ~ variables = {
              + "EXTRACTOR_FLAVOR"        = "single_pass"
              ~ "LLM_MODEL"               = "gemini-3.1-flash-lite" -> "gemini-3-flash-preview"
                # (5 unchanged elements hidden)
            }
        }

        # (3 unchanged blocks hidden)
    }

  # module.queue.aws_cloudwatch_metric_alarm.dlq_messages_visible will be updated in-place
  ~ resource "aws_cloudwatch_metric_alarm" "dlq_messages_visible" {
      ~ alarm_description                     = "Any message in the DLQ means a document exhausted maxReceiveCount=3 retries. The DLQ alarm is the single source of truth for failed messages." -> "Any message in the DLQ means a document exhausted its maxReceiveCount retries (3 for single-pass, 2 for agentic). The DLQ alarm is the single source of truth for failed messages."
        id                                    = "agentic-kie-deploy-prod-extraction-dlq-messages-visible"
        tags                                  = {
            "Environment" = "prod"
        }
        # (23 unchanged attributes hidden)
    }

Plan: 1 to add, 3 to change, 0 to destroy.

Changes to Outputs:
  + extractor_flavor                = "single_pass"

─────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.

gafnts added 9 commits June 7, 2026 13:03

Update extract docstring for flavor-selectable extraction

115f6a0

Merge pull request #37 from gafnts/feature/agentic-flavor-deployment

d286abc

Trigger staging extractor rebuild to deploy the agentic image

Gate concurrency SLO on CloudWatch metrics, report SQS in-flight as p…

bd8f98a

…roxy

Increase await_completion default timeout to 900s to outlast SQS visi…

7ff7ee9

…bility window

Skip cleanup of timed-out docs to avoid phantom DLQ failures on retry

1a894a9

Add agentic burst and sustained baseline reports for staging

d620dcd

Document agentic burst and sustained run results in ADR-0016

16a3230

Linkify ADR-0016 artifact references to local report files

2ef4954

Merge pull request #38 from gafnts/load/agentic-extractor

40da8db

Enhance SQS monitoring and timeout settings with documentation updates

gafnts merged commit c6cb7d0 into main Jun 7, 2026
7 checks passed

gafnts deleted the develop branch June 7, 2026 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance agentic extractor deployment and SQS monitoring#39

Enhance agentic extractor deployment and SQS monitoring#39
gafnts merged 9 commits into
mainfrom
develop

gafnts commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gafnts commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Terraform Plan · prod ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Terraform Plan · `prod` ✅