Skip to content

Enhance agentic extractor deployment and SQS monitoring#39

Merged
gafnts merged 9 commits into
mainfrom
develop
Jun 7, 2026
Merged

Enhance agentic extractor deployment and SQS monitoring#39
gafnts merged 9 commits into
mainfrom
develop

Conversation

@gafnts

@gafnts gafnts commented Jun 7, 2026

Copy link
Copy Markdown
Owner

This pull request addresses several issues identified during load testing of the agentic flavor deployment, focusing on improving harness reliability, cleanup safety, and the accuracy of post-run reporting. The main changes increase the default harness timeout, refine cleanup logic to avoid phantom DLQ entries, clarify documentation, and enhance the concurrency cap reporting for better diagnostics.

Harness reliability and cleanup safety:

  • Increased the default await_completion timeout from 600s to 900s in harness.py to ensure that documents failing on their first attempt have enough time for a retry before being marked as "timeout", reducing false SLO 1 failures. Added documentation clarifying the relationship between this timeout and the SQS visibility timeout.
  • Updated the cleanup function in harness.py to skip deletion of objects and rows for documents still marked as "timeout". This prevents premature deletion that could cause in-flight retries to fail with AccessDenied, which previously resulted in phantom DLQ entries. Added a log message to inform users about skipped documents and the need for manual cleanup after backlog drains.

Reporting and documentation improvements:

  • Enhanced the concurrency cap SLO reporting in report.py to clarify that the authoritative concurrency signals are Lambda's CloudWatch metrics, and the SQS in-flight metric is a noisy proxy. The proxy value is now reported but not used for gating SLOs, improving the accuracy and interpretability of test results. [1] [2]
  • Updated the docstring in extractor/handler.py to reflect that the extraction now uses the deployed flavor's NDA extraction, not just single-pass extraction.
  • Added detailed post-implementation analysis to the ADR (0016-agentic-flavor-deployment.md), documenting test artifacts, SLO verdicts, findings, and rationale for the above changes, including the root causes and fixes for harness timeouts and DLQ measurement artifacts.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

Terraform Plan · prod

Show plan
terraform -chdir=infra plan -var-file=envs/prod.tfvars
module.publisher.data.archive_file.publisher: Reading...
module.publisher.data.archive_file.publisher: Read complete after 0s [id=a8a217aa670dbf877cef8e62ea09f239480a3857]
module.uploader.data.archive_file.presigner: Reading...
module.uploader.data.archive_file.presigner: Read complete after 0s [id=c6003612e7d0a8992056dd501c589d957f338842]
module.publisher.data.aws_iam_policy_document.assume_role: Reading...
data.aws_secretsmanager_secret.llm_provider: Reading...
module.uploader.aws_cloudwatch_log_group.presigner: Refreshing state... [id=/aws/lambda/agentic-kie-deploy-prod-uploader]
module.uploader.aws_apigatewayv2_api.uploader: Refreshing state... [id=pgt5fz6hfb]
module.queue.aws_sqs_queue.extraction_dlq: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-extraction-dlq]
module.analytics.aws_glue_catalog_database.results: Refreshing state... [id=009160074575:agentic-kie-deploy_prod_analytics]
module.publisher.aws_cloudwatch_log_group.publisher: Refreshing state... [id=/aws/lambda/agentic-kie-deploy-prod-publisher]
data.aws_secretsmanager_secret.langsmith: Reading...
module.alarms.aws_sns_topic.alarms: Refreshing state... [id=arn:aws:sns:us-east-1:009160074575:agentic-kie-deploy-prod-alarms]
module.table.aws_dynamodb_table.results: Refreshing state... [id=agentic-kie-deploy-prod-results]
module.publisher.data.aws_iam_policy_document.assume_role: Read complete after 0s [id=666922913]
module.extractor.data.aws_iam_policy_document.assume_role: Reading...
module.extractor.data.aws_iam_policy_document.assume_role: Read complete after 0s [id=666922913]
module.publisher.aws_sqs_queue.publisher_dlq: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-publisher-dlq]
data.aws_secretsmanager_secret.llm_provider: Read complete after 1s [id=arn:aws:secretsmanager:us-east-1:009160074575:secret:agentic-kie-deploy/prod/llm-provider-GRUKBj]
module.uploader.aws_cloudwatch_log_group.api_access: Refreshing state... [id=/aws/apigateway/agentic-kie-deploy-prod-uploader]
module.extractor.aws_cloudwatch_log_group.extractor: Refreshing state... [id=/aws/lambda/agentic-kie-deploy-prod-extractor]
module.uploader.data.aws_iam_policy_document.assume_role: Reading...
module.uploader.data.aws_iam_policy_document.assume_role: Read complete after 0s [id=666922913]
data.aws_ecr_repository.extractor: Reading...
data.aws_caller_identity.current: Reading...
module.publisher.aws_iam_role.publisher: Refreshing state... [id=agentic-kie-deploy-prod-publisher-exec]
module.extractor.aws_iam_role.extractor: Refreshing state... [id=agentic-kie-deploy-prod-extractor-exec]
module.alarms.aws_sns_topic_subscription.email[0]: Refreshing state... [id=arn:aws:sns:us-east-1:009160074575:agentic-kie-deploy-prod-alarms:f9db7014-afdf-495e-b3f7-b445b4855d38]
data.aws_caller_identity.current: Read complete after 0s [id=009160074575]
module.publisher.data.aws_iam_policy_document.publisher_dlq: Reading...
module.publisher.data.aws_iam_policy_document.publisher_dlq: Read complete after 0s [id=379898922]
data.aws_secretsmanager_secret.langsmith: Read complete after 1s [id=arn:aws:secretsmanager:us-east-1:009160074575:secret:agentic-kie-deploy/prod/langsmith-yEQDzt]
module.uploader.aws_iam_role.presigner: Refreshing state... [id=agentic-kie-deploy-prod-uploader-exec]
module.queue.aws_sqs_queue.extraction: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-extraction]
module.queue.data.aws_iam_policy_document.extraction_dlq: Reading...
module.queue.data.aws_iam_policy_document.extraction_dlq: Read complete after 0s [id=3392646447]
module.publisher.aws_sqs_queue_policy.publisher_dlq: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-publisher-dlq]
module.queue.aws_sqs_queue_policy.extraction_dlq: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-extraction-dlq]
module.queue.aws_cloudwatch_metric_alarm.dlq_messages_visible: Refreshing state... [id=agentic-kie-deploy-prod-extraction-dlq-messages-visible]
module.publisher.aws_cloudwatch_metric_alarm.dlq_messages_visible: Refreshing state... [id=agentic-kie-deploy-prod-publisher-dlq-messages-visible]
module.uploader.aws_apigatewayv2_stage.default: Refreshing state... [id=$default]
module.analytics.aws_s3_bucket.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.analytics.aws_s3_bucket.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
data.aws_ecr_repository.extractor: Read complete after 0s [id=agentic-kie-deploy-prod-extractor]
module.ingestion.aws_s3_bucket.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.analytics.aws_s3_bucket.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.aws_s3_bucket_ownership_controls.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.analytics.aws_s3_bucket_server_side_encryption_configuration.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.analytics.aws_s3_bucket_public_access_block.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.analytics.aws_s3_bucket_lifecycle_configuration.results_logs: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098-logs]
module.ingestion.aws_s3_bucket_notification.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.data.aws_iam_policy_document.ingestion_tls_only: Reading...
module.ingestion.aws_s3_bucket_versioning.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket_lifecycle_configuration.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket_ownership_controls.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket_server_side_encryption_configuration.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.data.aws_iam_policy_document.ingestion_tls_only: Read complete after 0s [id=245963188]
module.ingestion.aws_s3_bucket_public_access_block.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.uploader.data.aws_iam_policy_document.presigner: Reading...
module.uploader.data.aws_iam_policy_document.presigner: Read complete after 0s [id=3714044795]
module.queue.aws_cloudwatch_event_rule.object_created: Refreshing state... [id=agentic-kie-deploy-prod-extraction-object-created]
module.extractor.data.aws_iam_policy_document.extractor: Reading...
module.extractor.data.aws_iam_policy_document.extractor: Read complete after 0s [id=1692636987]
module.analytics.aws_glue_catalog_table.extractions: Refreshing state... [id=009160074575:agentic-kie-deploy_prod_analytics:extractions]
module.analytics.aws_s3_bucket_server_side_encryption_configuration.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.data.aws_iam_policy_document.results_tls_only: Reading...
module.analytics.data.aws_iam_policy_document.results_tls_only: Read complete after 0s [id=3894737737]
module.analytics.aws_s3_bucket_lifecycle_configuration.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.aws_s3_bucket_notification.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.aws_s3_bucket_logging.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.aws_s3_bucket_versioning.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.analytics.aws_s3_bucket_public_access_block.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.ingestion.aws_s3_bucket_policy.ingestion_tls_only: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.analytics.aws_s3_bucket_ownership_controls.results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.uploader.aws_iam_role_policy.presigner: Refreshing state... [id=agentic-kie-deploy-prod-uploader-exec:presigner]
module.ingestion.aws_s3_bucket_logging.ingestion: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae]
module.ingestion.aws_s3_bucket_lifecycle_configuration.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.ingestion.aws_s3_bucket_ownership_controls.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.ingestion.aws_s3_bucket_public_access_block.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.ingestion.aws_s3_bucket_server_side_encryption_configuration.ingestion_logs: Refreshing state... [id=agentic-kie-deploy-prod-ingestion-aad6123c6a2f5bae-logs]
module.extractor.aws_iam_role_policy.extractor: Refreshing state... [id=agentic-kie-deploy-prod-extractor-exec:extractor]
module.analytics.aws_s3_bucket_public_access_block.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.aws_s3_bucket_ownership_controls.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.aws_athena_workgroup.results: Refreshing state... [id=agentic-kie-deploy-prod-analytics]
module.analytics.aws_s3_bucket_lifecycle_configuration.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.data.aws_iam_policy_document.athena_results_tls_only: Reading...
module.analytics.data.aws_iam_policy_document.athena_results_tls_only: Read complete after 0s [id=2676727016]
module.analytics.aws_s3_bucket_server_side_encryption_configuration.athena_results: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.analytics.aws_s3_bucket_policy.results_tls_only: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd21098]
module.queue.aws_cloudwatch_event_target.extraction_queue: Refreshing state... [id=agentic-kie-deploy-prod-extraction-object-created-terraform-20260531234921286200000001]
module.queue.data.aws_iam_policy_document.extraction_queue: Reading...
module.queue.data.aws_iam_policy_document.extraction_queue: Read complete after 0s [id=4087084627]
module.uploader.aws_lambda_function.presigner: Refreshing state... [id=agentic-kie-deploy-prod-uploader]
module.extractor.aws_lambda_function.extractor: Refreshing state... [id=agentic-kie-deploy-prod-extractor]
module.publisher.data.aws_iam_policy_document.publisher: Reading...
module.publisher.data.aws_iam_policy_document.publisher: Read complete after 0s [id=2357955661]
module.analytics.aws_s3_bucket_policy.athena_results_tls_only: Refreshing state... [id=agentic-kie-deploy-prod-extractions-32757ca74bd2-athena-results]
module.queue.aws_sqs_queue_policy.extraction: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/009160074575/agentic-kie-deploy-prod-extraction]
module.publisher.aws_iam_role_policy.publisher: Refreshing state... [id=agentic-kie-deploy-prod-publisher-exec:publisher]
module.publisher.aws_lambda_function.publisher: Refreshing state... [id=agentic-kie-deploy-prod-publisher]
module.extractor.aws_cloudwatch_metric_alarm.throttles: Refreshing state... [id=agentic-kie-deploy-prod-extractor-throttles]
module.extractor.aws_cloudwatch_metric_alarm.errors: Refreshing state... [id=agentic-kie-deploy-prod-extractor-errors]
module.extractor.aws_lambda_event_source_mapping.extraction: Refreshing state... [id=a478837a-6fba-4f1f-8fa1-346762d56960]
module.uploader.aws_lambda_permission.apigw_invoke: Refreshing state... [id=AllowAPIGatewayInvoke]
module.uploader.aws_cloudwatch_metric_alarm.throttles: Refreshing state... [id=agentic-kie-deploy-prod-uploader-throttles]
module.uploader.aws_cloudwatch_metric_alarm.errors: Refreshing state... [id=agentic-kie-deploy-prod-uploader-errors]
module.uploader.aws_apigatewayv2_integration.presigner: Refreshing state... [id=4kwa7tq]
module.publisher.aws_cloudwatch_metric_alarm.throttles: Refreshing state... [id=agentic-kie-deploy-prod-publisher-throttles]
module.publisher.aws_cloudwatch_metric_alarm.errors: Refreshing state... [id=agentic-kie-deploy-prod-publisher-errors]
module.publisher.aws_lambda_event_source_mapping.publisher: Refreshing state... [id=175ef08f-4119-44b9-84f9-34059ff7db0b]
module.uploader.aws_apigatewayv2_route.uploads: Refreshing state... [id=v3kyan2]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # module.alarms.aws_sns_topic_subscription.email[0] will be created
  + resource "aws_sns_topic_subscription" "email" {
      + arn                             = (known after apply)
      + confirmation_timeout_in_minutes = 1
      + confirmation_was_authenticated  = (known after apply)
      + endpoint                        = "gafnts@gmail.com"
      + endpoint_auto_confirms          = false
      + filter_policy_scope             = (known after apply)
      + id                              = (known after apply)
      + owner_id                        = (known after apply)
      + pending_confirmation            = (known after apply)
      + protocol                        = "email"
      + raw_message_delivery            = false
      + region                          = "us-east-1"
      + topic_arn                       = "arn:aws:sns:us-east-1:009160074575:agentic-kie-deploy-prod-alarms"
    }

  # module.extractor.aws_cloudwatch_metric_alarm.errors will be updated in-place
  ~ resource "aws_cloudwatch_metric_alarm" "errors" {
      ~ alarm_description                     = "Lambda invocations that ended in an unhandled exception. With maxReceiveCount=3 on the queue, a single bad document fires this up to three times before it lands in the DLQ — the alarm is the early-warning signal that the DLQ alarm is the confirmation of." -> "Lambda invocations that ended in an unhandled exception. A single bad document fires this once per delivery attempt before it lands in the DLQ (the alarm is the early-warning signal that the DLQ alarm is the confirmation of)."
        id                                    = "agentic-kie-deploy-prod-extractor-errors"
        tags                                  = {
            "Environment" = "prod"
        }
        # (23 unchanged attributes hidden)
    }

  # module.extractor.aws_lambda_function.extractor will be updated in-place
  ~ resource "aws_lambda_function" "extractor" {
        id                             = "agentic-kie-deploy-prod-extractor"
      ~ image_uri                      = "009160074575.dkr.ecr.us-east-1.amazonaws.com/agentic-kie-deploy-prod-extractor@sha256:53a1848f1438d6d21c6e27662544676909d967bd71e25261aca22d1b5f880ddc" -> "009160074575.dkr.ecr.us-east-1.amazonaws.com/agentic-kie-deploy-prod-extractor@sha256:d8ca88c3d31cb8bc7315e65d4e77ce7b65d63de4c2f74e2d538213ab072b5c2c"
      ~ last_modified                  = "2026-05-31T23:49:21.495+0000" -> (known after apply)
        tags                           = {
            "Environment" = "prod"
        }
        # (29 unchanged attributes hidden)

      ~ environment {
          ~ variables = {
              + "EXTRACTOR_FLAVOR"        = "single_pass"
              ~ "LLM_MODEL"               = "gemini-3.1-flash-lite" -> "gemini-3-flash-preview"
                # (5 unchanged elements hidden)
            }
        }

        # (3 unchanged blocks hidden)
    }

  # module.queue.aws_cloudwatch_metric_alarm.dlq_messages_visible will be updated in-place
  ~ resource "aws_cloudwatch_metric_alarm" "dlq_messages_visible" {
      ~ alarm_description                     = "Any message in the DLQ means a document exhausted maxReceiveCount=3 retries. The DLQ alarm is the single source of truth for failed messages." -> "Any message in the DLQ means a document exhausted its maxReceiveCount retries (3 for single-pass, 2 for agentic). The DLQ alarm is the single source of truth for failed messages."
        id                                    = "agentic-kie-deploy-prod-extraction-dlq-messages-visible"
        tags                                  = {
            "Environment" = "prod"
        }
        # (23 unchanged attributes hidden)
    }

Plan: 1 to add, 3 to change, 0 to destroy.

Changes to Outputs:
  + extractor_flavor                = "single_pass"

─────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.

@gafnts gafnts merged commit c6cb7d0 into main Jun 7, 2026
7 checks passed
@gafnts gafnts deleted the develop branch June 7, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant