Conversation
Trigger staging extractor rebuild to deploy the agentic image
Enhance SQS monitoring and timeout settings with documentation updates
Terraform Plan ·
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request addresses several issues identified during load testing of the agentic flavor deployment, focusing on improving harness reliability, cleanup safety, and the accuracy of post-run reporting. The main changes increase the default harness timeout, refine cleanup logic to avoid phantom DLQ entries, clarify documentation, and enhance the concurrency cap reporting for better diagnostics.
Harness reliability and cleanup safety:
await_completiontimeout from 600s to 900s inharness.pyto ensure that documents failing on their first attempt have enough time for a retry before being marked as "timeout", reducing false SLO 1 failures. Added documentation clarifying the relationship between this timeout and the SQS visibility timeout.cleanupfunction inharness.pyto skip deletion of objects and rows for documents still marked as "timeout". This prevents premature deletion that could cause in-flight retries to fail withAccessDenied, which previously resulted in phantom DLQ entries. Added a log message to inform users about skipped documents and the need for manual cleanup after backlog drains.Reporting and documentation improvements:
report.pyto clarify that the authoritative concurrency signals are Lambda's CloudWatch metrics, and the SQS in-flight metric is a noisy proxy. The proxy value is now reported but not used for gating SLOs, improving the accuracy and interpretability of test results. [1] [2]extractor/handler.pyto reflect that the extraction now uses the deployed flavor's NDA extraction, not just single-pass extraction.0016-agentic-flavor-deployment.md), documenting test artifacts, SLO verdicts, findings, and rationale for the above changes, including the root causes and fixes for harness timeouts and DLQ measurement artifacts.