Skip to content

fix: handle CloudFormation throttling in import gateway polling#1185

Merged
Hweinstock merged 3 commits intoaws:mainfrom
Hweinstock:fix/import-gateway-throttle-resilience
May 8, 2026
Merged

fix: handle CloudFormation throttling in import gateway polling#1185
Hweinstock merged 3 commits intoaws:mainfrom
Hweinstock:fix/import-gateway-throttle-resilience

Conversation

@Hweinstock
Copy link
Copy Markdown
Contributor

@Hweinstock Hweinstock commented May 8, 2026

Description

Problem

There has been a flaky test on main due to CloudFormation throttling under parallel test execution. See https://github.com/aws/agentcore-cli/actions/runs/25528496015/job/74929406576 for an example. The key log is [error] Phase 2 failed: Import change set failed: Rate exceeded. The existing retry logic in import does not gracefully handle throttling exceptions.

Solution

introduce a general poll utility that does handle transient errors, and is extendable to handle other types of errors. Migrate the import code to leverage it. Note that existing polling mechanisms exist, but none are generalized and re-usable. Migrating other cases is high-risk and low reward, and is therefore left out of scope here. They can be moved as needed.

Related Issue

Partially addresses #1179

Documentation PR

N/A

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Other (please describe):

Testing

How have you tested the change?

  • I ran npm run test:unit and npm run test:integ
  • I ran npm run typecheck
  • I ran npm run lint
  • If I modified src/assets/, I ran npm run test:update-snapshots and committed the updated snapshots

17 unit tests added for the new polling utility. All pass along with typecheck.

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.

@github-actions github-actions Bot added size/m PR size: M agentcore-harness-reviewing AgentCore Harness review in progress labels May 8, 2026
Copy link
Copy Markdown

@agentcore-cli-automation agentcore-cli-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the import flow's throttle resilience — the shared poll() utility is a nice abstraction.

A couple of things worth addressing before merging, mostly around the new utility's error semantics. The PR's stated goal is to improve behavior under throttling, but as written the caller loses all context when polling gives up, which somewhat undermines debuggability in the exact scenario this PR targets. Details inline.

Comment thread src/lib/utils/polling.ts Outdated
Comment thread src/lib/utils/polling.ts Outdated
Comment thread src/cli/commands/import/phase2-import.ts Outdated
Comment thread package.json Outdated
@github-actions github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label May 8, 2026
Adds a shared poll() utility with throttle-aware retry and migrates
phase2-import.ts to use it. Previously, Rate exceeded errors from
CloudFormation during concurrent e2e tests would crash the import
operation. Now throttle errors are retried on the next poll iteration.

Fixes: import-gateway e2e test failures under parallel execution
@github-actions github-actions Bot added size/m PR size: M and removed size/m PR size: M labels May 8, 2026
@Hweinstock Hweinstock force-pushed the fix/import-gateway-throttle-resilience branch from 53872eb to e38ac3e Compare May 8, 2026 18:28
@github-actions github-actions Bot added size/m PR size: M and removed size/m PR size: M labels May 8, 2026
…pping

Addresses PR review feedback:
- PollExhaustedError and PollTimeoutError now include the last error
  as `cause` for debuggability (e.g., shows 'Rate exceeded' when
  throttling exhausts retries)
- phase2-import.ts wraps poll errors with operation-specific messages
  ('Timed out waiting for change set creation') preserving original
  error as cause
- Fixed misleading message when maxConsecutiveErrors triggers (now
  reports actual attempt count)
- Added 3 tests verifying cause propagation
@Hweinstock Hweinstock force-pushed the fix/import-gateway-throttle-resilience branch from e38ac3e to 39164a3 Compare May 8, 2026 18:31
@github-actions github-actions Bot added size/m PR size: M and removed size/m PR size: M labels May 8, 2026
@github-actions github-actions Bot added size/m PR size: M and removed size/m PR size: M labels May 8, 2026
@Hweinstock Hweinstock closed this May 8, 2026
@Hweinstock Hweinstock reopened this May 8, 2026
@github-actions github-actions Bot added size/m PR size: M agentcore-harness-reviewing AgentCore Harness review in progress and removed size/m PR size: M labels May 8, 2026
Copy link
Copy Markdown

@agentcore-cli-automation agentcore-cli-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the current state of the PR. The substantive concerns from the earlier review have been addressed:

  • 39164a3 preserves lastError as cause on both PollExhaustedError and PollTimeoutError, so the throttling root cause is no longer discarded.
  • 39164a3 also wraps PollExhaustedError/PollTimeoutError at the call sites in phase2-import.ts with domain-specific messages (Timed out waiting for change set creation / Timed out waiting for import to complete), preserving the underlying error via cause.
  • e3d827a correctly scopes vi.useFakeTimers() to the backoff describe block via beforeEach/afterEach, so timer mocks don't leak into other tests.

The one remaining minor loose end already covered in the existing comment thread is the maxConsecutiveErrors branch still producing "Polling exhausted after N attempts" without noting that the abort was specifically due to consecutive errors (line 82). cause is now populated so users can diagnose, so IMO this is fine to leave as a follow-up polish.

The tests use real dependencies (no fs/SDK mocking) and only rely on fake timers where needed. No new serious issues from me — LGTM.

@github-actions github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label May 8, 2026
@Hweinstock Hweinstock marked this pull request as ready for review May 8, 2026 19:31
@Hweinstock Hweinstock requested a review from a team May 8, 2026 19:31
Copy link
Copy Markdown
Contributor

@jesseturner21 jesseturner21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@Hweinstock Hweinstock merged commit df27f12 into aws:main May 8, 2026
35 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/m PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants