feat: resilient region publishing, staging validation, ECR decoupling, and Slack failure reporting#470
Merged
Merged
Conversation
…ne staging, validation, cleanup, and release processes
…30 min Polling every 15s was noisy and unnecessary — validation takes several minutes on the orchestrator side. Switch to 300s sleep so the job only checks when results are realistically ready. Default VALIDATION_TIMEOUT_S raised from 600s to 1800s so at least 5–6 polls occur before giving up. Added timeout-minutes: 35 on the validate job to match.
…mes to improve error handling and reporting
… regions - Add publish_layer_safe + run_region_loop to libBuild.sh so a single region failure no longer aborts the entire publish loop; all regions are attempted and a pass/fail summary is printed at the end. - Fix empty layer_version propagation: add || return 1 after publish-layer-version command substitution so failures inside publish_public_layer are correctly surfaced when set -e is inactive (i.e. when called from an if-context). - Make GITHUB_STEP_SUMMARY and GITHUB_OUTPUT writes non-fatal (|| true) to avoid permission-denied failures inside Docker containers. - Replace all bare for-region loops in nodejs, python, java, ruby, dotnet, extension publish scripts with run_region_loop. - Add strategy.fail-fast: false to all matrix publish jobs so e.g. node22/node24 are not cancelled when node20 fails. - Mount GITHUB_OUTPUT + GITHUB_STEP_SUMMARY into Docker steps (node, java21) so failure_summary can propagate out of the container. - Add outputs: failure_summary to every publish job; include "N/total regions failed: <list>" in all Slack payloads when set.
…s on layer releases
…s vars to publish steps
…echanism Docker containers run as root but cannot write to GITHUB_OUTPUT/GITHUB_STEP_SUMMARY (owned by the runner user, mode 600). Failures were silently dropped, so no artifact was ever saved and re-runs always published to all regions. Fix: - run_region_loop writes to FAILED_REGIONS_FILE (new env var) when set - publish-node.yml and publish-java.yml pre-create a world-writable temp file, mount it into Docker, then relay its contents to GITHUB_OUTPUT after docker exits - chmod a+rw GITHUB_STEP_SUMMARY before Docker so the summary table renders
…mpt inputs for better error tracking
ECR publish now runs independently (if: always()) so a region failure in layer publishing no longer prevents Docker images from being pushed. Slack notification now downloads failed-regions-* artifacts and shows which specific regions failed per version instead of generic message. Added failure_key to all versions_json entries and run_attempt input to all 6 notify-slack jobs. ECR failures still surfaced separately.
…es and improve error handling
ashishsinghnr
previously approved these changes
May 6, 2026
| required: true | ||
| versions_json: | ||
| description: | | ||
| JSON array: [{"key":"20","label":"Node.js 20","job":"publish-node (20)","fallback":"success","failure_key":"nodejs-20"}] |
There was a problem hiding this comment.
I hope these are example data. Can you update them to looks like example data.
Contributor
Author
There was a problem hiding this comment.
Hi @thisistrinadh , Thank you for highlighting. I have updated the description.
thisistrinadh
previously approved these changes
May 7, 2026
thisistrinadh
left a comment
There was a problem hiding this comment.
Minor Comment. Can be taken later as well.
d556865
Sashwatdas123
approved these changes
May 7, 2026
Contributor
Sashwatdas123
left a comment
There was a problem hiding this comment.
LGTM, thank you @aavinash-nr for raising this pr 👍🏻
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens the Lambda layer publishing pipeline across all runtimes (Node.js, Python, Ruby, Java, Dotnet, Extension). Previously, a single AWS region error would abort the entire publish via
set -e, silently skipping all remaining regions and ECR pushes. This PR fixes that, adds a staging validation gate before production publish, and surfaces all failures clearly in Slack and the GitHub Actions Summary tab.Changes
Staging Validation Gate
poll-validation.shpolls the orchestrator results bucket on S3 every 5 min (up to 30 min timeout)Region-Resilient Publishing
run_region_loopinlibBuild.sh— iterates all configured regions, captures per-region pass/fail, and continues even when one region throws an AWS error (fixes theset -eabort bug)1/18 regions failed: sa-east-1)ECR Publishing Decoupled from Layer Publishing
publish_ecr_safe/finalize_ecr_resultshelpers: ECR always runs regardless of layer region failures, and ECR failures are reported separately without affecting the layer publish exit codeConfigurable Infrastructure via Actions Vars
LAYER_REGIONS,S3_BUCKET_PREFIX, andECR_REPOSITORYare now read from GitHub Actions repository variables instead of being hardcoded in scriptsAutomatic Region Retry on Workflow Re-run
region-retrycomposite action) instead of republishing to all regionsStructured Slack Notifications
notify-slack-layercomposite action sends one consolidated Slack message per workflow runfail-fast: falseadded to all matrix jobs so anode20failure no longer auto-cancelsnode22/node24Test Plan