Skip to content

feat: resilient region publishing, staging validation, ECR decoupling, and Slack failure reporting#470

Merged
aavinash-nr merged 16 commits into
newrelic:masterfrom
aavinash-nr:notify-slack
May 7, 2026
Merged

feat: resilient region publishing, staging validation, ECR decoupling, and Slack failure reporting#470
aavinash-nr merged 16 commits into
newrelic:masterfrom
aavinash-nr:notify-slack

Conversation

@aavinash-nr
Copy link
Copy Markdown
Contributor

Summary

Hardens the Lambda layer publishing pipeline across all runtimes (Node.js, Python, Ruby, Java, Dotnet, Extension). Previously, a single AWS region error would abort the entire publish via set -e, silently skipping all remaining regions and ECR pushes. This PR fixes that, adds a staging validation gate before production publish, and surfaces all failures clearly in Slack and the GitHub Actions Summary tab.

Changes

Staging Validation Gate

  • Layers are published to a staging environment first before production
  • A Lambda orchestrator is invoked to run end-to-end validation against the staged layers
  • poll-validation.sh polls the orchestrator results bucket on S3 every 5 min (up to 30 min timeout)
  • Production publish only proceeds if validation passes; staging layers are always cleaned up afterwards

Region-Resilient Publishing

  • Added run_region_loop in libBuild.sh — iterates all configured regions, captures per-region pass/fail, and continues even when one region throws an AWS error (fixes the set -e abort bug)
  • After each publish, a region summary table is written to the GitHub Actions Summary tab
  • Failed region names are included in the Slack notification (e.g. 1/18 regions failed: sa-east-1)

ECR Publishing Decoupled from Layer Publishing

  • For Python, Ruby, Dotnet, and Extension: ECR push was previously bundled inline in the same script block as layer publishing — a region failure would kill the script before ECR ran
  • Replaced with publish_ecr_safe / finalize_ecr_results helpers: ECR always runs regardless of layer region failures, and ECR failures are reported separately without affecting the layer publish exit code

Configurable Infrastructure via Actions Vars

  • LAYER_REGIONS, S3_BUCKET_PREFIX, and ECR_REPOSITORY are now read from GitHub Actions repository variables instead of being hardcoded in scripts

Automatic Region Retry on Workflow Re-run

  • On re-run, only previously failed regions are retried (via region-retry composite action) instead of republishing to all regions

Structured Slack Notifications

  • New notify-slack-layer composite action sends one consolidated Slack message per workflow run
  • Message includes per-runtime publish status, failed region list, ECR failure summary, and a direct link to the Actions run
  • fail-fast: false added to all matrix jobs so a node20 failure no longer auto-cancels node22/node24

Test Plan

  • Trigger a publish workflow with a valid tag and verify staging validation runs before production publish
  • Verify Slack message includes runtime status, failed regions (if any), and run URL
  • Simulate a region failure — confirm remaining regions still publish and the summary table appears in the Actions Summary tab
  • Simulate an ECR failure — confirm layer publish result is unaffected and ECR failure appears in Slack
  • Re-run a failed workflow — confirm only failed regions are retried
  • Verify staging layers are deleted after validation regardless of outcome

…ne staging, validation, cleanup, and release processes
…30 min

Polling every 15s was noisy and unnecessary — validation takes several
minutes on the orchestrator side. Switch to 300s sleep so the job only
checks when results are realistically ready.

Default VALIDATION_TIMEOUT_S raised from 600s to 1800s so at least
5–6 polls occur before giving up. Added timeout-minutes: 35 on the
validate job to match.
… regions

- Add publish_layer_safe + run_region_loop to libBuild.sh so a single
  region failure no longer aborts the entire publish loop; all regions
  are attempted and a pass/fail summary is printed at the end.
- Fix empty layer_version propagation: add || return 1 after
  publish-layer-version command substitution so failures inside
  publish_public_layer are correctly surfaced when set -e is inactive
  (i.e. when called from an if-context).
- Make GITHUB_STEP_SUMMARY and GITHUB_OUTPUT writes non-fatal (|| true)
  to avoid permission-denied failures inside Docker containers.
- Replace all bare for-region loops in nodejs, python, java, ruby,
  dotnet, extension publish scripts with run_region_loop.
- Add strategy.fail-fast: false to all matrix publish jobs so e.g.
  node22/node24 are not cancelled when node20 fails.
- Mount GITHUB_OUTPUT + GITHUB_STEP_SUMMARY into Docker steps (node,
  java21) so failure_summary can propagate out of the container.
- Add outputs: failure_summary to every publish job; include
  "N/total regions failed: <list>" in all Slack payloads when set.
…echanism

Docker containers run as root but cannot write to GITHUB_OUTPUT/GITHUB_STEP_SUMMARY
(owned by the runner user, mode 600). Failures were silently dropped, so no
artifact was ever saved and re-runs always published to all regions.

Fix:
- run_region_loop writes to FAILED_REGIONS_FILE (new env var) when set
- publish-node.yml and publish-java.yml pre-create a world-writable temp file,
  mount it into Docker, then relay its contents to GITHUB_OUTPUT after docker exits
- chmod a+rw GITHUB_STEP_SUMMARY before Docker so the summary table renders
ECR publish now runs independently (if: always()) so a region failure
in layer publishing no longer prevents Docker images from being pushed.

Slack notification now downloads failed-regions-* artifacts and shows
which specific regions failed per version instead of generic message.
Added failure_key to all versions_json entries and run_attempt input
to all 6 notify-slack jobs. ECR failures still surfaced separately.
ashishsinghnr
ashishsinghnr previously approved these changes May 6, 2026
required: true
versions_json:
description: |
JSON array: [{"key":"20","label":"Node.js 20","job":"publish-node (20)","fallback":"success","failure_key":"nodejs-20"}]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope these are example data. Can you update them to looks like example data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thisistrinadh , Thank you for highlighting. I have updated the description.

thisistrinadh
thisistrinadh previously approved these changes May 7, 2026
Copy link
Copy Markdown

@thisistrinadh thisistrinadh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Comment. Can be taken later as well.

@aavinash-nr aavinash-nr dismissed stale reviews from thisistrinadh and ashishsinghnr via d556865 May 7, 2026 10:14
Copy link
Copy Markdown
Contributor

@Sashwatdas123 Sashwatdas123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @aavinash-nr for raising this pr 👍🏻

@aavinash-nr aavinash-nr merged commit a939fbf into newrelic:master May 7, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants