Skip to content

ci(vv): aggregate Pages publish into one workflow to avoid concurrency cancellations#128

Open
renatosfagundes wants to merge 5 commits into
mainfrom
development
Open

ci(vv): aggregate Pages publish into one workflow to avoid concurrency cancellations#128
renatosfagundes wants to merge 5 commits into
mainfrom
development

Conversation

@renatosfagundes
Copy link
Copy Markdown
Collaborator

Summary

  • Promotes the V&V Pages-publish CI hotfix from development to main, fixing the publish-race that left 2-3 of the five module HTML reports stale on every push to a trunk branch since the v2.0.0 release.
  • Replaces the per-workflow publish-pages jobs with a single aggregator workflow (.github/workflows/publish-vv-reports.yml) triggered by workflow_run on completion of any of the five module V&V workflows.
  • Aggregator gates on gh run list for every workflow that ran on the SHA; treats "no run for this SHA" as not-applicable (single-module pushes deploy what ran), and deploys once every triggered workflow completes successfully.
  • Restores branch-namespaced report URLs (vv_<module>/<branch>/), preserving dev/main isolation that the previous per-module destination_dir provided and that every V&V step-summary browse-link still expects.
  • Refreshes the "Reports are published" comment blocks in all five vv-*.yml files and the matching paragraph in publish-index.yml to point readers at publish-vv-reports.yml.
  • No source-code, requirements, or model changes — CI infrastructure only.

Related Issue

Closes #127

Change Type

  • feat
  • fix
  • docs
  • test
  • ci
  • refactor/chore

AEB Areas Affected

  • Perception
  • Decision Logic
  • Driver Alerts
  • Braking Control
  • State Machine
  • CAN Interface
  • UDS Diagnostics
  • Integration
  • Documentation only

Requirements Impacted

Functional Requirements

  • N/A — no module behaviour changes; CI publication path only.

Non-Functional Requirements

  • NFR-VV-CI: every push to development/main whose V&V gates pass must publish the corresponding module reports to GitHub Pages.

Artifacts Updated

  • Requirements document
  • Modeling document
  • Simulink / Stateflow model
  • C source code
  • Tests
  • CI workflow
  • README / SCM documentation
  • CHANGELOG

Validation Evidence

  • Local build passes
  • Automated tests pass
  • CI passes
  • Traceability updated
  • Safety-relevant impact assessed
  • Documentation updated

Evidence

Failure observed on main HEAD (5cf8eba) — the v2.0.0 release:

Workflow Stack job publish-pages
Perception V&V success deployed
PID + Alert V&V success deployed
Publish V&V index deployed
CAN V&V success cancelled
Decision V&V success cancelled
UDS V&V success cancelled

Cancelled jobs entered completed/cancelled ~3 s after start with steps: [] — classic "third arrival to a cancel-in-progress: false group cancels the previously pending job" pattern.

Dry-run of the fixed aggregator against the same SHA (no deploy side effects, gh run list + gh run view):

SHA: 5cf8eba6f472513b27b212be3207230a8f40838a   (main, the broken release)

  Workflow            run id        wf concl    stack concl
  CAN V&V             25447897623   cancelled   success
  Decision V&V        25447896084   cancelled   success
  Perception V&V      25447892593   success     success
  PID + Alert V&V     25447891296   success     success
  UDS V&V             25447894289   cancelled   success

  FIXED gate -> DEPLOY (5 module reports published)

Same dry-run on a real partial-trigger push (b89ba86d02 on development, where only 4 of 5 V&V workflows fired because path filters didn't match Perception):

SHA: b89ba86d02b5f3779f4d504837bcfd9097560022

  Workflow            run id        stack concl
  CAN V&V             25413361683   success
  Decision V&V        25413361675   success
  Perception V&V      —             (no run for this SHA — skipped)
  PID + Alert V&V     25413361680   success
  UDS V&V             25413361684   success

  FIXED gate -> DEPLOY (4 module reports published, Perception keeps prior gh-pages content)

Local validation: yaml.safe_load passes on all 7 touched workflow files.

Reviewer Notes

  • This PR is the dev → main promotion of the bundle already approved on PR #126 (Rian's review + follow-up fixes). The diff into main is exactly that bundle, since the dev and main trees were identical at the v2.0.0 release point.
  • Focus on the Gate step in publish-vv-reports.yml: it's the only piece preventing duplicate or empty deploys. The three "exit silently" branches must not error or they cancel their own concurrency slot.
  • The aggregator hardcodes the five workflow name: strings in three places (workflow_run trigger list, gate's bash array, art_prefix table). Inline # NOTE: comments flag this for future maintainers; there's no CI check enforcing the sync.
  • The aggregator still uses concurrency: gh-pages-deploy so it serialises with publish-index.yml. With per-module publish jobs gone, max in-flight on the group is 2 (this one + index) — well below the 1+1 cancel threshold.

Risks / Open Points

  • First CI proof comes after merge, not before. The diff doesn't match any vv-*.yml paths: filter, so merging this PR won't itself trigger any V&V workflow — true end-to-end validation lands with the next source-code commit on main.
  • Workflow-name fragility: any future rename of a workflow's name: field silently drops that module unless all three lists in publish-vv-reports.yml are updated. Adding a CI check that diffs the lists is a sensible follow-up.
  • Artefact retention: gh run download reads from per-run artefacts (90-day default). A re-run of the aggregator alone on an old SHA whose artefacts have expired would fail to fetch; acceptable because Pages publication is forward-only.
  • Manual recovery on partial failures: if 4 of 5 modules pass and 1 fails on a push, the aggregator skips deploy. After the failing module is re-run successfully, the aggregator fires from that re-run event and deploys. Worth a README note once this has been observed in practice.

…y cancellations

Each of the five module V&V workflows used to deploy its own HTML
reports to gh-pages via a `publish-pages` job. With five workflows
triggered in parallel on every push to development/main, three
publish jobs were cancelled by GitHub Actions' "1 running + 1 queued"
concurrency limit (group `gh-pages-deploy`). Net effect: 2-3 of the
five module reports never updated on Pages despite the gates passing.

Replace with a single aggregator (`publish-vv-reports.yml`) triggered
by `workflow_run` on completion of any of the five V&V workflows.
Each invocation polls the GitHub API to check whether all five
sibling workflows for the same SHA have completed successfully; if
not, exits silently. Whichever invocation runs last performs a
single deploy that includes every module's artefacts. With one
job per push contending for the concurrency group, no cancellations.

Removes the `publish-pages` job from:
  - vv-can.yml
  - vv-decision.yml
  - vv-perception.yml
  - vv-pid-alert.yml
  - vv-uds.yml

The publish-index.yml workflow is unchanged — it continues to
deploy the docs/index.html landing page; this aggregator deploys
the per-module subdirectories under it. Both share the
`gh-pages-deploy` concurrency group, but with at most 2 contenders
(index + reports) instead of 6, neither gets cancelled.

Tested locally: all five trimmed workflows pass YAML lint, and the
aggregator matches the existing artefact-naming convention
(`vv-<module>-reports-run<N>`).
…eport URLs

Addresses Rian's review of PR #126:

1. Gate deadlock on single-module pushes (HIGH)
   The aggregator's `paths:` filters mean a commit touching one
   module's files only triggers that workflow. Previously the gate
   waited for all 5 workflows to show completed-success for the SHA;
   the 4 that never ran returned [] from `gh run list`, the gate
   parsed that as status=unknown, set all_done=false, and exited
   silently forever — Pages never updated. Now the gate and the
   download loop treat "no run for this SHA" as "not applicable"
   and deploy once every triggered workflow has completed
   successfully. Modules that did not run keep their previous
   gh-pages content untouched via keep_files: true.

2. URL drift on step-summary browse-links (MEDIUM)
   The previous per-module publish-pages jobs used
   destination_dir: vv_<module>/<branch>/, so the live URL was
   vv_<module>/development/ (or /main/). The aggregator now restores
   that namespacing by downloading into site/<module>/<branch>/ — the
   five step summaries (Browse reports: …/vv_<module>/${ref_name}/)
   resolve again, and dev/main snapshots stay isolated instead of
   overwriting each other.

3. Stale header comments (LOW)
   Refreshed the "Reports are published" blurb in all 5 vv-*.yml
   files and the matching paragraph in publish-index.yml to point
   readers at publish-vv-reports.yml as the canonical deployer.

4. Workflow-name sync hazard
   Added inline comments at the workflow_run trigger list and the
   bash arrays warning that the five names must stay in lockstep.

Validation:
- yaml.safe_load passes for all 7 touched workflow files.
- Dry-runs of the new gate against `main` SHA 5cf8eba (the broken
  release, 3-of-5 publish-pages cancelled) and `development` SHA
  b89ba86 (real partial trigger — Perception did not run) both
  evaluate to DEPLOY and would publish the live artefacts.
…lish

ci(vv): aggregate Pages publish into one workflow to avoid concurrency cancellations
The Makefile is in every vv-*.yml `paths:` filter, so this no-op
comment bump fires all five V&V workflows on the merge commit. With
publish-vv-reports.yml now on development, the aggregator fans them
in and produces a single gh-pages deploy commit refreshing
vv_<module>/development/ for every module (last refresh was 2026-05-06,
before the publish-pages race fix landed).

This same commit, carried forward in the upcoming development -> main
PR, will trigger the equivalent fan-in on main and finally populate
vv_can/main/, vv_decision/main/, and vv_uds/main/ -- which have been
404 since the v2.0.0 release because their publish-pages jobs lost
the concurrency race on every push.

No build/runtime impact; comment-only change.
@renatosfagundes renatosfagundes marked this pull request as draft May 15, 2026 03:07
chore(vv): trigger V&V workflows to repopulate gh-pages snapshots
@renatosfagundes renatosfagundes marked this pull request as ready for review May 15, 2026 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Pages publish race cancels 2-3 V&V reports per push

1 participant