Copied from upstream: NVIDIA/OpenShell#1946
Problem Statement
main can break when a pull request passes CI on a stale branch head, then GitHub creates a final merge commit against a newer main that was never tested by the PR checks.
We saw this with PR NVIDIA/OpenShell#1870 and PR NVIDIA/OpenShell#1577. PR NVIDIA/OpenShell#1870 passed Branch Checks on stale head 29d57cc, whose merge-base did not include NVIDIA/OpenShell#1577. The final merge commit ff028ce0 combined NVIDIA/OpenShell#1870 TLS reload shutdown handling with NVIDIA/OpenShell#1577 compute watcher shutdown handling, introduced a duplicate shutdown_tx binding, and caused main Rust lint to fail in https://github.com/NVIDIA/OpenShell/actions/runs/27656754843/job/81792472668.
PR NVIDIA/OpenShell#1945 fixes that immediate break, but the integration gap remains.
Proposed Design
Enable GitHub merge queue for the protected main branch and require queued merge groups to pass the same gates required for normal PRs.
Implementation outline:
- Enable
Require merge queue for the main branch protection/ruleset.
- Add the
merge_group trigger to workflows that publish required PR gate inputs, especially:
.github/workflows/branch-checks.yml
.github/workflows/branch-e2e.yml
.github/workflows/helm-lint.yml
- Confirm
.github/workflows/required-ci-gates.yml can evaluate and publish required gate statuses for merge-group runs, or update it so the required contexts are reported for merge queue validation.
- Keep the required contexts aligned with the existing PR gate contexts:
OpenShell / Branch Checks
OpenShell / E2E
OpenShell / GPU E2E
OpenShell / Helm Lint
- Document the expected maintainer workflow for adding a PR to the merge queue instead of merging directly.
GitHub documentation notes that merge queues validate PR changes applied to the latest target branch and any earlier queued changes, and that GitHub Actions workflows used as required checks must include the merge_group event: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue
Alternatives Considered
Require PR branches to be up to date before merging.
That would likely have caught the NVIDIA/OpenShell#1870/NVIDIA/OpenShell#1577 interaction too, because NVIDIA/OpenShell#1870 would have had to rerun checks after updating to include NVIDIA/OpenShell#1577. However, it pushes more manual branch-update work onto contributors and maintainers. Merge queue is a better fit for a busy main branch because it validates the final integration state without forcing every PR author to repeatedly rebase or merge main by hand.
Rely only on push CI after merge.
This detects breakage after main is already broken, which is what happened here.
Agent Investigation
Comment from @elezar (maintainer)
Additional example: NVIDIA/OpenShell#1872 is another case this issue should cover.
NVIDIA/OpenShell#1872 fixed the server build after NVIDIA/OpenShell#1566 introduced gRPC rate limiting code that used the private tonic::body::BoxBody alias. The PR checks for NVIDIA/OpenShell#1566 passed before the final integrated state on main exposed the tonic 0.14 incompatibility. A merge queue, or a strict up-to-date-branch requirement, would have forced the final branch-plus-current-main state through the required Rust checks before merge instead of relying on push CI after main was already broken.
This reinforces that the work is not only a repository setting change. The required CI workflows need to support merge queue validation explicitly. In the current workflow set, .github/workflows/branch-checks.yml, .github/workflows/branch-e2e.yml, and .github/workflows/helm-lint.yml trigger on push to pull-request/[0-9]+ plus workflow_dispatch, but not merge_group. .github/workflows/required-ci-gates.yml also keys its aggregation around pull_request_target and completed workflow runs from those branch workflows. The implementation should update these workflows so the required contexts are reported for merge-group SHAs.
Problem Statement
maincan break when a pull request passes CI on a stale branch head, then GitHub creates a final merge commit against a newermainthat was never tested by the PR checks.We saw this with PR NVIDIA/OpenShell#1870 and PR NVIDIA/OpenShell#1577. PR NVIDIA/OpenShell#1870 passed Branch Checks on stale head
29d57cc, whose merge-base did not include NVIDIA/OpenShell#1577. The final merge commitff028ce0combined NVIDIA/OpenShell#1870 TLS reload shutdown handling with NVIDIA/OpenShell#1577 compute watcher shutdown handling, introduced a duplicateshutdown_txbinding, and causedmainRust lint to fail in https://github.com/NVIDIA/OpenShell/actions/runs/27656754843/job/81792472668.PR NVIDIA/OpenShell#1945 fixes that immediate break, but the integration gap remains.
Proposed Design
Enable GitHub merge queue for the protected
mainbranch and require queued merge groups to pass the same gates required for normal PRs.Implementation outline:
Require merge queuefor themainbranch protection/ruleset.merge_grouptrigger to workflows that publish required PR gate inputs, especially:.github/workflows/branch-checks.yml.github/workflows/branch-e2e.yml.github/workflows/helm-lint.yml.github/workflows/required-ci-gates.ymlcan evaluate and publish required gate statuses for merge-group runs, or update it so the required contexts are reported for merge queue validation.OpenShell / Branch ChecksOpenShell / E2EOpenShell / GPU E2EOpenShell / Helm LintGitHub documentation notes that merge queues validate PR changes applied to the latest target branch and any earlier queued changes, and that GitHub Actions workflows used as required checks must include the
merge_groupevent: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queueAlternatives Considered
Require PR branches to be up to date before merging.
That would likely have caught the NVIDIA/OpenShell#1870/NVIDIA/OpenShell#1577 interaction too, because NVIDIA/OpenShell#1870 would have had to rerun checks after updating to include NVIDIA/OpenShell#1577. However, it pushes more manual branch-update work onto contributors and maintainers. Merge queue is a better fit for a busy
mainbranch because it validates the final integration state without forcing every PR author to repeatedly rebase or mergemainby hand.Rely only on push CI after merge.
This detects breakage after
mainis already broken, which is what happened here.Agent Investigation
watch-github-actionsto inspect failing run27656754843/ job81792472668.cargo clippy --workspace --all-targets -- -D warningsrejecting unused variableshutdown_txincrates/openshell-server/src/lib.rs.git blameand history showed the production duplicate channel came from the interaction of:e73745f1adding compute watcher shutdown handling.ff028ce0adding TLS reload shutdown handling.29d57cc, which did not include feat(gateway): add reconciler lease for HA multi-replica deployments NVIDIA/OpenShell#1577. The broken code only existed in the final merge commit onto newermain.Comment from @elezar (maintainer)
Additional example: NVIDIA/OpenShell#1872 is another case this issue should cover.
NVIDIA/OpenShell#1872 fixed the server build after NVIDIA/OpenShell#1566 introduced gRPC rate limiting code that used the private
tonic::body::BoxBodyalias. The PR checks for NVIDIA/OpenShell#1566 passed before the final integrated state onmainexposed the tonic 0.14 incompatibility. A merge queue, or a strict up-to-date-branch requirement, would have forced the final branch-plus-current-mainstate through the required Rust checks before merge instead of relying on push CI aftermainwas already broken.This reinforces that the work is not only a repository setting change. The required CI workflows need to support merge queue validation explicitly. In the current workflow set,
.github/workflows/branch-checks.yml,.github/workflows/branch-e2e.yml, and.github/workflows/helm-lint.ymltrigger onpushtopull-request/[0-9]+plusworkflow_dispatch, but notmerge_group..github/workflows/required-ci-gates.ymlalso keys its aggregation aroundpull_request_targetand completed workflow runs from those branch workflows. The implementation should update these workflows so the required contexts are reported for merge-group SHAs.