Skip to content

πŸ”΄ Estate CI-red master cause: @d135b05 reusable bump broke ~300 repos (diagnosis + remediation plan)Β #426

Description

@hyperpolymath

Round-2 estate CI audit (2026-06-25): ~300 of 377 repos have red CI; ~207 have required-check deadlock (failing shared checks required for merge β€” the reason ~everything has needed admin-merge). Root-caused from live failed-run logs. No changes made yet (owner chose diagnose-first).

MASTER CAUSE

The 2026-06-23/24 standardization sweep that bumped all wrapper pins to @d135b05 introduced two startup-failure bugs in the shared reusables/wrappers:

  • Bug A β€” timeout-minutes: on a uses: reusable-caller job. GitHub forbids it: "when a reusable workflow is called with 'uses', 'timeout-minutes' is not available." β†’ run fails validation, 0 jobs. In governance.yml, secret-scanner.yml, mirror.yml wrappers (generator-emitted).
  • Bug B β€” reusable job requests permissions above the caller's cap. At @d135b05 the scorecard + secret-scanner(gitleaks) reusables gained job-level permissions: requesting security-events: write / id-token: write / pull-requests: write; callers cap at workflow-level contents: read. GitHub: "permissions can only be maintained or reduced β€” not elevated." β†’ unsatisfiable β†’ startup_failure. Proven by the sharp transition (success at @3e4bd4c β†’ startup_failure at @d135b05, wrapper otherwise identical).

Per-workflow root cause

Workflow ~count Type Root cause Fix
OSSF Scorecard 90 startup_failure Bug B (security-events/id-token writes exceed cap) declare writes at reusable top-level permissions: (or grant at caller workflow-level); re-pin
Secret Scanner (scan/*) 87 startup_failure Bug B (gitleaks job perms) + Bug A (timeout on some wrappers) remove gitleaks job-perms block / hoist scopes; strip timeout-minutes; re-pin
Governance 48 startup_failure Bug A delete timeout-minutes: from governance.yml uses: job + fix generator
Hypatia Scan 47 step-failure hypatia-cli.sh scan . exits 1 on β‰₯medium findings as first line of a bash -e step β†’ aborts before the warn-don't-fail logic (all critical=0) append `
dogfood-gate (K9) 52 step-failure validator drift: k9-validate-action@2d96f43c now requires literal K9! first line + signature_required; templates never migrated fix *.k9.ncl templates OR relax the validator to accept a leading SPDX comment before K9!
mirror.yml 56 mixed Bug A (timeout) + stale pre-#416 reds + skipped-misread-as-failure strip timeout-minutes; the rest is expected-noise

Remediation plan (highest leverage first)

  1. Strip timeout-minutes: from every reusable-caller (uses:) job (Bug A) + fix the wrapper generator so it stops emitting it. Detector: actionlint. β†’ clears Governance + much of Secret Scanner + Mirror.
  2. Fix Bug B: declare needed write scopes at the reusable top-level permissions: in scorecard-reusable.yml + secret-scanner-reusable.yml (and/or caller workflow-level), then re-pin estate-wide. β†’ clears Scorecard + clean-wrapper Secret Scanner.
  3. || true on the hypatia scan line. β†’ clears ~47.
  4. Drop stale required contexts (e.g. scan / trufflehog, renamed) from each ruleset that still lists them (owner-gated, per-repo) β†’ clears the masked deadlock on the ~207.
  5. K9 template migration (owner-gated) β†’ dogfood-gate.

Items 1–3 are single central edits (reusables + generator) that clear the large majority of the ~300 red repos; 4–5 are owner-gated/per-repo. After 1–3, re-run the audit and reassess the deadlock before touching rulesets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions