From 06e868f662510c99bbd43f1b24f8b9f2368c5b01 Mon Sep 17 00:00:00 2001 From: Brandon Shrewsbury Date: Wed, 1 Jul 2026 13:49:24 -0600 Subject: [PATCH 1/6] Document the CI failure triage process Add .github/workflows/ci-failure-triage.md describing how scheduled-job failures flow to ci-failure issues (the report-ci-failure composite action) and how the daily Claude Code triage session turns them into fix PRs or comments, including setup constraints (must run in a repo-scoped session) and the full triage prompt kept in sync with the scheduled trigger. Link it from the workflows README's Failure notifications section and from the top-level README's CI table (updating the stale "N/A (Jira)" blocking notes to "opens ci-failure issue"). --- .github/workflows/README.md | 8 +- .github/workflows/ci-failure-triage.md | 124 +++++++++++++++++++++++++ README.md | 32 ++++--- 3 files changed, 146 insertions(+), 18 deletions(-) create mode 100644 .github/workflows/ci-failure-triage.md diff --git a/.github/workflows/README.md b/.github/workflows/README.md index 6b28b91aed..f5014c05ff 100644 --- a/.github/workflows/README.md +++ b/.github/workflows/README.md @@ -220,9 +220,11 @@ calls the local composite action `.github/actions/report-ci-failure`: `issues: write`. No external service or extra secret is required. This replaced the previous `atlassian/gajira-*` Jira integration, which had -stopped authenticating and left the scheduled jobs unmonitored. A Claude Code -Remote routine triages open `ci-failure` issues—opening a fix PR when the cause -is clear, or commenting on the issue when it is not. +stopped authenticating and left the scheduled jobs unmonitored. A scheduled +Claude Code session then triages open `ci-failure` issues daily—opening a fix PR +when the cause is clear, or commenting on the issue when it is not. See +[CI failure triage](ci-failure-triage.md) for the full flow, setup, and the +triage prompt. ## Helper scripts diff --git a/.github/workflows/ci-failure-triage.md b/.github/workflows/ci-failure-triage.md new file mode 100644 index 0000000000..b7d8715387 --- /dev/null +++ b/.github/workflows/ci-failure-triage.md @@ -0,0 +1,124 @@ +# CI failure triage + +This document describes how failures in the scheduled CI jobs are reported and +triaged, so anyone can understand, run, or change the process. + +## Overview + +The scheduled jobs have no pull request to block, so they can't fail loudly the +way PR checks do. Instead there are two halves: + +1. **Report** — when a scheduled job fails, it opens (or updates) a GitHub issue + labeled `ci-failure`. +2. **Triage** — once a day, an automated Claude Code session reads the open + `ci-failure` issues, verifies each failure against the actual run, and either + opens a fix PR or comments with the root cause. + +```text +scheduled job fails + └─ report-ci-failure composite action → GitHub issue (label: ci-failure) + └─ daily triage session → fix PR (fixable) + └─ comment (transient / owner decision) +``` + +This replaced the previous `atlassian/gajira-*` Jira integration, which had +stopped authenticating and left the jobs silently unmonitored. + +## 1. Reporting (in-repo) + +Each of `test-code-snippets.yml`, `check-methods.yml`, and `run-htmltest.yml` +ends with a `Report failure` step (gated `if: failure()`) that calls the local +composite action [`.github/actions/report-ci-failure`](../actions/report-ci-failure/action.yml): + +- Opens an issue titled `CI failure: ` with the `ci-failure` label and + a link to the failing run. +- If an open `ci-failure` issue for that job already exists, it comments the new + run link instead of opening a duplicate, so repeat failures collapse into one + tracking issue. +- Authenticates with the automatic `GITHUB_TOKEN` (the workflows grant + `issues: write`). No external service or extra secret. + +To wire the reporter into another scheduled workflow, add `issues: write` to its +permissions and a final step: + +```yaml +- name: Report failure + if: failure() + uses: ./.github/actions/report-ci-failure + with: + job-name: +``` + +## 2. Triage (Claude Code scheduled session) + +A scheduled **Claude Code** session runs daily (~6 AM US Eastern) and triages the +open `ci-failure` issues. It runs under a maintainer's Claude account (no API key +or extra secret) and produces PRs on `claude/ci-fix-` branches, mirroring +how the repo's other `claude/*` PRs are made. + +### Setup notes (how it must be configured) + +- **It must run in a session that is scoped to `viamrobotics/docs`** — that is, + the repo is cloned into the session and GitHub is reached through the session's + `mcp__github__*` tools. A blank/unscoped session has no repo access. +- Set it up as a **scheduled trigger from within a repo-scoped Claude Code (web) + session**, so the scheduled runs inherit that repo access. A fresh + session-per-fire trigger created from _outside_ such a session comes up + unscoped and fails at the access check; binding a schedule to a specific + persistent session may be disabled for the org. An in-session cron is + session-only and not durable — don't rely on it for a daily job. +- To change the behavior, edit the **prompt below** and update the scheduled + trigger to use the new text. + +### Conventions the triage session follows + +- Verifies every fact against the live run/logs/artifacts and repo files — it + does not trust issue text or hand-fed classifications. +- Fixes a whole class of errors exhaustively (all occurrences repo-wide), and + confirms each retarget destination actually exists rather than guessing. +- `Fixes #` only when the PR fully resolves what the job checks; otherwise + `Refs #` so a partially-fixed issue is not auto-closed. +- Commits as `Brandon Shrewsbury ` with **no** + `Co-Authored-By: …@anthropic.com` trailer (either breaks the repo's CLA check). +- Leaves transient infra flakiness, external-link churn (403/404/429), and + domain-owner decisions (for example `sdk_protos_map.csv`) as issue comments, + not PRs. + +### The triage prompt + +Keep this in sync with the scheduled trigger. When you change the process, edit +here first, then update the trigger. + +```text +Triage CI failures for viamrobotics/docs (cloned fresh at the repo root each run; use the GitHub connector's mcp__github__* tools + local git and file edits — there is no gh CLI). Act only on facts you verify yourself against the live run, logs, artifacts, and repo files; never trust issue numbers, error counts, or classifications you were handed (including in this prompt or the issue body) without confirming them. + +1. List open issues labeled "ci-failure". If none, stop. + +2. For each issue, gather the FULL picture: read the issue and its comments, find its linked failing Actions run, and pull the COMPLETE error list from the run's logs and artifacts (e.g. run-htmltest uploads an htmltest-report artifact listing every broken link) — do not rely on the snippet in the issue body or a truncated log. Jobs: test-code-snippets.yml (Python/Go/TS samples vs a live machine+org), check-methods.yml (SDK coverage), run-htmltest.yml (broken links). + +3. Skip an issue if it already has an open linked PR, OR if a human/maintainer has already posted a substantive root-cause triage that still holds (do not re-post the same conclusion — avoid noise). + +4. Classify every error into groups; for each group decide fixable vs not: + - Fixable by a minimal edit: internal broken links/anchors (retarget to the correct existing location); a sample using a wrong/renamed SDK method signature; stale API usage. + - NOT fixable by an edit (comment only): transient infra flakiness ("Failed to connect to robot" / "host appears to be offline", gRPC DEADLINE_EXCEEDED / INTERNAL, query timeouts, the shared test machine offline); external-link failures (403/404/429 — third-party churn and rate limits, e.g. pkg.go.dev; do NOT edit links for these, but note any that look permanently dead so a human can update them); and domain-owner decisions (e.g. editing SDK-coverage mapping such as sdk_protos_map.csv). + +5. EXHAUSTIVENESS + VERIFICATION for anything you fix: + - Fix the whole class, not a sample: enumerate EVERY occurrence across the entire repo (grep all files, not just those named in the log) and fix all of them. Before opening the PR, reconcile the number you fixed against the total fixable count; if they don't match, keep going. + - Never guess a destination: confirm each new link/anchor target actually exists (grep the destination page for the exact heading/anchor id). If a broken link has no clear valid target (section renamed or removed), leave it unchanged and flag it as "needs human decision". + - Run the CLAUDE.md pre-PR checks relevant to your change (e.g. prettier + markdownlint for docs) and re-verify every changed target resolves. + +6. Open the PR via the mcp__github__ tools on branch claude/ci-fix-. In the PR body, state exactly what you fixed and what you deliberately left, with counts (e.g. "fixes 10 of 10 internal broken anchors; the remaining ~338 errors are external 403/404/429, out of scope"). Use "Fixes #" ONLY if the PR fully resolves what the job checks so it will pass next run; if it fixes only part (e.g. the internal anchors while external/transient errors remain), use "Refs #" so the issue is NOT auto-closed. Commit as Brandon Shrewsbury , with NO "Co-Authored-By: ...@anthropic.com" trailer (it breaks the CLA). Then comment the PR link + a short verified breakdown on the issue. + +7. If transient/uncertain, or the correct fix is a domain-owner decision: comment the verified likely cause on the issue; no PR. + +8. Keep changes minimal and scoped strictly to the reported failures; make no unrelated edits and do not close issues. Post one concise summary comment on each issue you actually acted on. +``` + +## Known limitations + +- The `test-code-snippets` machine samples depend on a shared test machine that + is intermittently offline; those failures are transient and the triage session + will comment rather than "fix" them. See the repair notes in + [`README.md`](README.md). +- `run-htmltest` external-link failures are mostly third-party churn/rate-limits; + only internal anchors are auto-fixed. diff --git a/README.md b/README.md index 36c5552732..879470e758 100644 --- a/README.md +++ b/README.md @@ -117,21 +117,23 @@ A GitHub workflow automatically publishes the docs to [https://docs.viam.com/](h GitHub Actions workflows run linting and link checks on every pull request, publish and sync search on merges to `main`, and validate code samples and SDK coverage on a weekly schedule. The table below is a quick reference; see [`.github/workflows/README.md`](.github/workflows/README.md) for full descriptions, triggers, secrets, and known issues. -| Workflow | What it does | When it runs | Blocks PR? | -| --- | --- | --- | --- | -| `vale-lint.yml` | Vale prose style check | Pull request | Yes | -| `codespell.yml` | Spell-check `docs/` | Pull request | Yes | -| `run-htmltest-local.yml` | Internal link check | Pull request | Yes | -| `markdown-lint.yml` | Markdown structure lint | Pull request | Informational | -| `prettier-lint.yml` | Prettier formatting check | Pull request, push to `main` | Informational | -| `python-lint.yml` | Lint Python snippets in Markdown | Pull request | Informational | -| `pr-labeler.yml` | Add `safe to build` label / welcome comment | PR opened | No | -| `alias-reminder.yml` | Remind authors to add redirect aliases | PR moved files | No | -| `docs.yml` | Build site, sync search index | Push to `main` | N/A | -| `inkeep.yml` | Sync docs source to Inkeep AI search | Push to `main` (`docs/`) | N/A | -| `run-htmltest.yml` | External link check | Tuesdays 10:00 UTC | N/A (Jira) | -| `test-code-snippets.yml` | Run Python/Go/TS code samples against live Viam | Mondays 09:00 UTC, push to samples | N/A (Jira) | -| `check-methods.yml` | Check docs coverage of SDK API methods | Wednesdays 10:00 UTC | N/A (Jira) | +| Workflow | What it does | When it runs | Blocks PR? | +| ------------------------ | ----------------------------------------------- | ---------------------------------- | ------------------------------ | +| `vale-lint.yml` | Vale prose style check | Pull request | Yes | +| `codespell.yml` | Spell-check `docs/` | Pull request | Yes | +| `run-htmltest-local.yml` | Internal link check | Pull request | Yes | +| `markdown-lint.yml` | Markdown structure lint | Pull request | Informational | +| `prettier-lint.yml` | Prettier formatting check | Pull request, push to `main` | Informational | +| `python-lint.yml` | Lint Python snippets in Markdown | Pull request | Informational | +| `pr-labeler.yml` | Add `safe to build` label / welcome comment | PR opened | No | +| `alias-reminder.yml` | Remind authors to add redirect aliases | PR moved files | No | +| `docs.yml` | Build site, sync search index | Push to `main` | N/A | +| `inkeep.yml` | Sync docs source to Inkeep AI search | Push to `main` (`docs/`) | N/A | +| `run-htmltest.yml` | External link check | Tuesdays 10:00 UTC | N/A (opens `ci-failure` issue) | +| `test-code-snippets.yml` | Run Python/Go/TS code samples against live Viam | Mondays 09:00 UTC, push to samples | N/A (opens `ci-failure` issue) | +| `check-methods.yml` | Check docs coverage of SDK API methods | Wednesdays 10:00 UTC | N/A (opens `ci-failure` issue) | > [!NOTE] > The scheduled `test-code-snippets.yml` and `check-methods.yml` jobs run against a live Viam test organization and external SDK sites, so their results depend on that org's state and on upstream changes. See [`.github/workflows/README.md`](.github/workflows/README.md#test-org-dependency) for details. +> +> On failure, the scheduled jobs open a deduplicated GitHub issue labeled `ci-failure`, and a daily Claude Code session triages those issues into fix PRs or comments. See [CI failure triage](.github/workflows/ci-failure-triage.md). From e901b16b31cc73963137452e016387dc51caf5c8 Mon Sep 17 00:00:00 2001 From: Brandon Shrewsbury Date: Wed, 1 Jul 2026 15:03:12 -0600 Subject: [PATCH 2/6] Harden triage prompt: fix failing linked PRs, minimal diff + prettier gate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add two lessons from the first live runs: (1) if an issue's linked PR has failing checks, fix that PR's branch in place instead of skipping it (the skip-if-open-PR rule would otherwise leave a red PR stuck); (2) make the minimal diff and run prettier (a required check) — do not follow markdownlint into formatting that prettier rejects, which had left a PR red. --- .github/workflows/ci-failure-triage.md | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/.github/workflows/ci-failure-triage.md b/.github/workflows/ci-failure-triage.md index b7d8715387..844d31d318 100644 --- a/.github/workflows/ci-failure-triage.md +++ b/.github/workflows/ci-failure-triage.md @@ -76,6 +76,13 @@ how the repo's other `claude/*` PRs are made. does not trust issue text or hand-fed classifications. - Fixes a whole class of errors exhaustively (all occurrences repo-wide), and confirms each retarget destination actually exists rather than guessing. +- Makes the **minimal diff** — only the substantive change, no incidental + reformatting — then runs the pre-PR checks in order (`prettier --write`, then + markdownlint, then vale) and confirms `prettier --check` passes. `prettier` is + a required status check, so when it and markdownlint disagree on formatting + (for example, blank lines around lists), prettier wins. +- If an issue's linked PR already exists but its **checks are failing**, fixes + that PR's branch in place rather than opening a duplicate or skipping it. - `Fixes #` only when the PR fully resolves what the job checks; otherwise `Refs #` so a partially-fixed issue is not auto-closed. - Commits as `Brandon Shrewsbury ` with **no** @@ -96,22 +103,29 @@ Triage CI failures for viamrobotics/docs (cloned fresh at the repo root each run 2. For each issue, gather the FULL picture: read the issue and its comments, find its linked failing Actions run, and pull the COMPLETE error list from the run's logs and artifacts (e.g. run-htmltest uploads an htmltest-report artifact listing every broken link) — do not rely on the snippet in the issue body or a truncated log. Jobs: test-code-snippets.yml (Python/Go/TS samples vs a live machine+org), check-methods.yml (SDK coverage), run-htmltest.yml (broken links). -3. Skip an issue if it already has an open linked PR, OR if a human/maintainer has already posted a substantive root-cause triage that still holds (do not re-post the same conclusion — avoid noise). +3. Decide act vs skip vs FIX-EXISTING-PR: + - Skip if the issue already has an open linked PR whose checks are all GREEN, OR a human/maintainer already posted a substantive root-cause triage that still holds (don't duplicate or re-post — avoid noise). + - If the issue has an open linked PR whose CI checks are FAILING: do NOT open a new PR. Check out that PR's branch, diagnose and fix what's failing (including the PR's own check failures, e.g. a prettier failure), and push to the same branch. Then move on. + - Otherwise, proceed to fix below. 4. Classify every error into groups; for each group decide fixable vs not: - Fixable by a minimal edit: internal broken links/anchors (retarget to the correct existing location); a sample using a wrong/renamed SDK method signature; stale API usage. - NOT fixable by an edit (comment only): transient infra flakiness ("Failed to connect to robot" / "host appears to be offline", gRPC DEADLINE_EXCEEDED / INTERNAL, query timeouts, the shared test machine offline); external-link failures (403/404/429 — third-party churn and rate limits, e.g. pkg.go.dev; do NOT edit links for these, but note any that look permanently dead so a human can update them); and domain-owner decisions (e.g. editing SDK-coverage mapping such as sdk_protos_map.csv). 5. EXHAUSTIVENESS + VERIFICATION for anything you fix: - - Fix the whole class, not a sample: enumerate EVERY occurrence across the entire repo (grep all files, not just those named in the log) and fix all of them. Before opening the PR, reconcile the number you fixed against the total fixable count; if they don't match, keep going. + - Fix the whole class, not a sample: enumerate EVERY occurrence across the entire repo (grep all files, not just those named in the log) and fix all of them. Before opening/updating the PR, reconcile the number you fixed against the total fixable count; if they don't match, keep going. - Never guess a destination: confirm each new link/anchor target actually exists (grep the destination page for the exact heading/anchor id). If a broken link has no clear valid target (section renamed or removed), leave it unchanged and flag it as "needs human decision". - - Run the CLAUDE.md pre-PR checks relevant to your change (e.g. prettier + markdownlint for docs) and re-verify every changed target resolves. -6. Open the PR via the mcp__github__ tools on branch claude/ci-fix-. In the PR body, state exactly what you fixed and what you deliberately left, with counts (e.g. "fixes 10 of 10 internal broken anchors; the remaining ~338 errors are external 403/404/429, out of scope"). Use "Fixes #" ONLY if the PR fully resolves what the job checks so it will pass next run; if it fixes only part (e.g. the internal anchors while external/transient errors remain), use "Refs #" so the issue is NOT auto-closed. Commit as Brandon Shrewsbury , with NO "Co-Authored-By: ...@anthropic.com" trailer (it breaks the CLA). Then comment the PR link + a short verified breakdown on the issue. +6. MINIMAL, PROPERLY-FORMATTED DIFF (this is where a fix most often regresses): + - Change ONLY what the fix requires. Do NOT reformat, re-wrap, or add/remove blank lines on lines unrelated to your fix. + - Then run the CLAUDE.md pre-PR checks IN THIS ORDER and commit exactly what they produce: (1) `npx prettier --write `, (2) `npx markdownlint-cli --config .markdownlint.yaml `, (3) vale. prettier owns formatting and is a REQUIRED status check; it can disagree with markdownlint (e.g. blank lines around lists) — when they conflict, prettier wins. Never hand-apply a markdownlint suggestion that makes prettier fail. + - Before opening/updating the PR, confirm `npx prettier --check ` passes and markdownlint is clean on every file you touched. If a required check tool genuinely can't run in your environment, say so in the PR body and keep the diff as small as possible so you don't introduce issues that tool would catch. -7. If transient/uncertain, or the correct fix is a domain-owner decision: comment the verified likely cause on the issue; no PR. +7. Open (or update) the PR via the mcp__github__ tools on branch claude/ci-fix-. In the PR body, state exactly what you fixed and what you deliberately left, with counts (e.g. "fixes 10 of 10 internal broken anchors; the remaining ~338 errors are external 403/404/429, out of scope"). Use "Fixes #" ONLY if the PR fully resolves what the job checks so it will pass next run; if it fixes only part (e.g. the internal anchors while external/transient errors remain), use "Refs #" so the issue is NOT auto-closed. Commit as Brandon Shrewsbury , with NO "Co-Authored-By: ...@anthropic.com" trailer (it breaks the CLA). Then comment the PR link + a short verified breakdown on the issue. -8. Keep changes minimal and scoped strictly to the reported failures; make no unrelated edits and do not close issues. Post one concise summary comment on each issue you actually acted on. +8. If transient/uncertain, or the correct fix is a domain-owner decision: comment the verified likely cause on the issue; no PR. + +9. Keep changes minimal and scoped strictly to the reported failures; make no unrelated edits and do not close issues. Post one concise summary comment on each issue you actually acted on. ``` ## Known limitations From 430e72e4cd2c3adbab60690674e7c3b3d7d6f79d Mon Sep 17 00:00:00 2001 From: Brandon Shrewsbury Date: Wed, 1 Jul 2026 15:07:30 -0600 Subject: [PATCH 3/6] Harden triage prompt: sync checkout first, robust PR discovery The scheduled session is long-lived and reused, so its clone can be stale. Add an explicit sync step (git fetch + reset to origin/main; re-fetch before touching a PR branch) and stop assuming a fresh clone. Also make existing-PR discovery robust: a 'Refs #n' mention is not a formal linked-PR, so check the issue's cross-references, comments, and claude/ci-fix-* branches. --- .github/workflows/ci-failure-triage.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/.github/workflows/ci-failure-triage.md b/.github/workflows/ci-failure-triage.md index 844d31d318..01c7f31ef5 100644 --- a/.github/workflows/ci-failure-triage.md +++ b/.github/workflows/ci-failure-triage.md @@ -72,6 +72,8 @@ how the repo's other `claude/*` PRs are made. ### Conventions the triage session follows +- Syncs the checkout first (`git fetch` + reset to `origin/main`) — the scheduled + session is long-lived and reused, so its clone can be stale between runs. - Verifies every fact against the live run/logs/artifacts and repo files — it does not trust issue text or hand-fed classifications. - Fixes a whole class of errors exhaustively (all occurrences repo-wide), and @@ -97,16 +99,18 @@ Keep this in sync with the scheduled trigger. When you change the process, edit here first, then update the trigger. ```text -Triage CI failures for viamrobotics/docs (cloned fresh at the repo root each run; use the GitHub connector's mcp__github__* tools + local git and file edits — there is no gh CLI). Act only on facts you verify yourself against the live run, logs, artifacts, and repo files; never trust issue numbers, error counts, or classifications you were handed (including in this prompt or the issue body) without confirming them. +Triage CI failures for viamrobotics/docs. The repo is cloned at the repo root, but this session may be reused across scheduled runs, so do NOT assume the checkout is current. Use the GitHub connector's mcp__github__* tools + local git and file edits — there is no gh CLI. Act only on facts you verify yourself against the live run, logs, artifacts, and repo files; never trust issue numbers, error counts, or classifications you were handed (including in this prompt or the issue body) without confirming them. + +0. SYNC FIRST. Before reading any files, refresh the checkout: `git fetch origin --prune`, then `git switch main && git reset --hard origin/main`. Always re-fetch before you touch a branch, and edit files only after syncing — otherwise you may analyze or patch a stale tree. 1. List open issues labeled "ci-failure". If none, stop. 2. For each issue, gather the FULL picture: read the issue and its comments, find its linked failing Actions run, and pull the COMPLETE error list from the run's logs and artifacts (e.g. run-htmltest uploads an htmltest-report artifact listing every broken link) — do not rely on the snippet in the issue body or a truncated log. Jobs: test-code-snippets.yml (Python/Go/TS samples vs a live machine+org), check-methods.yml (SDK coverage), run-htmltest.yml (broken links). -3. Decide act vs skip vs FIX-EXISTING-PR: - - Skip if the issue already has an open linked PR whose checks are all GREEN, OR a human/maintainer already posted a substantive root-cause triage that still holds (don't duplicate or re-post — avoid noise). - - If the issue has an open linked PR whose CI checks are FAILING: do NOT open a new PR. Check out that PR's branch, diagnose and fix what's failing (including the PR's own check failures, e.g. a prettier failure), and push to the same branch. Then move on. - - Otherwise, proceed to fix below. +3. Decide act vs skip vs FIX-EXISTING-PR. First determine whether a PR for this issue already exists: check the issue's timeline/cross-references AND its comments (the triage posts the PR link on the issue), and look for a `claude/ci-fix-*` branch — a "Refs #" mention does NOT create a formal linked-PR, so don't rely only on GitHub's "linked PR" field. + - Skip if that PR's checks are all GREEN, OR a human/maintainer already posted a substantive root-cause triage that still holds (don't duplicate or re-post — avoid noise). + - If that PR's CI checks are FAILING: do NOT open a new PR. `git fetch origin `, `git switch `, `git reset --hard origin/`, then diagnose and fix what's failing (including the PR's own check failures, e.g. a prettier failure) and push to the same branch. Then move on. + - Otherwise, proceed to fix below (from an up-to-date `main`). 4. Classify every error into groups; for each group decide fixable vs not: - Fixable by a minimal edit: internal broken links/anchors (retarget to the correct existing location); a sample using a wrong/renamed SDK method signature; stale API usage. From dd022b7b5116184e15119a2d2db410fc892cccf4 Mon Sep 17 00:00:00 2001 From: Brandon Shrewsbury Date: Wed, 1 Jul 2026 15:16:23 -0600 Subject: [PATCH 4/6] Prefix triage PR titles with [Claude CI Failure] Make the automated PRs easy to spot and filter; keep the prefix when updating an existing PR title too. --- .github/workflows/ci-failure-triage.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/ci-failure-triage.md b/.github/workflows/ci-failure-triage.md index 01c7f31ef5..7a7e9b7b0f 100644 --- a/.github/workflows/ci-failure-triage.md +++ b/.github/workflows/ci-failure-triage.md @@ -85,6 +85,8 @@ how the repo's other `claude/*` PRs are made. (for example, blank lines around lists), prettier wins. - If an issue's linked PR already exists but its **checks are failing**, fixes that PR's branch in place rather than opening a duplicate or skipping it. +- Titles its PRs `[Claude CI Failure] …` so maintainers can spot and filter + them. - `Fixes #` only when the PR fully resolves what the job checks; otherwise `Refs #` so a partially-fixed issue is not auto-closed. - Commits as `Brandon Shrewsbury ` with **no** @@ -125,7 +127,7 @@ Triage CI failures for viamrobotics/docs. The repo is cloned at the repo root, b - Then run the CLAUDE.md pre-PR checks IN THIS ORDER and commit exactly what they produce: (1) `npx prettier --write `, (2) `npx markdownlint-cli --config .markdownlint.yaml `, (3) vale. prettier owns formatting and is a REQUIRED status check; it can disagree with markdownlint (e.g. blank lines around lists) — when they conflict, prettier wins. Never hand-apply a markdownlint suggestion that makes prettier fail. - Before opening/updating the PR, confirm `npx prettier --check ` passes and markdownlint is clean on every file you touched. If a required check tool genuinely can't run in your environment, say so in the PR body and keep the diff as small as possible so you don't introduce issues that tool would catch. -7. Open (or update) the PR via the mcp__github__ tools on branch claude/ci-fix-. In the PR body, state exactly what you fixed and what you deliberately left, with counts (e.g. "fixes 10 of 10 internal broken anchors; the remaining ~338 errors are external 403/404/429, out of scope"). Use "Fixes #" ONLY if the PR fully resolves what the job checks so it will pass next run; if it fixes only part (e.g. the internal anchors while external/transient errors remain), use "Refs #" so the issue is NOT auto-closed. Commit as Brandon Shrewsbury , with NO "Co-Authored-By: ...@anthropic.com" trailer (it breaks the CLA). Then comment the PR link + a short verified breakdown on the issue. +7. Open (or update) the PR via the mcp__github__ tools on branch claude/ci-fix-. Title it "[Claude CI Failure] " — keep that exact prefix, including when you update an existing PR's title. In the PR body, state exactly what you fixed and what you deliberately left, with counts (e.g. "fixes 10 of 10 internal broken anchors; the remaining ~338 errors are external 403/404/429, out of scope"). Use "Fixes #" ONLY if the PR fully resolves what the job checks so it will pass next run; if it fixes only part (e.g. the internal anchors while external/transient errors remain), use "Refs #" so the issue is NOT auto-closed. Commit as Brandon Shrewsbury , with NO "Co-Authored-By: ...@anthropic.com" trailer (it breaks the CLA). Then comment the PR link + a short verified breakdown on the issue. 8. If transient/uncertain, or the correct fix is a domain-owner decision: comment the verified likely cause on the issue; no PR. From 251be21e51e3e4209e1bcd5c9a72fd62a858bd19 Mon Sep 17 00:00:00 2001 From: Brandon Shrewsbury Date: Wed, 1 Jul 2026 16:22:07 -0600 Subject: [PATCH 5/6] Sharpen README routine note; add at-a-glance to triage doc --- .github/workflows/ci-failure-triage.md | 13 +++++++++++++ README.md | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/.github/workflows/ci-failure-triage.md b/.github/workflows/ci-failure-triage.md index 7a7e9b7b0f..2f94e697fa 100644 --- a/.github/workflows/ci-failure-triage.md +++ b/.github/workflows/ci-failure-triage.md @@ -56,6 +56,19 @@ open `ci-failure` issues. It runs under a maintainer's Claude account (no API ke or extra secret) and produces PRs on `claude/ci-fix-` branches, mirroring how the repo's other `claude/*` PRs are made. +**At a glance:** + +- **Frequency:** daily, ~6 AM US Eastern / 10:00 UTC (fixed offset — no DST + adjustment). +- **Runs as:** a repo-scoped Claude Code session under a maintainer's Claude + account — no API key or extra secret. +- **Trigger:** a scheduled Claude Code (web) trigger created from a repo-scoped + session (see [Setup notes](#setup-notes-how-it-must-be-configured)). +- **Input:** open issues labeled `ci-failure`. +- **Output:** a fix PR on a `claude/ci-fix-` branch titled + `[Claude CI Failure] …`, or a root-cause comment on the issue when it's not + auto-fixable. + ### Setup notes (how it must be configured) - **It must run in a session that is scoped to `viamrobotics/docs`** — that is, diff --git a/README.md b/README.md index 879470e758..1ae8d53913 100644 --- a/README.md +++ b/README.md @@ -136,4 +136,4 @@ GitHub Actions workflows run linting and link checks on every pull request, publ > [!NOTE] > The scheduled `test-code-snippets.yml` and `check-methods.yml` jobs run against a live Viam test organization and external SDK sites, so their results depend on that org's state and on upstream changes. See [`.github/workflows/README.md`](.github/workflows/README.md#test-org-dependency) for details. > -> On failure, the scheduled jobs open a deduplicated GitHub issue labeled `ci-failure`, and a daily Claude Code session triages those issues into fix PRs or comments. See [CI failure triage](.github/workflows/ci-failure-triage.md). +> On failure, the scheduled jobs open a deduplicated GitHub issue labeled `ci-failure`. A daily **CI-failure triage routine** (a scheduled Claude Code session) then reviews those issues and either opens a fix PR titled `[Claude CI Failure] …` or comments with the root cause. See [CI failure triage](.github/workflows/ci-failure-triage.md) for the full flow, schedule, setup, and prompt. From 3aa1f9116099a107d578372fd4fa6f01e7fd1a5b Mon Sep 17 00:00:00 2001 From: Brandon Shrewsbury Date: Wed, 1 Jul 2026 22:28:28 +0000 Subject: [PATCH 6/6] Pin prettier to the CI version in pre-PR checks and triage prompt PR #5133 passed a local `npx prettier --check` but failed CI's prettier check: npx resolved an unpinned `prettier` to a newer local version (3.8.1) that formats blank-lines-before-nested-lists differently than the 3.2.5 that .github/workflows/prettier-lint.yml pins. Pin the pre-PR check commands to prettier@3.2.5 in both CLAUDE.md and the triage prompt, and add a step to the triage prompt to verify the pushed branch's actual CI status rather than trusting a local check alone. --- .github/workflows/ci-failure-triage.md | 18 ++++++++++++------ CLAUDE.md | 4 ++-- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/.github/workflows/ci-failure-triage.md b/.github/workflows/ci-failure-triage.md index 2f94e697fa..57868d21de 100644 --- a/.github/workflows/ci-failure-triage.md +++ b/.github/workflows/ci-failure-triage.md @@ -92,10 +92,15 @@ how the repo's other `claude/*` PRs are made. - Fixes a whole class of errors exhaustively (all occurrences repo-wide), and confirms each retarget destination actually exists rather than guessing. - Makes the **minimal diff** — only the substantive change, no incidental - reformatting — then runs the pre-PR checks in order (`prettier --write`, then - markdownlint, then vale) and confirms `prettier --check` passes. `prettier` is - a required status check, so when it and markdownlint disagree on formatting - (for example, blank lines around lists), prettier wins. + reformatting — then runs the pre-PR checks in order (`prettier@3.2.5 --write`, + then markdownlint, then vale) and confirms `prettier@3.2.5 --check` passes + before pushing. `prettier` is a required status check, so when it and + markdownlint disagree on formatting (for example, blank lines around lists), + prettier wins. The version pin matters: an unpinned `npx prettier` can + resolve to a newer local version than the `3.2.5` that + [`prettier-lint.yml`](prettier-lint.yml) runs in CI, and the two + disagree on blank lines before nested lists — always run the pinned + `npx prettier@3.2.5` (matching CLAUDE.md), not a bare `npx prettier`. - If an issue's linked PR already exists but its **checks are failing**, fixes that PR's branch in place rather than opening a duplicate or skipping it. - Titles its PRs `[Claude CI Failure] …` so maintainers can spot and filter @@ -137,8 +142,9 @@ Triage CI failures for viamrobotics/docs. The repo is cloned at the repo root, b 6. MINIMAL, PROPERLY-FORMATTED DIFF (this is where a fix most often regresses): - Change ONLY what the fix requires. Do NOT reformat, re-wrap, or add/remove blank lines on lines unrelated to your fix. - - Then run the CLAUDE.md pre-PR checks IN THIS ORDER and commit exactly what they produce: (1) `npx prettier --write `, (2) `npx markdownlint-cli --config .markdownlint.yaml `, (3) vale. prettier owns formatting and is a REQUIRED status check; it can disagree with markdownlint (e.g. blank lines around lists) — when they conflict, prettier wins. Never hand-apply a markdownlint suggestion that makes prettier fail. - - Before opening/updating the PR, confirm `npx prettier --check ` passes and markdownlint is clean on every file you touched. If a required check tool genuinely can't run in your environment, say so in the PR body and keep the diff as small as possible so you don't introduce issues that tool would catch. + - Then run the CLAUDE.md pre-PR checks IN THIS ORDER and commit exactly what they produce: (1) `npx prettier@3.2.5 --write `, (2) `npx markdownlint-cli --config .markdownlint.yaml `, (3) vale. Pin the prettier version to `3.2.5` (check `.github/workflows/prettier-lint.yml` for the current pin) — a bare `npx prettier` can silently resolve to a newer version that formats blank-lines-before-nested-lists differently than the pinned version, producing a diff that looks clean locally but fails the required `prettier` check in CI. prettier owns formatting and is a REQUIRED status check; it can disagree with markdownlint (e.g. blank lines around lists) — when they conflict, prettier wins. Never hand-apply a markdownlint suggestion that makes prettier fail. + - Before opening/updating the PR, confirm `npx prettier@3.2.5 --check ` passes and markdownlint is clean on every file you touched. + - AFTER PUSHING, before considering the fix done: check the actual CI run status on the pushed branch/PR (not just your local check). If prettier or any other required check fails in CI despite passing locally, re-diagnose the tool-version mismatch (or other environment difference) rather than assuming the CI failure is unrelated, fix it, and push again — repeat until the PR's checks are green. If a required check tool genuinely can't run in your environment at all, say so in the PR body and keep the diff as small as possible so you don't introduce issues that tool would catch. 7. Open (or update) the PR via the mcp__github__ tools on branch claude/ci-fix-. Title it "[Claude CI Failure] " — keep that exact prefix, including when you update an existing PR's title. In the PR body, state exactly what you fixed and what you deliberately left, with counts (e.g. "fixes 10 of 10 internal broken anchors; the remaining ~338 errors are external 403/404/429, out of scope"). Use "Fixes #" ONLY if the PR fully resolves what the job checks so it will pass next run; if it fixes only part (e.g. the internal anchors while external/transient errors remain), use "Refs #" so the issue is NOT auto-closed. Commit as Brandon Shrewsbury , with NO "Co-Authored-By: ...@anthropic.com" trailer (it breaks the CLA). Then comment the PR link + a short verified breakdown on the issue. diff --git a/CLAUDE.md b/CLAUDE.md index acd91b403c..668beaf84f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -7,10 +7,10 @@ Run these checks from the worktree directory before committing or pushing. All f ### 1. Format with prettier ```bash -npx prettier --write docs/[section]/**/*.md +npx prettier@3.2.5 --write docs/[section]/**/*.md ``` -Prettier auto-formats markdown. Run with `--write` to fix in place, or `--check` to see what would change without modifying files. +Prettier auto-formats markdown. Run with `--write` to fix in place, or `--check` to see what would change without modifying files. Pin the version to match CI's `.github/workflows/prettier-lint.yml` (currently `3.2.5`) — a bare `npx prettier` can resolve to a newer local version that formats blank lines before nested lists differently, producing a diff that looks clean locally but fails the required `prettier` check in CI. ### 2. Lint markdown structure