From 948710c75aee047f0772329887457fe1969864fb Mon Sep 17 00:00:00 2001
From: Brandon Shrewsbury <brandon.shrewsbury@viam.com>
Date: Fri, 26 Jun 2026 13:53:55 -0600
Subject: [PATCH] Add detailed CI repair plans for a future session

Add .github/workflows/plans/: a per-issue plan set so a future session can
execute each repair without re-deriving context. Each plan covers problem, root
cause, prerequisites, step-by-step changes with file/line references,
verification, rollback, and risks.

- 00 provision a new org, machine, and secrets (maps every hardcoded ID and
  secret to re-point; foundation for the live-API jobs)
- 01 restore failure notifications (Jira + backup alert)
- 02 fix Test Code Samples (org member, step isolation, ID re-point, flaky
  teardown tolerance)
- 03 fix Alias reminder (set-output -> GITHUB_OUTPUT)
- 04 fix PR Test Label Manager (confirm safe-to-build consumer, trigger/cond)
- 05 fix SDK method coverage (signal, concurrency, scraper hardening)
- 06 cleanup and modernization (action bumps, dead steps, naming)

Index README orders the plans and lists the secrets inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../plans/00-provision-org-machine-secrets.md | 143 ++++++++++++++++++
 .../plans/01-restore-notifications.md         |  65 ++++++++
 .../plans/02-fix-test-code-snippets.md        | 125 +++++++++++++++
 .../workflows/plans/03-fix-alias-reminder.md  |  66 ++++++++
 .github/workflows/plans/04-fix-pr-labeler.md  |  63 ++++++++
 .../workflows/plans/05-fix-check-methods.md   |  68 +++++++++
 .../plans/06-cleanup-and-modernization.md     |  83 ++++++++++
 .github/workflows/plans/README.md             |  61 ++++++++
 8 files changed, 674 insertions(+)
 create mode 100644 .github/workflows/plans/00-provision-org-machine-secrets.md
 create mode 100644 .github/workflows/plans/01-restore-notifications.md
 create mode 100644 .github/workflows/plans/02-fix-test-code-snippets.md
 create mode 100644 .github/workflows/plans/03-fix-alias-reminder.md
 create mode 100644 .github/workflows/plans/04-fix-pr-labeler.md
 create mode 100644 .github/workflows/plans/05-fix-check-methods.md
 create mode 100644 .github/workflows/plans/06-cleanup-and-modernization.md
 create mode 100644 .github/workflows/plans/README.md
diff --git a/.github/workflows/plans/00-provision-org-machine-secrets.md b/.github/workflows/plans/00-provision-org-machine-secrets.md
new file mode 100644
index 0000000000..2862207aaf
--- /dev/null
+++ b/.github/workflows/plans/00-provision-org-machine-secrets.md
@@ -0,0 +1,143 @@
+# Plan 00: Provision a new org, machine, and secrets
+
+**Status:** NOT STARTED
+**Fixes:** the live-API foundation for `test-code-snippets.yml` and the
+`docs.yml` search-index jobs.
+**Depended on by:** [Plan 02](02-fix-test-code-snippets.md).
+
+## Problem
+
+The code-sample tests and the search-index sync run against a shared Viam test
+organization whose identity lives in repository secrets and whose resource IDs
+are hardcoded into both the workflow and the sample files. The current org has
+drifted (no roleless member, possible stale machine/keys), and the secret
+values are not recoverable. The cleanest reset is a fresh org, location,
+machine, and key set, with every secret and hardcoded ID re-pointed in one
+coordinated change.
+
+## What the jobs actually need from the org
+
+Collected from `test-code-snippets.yml` and the sample files:
+
+1. An **organization** whose ID becomes `TEST_ORG_ID`.
+2. An **org owner API key** (key + key ID) for `VIAM_API_KEY` /
+   `VIAM_API_KEY_ID`. The samples authenticate with this and expect owner-level
+   access (create/delete locations, keys, datasets, data pipelines, roles).
+3. A **location** that replaces the hardcoded `pg5q3j3h95`. Samples reference it
+   directly and through the machine address `auto-machine-main.<loc>.viam.cloud`.
+4. A **machine** with a **main part**, used two ways:
+   - The workflow fetches its config to run a local `viam-server`
+     (`config?id=<machine-part-id>`), authenticated with a **machine part API
+     key** (`key_id` + `TEST_MACHINE_KEY`).
+   - Samples connect to it at the machine address above.
+5. A **roleless member** in the org so the roles lifecycle sample can grant a
+   fresh role (see [Plan 02](02-fix-test-code-snippets.md)).
+6. A **user to invite** whose email becomes `TEST_EMAIL` (must not already be a
+   member; the sample invites then deletes the invite).
+7. A **second organization** (`ORG_ID_2`) used only by the share/unshare
+   location sample.
+8. A **second org + key for data regions** (`VIAM_API_KEY_DATA_REGIONS` /
+   `VIAM_API_KEY_ID_DATA_REGIONS`) used by the data-regions sample.
+
+## Hardcoded IDs to replace
+
+These are baked into the workflow and sample files and must be updated to the
+new resources. Counts are approximate occurrence counts across the repo.
+
+| Current value | Meaning | Where | New value |
+| --- | --- | --- | --- |
+| `pg5q3j3h95` | Location ID | `test-code-snippets.yml`-adjacent samples; ~11 direct + 25 in machine address | new location ID |
+| `deb8782c-7b48-4d35-812d-2caa94b61f77` | Machine **part** ID | workflow config fetch (`id=`); `MACHINE_PART_ID`/`PART_ID` in samples (~12) | new part ID |
+| `824b6570-7b1d-4622-a19d-37c472dba467` | `VIAM_PART_ID` / `PART_ID` | workflow env line 49; samples (~8) | new part ID (confirm which part) |
+| `1030f25a-f4f2-4872-9762-e33fa1e0444d` | Machine part **key ID** | workflow config-fetch header | new machine key ID |
+| `5ec7266e-f762-4ea8-9c29-4dd592718b48` | Machine ID | samples (~3) | new machine ID |
+| `b5e9f350-cbcf-4d2a-bbb1-a2e2fd6851e1` | `ORG_ID_2` (share target) | samples (~18) | new second org ID |
+| `16b8a3e5-7944-4e1c-8ccd-935c1ba3be59` | resource ID (confirm: dataset/fragment) | samples (~6) | new ID |
+
+> [!NOTE]
+> There appear to be **two distinct part IDs** (`deb8782c...` and `824b6570...`).
+> Confirm whether the machine has two parts, or whether one of these is stale,
+> before re-pointing. The config fetch uses `deb8782c...`; the workflow env var
+> `VIAM_PART_ID` uses `824b6570...`.
+
+## Tooling
+
+A future session can use the available skills:
+
+- `local-viam-server`: create a machine and run a local `viam-server` (handles
+  the machine part secret the CLI does not expose).
+- `viam-modules-fleet`: `viam` CLI for orgs, locations, machines, API keys.
+- `viam-machine-config`: push a machine config if the machine needs components.
+
+The machine only needs to start and stay connected; the samples do not require
+specific components on it unless a sample connects and calls a component API.
+Audit the samples that use `MACHINE_ADDRESS` to confirm what the machine must
+expose, and configure those components (see [Plan 02](02-fix-test-code-snippets.md)).
+
+## Plan
+
+1. **Create the primary test org.** Record its ID for `TEST_ORG_ID`. Give it a
+   recognizable name; note that the orgs sample resets the name to
+   `docs-scheduled-tests` on each run, so expect that name to reappear.
+2. **Create a location** in that org. Record its ID (replaces `pg5q3j3h95`).
+3. **Create a machine** in that location with a main part. Record the machine
+   ID, the part ID, and confirm the machine address
+   (`auto-machine-main.<location-id>.viam.cloud` or the actual address shown in
+   the app).
+4. **Create a machine part API key.** Record the key ID and key value for the
+   workflow config fetch (`key_id` header and `TEST_MACHINE_KEY`).
+5. **Create an org owner API key.** Record the key and key ID for `VIAM_API_KEY`
+   / `VIAM_API_KEY_ID`.
+6. **Add a roleless member** to the org (invite a service/test user and remove
+   any default role, or confirm a member with no authorizations). This unblocks
+   the roles lifecycle sample.
+7. **Choose a `TEST_EMAIL`** for a user who is not a member of the org (the
+   sample invites and then revokes this address).
+8. **Create the second org** (`ORG_ID_2`) for the share/unshare sample. A bare
+   org with no resources is enough.
+9. **Create the data-regions org and key** for `VIAM_API_KEY_DATA_REGIONS` /
+   `VIAM_API_KEY_ID_DATA_REGIONS`. Confirm what the data-regions sample requires
+   (it changes an org region and tolerates the "region cannot be changed" error).
+10. **Configure the machine** so every sample that connects to `MACHINE_ADDRESS`
+    finds the components it calls (audit list in Plan 02). Start a
+    `viam-server` once to confirm the config is valid and the machine reports
+    online.
+11. **Update the workflow** `test-code-snippets.yml`: replace the hardcoded
+    `key_id` and config-fetch `id` (lines around 29) and `VIAM_PART_ID`
+    (line ~49) with the new machine values. Prefer moving these into secrets or
+    workflow `env` referencing secrets rather than re-hardcoding.
+12. **Update the sample files** with the new location/machine/org IDs (see the
+    table above). Use a scripted find-and-replace, then grep to confirm no old
+    IDs remain.
+13. **Update the GitHub repository secrets** (Settings -> Secrets and variables
+    -> Actions): `TEST_ORG_ID`, `VIAM_API_KEY`, `VIAM_API_KEY_ID`,
+    `TEST_MACHINE_KEY`, `TEST_EMAIL`, `VIAM_API_KEY_DATA_REGIONS`,
+    `VIAM_API_KEY_ID_DATA_REGIONS`.
+
+## Verification
+
+- Run the workflow through `workflow_dispatch` on the branch and confirm the
+  "Start viam-server in background" step fetches the config and the server
+  reports running.
+- Confirm `grep -rE 'pg5q3j3h95|deb8782c|824b6570|1030f25a|5ec7266e|b5e9f350|16b8a3e5' static/include/examples .github/workflows/test-code-snippets.yml`
+  returns nothing.
+- Proceed to [Plan 02](02-fix-test-code-snippets.md) to make the samples pass.
+
+## Rollback
+
+Keep the old secret values recorded until the new org is confirmed working. If
+the new org misbehaves, restore the previous secret values and revert the ID
+changes in one commit. The old org can stay untouched until the new one is
+green.
+
+## Risks and open questions
+
+- **Two part IDs**: confirm the machine's part topology before re-pointing.
+- **Component requirements**: some samples may call specific component APIs on
+  the machine; the machine config must provide them or those samples fail at
+  connect/call time even with a healthy server.
+- **Data-regions org**: regions can only be set once per org, so the
+  data-regions org may be a single-use resource; confirm the sample tolerates a
+  pre-set region (it currently logs and continues on that error).
+- **`ORG_ID_2` permissions**: the owner key for the primary org must be allowed
+  to share a location into `ORG_ID_2`; confirm cross-org sharing is permitted.
diff --git a/.github/workflows/plans/01-restore-notifications.md b/.github/workflows/plans/01-restore-notifications.md
new file mode 100644
index 0000000000..ed8c1d46db
--- /dev/null
+++ b/.github/workflows/plans/01-restore-notifications.md
@@ -0,0 +1,65 @@
+# Plan 01: Restore failure notifications
+
+**Status:** NOT STARTED
+**Fixes:** the Jira notification steps in `test-code-snippets.yml`,
+`check-methods.yml`, and `run-htmltest.yml`.
+**Depended on by:** [Plan 05](05-fix-check-methods.md) (its only signal is Jira).
+
+## Problem
+
+Three scheduled jobs report failures only by opening a Jira ticket, and the
+Jira steps are themselves failing (the `Login to Jira` / `Create Jira ticket`
+steps show as failed in recent runs). The result: these jobs run unmonitored,
+so even a real regression is silent. This is the highest-impact repair because
+it is the prerequisite for trusting every other scheduled job.
+
+## Root cause (to confirm)
+
+The `atlassian/gajira-login@v3` step authenticates with `JIRA_BASE_URL`,
+`JIRA_USER_EMAIL`, and `JIRA_API_TOKEN`. The most likely causes, in order:
+
+1. An expired or rotated `JIRA_API_TOKEN`.
+2. A changed `JIRA_BASE_URL` or account email.
+3. The `DOCS` project no longer accepting the issue types used
+   (`Bug` for the htmltest and code-sample jobs, `Task` for check-methods), or
+   a required field now being enforced on create.
+
+## Plan
+
+1. **Reproduce** by triggering one scheduled workflow through
+   `workflow_dispatch` and reading the `Login to Jira` step log for the exact
+   error (auth failure, project not found, required field, and so on).
+2. **Refresh credentials**: regenerate the Jira API token, confirm the account
+   email and base URL, and update the three secrets.
+3. **Confirm the create payload**: verify the `DOCS` project key, the issue
+   types (`Bug` / `Task`), and any newly required fields. Update the
+   `gajira-create` step inputs if the project schema changed.
+4. **Add a backup notification path** so a single broken integration cannot
+   silence the jobs again. Options, cheapest first:
+   - A Slack incoming-webhook step on `failure()` posting the run URL.
+   - A GitHub issue created on failure with a dedup label.
+   - Email through an action on `failure()`.
+5. **Make notification failures visible**: ensure the Jira/Slack steps run with
+   `if: failure()` (they do today) and that a failure of the notification step
+   itself surfaces in the run status rather than passing silently.
+
+## Verification
+
+- Trigger each of the three scheduled jobs through `workflow_dispatch` while
+  forcing a failure (or run against a known-failing state) and confirm a ticket
+  is created and the backup alert fires.
+- Confirm the created ticket lands in the expected `DOCS` project with the
+  correct issue type and a useful title and body (include the run URL).
+
+## Rollback
+
+The notification steps are isolated from the test logic, so reverting the
+`gajira-*` step changes or removing the backup step does not affect what the
+jobs test. Keep the previous step config in the PR description for quick revert.
+
+## Risks and open questions
+
+- Is the `DOCS` Jira project still the right destination, and who owns the
+  service account behind `JIRA_USER_EMAIL`?
+- If Jira is being retired for this purpose, replace it with the backup path as
+  the primary, rather than fixing the login.
diff --git a/.github/workflows/plans/02-fix-test-code-snippets.md b/.github/workflows/plans/02-fix-test-code-snippets.md
new file mode 100644
index 0000000000..90aafbbf16
--- /dev/null
+++ b/.github/workflows/plans/02-fix-test-code-snippets.md
@@ -0,0 +1,125 @@
+# Plan 02: Fix Test Code Samples
+
+**Status:** NOT STARTED
+**Fixes:** `test-code-snippets.yml` (the _Test Code Samples_ job).
+**Depends on:** [Plan 00](00-provision-org-machine-secrets.md) (org, machine,
+secrets) and benefits from [Plan 01](01-restore-notifications.md).
+
+## Problem
+
+The job runs every Python, Go, and TypeScript sample under
+`static/include/examples/` against a live `viam-server` and a live Viam org. It
+has failed on every run since 2025-11-10. The Python step exits non-zero on the
+first failing file, which aborts the Go and TypeScript steps, so a single bad
+sample hides the rest of the suite.
+
+## Root causes
+
+1. **Org data dependency**: `fleet-api/fleet-management-api-orgs.py` granted a
+   location-owner role to `member_list[-1]`, who is now an org owner and already
+   inherits that role, so the grant is rejected. PR #5106 changes the sample to
+   pick a member with no existing authorizations, but the org must actually
+   contain such a member (Plan 00, step 6).
+2. **No step isolation**: the three language steps run sequentially in one job;
+   the Python step's non-zero exit skips Go and TypeScript.
+3. **Stale hardcoded IDs**: the server-start step and many samples hardcode the
+   machine, location, and org IDs (Plan 00).
+4. **Flaky live calls**: some samples fail intermittently on transient
+   `INTERNAL` backend errors (observed in the `data-pipelines` teardown
+   `delete_data_pipeline` calls), unrelated to docs changes.
+
+## Plan
+
+### A. Land the org-data prerequisites
+
+1. Confirm [Plan 00](00-provision-org-machine-secrets.md) is complete: new org,
+   machine, secrets, and a **roleless member**.
+2. Merge or rebase PR #5106 so the orgs sample selects a roleless member and
+   asserts clearly if none exists.
+
+### B. Isolate the language steps
+
+Today a Python failure blocks Go and TypeScript. Pick one:
+
+- **Preferred**: split into three jobs (`test-python`, `test-go`, `test-ts`),
+  each setting up its own runtime and running its own samples. Share the
+  viam-server startup through a reusable setup (composite action or a setup
+  job + artifact), or start the server in each job. This gives independent
+  pass/fail per language and parallel execution.
+- **Lighter**: keep one job but make each language loop record failures and
+  continue, then fail the job once at the end with a combined summary, so all
+  three languages always run.
+
+### C. Re-point hardcoded IDs
+
+Apply the ID replacements from [Plan 00](00-provision-org-machine-secrets.md).
+Prefer reading machine identifiers from secrets in the workflow rather than
+hardcoding. Confirm with:
+
+```bash
+grep -rE 'pg5q3j3h95|deb8782c|824b6570|1030f25a|5ec7266e|b5e9f350|16b8a3e5' \
+  static/include/examples .github/workflows/test-code-snippets.yml
+```
+
+### D. Make teardown resilient to flaky backends
+
+For samples whose **teardown** (not the demonstrated behavior) makes a live call
+that can return a transient `INTERNAL` error, wrap the teardown so a flaky
+response does not fail the run. Confirmed cases:
+
+- `data-pipelines/pipeline-create.py` (`delete_data_pipeline` teardown).
+- `data-pipelines/pipeline-list.py` (`delete_data_pipeline` teardown).
+
+Pattern (teardown only, never the asserted behavior):
+
+```python
+try:
+    await data_client.delete_data_pipeline(pipeline_id)
+except GRPCError as e:
+    print(f"teardown delete failed (ignored): {e}")
+```
+
+Do not blanket-wrap asserted calls; only teardown/cleanup that is incidental to
+what the sample demonstrates.
+
+### E. Audit machine-dependent samples
+
+List the samples that connect to the machine and confirm the new machine
+exposes the components they call, so they do not fail at connect or call time:
+
+```bash
+grep -rln 'MACHINE_ADDRESS\|auto-machine-main' static/include/examples
+```
+
+For each, note the component or service APIs it calls and ensure the Plan 00
+machine config provides them.
+
+### F. Quieten known anti-patterns (optional, low risk)
+
+- Remove `pip install asyncio` (line ~61): `asyncio` is stdlib.
+- Move Python off 3.9 (end of life approaching) and Node off the non-LTS 23.
+
+## Verification
+
+1. Trigger the workflow through `workflow_dispatch` on the branch.
+2. Confirm all three language steps run (not just Python) and report a per-file
+   pass/fail summary.
+3. Confirm a fully green run end to end.
+4. Re-run once to confirm previously flaky samples are now stable.
+
+## Rollback
+
+Each change is independent: the step-split, the ID re-point, and the teardown
+guards can be reverted individually. Keep the single-job version in git history
+so it can be restored if the split causes runner or secret-scoping issues.
+
+## Risks and open questions
+
+- Splitting jobs multiplies viam-server startups (and secret usage); confirm the
+  machine tolerates concurrent connections, or gate the language jobs to run
+  sequentially with `needs`.
+- The "stable" viam-server AppImage is re-pulled each run, so a server release
+  can change behavior with no repo change; consider pinning a known-good server
+  version for reproducibility.
+- Some failures may be genuine sample bugs surfaced once the suite runs fully;
+  budget time to triage newly visible Go and TypeScript failures.
diff --git a/.github/workflows/plans/03-fix-alias-reminder.md b/.github/workflows/plans/03-fix-alias-reminder.md
new file mode 100644
index 0000000000..4464fc9681
--- /dev/null
+++ b/.github/workflows/plans/03-fix-alias-reminder.md
@@ -0,0 +1,66 @@
+# Plan 03: Fix Alias reminder
+
+**Status:** NOT STARTED
+**Fixes:** `alias-reminder.yml` (the _Alias reminder_ job).
+**Depends on:** nothing.
+
+## Problem
+
+On pull requests that rename or move `.md` files, this workflow should post a
+sticky comment reminding the author to add redirect aliases. It currently never
+posts, so authors get no reminder and broken redirects slip through.
+
+## Root cause
+
+The detection step emits its result with the deprecated `::set-output` workflow
+command (`Write-Host "::set-output name=...::"`). GitHub has disabled
+`::set-output`, so `steps.check_files_moved.outputs.*` is never populated. The
+gate on the comment step (`== 'True'`) therefore never evaluates true, and the
+comment step is dead.
+
+## Plan
+
+1. **Migrate to `$GITHUB_OUTPUT`.** Replace the `::set-output` writes in the
+   PowerShell detection step with appends to the `GITHUB_OUTPUT` file. In `pwsh`:
+
+   ```powershell
+   "files_moved=$filesMoved" | Out-File -FilePath $env:GITHUB_OUTPUT -Append
+   "moved_list=$movedList"   | Out-File -FilePath $env:GITHUB_OUTPUT -Append
+   ```
+
+   Keep the existing output names so the downstream `if:` does not need changing
+   beyond confirming the comparison value (`'True'` vs `'true'`: normalize the
+   value the step writes and the condition it is compared against).
+2. **Handle multi-line output safely.** If `moved_list` can contain multiple
+   filenames or newlines, use the heredoc form for `GITHUB_OUTPUT` so the value
+   is not truncated:
+
+   ```powershell
+   "moved_list<<EOF"  | Out-File $env:GITHUB_OUTPUT -Append
+   $movedList         | Out-File $env:GITHUB_OUTPUT -Append
+   "EOF"              | Out-File $env:GITHUB_OUTPUT -Append
+   ```
+
+3. **Confirm the gate** on the sticky-comment step references the corrected
+   output and boolean value.
+
+## Verification
+
+1. Open a test PR that renames a `.md` file under `docs/`.
+2. Confirm the workflow runs (note: it triggers on `labeled` and `synchronize`,
+   not `opened`, so push a change or apply a label after opening).
+3. Confirm the sticky comment appears and lists the moved file.
+4. Open a PR with no moved files and confirm no comment is posted.
+
+## Rollback
+
+Single-workflow change with no side effects beyond a PR comment; revert the
+commit to restore the previous (non-posting) behavior.
+
+## Risks and open questions
+
+- The workflow uses `pull_request_target` and checks out the PR head SHA; keep
+  the step limited to `git diff` and grep (no execution of PR code) to avoid the
+  known elevated-token risk of that trigger.
+- Consider adding `opened` to the trigger types so the reminder also fires when
+  a PR is first opened with moved files, not only after a label or new push.
diff --git a/.github/workflows/plans/04-fix-pr-labeler.md b/.github/workflows/plans/04-fix-pr-labeler.md
new file mode 100644
index 0000000000..c211ec837a
--- /dev/null
+++ b/.github/workflows/plans/04-fix-pr-labeler.md
@@ -0,0 +1,63 @@
+# Plan 04: Fix PR Test Label Manager
+
+**Status:** NOT STARTED
+**Fixes:** `pr-labeler.yml` (the _PR Test Label Manager_ job).
+**Depends on:** nothing (but step 1 is a discovery task).
+
+## Problem
+
+When a PR opens, the workflow adds a `safe to build` label if the author is a
+`viamrobotics` org member, otherwise it posts a welcome comment. Two issues:
+
+1. **No in-repo consumer of `safe to build`.** Nothing in `.github/` reads the
+   label. It is almost certainly consumed by Netlify (deploy-preview gating for
+   untrusted PRs), but that is unconfirmed. If nothing consumes it, the whole
+   workflow is dead weight.
+2. **Trigger/condition mismatch.** The workflow subscribes only to
+   `types: [opened]`, but the job `if:` checks for
+   `opened`/`synchronize`/`reopened`. The `synchronize` and `reopened` branches
+   can never fire.
+3. **Fragile error handling.** Any `checkMembershipForUser` error (rate limit,
+   transient API failure), not just non-membership, falls into the `catch` and
+   posts the contributor comment. A leftover `console.log("here")` remains.
+
+## Plan
+
+1. **Confirm the consumer first.** Check the Netlify site settings for this repo
+   for a build condition or "deploy only when label present" rule referencing
+   `safe to build`. Also check branch protection and any external automation.
+   - If a consumer exists: document it in the [workflows reference](../README.md)
+     and proceed to fix the workflow.
+   - If no consumer exists: propose removing the workflow (record the decision
+     in the reference) and stop here.
+2. **Resolve the trigger/condition mismatch.** Decide the intended behavior:
+   - If labeling should happen only on open: simplify the job `if:` to
+     `github.event.action == 'opened'`.
+   - If it should also re-check on updates: add `synchronize` and `reopened` to
+     the `on: pull_request_target` types.
+3. **Tighten error handling.** Distinguish "not a member" (the expected 404 path)
+   from other API errors. Only post the contributor comment on a genuine
+   non-membership result; let unexpected errors fail the step (or log and skip)
+   rather than mislabeling a member as an outside contributor.
+4. **Remove the leftover `console.log("here")` debug line.**
+
+## Verification
+
+1. Open a test PR as an org member and confirm the `safe to build` label is
+   added and no welcome comment is posted.
+2. Open or simulate a PR from a non-member and confirm the welcome comment is
+   posted and the label is not added.
+3. If a consumer was confirmed (step 1), verify the downstream build behaves as
+   expected with and without the label.
+
+## Rollback
+
+Single-workflow change; revert the commit. The label itself is idempotent, so
+reverting does not strand PRs.
+
+## Risks and open questions
+
+- `pull_request_target` runs with elevated permissions and `PR_TOKEN`; the
+  script must not check out or execute PR code (it currently does not).
+- The membership check requires the token to have org read scope; confirm
+  `PR_TOKEN` still has it.
diff --git a/.github/workflows/plans/05-fix-check-methods.md b/.github/workflows/plans/05-fix-check-methods.md
new file mode 100644
index 0000000000..ccf44a1fae
--- /dev/null
+++ b/.github/workflows/plans/05-fix-check-methods.md
@@ -0,0 +1,68 @@
+# Plan 05: Fix SDK method coverage
+
+**Status:** NOT STARTED
+**Fixes:** `check-methods.yml` (the _SDK method coverage_ job).
+**Depends on:** [Plan 01](01-restore-notifications.md) (Jira is its only signal).
+
+## Problem
+
+The job detects when the Viam SDKs gain or remove API methods that the docs'
+generated reference has not accounted for. Three issues blunt its value:
+
+1. **Job-level `continue-on-error: true`** means the coverage check never fails
+   the run; its only signal is a Jira ticket, and Jira is currently broken
+   (see Plan 01). Regressions are effectively silent.
+2. **`concurrency.group` keys on `github.event.number`**, which is null for
+   `schedule` and `workflow_dispatch`, so unrelated runs share one group with
+   `cancel-in-progress: true` and can cancel each other.
+3. **Fragile external scraping.** The check parses the HTML of four SDK doc
+   sites (`python.viam.dev`, `pkg.go.dev`, `ts.viam.dev`, `flutter.viam.dev`)
+   and the upstream gRPC protos; an upstream layout change breaks parsing and
+   opens spurious tickets.
+4. **Stale cron comment.** The comment says "weekdays" but the schedule
+   `0 10 * * 3` is Wednesday only.
+
+## Plan
+
+1. **Restore the signal first** (Plan 01). Coverage results are worthless if no
+   one is told.
+2. **Decide on blocking.** Once Jira (or a backup alert) is reliable, decide
+   whether a coverage gap should fail the run. If yes, remove
+   `continue-on-error: true` so the run status reflects reality. If the check is
+   too noisy to block, keep it non-blocking but ensure the alert is dependable.
+3. **Fix the concurrency group.** Use a static group for scheduled runs, for
+   example `group: check-methods-${{ github.ref }}` or drop
+   `cancel-in-progress`, so a manual run and a scheduled run do not cancel each
+   other.
+4. **Correct the cron comment** to "Wednesdays 10:00 UTC".
+5. **Harden the scrapers (optional, larger).** The parsers
+   (`update_sdk_methods.py` plus `parse_*.py`) depend on exact external HTML.
+   Options: add a clear "parsing failed" error distinct from "coverage gap" so
+   an upstream layout change does not masquerade as a docs regression; or move
+   toward an API/introspection-based source where available instead of HTML
+   scraping.
+6. **Trim inline deps.** The workflow installs `beautifulsoup4 markdownify
+   argparse`; `argparse` is stdlib and can be dropped.
+
+## Verification
+
+1. Trigger through `workflow_dispatch` and confirm the coverage check runs and,
+   on a seeded gap, both fails appropriately (if made blocking) and notifies.
+2. Trigger a manual run while a scheduled run is in flight (or simulate) and
+   confirm they no longer cancel each other.
+3. Confirm the run distinguishes a scraping/parse failure from a real coverage
+   gap.
+
+## Rollback
+
+The `continue-on-error` and concurrency changes are one-line reverts. The
+scraper hardening, if attempted, should be a separate commit so it can be
+reverted without losing the smaller fixes.
+
+## Risks and open questions
+
+- Making the check blocking will surface real, possibly large, coverage gaps
+  that have accumulated while it was silent; budget triage time before flipping
+  `continue-on-error`.
+- The scrapers carry hand-maintained ignore-lists and resource-name remaps that
+  drift as SDKs evolve; expect maintenance regardless of hardening.
diff --git a/.github/workflows/plans/06-cleanup-and-modernization.md b/.github/workflows/plans/06-cleanup-and-modernization.md
new file mode 100644
index 0000000000..08fece0636
--- /dev/null
+++ b/.github/workflows/plans/06-cleanup-and-modernization.md
@@ -0,0 +1,83 @@
+# Plan 06: Cleanup and modernization
+
+**Status:** NOT STARTED
+**Fixes:** stale and dead configuration across all workflows.
+**Depends on:** nothing; safe to batch independently.
+
+## Problem
+
+Several non-functional issues accumulate maintenance debt and confuse readers:
+old action versions, moving-branch pins, dead setup steps, a commented-out job,
+and misleading names. None blocks today, but they make the workflows fragile and
+hard to trust.
+
+## Plan
+
+Each item is independent; land them as small, reviewable commits.
+
+### Action version bumps
+
+- [ ] `actions/checkout`: bump `v2` (`markdown-lint.yml`, `python-lint.yml`) and
+      `v3` (most others) to the current major.
+- [ ] `actions/setup-python@v4` to current.
+- [ ] `actions/github-script@v6` (`pr-labeler.yml`) to current.
+- [ ] `actions/configure-pages@v2` (`docs.yml`) to current.
+
+### Pin moving references
+
+- [ ] `wjdp/htmltest-action@master` in `run-htmltest.yml` and
+      `run-htmltest-local.yml`: pin to a released tag or commit SHA for
+      reproducibility and supply-chain safety.
+- [ ] `errata-ai/vale-action@reviewdog` (`vale-lint.yml`): pin to a version or
+      SHA instead of a branch.
+
+### Remove dead steps and jobs
+
+- [ ] Delete the unused Python 3.8 venv steps in `vale-lint.yml` (Vale is a Go
+      binary; the venv is never used) and in `python-lint.yml` (the
+      `source env/bin/activate` runs in its own shell and has no effect).
+- [ ] Move `python-lint.yml` off end-of-life Python 3.8.
+- [ ] Delete the commented-out `deploy` job in `docs.yml` (Netlify handles
+      deployment; the Pages permissions and Setup Pages step are then unused).
+- [ ] Remove `pip install asyncio` from `test-code-snippets.yml` (stdlib).
+
+### Fix misleading names and comments
+
+- [ ] `prettier-lint.yml` `name:` says "Lint JS files" but it checks Markdown.
+- [ ] Copy-pasted header comments in `run-htmltest-local.yml` (says
+      `run-htmltest.yml` and `dist/`).
+- [ ] The "Don't fail the build on broken links" comments contradict
+      `continue-on-error: false` in both htmltest workflows.
+
+### Decide on the informational linters
+
+- [ ] `markdown-lint.yml`, `prettier-lint.yml`, `python-lint.yml` all set
+      `continue-on-error: true`, so they never block and duplicate the local
+      pre-commit checks. Decide per workflow: make it blocking (give it teeth)
+      or remove it (reduce noise). Record the decision in the
+      [workflows reference](../README.md).
+
+### Redundant gating
+
+- [ ] `inkeep.yml` has both a top-level `paths` filter and an in-job
+      `dorny/paths-filter` check; remove the redundant inner check.
+
+## Verification
+
+- Open a PR with the batched changes and confirm all PR-triggered checks still
+  run and pass.
+- For scheduled-only workflows, trigger each through `workflow_dispatch` and
+  confirm the bumped actions and new pins resolve and run.
+
+## Rollback
+
+Each item is a small isolated commit; revert individually. Action-version bumps
+are the only ones with behavior risk; if a bumped action changes inputs, pin to
+the last working version and note it.
+
+## Risks and open questions
+
+- Bumped actions occasionally change input names or defaults; read each action's
+  changelog for the major bump.
+- Making the informational linters blocking may surface a backlog of existing
+  violations; fix or baseline them before flipping.
diff --git a/.github/workflows/plans/README.md b/.github/workflows/plans/README.md
new file mode 100644
index 0000000000..b949afbbfd
--- /dev/null
+++ b/.github/workflows/plans/README.md
@@ -0,0 +1,61 @@
+# CI repair plans
+
+These documents are a working plan for repairing the broken and degraded
+GitHub Actions jobs described in the [workflows reference](../README.md). They
+are written so a future session (human or agent) can pick up any one plan and
+execute it without re-deriving the context.
+
+Each plan is self-contained and follows the same structure: problem, root
+cause, prerequisites, step-by-step changes (with exact file and line
+references), verification, rollback, and risks.
+
+## Status legend
+
+- `NOT STARTED`: no work done yet.
+- `IN PROGRESS`: partially done; see the notes in the plan.
+- `BLOCKED`: waiting on a prerequisite (named in the plan).
+- `DONE`: merged and verified; the plan can be deleted.
+
+## Plans and execution order
+
+Run them in roughly this order. Plan 00 is a hard prerequisite for plan 02.
+
+| # | Plan | Fixes | Status | Depends on |
+| --- | --- | --- | --- | --- |
+| 00 | [Provision a new org, machine, and secrets](00-provision-org-machine-secrets.md) | Foundation for the live-API jobs | NOT STARTED | — |
+| 01 | [Restore failure notifications](01-restore-notifications.md) | Jira steps in the 3 scheduled jobs | NOT STARTED | — |
+| 02 | [Fix Test Code Samples](02-fix-test-code-snippets.md) | `test-code-snippets.yml` | NOT STARTED | 00 |
+| 03 | [Fix Alias reminder](03-fix-alias-reminder.md) | `alias-reminder.yml` | NOT STARTED | — |
+| 04 | [Fix PR Test Label Manager](04-fix-pr-labeler.md) | `pr-labeler.yml` | NOT STARTED | — |
+| 05 | [Fix SDK method coverage](05-fix-check-methods.md) | `check-methods.yml` | NOT STARTED | 01 |
+| 06 | [Cleanup and modernization](06-cleanup-and-modernization.md) | All workflows (stale actions, dead steps) | NOT STARTED | — |
+
+## Shared context
+
+- The scheduled jobs `test-code-snippets.yml`, `check-methods.yml`, and
+  `run-htmltest.yml` report failures only by opening a Jira ticket, and those
+  Jira steps are themselves failing. Plan 01 restores that signal; do it early
+  so the other jobs' results are actually seen.
+- `test-code-snippets.yml` and the search-index jobs in `docs.yml` authenticate
+  to a shared Viam test organization through repository secrets. Plan 00
+  provisions a fresh org, machine, and API keys and re-points every secret and
+  hardcoded ID. Plan 02 then makes the samples pass against it.
+- Related work already in flight: PR #5106 makes the orgs sample select a
+  roleless member; PR #5107 adds the workflows reference and the repair TODO.
+
+## Repository secrets inventory
+
+Names referenced by the workflows (values are not readable; this is the list to
+re-point in plan 00):
+
+| Secret | Used by | Tied to the test org? |
+| --- | --- | --- |
+| `TEST_ORG_ID` | test-code-snippets, docs (index sync) | Yes |
+| `VIAM_API_KEY` / `VIAM_API_KEY_ID` | test-code-snippets, docs | Yes (org owner key) |
+| `TEST_MACHINE_KEY` | test-code-snippets (config fetch) | Yes (machine part key) |
+| `TEST_EMAIL` | test-code-snippets (invite target) | Yes (a user, not a member) |
+| `VIAM_API_KEY_DATA_REGIONS` / `VIAM_API_KEY_ID_DATA_REGIONS` | test-code-snippets (data-regions sample) | Yes (a second org) |
+| `JIRA_BASE_URL` / `JIRA_USER_EMAIL` / `JIRA_API_TOKEN` | test-code-snippets, check-methods, run-htmltest | No (Jira) |
+| `TYPESENSE_TUTORIALS_API_KEY` / `TYPESENSE_API_KEY_R` | docs (index sync) | No (Typesense) |
+| `INKEEP_API_KEY` | inkeep | No (Inkeep) |
+| `PR_TOKEN` | pr-labeler, alias-reminder | No (GitHub PAT) |