From 948710c75aee047f0772329887457fe1969864fb Mon Sep 17 00:00:00 2001 From: Brandon Shrewsbury Date: Fri, 26 Jun 2026 13:53:55 -0600 Subject: [PATCH] Add detailed CI repair plans for a future session Add .github/workflows/plans/: a per-issue plan set so a future session can execute each repair without re-deriving context. Each plan covers problem, root cause, prerequisites, step-by-step changes with file/line references, verification, rollback, and risks. - 00 provision a new org, machine, and secrets (maps every hardcoded ID and secret to re-point; foundation for the live-API jobs) - 01 restore failure notifications (Jira + backup alert) - 02 fix Test Code Samples (org member, step isolation, ID re-point, flaky teardown tolerance) - 03 fix Alias reminder (set-output -> GITHUB_OUTPUT) - 04 fix PR Test Label Manager (confirm safe-to-build consumer, trigger/cond) - 05 fix SDK method coverage (signal, concurrency, scraper hardening) - 06 cleanup and modernization (action bumps, dead steps, naming) Index README orders the plans and lists the secrets inventory. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../plans/00-provision-org-machine-secrets.md | 143 ++++++++++++++++++ .../plans/01-restore-notifications.md | 65 ++++++++ .../plans/02-fix-test-code-snippets.md | 125 +++++++++++++++ .../workflows/plans/03-fix-alias-reminder.md | 66 ++++++++ .github/workflows/plans/04-fix-pr-labeler.md | 63 ++++++++ .../workflows/plans/05-fix-check-methods.md | 68 +++++++++ .../plans/06-cleanup-and-modernization.md | 83 ++++++++++ .github/workflows/plans/README.md | 61 ++++++++ 8 files changed, 674 insertions(+) create mode 100644 .github/workflows/plans/00-provision-org-machine-secrets.md create mode 100644 .github/workflows/plans/01-restore-notifications.md create mode 100644 .github/workflows/plans/02-fix-test-code-snippets.md create mode 100644 .github/workflows/plans/03-fix-alias-reminder.md create mode 100644 .github/workflows/plans/04-fix-pr-labeler.md create mode 100644 .github/workflows/plans/05-fix-check-methods.md create mode 100644 .github/workflows/plans/06-cleanup-and-modernization.md create mode 100644 .github/workflows/plans/README.md diff --git a/.github/workflows/plans/00-provision-org-machine-secrets.md b/.github/workflows/plans/00-provision-org-machine-secrets.md new file mode 100644 index 0000000000..2862207aaf --- /dev/null +++ b/.github/workflows/plans/00-provision-org-machine-secrets.md @@ -0,0 +1,143 @@ +# Plan 00: Provision a new org, machine, and secrets + +**Status:** NOT STARTED +**Fixes:** the live-API foundation for `test-code-snippets.yml` and the +`docs.yml` search-index jobs. +**Depended on by:** [Plan 02](02-fix-test-code-snippets.md). + +## Problem + +The code-sample tests and the search-index sync run against a shared Viam test +organization whose identity lives in repository secrets and whose resource IDs +are hardcoded into both the workflow and the sample files. The current org has +drifted (no roleless member, possible stale machine/keys), and the secret +values are not recoverable. The cleanest reset is a fresh org, location, +machine, and key set, with every secret and hardcoded ID re-pointed in one +coordinated change. + +## What the jobs actually need from the org + +Collected from `test-code-snippets.yml` and the sample files: + +1. An **organization** whose ID becomes `TEST_ORG_ID`. +2. An **org owner API key** (key + key ID) for `VIAM_API_KEY` / + `VIAM_API_KEY_ID`. The samples authenticate with this and expect owner-level + access (create/delete locations, keys, datasets, data pipelines, roles). +3. A **location** that replaces the hardcoded `pg5q3j3h95`. Samples reference it + directly and through the machine address `auto-machine-main..viam.cloud`. +4. A **machine** with a **main part**, used two ways: + - The workflow fetches its config to run a local `viam-server` + (`config?id=`), authenticated with a **machine part API + key** (`key_id` + `TEST_MACHINE_KEY`). + - Samples connect to it at the machine address above. +5. A **roleless member** in the org so the roles lifecycle sample can grant a + fresh role (see [Plan 02](02-fix-test-code-snippets.md)). +6. A **user to invite** whose email becomes `TEST_EMAIL` (must not already be a + member; the sample invites then deletes the invite). +7. A **second organization** (`ORG_ID_2`) used only by the share/unshare + location sample. +8. A **second org + key for data regions** (`VIAM_API_KEY_DATA_REGIONS` / + `VIAM_API_KEY_ID_DATA_REGIONS`) used by the data-regions sample. + +## Hardcoded IDs to replace + +These are baked into the workflow and sample files and must be updated to the +new resources. Counts are approximate occurrence counts across the repo. + +| Current value | Meaning | Where | New value | +| --- | --- | --- | --- | +| `pg5q3j3h95` | Location ID | `test-code-snippets.yml`-adjacent samples; ~11 direct + 25 in machine address | new location ID | +| `deb8782c-7b48-4d35-812d-2caa94b61f77` | Machine **part** ID | workflow config fetch (`id=`); `MACHINE_PART_ID`/`PART_ID` in samples (~12) | new part ID | +| `824b6570-7b1d-4622-a19d-37c472dba467` | `VIAM_PART_ID` / `PART_ID` | workflow env line 49; samples (~8) | new part ID (confirm which part) | +| `1030f25a-f4f2-4872-9762-e33fa1e0444d` | Machine part **key ID** | workflow config-fetch header | new machine key ID | +| `5ec7266e-f762-4ea8-9c29-4dd592718b48` | Machine ID | samples (~3) | new machine ID | +| `b5e9f350-cbcf-4d2a-bbb1-a2e2fd6851e1` | `ORG_ID_2` (share target) | samples (~18) | new second org ID | +| `16b8a3e5-7944-4e1c-8ccd-935c1ba3be59` | resource ID (confirm: dataset/fragment) | samples (~6) | new ID | + +> [!NOTE] +> There appear to be **two distinct part IDs** (`deb8782c...` and `824b6570...`). +> Confirm whether the machine has two parts, or whether one of these is stale, +> before re-pointing. The config fetch uses `deb8782c...`; the workflow env var +> `VIAM_PART_ID` uses `824b6570...`. + +## Tooling + +A future session can use the available skills: + +- `local-viam-server`: create a machine and run a local `viam-server` (handles + the machine part secret the CLI does not expose). +- `viam-modules-fleet`: `viam` CLI for orgs, locations, machines, API keys. +- `viam-machine-config`: push a machine config if the machine needs components. + +The machine only needs to start and stay connected; the samples do not require +specific components on it unless a sample connects and calls a component API. +Audit the samples that use `MACHINE_ADDRESS` to confirm what the machine must +expose, and configure those components (see [Plan 02](02-fix-test-code-snippets.md)). + +## Plan + +1. **Create the primary test org.** Record its ID for `TEST_ORG_ID`. Give it a + recognizable name; note that the orgs sample resets the name to + `docs-scheduled-tests` on each run, so expect that name to reappear. +2. **Create a location** in that org. Record its ID (replaces `pg5q3j3h95`). +3. **Create a machine** in that location with a main part. Record the machine + ID, the part ID, and confirm the machine address + (`auto-machine-main..viam.cloud` or the actual address shown in + the app). +4. **Create a machine part API key.** Record the key ID and key value for the + workflow config fetch (`key_id` header and `TEST_MACHINE_KEY`). +5. **Create an org owner API key.** Record the key and key ID for `VIAM_API_KEY` + / `VIAM_API_KEY_ID`. +6. **Add a roleless member** to the org (invite a service/test user and remove + any default role, or confirm a member with no authorizations). This unblocks + the roles lifecycle sample. +7. **Choose a `TEST_EMAIL`** for a user who is not a member of the org (the + sample invites and then revokes this address). +8. **Create the second org** (`ORG_ID_2`) for the share/unshare sample. A bare + org with no resources is enough. +9. **Create the data-regions org and key** for `VIAM_API_KEY_DATA_REGIONS` / + `VIAM_API_KEY_ID_DATA_REGIONS`. Confirm what the data-regions sample requires + (it changes an org region and tolerates the "region cannot be changed" error). +10. **Configure the machine** so every sample that connects to `MACHINE_ADDRESS` + finds the components it calls (audit list in Plan 02). Start a + `viam-server` once to confirm the config is valid and the machine reports + online. +11. **Update the workflow** `test-code-snippets.yml`: replace the hardcoded + `key_id` and config-fetch `id` (lines around 29) and `VIAM_PART_ID` + (line ~49) with the new machine values. Prefer moving these into secrets or + workflow `env` referencing secrets rather than re-hardcoding. +12. **Update the sample files** with the new location/machine/org IDs (see the + table above). Use a scripted find-and-replace, then grep to confirm no old + IDs remain. +13. **Update the GitHub repository secrets** (Settings -> Secrets and variables + -> Actions): `TEST_ORG_ID`, `VIAM_API_KEY`, `VIAM_API_KEY_ID`, + `TEST_MACHINE_KEY`, `TEST_EMAIL`, `VIAM_API_KEY_DATA_REGIONS`, + `VIAM_API_KEY_ID_DATA_REGIONS`. + +## Verification + +- Run the workflow through `workflow_dispatch` on the branch and confirm the + "Start viam-server in background" step fetches the config and the server + reports running. +- Confirm `grep -rE 'pg5q3j3h95|deb8782c|824b6570|1030f25a|5ec7266e|b5e9f350|16b8a3e5' static/include/examples .github/workflows/test-code-snippets.yml` + returns nothing. +- Proceed to [Plan 02](02-fix-test-code-snippets.md) to make the samples pass. + +## Rollback + +Keep the old secret values recorded until the new org is confirmed working. If +the new org misbehaves, restore the previous secret values and revert the ID +changes in one commit. The old org can stay untouched until the new one is +green. + +## Risks and open questions + +- **Two part IDs**: confirm the machine's part topology before re-pointing. +- **Component requirements**: some samples may call specific component APIs on + the machine; the machine config must provide them or those samples fail at + connect/call time even with a healthy server. +- **Data-regions org**: regions can only be set once per org, so the + data-regions org may be a single-use resource; confirm the sample tolerates a + pre-set region (it currently logs and continues on that error). +- **`ORG_ID_2` permissions**: the owner key for the primary org must be allowed + to share a location into `ORG_ID_2`; confirm cross-org sharing is permitted. diff --git a/.github/workflows/plans/01-restore-notifications.md b/.github/workflows/plans/01-restore-notifications.md new file mode 100644 index 0000000000..ed8c1d46db --- /dev/null +++ b/.github/workflows/plans/01-restore-notifications.md @@ -0,0 +1,65 @@ +# Plan 01: Restore failure notifications + +**Status:** NOT STARTED +**Fixes:** the Jira notification steps in `test-code-snippets.yml`, +`check-methods.yml`, and `run-htmltest.yml`. +**Depended on by:** [Plan 05](05-fix-check-methods.md) (its only signal is Jira). + +## Problem + +Three scheduled jobs report failures only by opening a Jira ticket, and the +Jira steps are themselves failing (the `Login to Jira` / `Create Jira ticket` +steps show as failed in recent runs). The result: these jobs run unmonitored, +so even a real regression is silent. This is the highest-impact repair because +it is the prerequisite for trusting every other scheduled job. + +## Root cause (to confirm) + +The `atlassian/gajira-login@v3` step authenticates with `JIRA_BASE_URL`, +`JIRA_USER_EMAIL`, and `JIRA_API_TOKEN`. The most likely causes, in order: + +1. An expired or rotated `JIRA_API_TOKEN`. +2. A changed `JIRA_BASE_URL` or account email. +3. The `DOCS` project no longer accepting the issue types used + (`Bug` for the htmltest and code-sample jobs, `Task` for check-methods), or + a required field now being enforced on create. + +## Plan + +1. **Reproduce** by triggering one scheduled workflow through + `workflow_dispatch` and reading the `Login to Jira` step log for the exact + error (auth failure, project not found, required field, and so on). +2. **Refresh credentials**: regenerate the Jira API token, confirm the account + email and base URL, and update the three secrets. +3. **Confirm the create payload**: verify the `DOCS` project key, the issue + types (`Bug` / `Task`), and any newly required fields. Update the + `gajira-create` step inputs if the project schema changed. +4. **Add a backup notification path** so a single broken integration cannot + silence the jobs again. Options, cheapest first: + - A Slack incoming-webhook step on `failure()` posting the run URL. + - A GitHub issue created on failure with a dedup label. + - Email through an action on `failure()`. +5. **Make notification failures visible**: ensure the Jira/Slack steps run with + `if: failure()` (they do today) and that a failure of the notification step + itself surfaces in the run status rather than passing silently. + +## Verification + +- Trigger each of the three scheduled jobs through `workflow_dispatch` while + forcing a failure (or run against a known-failing state) and confirm a ticket + is created and the backup alert fires. +- Confirm the created ticket lands in the expected `DOCS` project with the + correct issue type and a useful title and body (include the run URL). + +## Rollback + +The notification steps are isolated from the test logic, so reverting the +`gajira-*` step changes or removing the backup step does not affect what the +jobs test. Keep the previous step config in the PR description for quick revert. + +## Risks and open questions + +- Is the `DOCS` Jira project still the right destination, and who owns the + service account behind `JIRA_USER_EMAIL`? +- If Jira is being retired for this purpose, replace it with the backup path as + the primary, rather than fixing the login. diff --git a/.github/workflows/plans/02-fix-test-code-snippets.md b/.github/workflows/plans/02-fix-test-code-snippets.md new file mode 100644 index 0000000000..90aafbbf16 --- /dev/null +++ b/.github/workflows/plans/02-fix-test-code-snippets.md @@ -0,0 +1,125 @@ +# Plan 02: Fix Test Code Samples + +**Status:** NOT STARTED +**Fixes:** `test-code-snippets.yml` (the _Test Code Samples_ job). +**Depends on:** [Plan 00](00-provision-org-machine-secrets.md) (org, machine, +secrets) and benefits from [Plan 01](01-restore-notifications.md). + +## Problem + +The job runs every Python, Go, and TypeScript sample under +`static/include/examples/` against a live `viam-server` and a live Viam org. It +has failed on every run since 2025-11-10. The Python step exits non-zero on the +first failing file, which aborts the Go and TypeScript steps, so a single bad +sample hides the rest of the suite. + +## Root causes + +1. **Org data dependency**: `fleet-api/fleet-management-api-orgs.py` granted a + location-owner role to `member_list[-1]`, who is now an org owner and already + inherits that role, so the grant is rejected. PR #5106 changes the sample to + pick a member with no existing authorizations, but the org must actually + contain such a member (Plan 00, step 6). +2. **No step isolation**: the three language steps run sequentially in one job; + the Python step's non-zero exit skips Go and TypeScript. +3. **Stale hardcoded IDs**: the server-start step and many samples hardcode the + machine, location, and org IDs (Plan 00). +4. **Flaky live calls**: some samples fail intermittently on transient + `INTERNAL` backend errors (observed in the `data-pipelines` teardown + `delete_data_pipeline` calls), unrelated to docs changes. + +## Plan + +### A. Land the org-data prerequisites + +1. Confirm [Plan 00](00-provision-org-machine-secrets.md) is complete: new org, + machine, secrets, and a **roleless member**. +2. Merge or rebase PR #5106 so the orgs sample selects a roleless member and + asserts clearly if none exists. + +### B. Isolate the language steps + +Today a Python failure blocks Go and TypeScript. Pick one: + +- **Preferred**: split into three jobs (`test-python`, `test-go`, `test-ts`), + each setting up its own runtime and running its own samples. Share the + viam-server startup through a reusable setup (composite action or a setup + job + artifact), or start the server in each job. This gives independent + pass/fail per language and parallel execution. +- **Lighter**: keep one job but make each language loop record failures and + continue, then fail the job once at the end with a combined summary, so all + three languages always run. + +### C. Re-point hardcoded IDs + +Apply the ID replacements from [Plan 00](00-provision-org-machine-secrets.md). +Prefer reading machine identifiers from secrets in the workflow rather than +hardcoding. Confirm with: + +```bash +grep -rE 'pg5q3j3h95|deb8782c|824b6570|1030f25a|5ec7266e|b5e9f350|16b8a3e5' \ + static/include/examples .github/workflows/test-code-snippets.yml +``` + +### D. Make teardown resilient to flaky backends + +For samples whose **teardown** (not the demonstrated behavior) makes a live call +that can return a transient `INTERNAL` error, wrap the teardown so a flaky +response does not fail the run. Confirmed cases: + +- `data-pipelines/pipeline-create.py` (`delete_data_pipeline` teardown). +- `data-pipelines/pipeline-list.py` (`delete_data_pipeline` teardown). + +Pattern (teardown only, never the asserted behavior): + +```python +try: + await data_client.delete_data_pipeline(pipeline_id) +except GRPCError as e: + print(f"teardown delete failed (ignored): {e}") +``` + +Do not blanket-wrap asserted calls; only teardown/cleanup that is incidental to +what the sample demonstrates. + +### E. Audit machine-dependent samples + +List the samples that connect to the machine and confirm the new machine +exposes the components they call, so they do not fail at connect or call time: + +```bash +grep -rln 'MACHINE_ADDRESS\|auto-machine-main' static/include/examples +``` + +For each, note the component or service APIs it calls and ensure the Plan 00 +machine config provides them. + +### F. Quieten known anti-patterns (optional, low risk) + +- Remove `pip install asyncio` (line ~61): `asyncio` is stdlib. +- Move Python off 3.9 (end of life approaching) and Node off the non-LTS 23. + +## Verification + +1. Trigger the workflow through `workflow_dispatch` on the branch. +2. Confirm all three language steps run (not just Python) and report a per-file + pass/fail summary. +3. Confirm a fully green run end to end. +4. Re-run once to confirm previously flaky samples are now stable. + +## Rollback + +Each change is independent: the step-split, the ID re-point, and the teardown +guards can be reverted individually. Keep the single-job version in git history +so it can be restored if the split causes runner or secret-scoping issues. + +## Risks and open questions + +- Splitting jobs multiplies viam-server startups (and secret usage); confirm the + machine tolerates concurrent connections, or gate the language jobs to run + sequentially with `needs`. +- The "stable" viam-server AppImage is re-pulled each run, so a server release + can change behavior with no repo change; consider pinning a known-good server + version for reproducibility. +- Some failures may be genuine sample bugs surfaced once the suite runs fully; + budget time to triage newly visible Go and TypeScript failures. diff --git a/.github/workflows/plans/03-fix-alias-reminder.md b/.github/workflows/plans/03-fix-alias-reminder.md new file mode 100644 index 0000000000..4464fc9681 --- /dev/null +++ b/.github/workflows/plans/03-fix-alias-reminder.md @@ -0,0 +1,66 @@ +# Plan 03: Fix Alias reminder + +**Status:** NOT STARTED +**Fixes:** `alias-reminder.yml` (the _Alias reminder_ job). +**Depends on:** nothing. + +## Problem + +On pull requests that rename or move `.md` files, this workflow should post a +sticky comment reminding the author to add redirect aliases. It currently never +posts, so authors get no reminder and broken redirects slip through. + +## Root cause + +The detection step emits its result with the deprecated `::set-output` workflow +command (`Write-Host "::set-output name=...::"`). GitHub has disabled +`::set-output`, so `steps.check_files_moved.outputs.*` is never populated. The +gate on the comment step (`== 'True'`) therefore never evaluates true, and the +comment step is dead. + +## Plan + +1. **Migrate to `$GITHUB_OUTPUT`.** Replace the `::set-output` writes in the + PowerShell detection step with appends to the `GITHUB_OUTPUT` file. In `pwsh`: + + ```powershell + "files_moved=$filesMoved" | Out-File -FilePath $env:GITHUB_OUTPUT -Append + "moved_list=$movedList" | Out-File -FilePath $env:GITHUB_OUTPUT -Append + ``` + + Keep the existing output names so the downstream `if:` does not need changing + beyond confirming the comparison value (`'True'` vs `'true'`: normalize the + value the step writes and the condition it is compared against). +2. **Handle multi-line output safely.** If `moved_list` can contain multiple + filenames or newlines, use the heredoc form for `GITHUB_OUTPUT` so the value + is not truncated: + + ```powershell + "moved_list<