viamrobotics · btshrewsbury-viam · Jun 26, 2026
diff --git a/.github/workflows/plans/00-provision-org-machine-secrets.md b/.github/workflows/plans/00-provision-org-machine-secrets.md
@@ -0,0 +1,143 @@
+# Plan 00: Provision a new org, machine, and secrets
+
+**Status:** NOT STARTED
+**Fixes:** the live-API foundation for `test-code-snippets.yml` and the
+`docs.yml` search-index jobs.
+**Depended on by:** [Plan 02](02-fix-test-code-snippets.md).
+
+## Problem
+
+The code-sample tests and the search-index sync run against a shared Viam test
+organization whose identity lives in repository secrets and whose resource IDs
+are hardcoded into both the workflow and the sample files. The current org has
+drifted (no roleless member, possible stale machine/keys), and the secret
+values are not recoverable. The cleanest reset is a fresh org, location,
+machine, and key set, with every secret and hardcoded ID re-pointed in one
+coordinated change.
+
+## What the jobs actually need from the org
+
+Collected from `test-code-snippets.yml` and the sample files:
+
+1. An **organization** whose ID becomes `TEST_ORG_ID`.
+2. An **org owner API key** (key + key ID) for `VIAM_API_KEY` /
+   `VIAM_API_KEY_ID`. The samples authenticate with this and expect owner-level
+   access (create/delete locations, keys, datasets, data pipelines, roles).
+3. A **location** that replaces the hardcoded `pg5q3j3h95`. Samples reference it
+   directly and through the machine address `auto-machine-main.<loc>.viam.cloud`.
+4. A **machine** with a **main part**, used two ways:
+   - The workflow fetches its config to run a local `viam-server`
+     (`config?id=<machine-part-id>`), authenticated with a **machine part API
+     key** (`key_id` + `TEST_MACHINE_KEY`).
+   - Samples connect to it at the machine address above.
+5. A **roleless member** in the org so the roles lifecycle sample can grant a
+   fresh role (see [Plan 02](02-fix-test-code-snippets.md)).
+6. A **user to invite** whose email becomes `TEST_EMAIL` (must not already be a
+   member; the sample invites then deletes the invite).
+7. A **second organization** (`ORG_ID_2`) used only by the share/unshare
+   location sample.
+8. A **second org + key for data regions** (`VIAM_API_KEY_DATA_REGIONS` /
+   `VIAM_API_KEY_ID_DATA_REGIONS`) used by the data-regions sample.
+
+## Hardcoded IDs to replace
+
+These are baked into the workflow and sample files and must be updated to the
+new resources. Counts are approximate occurrence counts across the repo.
+
+| Current value | Meaning | Where | New value |
+| --- | --- | --- | --- |
+| `pg5q3j3h95` | Location ID | `test-code-snippets.yml`-adjacent samples; ~11 direct + 25 in machine address | new location ID |
+| `deb8782c-7b48-4d35-812d-2caa94b61f77` | Machine **part** ID | workflow config fetch (`id=`); `MACHINE_PART_ID`/`PART_ID` in samples (~12) | new part ID |
+| `824b6570-7b1d-4622-a19d-37c472dba467` | `VIAM_PART_ID` / `PART_ID` | workflow env line 49; samples (~8) | new part ID (confirm which part) |
+| `1030f25a-f4f2-4872-9762-e33fa1e0444d` | Machine part **key ID** | workflow config-fetch header | new machine key ID |
+| `5ec7266e-f762-4ea8-9c29-4dd592718b48` | Machine ID | samples (~3) | new machine ID |
+| `b5e9f350-cbcf-4d2a-bbb1-a2e2fd6851e1` | `ORG_ID_2` (share target) | samples (~18) | new second org ID |
+| `16b8a3e5-7944-4e1c-8ccd-935c1ba3be59` | resource ID (confirm: dataset/fragment) | samples (~6) | new ID |
+
+> [!NOTE]
+> There appear to be **two distinct part IDs** (`deb8782c...` and `824b6570...`).
+> Confirm whether the machine has two parts, or whether one of these is stale,
+> before re-pointing. The config fetch uses `deb8782c...`; the workflow env var
+> `VIAM_PART_ID` uses `824b6570...`.
+
+## Tooling
+
+A future session can use the available skills:
+
+- `local-viam-server`: create a machine and run a local `viam-server` (handles
+  the machine part secret the CLI does not expose).
+- `viam-modules-fleet`: `viam` CLI for orgs, locations, machines, API keys.
+- `viam-machine-config`: push a machine config if the machine needs components.
+
+The machine only needs to start and stay connected; the samples do not require
+specific components on it unless a sample connects and calls a component API.
+Audit the samples that use `MACHINE_ADDRESS` to confirm what the machine must
+expose, and configure those components (see [Plan 02](02-fix-test-code-snippets.md)).
+
+## Plan
+
+1. **Create the primary test org.** Record its ID for `TEST_ORG_ID`. Give it a
+   recognizable name; note that the orgs sample resets the name to
+   `docs-scheduled-tests` on each run, so expect that name to reappear.
+2. **Create a location** in that org. Record its ID (replaces `pg5q3j3h95`).
+3. **Create a machine** in that location with a main part. Record the machine
+   ID, the part ID, and confirm the machine address
+   (`auto-machine-main.<location-id>.viam.cloud` or the actual address shown in
+   the app).
+4. **Create a machine part API key.** Record the key ID and key value for the
+   workflow config fetch (`key_id` header and `TEST_MACHINE_KEY`).
+5. **Create an org owner API key.** Record the key and key ID for `VIAM_API_KEY`
+   / `VIAM_API_KEY_ID`.
+6. **Add a roleless member** to the org (invite a service/test user and remove
+   any default role, or confirm a member with no authorizations). This unblocks
+   the roles lifecycle sample.
+7. **Choose a `TEST_EMAIL`** for a user who is not a member of the org (the
+   sample invites and then revokes this address).
+8. **Create the second org** (`ORG_ID_2`) for the share/unshare sample. A bare
+   org with no resources is enough.
+9. **Create the data-regions org and key** for `VIAM_API_KEY_DATA_REGIONS` /
+   `VIAM_API_KEY_ID_DATA_REGIONS`. Confirm what the data-regions sample requires
+   (it changes an org region and tolerates the "region cannot be changed" error).
+10. **Configure the machine** so every sample that connects to `MACHINE_ADDRESS`
+    finds the components it calls (audit list in Plan 02). Start a
+    `viam-server` once to confirm the config is valid and the machine reports
+    online.
+11. **Update the workflow** `test-code-snippets.yml`: replace the hardcoded
+    `key_id` and config-fetch `id` (lines around 29) and `VIAM_PART_ID`
+    (line ~49) with the new machine values. Prefer moving these into secrets or
+    workflow `env` referencing secrets rather than re-hardcoding.
+12. **Update the sample files** with the new location/machine/org IDs (see the
+    table above). Use a scripted find-and-replace, then grep to confirm no old
+    IDs remain.
+13. **Update the GitHub repository secrets** (Settings -> Secrets and variables
+    -> Actions): `TEST_ORG_ID`, `VIAM_API_KEY`, `VIAM_API_KEY_ID`,
+    `TEST_MACHINE_KEY`, `TEST_EMAIL`, `VIAM_API_KEY_DATA_REGIONS`,
+    `VIAM_API_KEY_ID_DATA_REGIONS`.
+
+## Verification
+
+- Run the workflow through `workflow_dispatch` on the branch and confirm the
+  "Start viam-server in background" step fetches the config and the server
+  reports running.
+- Confirm `grep -rE 'pg5q3j3h95|deb8782c|824b6570|1030f25a|5ec7266e|b5e9f350|16b8a3e5' static/include/examples .github/workflows/test-code-snippets.yml`
+  returns nothing.
+- Proceed to [Plan 02](02-fix-test-code-snippets.md) to make the samples pass.
+
+## Rollback
+
+Keep the old secret values recorded until the new org is confirmed working. If
+the new org misbehaves, restore the previous secret values and revert the ID
+changes in one commit. The old org can stay untouched until the new one is
+green.
+
+## Risks and open questions
+
+- **Two part IDs**: confirm the machine's part topology before re-pointing.
+- **Component requirements**: some samples may call specific component APIs on
+  the machine; the machine config must provide them or those samples fail at
+  connect/call time even with a healthy server.
+- **Data-regions org**: regions can only be set once per org, so the
+  data-regions org may be a single-use resource; confirm the sample tolerates a
+  pre-set region (it currently logs and continues on that error).
+- **`ORG_ID_2` permissions**: the owner key for the primary org must be allowed
+  to share a location into `ORG_ID_2`; confirm cross-org sharing is permitted.
diff --git a/.github/workflows/plans/01-restore-notifications.md b/.github/workflows/plans/01-restore-notifications.md
@@ -0,0 +1,65 @@
+# Plan 01: Restore failure notifications
+
+**Status:** NOT STARTED
+**Fixes:** the Jira notification steps in `test-code-snippets.yml`,
+`check-methods.yml`, and `run-htmltest.yml`.
+**Depended on by:** [Plan 05](05-fix-check-methods.md) (its only signal is Jira).
+
+## Problem
+
+Three scheduled jobs report failures only by opening a Jira ticket, and the
+Jira steps are themselves failing (the `Login to Jira` / `Create Jira ticket`
+steps show as failed in recent runs). The result: these jobs run unmonitored,
+so even a real regression is silent. This is the highest-impact repair because
+it is the prerequisite for trusting every other scheduled job.
+
+## Root cause (to confirm)
+
+The `atlassian/gajira-login@v3` step authenticates with `JIRA_BASE_URL`,
+`JIRA_USER_EMAIL`, and `JIRA_API_TOKEN`. The most likely causes, in order:
+
+1. An expired or rotated `JIRA_API_TOKEN`.
+2. A changed `JIRA_BASE_URL` or account email.
+3. The `DOCS` project no longer accepting the issue types used
+   (`Bug` for the htmltest and code-sample jobs, `Task` for check-methods), or
+   a required field now being enforced on create.
+
+## Plan
+
+1. **Reproduce** by triggering one scheduled workflow through
+   `workflow_dispatch` and reading the `Login to Jira` step log for the exact
+   error (auth failure, project not found, required field, and so on).
+2. **Refresh credentials**: regenerate the Jira API token, confirm the account
+   email and base URL, and update the three secrets.
+3. **Confirm the create payload**: verify the `DOCS` project key, the issue
+   types (`Bug` / `Task`), and any newly required fields. Update the
+   `gajira-create` step inputs if the project schema changed.
+4. **Add a backup notification path** so a single broken integration cannot
+   silence the jobs again. Options, cheapest first:
+   - A Slack incoming-webhook step on `failure()` posting the run URL.
+   - A GitHub issue created on failure with a dedup label.
+   - Email through an action on `failure()`.
+5. **Make notification failures visible**: ensure the Jira/Slack steps run with
+   `if: failure()` (they do today) and that a failure of the notification step
+   itself surfaces in the run status rather than passing silently.
+
+## Verification
+
+- Trigger each of the three scheduled jobs through `workflow_dispatch` while
+  forcing a failure (or run against a known-failing state) and confirm a ticket
+  is created and the backup alert fires.
+- Confirm the created ticket lands in the expected `DOCS` project with the
+  correct issue type and a useful title and body (include the run URL).
+
+## Rollback
+
+The notification steps are isolated from the test logic, so reverting the
+`gajira-*` step changes or removing the backup step does not affect what the
+jobs test. Keep the previous step config in the PR description for quick revert.
+
+## Risks and open questions
+
+- Is the `DOCS` Jira project still the right destination, and who owns the
+  service account behind `JIRA_USER_EMAIL`?
+- If Jira is being retired for this purpose, replace it with the backup path as
+  the primary, rather than fixing the login.
diff --git a/.github/workflows/plans/02-fix-test-code-snippets.md b/.github/workflows/plans/02-fix-test-code-snippets.md
@@ -0,0 +1,125 @@
+# Plan 02: Fix Test Code Samples
+
+**Status:** NOT STARTED
+**Fixes:** `test-code-snippets.yml` (the _Test Code Samples_ job).
+**Depends on:** [Plan 00](00-provision-org-machine-secrets.md) (org, machine,
+secrets) and benefits from [Plan 01](01-restore-notifications.md).
+
+## Problem
+
+The job runs every Python, Go, and TypeScript sample under
+`static/include/examples/` against a live `viam-server` and a live Viam org. It
+has failed on every run since 2025-11-10. The Python step exits non-zero on the
+first failing file, which aborts the Go and TypeScript steps, so a single bad
+sample hides the rest of the suite.
+
+## Root causes
+
+1. **Org data dependency**: `fleet-api/fleet-management-api-orgs.py` granted a
+   location-owner role to `member_list[-1]`, who is now an org owner and already
+   inherits that role, so the grant is rejected. PR #5106 changes the sample to
+   pick a member with no existing authorizations, but the org must actually
+   contain such a member (Plan 00, step 6).
+2. **No step isolation**: the three language steps run sequentially in one job;
+   the Python step's non-zero exit skips Go and TypeScript.
+3. **Stale hardcoded IDs**: the server-start step and many samples hardcode the
+   machine, location, and org IDs (Plan 00).
+4. **Flaky live calls**: some samples fail intermittently on transient
+   `INTERNAL` backend errors (observed in the `data-pipelines` teardown
+   `delete_data_pipeline` calls), unrelated to docs changes.
+
+## Plan
+
+### A. Land the org-data prerequisites
+
+1. Confirm [Plan 00](00-provision-org-machine-secrets.md) is complete: new org,
+   machine, secrets, and a **roleless member**.
+2. Merge or rebase PR #5106 so the orgs sample selects a roleless member and
+   asserts clearly if none exists.
+
+### B. Isolate the language steps
+
+Today a Python failure blocks Go and TypeScript. Pick one:
+
+- **Preferred**: split into three jobs (`test-python`, `test-go`, `test-ts`),
+  each setting up its own runtime and running its own samples. Share the
+  viam-server startup through a reusable setup (composite action or a setup
+  job + artifact), or start the server in each job. This gives independent
+  pass/fail per language and parallel execution.
+- **Lighter**: keep one job but make each language loop record failures and
+  continue, then fail the job once at the end with a combined summary, so all
+  three languages always run.
+
+### C. Re-point hardcoded IDs
+
+Apply the ID replacements from [Plan 00](00-provision-org-machine-secrets.md).
+Prefer reading machine identifiers from secrets in the workflow rather than
+hardcoding. Confirm with:
+
+```bash
+grep -rE 'pg5q3j3h95|deb8782c|824b6570|1030f25a|5ec7266e|b5e9f350|16b8a3e5' \
+  static/include/examples .github/workflows/test-code-snippets.yml
+```
+
+### D. Make teardown resilient to flaky backends
+
+For samples whose **teardown** (not the demonstrated behavior) makes a live call
+that can return a transient `INTERNAL` error, wrap the teardown so a flaky
+response does not fail the run. Confirmed cases:
+
+- `data-pipelines/pipeline-create.py` (`delete_data_pipeline` teardown).
+- `data-pipelines/pipeline-list.py` (`delete_data_pipeline` teardown).
+
+Pattern (teardown only, never the asserted behavior):
+
+```python
+try:
+    await data_client.delete_data_pipeline(pipeline_id)
+except GRPCError as e:
+    print(f"teardown delete failed (ignored): {e}")
+```
+
+Do not blanket-wrap asserted calls; only teardown/cleanup that is incidental to
+what the sample demonstrates.
+
+### E. Audit machine-dependent samples
+
+List the samples that connect to the machine and confirm the new machine
+exposes the components they call, so they do not fail at connect or call time:
+
+```bash
+grep -rln 'MACHINE_ADDRESS\|auto-machine-main' static/include/examples
+```
+
+For each, note the component or service APIs it calls and ensure the Plan 00
+machine config provides them.
+
+### F. Quieten known anti-patterns (optional, low risk)
+
+- Remove `pip install asyncio` (line ~61): `asyncio` is stdlib.
+- Move Python off 3.9 (end of life approaching) and Node off the non-LTS 23.
+
+## Verification
+
+1. Trigger the workflow through `workflow_dispatch` on the branch.
+2. Confirm all three language steps run (not just Python) and report a per-file
+   pass/fail summary.
+3. Confirm a fully green run end to end.
+4. Re-run once to confirm previously flaky samples are now stable.
+
+## Rollback
+
+Each change is independent: the step-split, the ID re-point, and the teardown
+guards can be reverted individually. Keep the single-job version in git history
+so it can be restored if the split causes runner or secret-scoping issues.
+
+## Risks and open questions
+
+- Splitting jobs multiplies viam-server startups (and secret usage); confirm the
+  machine tolerates concurrent connections, or gate the language jobs to run
+  sequentially with `needs`.
+- The "stable" viam-server AppImage is re-pulled each run, so a server release
+  can change behavior with no repo change; consider pinning a known-good server
+  version for reproducibility.
+- Some failures may be genuine sample bugs surfaced once the suite runs fully;
+  budget time to triage newly visible Go and TypeScript failures.