Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions .github/workflows/plans/00-provision-org-machine-secrets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Plan 00: Provision a new org, machine, and secrets

**Status:** NOT STARTED
**Fixes:** the live-API foundation for `test-code-snippets.yml` and the
`docs.yml` search-index jobs.
**Depended on by:** [Plan 02](02-fix-test-code-snippets.md).

## Problem

The code-sample tests and the search-index sync run against a shared Viam test
organization whose identity lives in repository secrets and whose resource IDs
are hardcoded into both the workflow and the sample files. The current org has
drifted (no roleless member, possible stale machine/keys), and the secret
values are not recoverable. The cleanest reset is a fresh org, location,
machine, and key set, with every secret and hardcoded ID re-pointed in one
coordinated change.

## What the jobs actually need from the org

Collected from `test-code-snippets.yml` and the sample files:

1. An **organization** whose ID becomes `TEST_ORG_ID`.
2. An **org owner API key** (key + key ID) for `VIAM_API_KEY` /
`VIAM_API_KEY_ID`. The samples authenticate with this and expect owner-level
access (create/delete locations, keys, datasets, data pipelines, roles).
3. A **location** that replaces the hardcoded `pg5q3j3h95`. Samples reference it
directly and through the machine address `auto-machine-main.<loc>.viam.cloud`.
4. A **machine** with a **main part**, used two ways:
- The workflow fetches its config to run a local `viam-server`
(`config?id=<machine-part-id>`), authenticated with a **machine part API
key** (`key_id` + `TEST_MACHINE_KEY`).
- Samples connect to it at the machine address above.
5. A **roleless member** in the org so the roles lifecycle sample can grant a
fresh role (see [Plan 02](02-fix-test-code-snippets.md)).
6. A **user to invite** whose email becomes `TEST_EMAIL` (must not already be a
member; the sample invites then deletes the invite).
7. A **second organization** (`ORG_ID_2`) used only by the share/unshare
location sample.
8. A **second org + key for data regions** (`VIAM_API_KEY_DATA_REGIONS` /
`VIAM_API_KEY_ID_DATA_REGIONS`) used by the data-regions sample.

## Hardcoded IDs to replace

These are baked into the workflow and sample files and must be updated to the
new resources. Counts are approximate occurrence counts across the repo.

| Current value | Meaning | Where | New value |
| --- | --- | --- | --- |
| `pg5q3j3h95` | Location ID | `test-code-snippets.yml`-adjacent samples; ~11 direct + 25 in machine address | new location ID |
| `deb8782c-7b48-4d35-812d-2caa94b61f77` | Machine **part** ID | workflow config fetch (`id=`); `MACHINE_PART_ID`/`PART_ID` in samples (~12) | new part ID |
| `824b6570-7b1d-4622-a19d-37c472dba467` | `VIAM_PART_ID` / `PART_ID` | workflow env line 49; samples (~8) | new part ID (confirm which part) |
| `1030f25a-f4f2-4872-9762-e33fa1e0444d` | Machine part **key ID** | workflow config-fetch header | new machine key ID |
| `5ec7266e-f762-4ea8-9c29-4dd592718b48` | Machine ID | samples (~3) | new machine ID |
| `b5e9f350-cbcf-4d2a-bbb1-a2e2fd6851e1` | `ORG_ID_2` (share target) | samples (~18) | new second org ID |
| `16b8a3e5-7944-4e1c-8ccd-935c1ba3be59` | resource ID (confirm: dataset/fragment) | samples (~6) | new ID |

> [!NOTE]
> There appear to be **two distinct part IDs** (`deb8782c...` and `824b6570...`).
> Confirm whether the machine has two parts, or whether one of these is stale,
> before re-pointing. The config fetch uses `deb8782c...`; the workflow env var
> `VIAM_PART_ID` uses `824b6570...`.

## Tooling

A future session can use the available skills:

- `local-viam-server`: create a machine and run a local `viam-server` (handles
the machine part secret the CLI does not expose).
- `viam-modules-fleet`: `viam` CLI for orgs, locations, machines, API keys.
- `viam-machine-config`: push a machine config if the machine needs components.

The machine only needs to start and stay connected; the samples do not require
specific components on it unless a sample connects and calls a component API.
Audit the samples that use `MACHINE_ADDRESS` to confirm what the machine must
expose, and configure those components (see [Plan 02](02-fix-test-code-snippets.md)).

## Plan

1. **Create the primary test org.** Record its ID for `TEST_ORG_ID`. Give it a
recognizable name; note that the orgs sample resets the name to
`docs-scheduled-tests` on each run, so expect that name to reappear.
2. **Create a location** in that org. Record its ID (replaces `pg5q3j3h95`).
3. **Create a machine** in that location with a main part. Record the machine
ID, the part ID, and confirm the machine address
(`auto-machine-main.<location-id>.viam.cloud` or the actual address shown in
the app).
4. **Create a machine part API key.** Record the key ID and key value for the
workflow config fetch (`key_id` header and `TEST_MACHINE_KEY`).
5. **Create an org owner API key.** Record the key and key ID for `VIAM_API_KEY`
/ `VIAM_API_KEY_ID`.
6. **Add a roleless member** to the org (invite a service/test user and remove
any default role, or confirm a member with no authorizations). This unblocks
the roles lifecycle sample.
7. **Choose a `TEST_EMAIL`** for a user who is not a member of the org (the
sample invites and then revokes this address).
8. **Create the second org** (`ORG_ID_2`) for the share/unshare sample. A bare
org with no resources is enough.
9. **Create the data-regions org and key** for `VIAM_API_KEY_DATA_REGIONS` /
`VIAM_API_KEY_ID_DATA_REGIONS`. Confirm what the data-regions sample requires
(it changes an org region and tolerates the "region cannot be changed" error).
10. **Configure the machine** so every sample that connects to `MACHINE_ADDRESS`
finds the components it calls (audit list in Plan 02). Start a
`viam-server` once to confirm the config is valid and the machine reports
online.
11. **Update the workflow** `test-code-snippets.yml`: replace the hardcoded
`key_id` and config-fetch `id` (lines around 29) and `VIAM_PART_ID`
(line ~49) with the new machine values. Prefer moving these into secrets or
workflow `env` referencing secrets rather than re-hardcoding.
12. **Update the sample files** with the new location/machine/org IDs (see the
table above). Use a scripted find-and-replace, then grep to confirm no old
IDs remain.
13. **Update the GitHub repository secrets** (Settings -> Secrets and variables
-> Actions): `TEST_ORG_ID`, `VIAM_API_KEY`, `VIAM_API_KEY_ID`,
`TEST_MACHINE_KEY`, `TEST_EMAIL`, `VIAM_API_KEY_DATA_REGIONS`,
`VIAM_API_KEY_ID_DATA_REGIONS`.

## Verification

- Run the workflow through `workflow_dispatch` on the branch and confirm the
"Start viam-server in background" step fetches the config and the server
reports running.
- Confirm `grep -rE 'pg5q3j3h95|deb8782c|824b6570|1030f25a|5ec7266e|b5e9f350|16b8a3e5' static/include/examples .github/workflows/test-code-snippets.yml`
returns nothing.
- Proceed to [Plan 02](02-fix-test-code-snippets.md) to make the samples pass.

## Rollback

Keep the old secret values recorded until the new org is confirmed working. If
the new org misbehaves, restore the previous secret values and revert the ID
changes in one commit. The old org can stay untouched until the new one is
green.

## Risks and open questions

- **Two part IDs**: confirm the machine's part topology before re-pointing.
- **Component requirements**: some samples may call specific component APIs on
the machine; the machine config must provide them or those samples fail at
connect/call time even with a healthy server.
- **Data-regions org**: regions can only be set once per org, so the
data-regions org may be a single-use resource; confirm the sample tolerates a
pre-set region (it currently logs and continues on that error).
- **`ORG_ID_2` permissions**: the owner key for the primary org must be allowed
to share a location into `ORG_ID_2`; confirm cross-org sharing is permitted.
65 changes: 65 additions & 0 deletions .github/workflows/plans/01-restore-notifications.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Plan 01: Restore failure notifications

**Status:** NOT STARTED
**Fixes:** the Jira notification steps in `test-code-snippets.yml`,
`check-methods.yml`, and `run-htmltest.yml`.
**Depended on by:** [Plan 05](05-fix-check-methods.md) (its only signal is Jira).

## Problem

Three scheduled jobs report failures only by opening a Jira ticket, and the
Jira steps are themselves failing (the `Login to Jira` / `Create Jira ticket`
steps show as failed in recent runs). The result: these jobs run unmonitored,
so even a real regression is silent. This is the highest-impact repair because
it is the prerequisite for trusting every other scheduled job.

## Root cause (to confirm)

The `atlassian/gajira-login@v3` step authenticates with `JIRA_BASE_URL`,
`JIRA_USER_EMAIL`, and `JIRA_API_TOKEN`. The most likely causes, in order:

1. An expired or rotated `JIRA_API_TOKEN`.
2. A changed `JIRA_BASE_URL` or account email.
3. The `DOCS` project no longer accepting the issue types used
(`Bug` for the htmltest and code-sample jobs, `Task` for check-methods), or
a required field now being enforced on create.

## Plan

1. **Reproduce** by triggering one scheduled workflow through
`workflow_dispatch` and reading the `Login to Jira` step log for the exact
error (auth failure, project not found, required field, and so on).
2. **Refresh credentials**: regenerate the Jira API token, confirm the account
email and base URL, and update the three secrets.
3. **Confirm the create payload**: verify the `DOCS` project key, the issue
types (`Bug` / `Task`), and any newly required fields. Update the
`gajira-create` step inputs if the project schema changed.
4. **Add a backup notification path** so a single broken integration cannot
silence the jobs again. Options, cheapest first:
- A Slack incoming-webhook step on `failure()` posting the run URL.
- A GitHub issue created on failure with a dedup label.
- Email through an action on `failure()`.
5. **Make notification failures visible**: ensure the Jira/Slack steps run with
`if: failure()` (they do today) and that a failure of the notification step
itself surfaces in the run status rather than passing silently.

## Verification

- Trigger each of the three scheduled jobs through `workflow_dispatch` while
forcing a failure (or run against a known-failing state) and confirm a ticket
is created and the backup alert fires.
- Confirm the created ticket lands in the expected `DOCS` project with the
correct issue type and a useful title and body (include the run URL).

## Rollback

The notification steps are isolated from the test logic, so reverting the
`gajira-*` step changes or removing the backup step does not affect what the
jobs test. Keep the previous step config in the PR description for quick revert.

## Risks and open questions

- Is the `DOCS` Jira project still the right destination, and who owns the
service account behind `JIRA_USER_EMAIL`?
- If Jira is being retired for this purpose, replace it with the backup path as
the primary, rather than fixing the login.
125 changes: 125 additions & 0 deletions .github/workflows/plans/02-fix-test-code-snippets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Plan 02: Fix Test Code Samples

**Status:** NOT STARTED
**Fixes:** `test-code-snippets.yml` (the _Test Code Samples_ job).
**Depends on:** [Plan 00](00-provision-org-machine-secrets.md) (org, machine,
secrets) and benefits from [Plan 01](01-restore-notifications.md).

## Problem

The job runs every Python, Go, and TypeScript sample under
`static/include/examples/` against a live `viam-server` and a live Viam org. It
has failed on every run since 2025-11-10. The Python step exits non-zero on the
first failing file, which aborts the Go and TypeScript steps, so a single bad
sample hides the rest of the suite.

## Root causes

1. **Org data dependency**: `fleet-api/fleet-management-api-orgs.py` granted a
location-owner role to `member_list[-1]`, who is now an org owner and already
inherits that role, so the grant is rejected. PR #5106 changes the sample to
pick a member with no existing authorizations, but the org must actually
contain such a member (Plan 00, step 6).
2. **No step isolation**: the three language steps run sequentially in one job;
the Python step's non-zero exit skips Go and TypeScript.
3. **Stale hardcoded IDs**: the server-start step and many samples hardcode the
machine, location, and org IDs (Plan 00).
4. **Flaky live calls**: some samples fail intermittently on transient
`INTERNAL` backend errors (observed in the `data-pipelines` teardown
`delete_data_pipeline` calls), unrelated to docs changes.

## Plan

### A. Land the org-data prerequisites

1. Confirm [Plan 00](00-provision-org-machine-secrets.md) is complete: new org,
machine, secrets, and a **roleless member**.
2. Merge or rebase PR #5106 so the orgs sample selects a roleless member and
asserts clearly if none exists.

### B. Isolate the language steps

Today a Python failure blocks Go and TypeScript. Pick one:

- **Preferred**: split into three jobs (`test-python`, `test-go`, `test-ts`),
each setting up its own runtime and running its own samples. Share the
viam-server startup through a reusable setup (composite action or a setup
job + artifact), or start the server in each job. This gives independent
pass/fail per language and parallel execution.
- **Lighter**: keep one job but make each language loop record failures and
continue, then fail the job once at the end with a combined summary, so all
three languages always run.

### C. Re-point hardcoded IDs

Apply the ID replacements from [Plan 00](00-provision-org-machine-secrets.md).
Prefer reading machine identifiers from secrets in the workflow rather than
hardcoding. Confirm with:

```bash
grep -rE 'pg5q3j3h95|deb8782c|824b6570|1030f25a|5ec7266e|b5e9f350|16b8a3e5' \
static/include/examples .github/workflows/test-code-snippets.yml
```

### D. Make teardown resilient to flaky backends

For samples whose **teardown** (not the demonstrated behavior) makes a live call
that can return a transient `INTERNAL` error, wrap the teardown so a flaky
response does not fail the run. Confirmed cases:

- `data-pipelines/pipeline-create.py` (`delete_data_pipeline` teardown).
- `data-pipelines/pipeline-list.py` (`delete_data_pipeline` teardown).

Pattern (teardown only, never the asserted behavior):

```python
try:
await data_client.delete_data_pipeline(pipeline_id)
except GRPCError as e:
print(f"teardown delete failed (ignored): {e}")
```

Do not blanket-wrap asserted calls; only teardown/cleanup that is incidental to
what the sample demonstrates.

### E. Audit machine-dependent samples

List the samples that connect to the machine and confirm the new machine
exposes the components they call, so they do not fail at connect or call time:

```bash
grep -rln 'MACHINE_ADDRESS\|auto-machine-main' static/include/examples
```

For each, note the component or service APIs it calls and ensure the Plan 00
machine config provides them.

### F. Quieten known anti-patterns (optional, low risk)

- Remove `pip install asyncio` (line ~61): `asyncio` is stdlib.
- Move Python off 3.9 (end of life approaching) and Node off the non-LTS 23.

## Verification

1. Trigger the workflow through `workflow_dispatch` on the branch.
2. Confirm all three language steps run (not just Python) and report a per-file
pass/fail summary.
3. Confirm a fully green run end to end.
4. Re-run once to confirm previously flaky samples are now stable.

## Rollback

Each change is independent: the step-split, the ID re-point, and the teardown
guards can be reverted individually. Keep the single-job version in git history
so it can be restored if the split causes runner or secret-scoping issues.

## Risks and open questions

- Splitting jobs multiplies viam-server startups (and secret usage); confirm the
machine tolerates concurrent connections, or gate the language jobs to run
sequentially with `needs`.
- The "stable" viam-server AppImage is re-pulled each run, so a server release
can change behavior with no repo change; consider pinning a known-good server
version for reproducibility.
- Some failures may be genuine sample bugs surfaced once the suite runs fully;
budget time to triage newly visible Go and TypeScript failures.
Loading
Loading