Skip to content

Implement supervised runner lane maintainer #1235

@cbusillo

Description

@cbusillo

Objective

Implement the durable Launchplane runner lane maintainer required before product repositories, including cbusillo/odoo-tenant-cm-website, can depend on Launchplane-managed self-hosted runner lanes.

This is the focused follow-up to #414 after PR #1234 disabled the transient registration shortcut.

Current hard rule

No product repo agent should target a Launchplane-managed self-hosted runner lane until this issue is complete.

The disabled proof path registered cm-website-chris-testing and briefly made it appear online, but it started run.sh from inside a GitHub Actions job. That runner went offline after job cleanup. PR #1234 removed that apply behavior. The product repo runner inventory is currently expected to be zero runners.

Track split

Keep these tracks separate so future agents do not treat documentation or route-contract progress as runner readiness.

Track A: cm-website route-contract work

Owner: cbusillo/odoo-tenant-cm-website.

This track is unblocked by runner infrastructure. It may continue on GitHub-hosted ubuntu-latest workflows that call Launchplane over HTTPS/OIDC. It verifies the current Odoo preview/publish/apply route contract and product/runtime records.

Track A does not create, adopt, configure, or require a Launchplane-managed self-hosted runner.

Track B: self-hosted runner adoption

Owner: Launchplane.

This issue implements Track B. cm-website self-hosted runner adoption stays blocked until Launchplane completes the supervised maintainer proof in the live apply slice below.

Runner adoption in Launchplane reusable workflows is also Launchplane-side work, not a cm-website repo edit.

Definition of done

cbusillo/odoo-tenant-cm-website has a Launchplane-managed runner lane only after all of this evidence exists:

  • Launchplane desired-state record exists for repository=cbusillo/odoo-tenant-cm-website, host_name=chris-testing, lane_name=cm-website-chris-testing.
  • The runner service is owned by systemd or an equivalent persistent supervisor, not by a GitHub Actions job process.
  • The runner process runs as the expected constrained service user.
  • GitHub inventory shows the lane online.
  • Labels include self-hosted, launchplane, launchplane-managed, chris-testing, and cm-website.
  • Baseline readiness passes after the service is running.
  • A completed Launchplane audit record includes service state, GitHub inventory, labels, baseline evidence, and redacted provider evidence.
  • A remove/restart path exists for the same managed lane and refuses unmanaged runners.

Global slice rules

Do this as ordered PR slices. Do not skip ahead to live apply.

PRs 1-4 must be shippable without live host mutation. Their tests should assert no GitHub registration/remove token fetch, no config.sh, no privileged helper verb, no process spawn, and no host mutation unless explicitly mocked.

The first slice allowed to return a real completed maintainer audit must verify all completion gates in the definition of done. Anything less is a false green lane.

No code path may start a runner with nohup ./run.sh, plain ./run.sh &, PID-file backgrounding, or any GitHub Actions job-owned process lifecycle.

Execution path

PR 1: Maintainer desired-state contract and planner only

Goal: define the durable desired-state model and fail-closed plan. No host mutation, no GitHub registration token fetch.

Create or modify:

  • control_plane/contracts/runner_lane_maintainer.py
  • tests/test_runner_lane_maintainer.py
  • docs/runner-lane-baseline.md
  • docs/records.md

Contracts:

  • RunnerLaneMaintainerDesiredState
    • repository
    • host_name
    • lane_name
    • registration_root
    • service_user
    • systemd_unit_name
    • labels
    • runner_version_policy
    • managed=true
  • RunnerLaneMaintainerObservedState
    • GitHub inventory lane, if any
    • local runner directory state
    • service/unit state
    • baseline readiness state
  • RunnerLaneMaintainerPlan
    • status: ready | blocked
    • action: create | adopt | reconcile | restart | remove
    • observed shape: absent | github_only | local_service_only | supervised_active | supervised_inactive | mismatched_labels | unknown_conflict
    • blockers
    • next_steps
  • RunnerLaneMaintainerAuditRecord
    • status: planned | completed | failed
    • desired state
    • observed pre/post state
    • redacted provider evidence

Fail-closed blockers:

  • repository not allowlisted
  • host not approved
  • registration root outside allowlist
  • missing launchplane-managed label
  • unmanaged existing runner found
  • duplicate lane name
  • service user mismatch
  • unit name mismatch
  • path traversal or unsafe lane name
  • local/GitHub stale state cannot be safely adopted or removed
  • baseline missing or not ready for completed state
  • mutate requested without idempotency/confirmation

Tests:

uv run python -m unittest tests.test_runner_lane_maintainer
uv run --extra dev ruff check control_plane/contracts/runner_lane_maintainer.py tests/test_runner_lane_maintainer.py
uv run --extra dev mypy control_plane/contracts/runner_lane_maintainer.py tests/test_runner_lane_maintainer.py

Acceptance:

  • Dry-run never requests a GitHub token.
  • Planner selects create for zero GitHub runners and no local managed service.
  • Planner distinguishes absent, GitHub-only, local-service-only, active supervised, inactive supervised, mismatched-label, and unknown-conflict states.
  • Planner selects blocked for unmanaged matching lane, unsafe path, duplicate lane, or unknown conflict.
  • Planner cannot produce completed state.

PR 2: Storage and service audit evidence

Goal: make maintainer audits durable before any live host mutation exists.

Create or modify:

  • storage migration for launchplane_runner_lane_maintainer_audits
  • control_plane/storage/filesystem.py
  • control_plane/storage/postgres.py
  • control_plane/service.py
  • tests/test_filesystem_store.py
  • tests/test_postgres_store.py
  • tests/test_service.py
  • docs/service-boundary.md
  • docs/records.md

Route:

  • POST /v1/evidence/runner-lane-maintainer/audits
  • authz action: runner_lane_maintainer_audit.write
  • idempotency key required

Tests:

uv run python -m unittest tests.test_filesystem_store tests.test_postgres_store tests.test_service

Acceptance:

  • Planned, failed, and completed audit records persist.
  • Filesystem keys are collision-safe and cannot path traverse.
  • Service rejects unauthorized writes.
  • Token strings are not persisted.

PR 3: Host service model and privileged helper contract

Goal: specify and test the systemd boundary before any systemd apply path exists. No live host mutation.

Create or modify:

  • control_plane/workflows/runner_lane_maintainer_executor.py
  • helper contract module, for example control_plane/contracts/runner_lane_host_service.py
  • tests for service renderer/helper validation
  • docs for host helper install policy

Required model:

  • Use a persistent systemd system unit, for example launchplane-runner@cm-website-chris-testing.service.
  • Unit runs as the constrained service user.
  • Unit working directory is exactly <registration_root>/<lane_name>.
  • Unit ExecStart points to <runner-dir>/run.sh.
  • Unit has restart policy.

Privileged boundary:

  • Prefer a tiny root-owned helper with explicit verbs:
    • install-or-update-unit
    • daemon-reload
    • enable-now
    • restart
    • stop
    • disable
    • remove-unit
  • If sudo is used, only allow the helper and fixed validated verbs.
  • Do not grant arbitrary systemctl, arbitrary file write, or shell access.

Tests:

uv run python -m unittest tests.test_runner_lane_maintainer

Acceptance:

  • Helper rejects unsafe lane names, path traversal, mismatched user, mismatched host, roots outside allowlist, and arbitrary unit names.
  • Helper exposes only explicit lane-scoped verbs.
  • Helper dry-run output is structured and redacted.
  • Unit renderer never embeds GitHub registration/remove tokens.
  • No code path contains nohup ./run.sh, ./run.sh &, or PID-file backgrounding.

PR 4: Executor dry-run and mocked apply

Goal: implement maintainer executor behavior with injected/mocked adapters only. No live workflow dispatch yet.

Create or modify:

  • control_plane/workflows/runner_lane_maintainer_executor.py
  • control_plane/cli_runner_lanes.py
  • tests/test_runner_lane_maintainer.py
  • docs/runner-lane-baseline.md

Executor sequence:

  1. Load desired state and pre-observed state.
  2. Plan action.
  3. If dry-run, write planned audit and stop.
  4. If apply, require idempotency/confirmation.
  5. Fetch GitHub registration token only when plan needs create/adopt/reconfigure.
  6. Run config.sh only inside approved root.
  7. Use helper to install/update/start systemd service.
  8. Read service status.
  9. Re-read GitHub inventory.
  10. Run/read baseline readiness.
  11. Write completed audit only if all verification passes; otherwise failed audit.

Tests:

uv run python -m unittest tests.test_runner_lane_maintainer tests.test_runner_lane_registration

Acceptance:

  • Apply success requires mocked service enabled/active + process user + GitHub online + labels + baseline ready.
  • Service active but GitHub offline fails.
  • GitHub online but service inactive fails.
  • Baseline missing/not ready fails.
  • Token value never appears in JSON/log payloads.
  • Existing runner-lane-registration-executor --mutate remains failed or delegates only to this maintainer path; it must not revive the shortcut.

PR 5: Manual workflow for maintainer dry-run only

Goal: expose a manual ops workflow that can collect dry-run evidence for cm-website.

Create or modify:

  • .github/workflows/runner-lane-maintainer.yml
  • docs/operations.md
  • docs/runner-lane-baseline.md

Workflow properties:

  • manual workflow_dispatch only
  • runs on Launchplane ops lane, not product repo lane
  • dry-run default
  • apply inputs may exist, but live apply must fail closed until PR 6
  • apply requires confirmation phrase and idempotency key
  • uploads maintainer result artifact
  • dry-run accepts only expected planned/blocked result shapes
  • completed result is accepted only after the executor implements all completion gates

Dry-run proof command should target:

  • repository: cbusillo/odoo-tenant-cm-website
  • host: chris-testing
  • lane: cm-website-chris-testing
  • labels: self-hosted, launchplane, launchplane-managed, chris-testing, cm-website

Acceptance:

  • Dry-run produces planned audit and artifact.
  • No token fetch happens during dry-run.
  • cm-website GitHub runner inventory remains zero runners after dry-run.

PR 6: Live cm-website apply proof

Goal: create the first durable product runner lane only after PRs 1-5 are merged, deployed, and reviewed.

Steps:

  1. Confirm cbusillo/odoo-tenant-cm-website runner inventory is zero.
  2. Dispatch maintainer dry-run and archive artifact.
  3. Dispatch maintainer apply with explicit confirmation and idempotency key.
  4. Verify systemd unit enabled and active on chris-testing.
  5. Verify process user is expected service user.
  6. Verify GitHub inventory shows cm-website-chris-testing online.
  7. Verify labels include all required labels.
  8. Run baseline readiness after service is active, including Docker credential isolation and Buildx/toolchain evidence when the lane will receive build work.
  9. Write completed audit only after every completion gate passes.
  10. Run a tiny product-repo no-op workflow on the lane, if product routing is required.

Acceptance:

  • Completed audit evidence exists.
  • Remove/restart dry-runs are available for the managed lane.
  • No product workflow has been changed to require the lane before audit evidence exists.
  • cm-website issue is updated with the runner lane name and evidence links.

cm-website gate

Until PR 6 completes, the cm-website agent may continue only Track A work that uses GitHub-hosted runners and Launchplane HTTPS APIs. It must not edit product workflows to require self-hosted or cm-website-chris-testing.

The cm-website agent may resume Track B self-hosted-runner-dependent work only when this issue has a completed audit artifact proving:

  • lane online
  • systemd service active
  • labels correct
  • baseline ready
  • remove/restart control path present

Validation commands

For each PR, run the focused tests above plus the repo gate appropriate to changed files. Before merge of any slice that touches service/storage:

uv run python -m unittest tests.test_runner_lane_maintainer tests.test_runner_lane_registration tests.test_service tests.test_filesystem_store tests.test_postgres_store
uv run --extra dev ruff check <changed-python-files>
uv run --extra dev mypy <changed-python-files>
git diff --check

Before the live apply PR/dispatch, require main branch CI, Security, CodeQL, and Deploy Launchplane to pass.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions