Skip to content

Control self-hosted runner lanes #414

@cbusillo

Description

@cbusillo

Objective

Let Launchplane eventually manage self-hosted GitHub Actions runner lanes instead of only observing them.

Finish Line

Launchplane safely manages runner lane lifecycle from explicit policy

Current Status

State: Runner-control now has the fail-closed policy boundary needed before any host adapter work.

Completed:

Validation:

  • Local focused tests passed: tests.test_runner_lane_control, tests.test_service_auth, and tests.test_service.LaunchplaneServiceTests.test_local_operator_token_cannot_read_product_profiles.
  • Local full unit suite passed: uv run python -m unittest.
  • Local changed-file Ruff check/format and mypy passed; JetBrains changed-files inspection returned stale cached results.
  • Fail closed runner and local operator controls #722 PR checks passed: CI, Security, CodeQL, frontend_validate, static_checks, test, container_scan, workflow_lint, secret_scan.
  • Post-merge main checks passed on cb85056: CI, Security, CodeQL, and Deploy Launchplane.

Next action:

  • Define the host-adapter boundary and disposable/non-production runner lane needed for the first controlled mutation exercise.

Waiting for:

  • A safe disposable runner lane or explicit host adapter target for exercising create/drain/restart/remove behavior. Until then, keep this lane read-only.

Scope

  • Desired runner lane counts per repo/host/pool.
  • Runner provisioning/registration and service/container lifecycle.
  • Draining lanes before restart/removal.
  • Health checks and automatic restart for unhealthy lanes.
  • Label management for lane identity and capability labels.
  • Guardrails so Launchplane cannot delete or reconfigure unmanaged runners by accident.

Acceptance Criteria

  • Launchplane has a policy model for desired runner lanes and allowed hosts.
  • Runner control requires explicit opt-in per host and repo.
  • Launchplane can create a new runner lane with canonical labels and verify GitHub sees it online.
  • Launchplane can drain a lane, wait for no active job, restart it, and verify it returns online.
  • Launchplane can remove only lanes it owns and leaves manual/unmanaged lanes untouched.
  • All mutating actions support dry-run and audit records.
  • Failure leaves the system in a safe state with a clear recovery comment/log.

Relationships

Related to completed #413 for runner inventory evidence and #710 for queue-wait observation. Related to #410 because merge-train scheduling may use runner capacity signals, but runner control should not block the Level 1 merge train.

Validation

  • Unit tests for desired-state planning and ownership guardrails.
  • A live smoke on a disposable runner lane before touching production repo lanes.
  • Manual rollback instructions documented for service/container cleanup.

Decisions

  • Awareness first, control later.
  • Do not let Launchplane mutate runner services until it can inventory ownership and health reliably.
  • Treat host-level execution as a privileged ops boundary with explicit opt-in.

Open Questions

  • Should Launchplane control systemd runner services, Dockerized runners, or both?
  • Should runner scaling be event-driven, scheduled, or manually requested?
  • How should Launchplane store host credentials or delegate host actions safely?

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:waitingPlan is waiting on non-issue evidence or decision

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions