Skip to content

api: HTTP queue API, runner registration, and submission auth for distributed worker mode #31

@AlbinoGeek

Description

@AlbinoGeek

Parent: Rethunk-AI/bakeoff#37 — Distributed worker mode (pull client + queue API)
Companion: Rethunk-AI/bakeoff — queue schema + pull client in `bench/`

Context

bakeoff#37 established distributed worker mode: runners poll a central queue, claim jobs, execute locally, and submit results. The queue schema and harness-side pull client live in the bakeoff repo. This ticket covers the bakeoff-results side: the HTTP surface runners hit to claim work and submit results, plus the auth layer that controls who can submit.

Scope

1. Runner registration endpoint

```
POST /api/runners/register
```

  • Accepts runner identity (Ed25519 public key, hostname, declared capabilities)
  • Returns runner token (short-lived JWT or HMAC-signed opaque token)
  • Inserts or upserts into `runners` table (bakeoff schema)
  • Registration gated by submission whitelist (see Auth section)

2. Queue claim endpoint

```
POST /api/queue/claim
```

  • Runner presents token + capability declaration (VRAM, quantization support)
  • Server claims next eligible `PENDING` run_queue row (`FOR UPDATE SKIP LOCKED`, priority order, capability filter)
  • Returns claimed job payload (run_id, model, tasks, config)
  • Heartbeat: runner must POST `/api/queue/:run_id/heartbeat` on interval or job returns to PENDING

3. Result submission endpoint

```
POST /api/queue/:run_id/submit
```

  • Runner submits signed result bundle (same Sigstore/cosign envelope as manual submissions)
  • Server verifies signature, validates schema, writes to `run_model_metrics` / `run_hardware_metrics`
  • Marks `run_queue` row COMPLETE

4. Auth surface

Phase 1 — submission whitelist:

  • Static allowlist of registered runner public keys
  • Reject claim/submit from unregistered keys
  • Admin endpoint (authenticated) to add/remove keys

Phase 2 — OAuth + approved keys (follow-on):

  • OAuth provider integration (GitHub OAuth app or similar)
  • Approved-key issuance flow: runner requests approval, admin approves, key enters whitelist
  • Long-term: federated trust model for community runners

Acceptance criteria

  • Runner registration endpoint — insert/upsert, return token
  • Claim endpoint — SKIP LOCKED claim, capability filter, heartbeat contract defined
  • Submit endpoint — signature verification, schema validation, queue row update
  • Whitelist gate enforced on claim and submit
  • Admin key-management endpoint (add/remove from whitelist)
  • Heartbeat timeout: stale claimed jobs returned to PENDING after configurable TTL
  • Cross-linked to Add distributed worker mode to runner bakeoff#37

Out of scope (this ticket)

  • Pull client implementation (lives in bakeoff `bench/` — see bakeoff#37)
  • OAuth Phase 2 auth (follow-on ticket)
  • UI for runner management

— Bastion // 050911ZJUN26

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions