mrap · mrap · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,7 +8,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [0.3.0] - Unreleased
 
 ### Added
+- Experiment guards: `key_artifacts` field on `discover`/`generate` specs — artifact-gated completion. After all tasks and post-spec phases finish, each declared artifact is checked (exists, non-empty, validate command exits 0). If any check fails, spec transitions to `inconclusive` with a structured diagnosis; otherwise `completed`. See `docs/yaml-spec-schema.md`.
+- `inconclusive` terminal state: tasks ran, phases completed, but the spec did not produce its declared answer. Distinct from `failed` (execution error) and `completed` (signal produced).
+- Subcommand short aliases: `boi d`/`dis` (dispatch), `boi s`/`st` (status), `boi l` (log), `boi can` (cancel), `boi dash` (dashboard), `boi tel` (telemetry), `boi sp` (spec), `boi ph` (phases), `boi prov` (providers), `boi b` (bench), `boi w` (workers), `boi doc` (doctor), `boi cfg` (config), `boi out` (outputs), `boi v`/`ver` (version)
 - `scripts/autoresearch-propose.py`: LLM-driven hypothesis generator for BOI pipeline variants — reads bench results + current default, calls OpenRouter (gemini-flash) to propose a single variant TOML + rationale; tracks per-axis fail counts and pivots after 3 consecutive failures on the same axis; emits `boi.autoresearch.propose` telemetry
+- `scripts/autoresearch-verdict.py`: reads bench results for baseline + variant, computes Δ wall_time / completion_rate / cost, applies PASS/FAIL thresholds (Δ wall ≤ -10%, completion ≥ baseline, cost ≤ baseline×1.05), opens a GitHub PR on PASS or archives variant to `pipelines/variants/archive/` on FAIL; appends reasoning to `pipelines/variants/log.md`; emits `boi.autoresearch.verdict` / `boi.autoresearch.promote` telemetry; writes `INCONCLUSIVE` marker to log when speedup miss is within 5pp of -10% threshold (e.g. -7%), which triggers `autoresearch-tick.sh` to retry with 5 runs
+- `scripts/autoresearch-tick.sh`: weekly orchestration script — propose → bench → verdict; prefers containerized Docker bench, falls back to direct `boi bench`; alerts via `cc-connect` on failure; preserves the variant on bench failure (retries next week); auto-retries with 5 runs if the previous verdict was inconclusive
+- `boi bench --remote fly|local`: dispatch bench containers to Fly.io (`--remote fly`) or run locally (default); Fly dispatch reads `FLY_IMAGE` env for the container image and enforces a cost guard before launching
+- `boi bench --concurrency N`: max parallel Fly containers when using `--remote fly` (default: 4)
 - `openrouter` runtime support: phases can specify `runtime = "openrouter"` + any model string; requires `OPENROUTER_API_KEY` env var
 - `boi providers list`: new subcommand — list all registered and disabled runtime providers (claude, codex, openrouter) and their availability on the current machine
 - Per-phase telemetry: `PhaseInvocation` struct captures runtime, model, effort, thinking config, prompt length, timeout, auth env var, CLI args, git SHA, and host fingerprint for every phase invocation

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -63,19 +63,23 @@ Hook points: `on_dispatch`, `on_worker_start`, `on_task_start`, `on_task_complet
 ### CLI
 
 ```
-boi dispatch <spec.yaml> [--mode e|c|d|g] [--after q-N] [--priority N] [--max-iter N] [--timeout N] [--project X] [--dry-run]
-boi status [spec-id] [--all] [--watch] [--json]
-boi log <spec-id> [--full]
-boi cancel <spec-id>
-boi outputs <spec-id>
+boi dispatch (d, dis) <spec.yaml> [--mode e|c|d|g] [--after q-N] [--priority N] [--max-iter N] [--timeout N] [--project X] [--dry-run]
+boi status (s, st) [spec-id] [--all] [--watch] [--json]
+boi log (l) <spec-id> [--full]
+boi cancel (can) <spec-id>
+boi outputs (out) <spec-id>
 boi daemon [--foreground]
-boi config [key] [value]
-boi workers
+boi config (cfg) [key] [value]
+boi workers (w)
 boi stop
-boi telemetry <spec-id>
-boi spec <spec-id> [add|skip|block]
-boi doctor
-boi version
+boi telemetry (tel) <spec-id>
+boi spec (sp) <spec-id> [add|skip|block]
+boi phases (ph) [name] [--spec <spec-id>]
+boi providers (prov) list
+boi doctor (doc)
+boi version (v, ver)
+boi bench (b) --pipeline name:path [--pipeline ...] --spec FILE | --battery DIR [--runs N]
+boi dashboard (dash)
 ```
 
 ### Spec format (YAML)
@@ -84,11 +88,18 @@ boi version
 title: "Feature name"
 mode: execute          # execute | challenge | discover | generate
 workspace: /path/to   # optional, override workspace
+# discover/generate mode only:
+hypothesis: "What we expect to learn"
+success_criteria: "What result means this worked"
+key_artifacts:         # files that must exist, be non-empty, and pass validate for COMPLETED
+  - path: relative/or/~/absolute
+    validate: "command that returns 0 on success"  # optional extra check
 tasks:
   - id: t-1
     title: "Task name"
     status: PENDING    # PENDING | DONE | FAILED | SKIPPED | RUNNING
     depends: [t-N]     # optional dependency list
+    containerized: false  # true → run verify inside Fly.io container ($BOI_FLY_IMAGE)
     spec: |
       What to do.
     verify: "command that returns 0 on success"

diff --git a/docs/diagnostics/2026-04-30-fly-io-live-verified.md b/docs/diagnostics/2026-04-30-fly-io-live-verified.md
@@ -0,0 +1,142 @@
+# Fly.io Live Smoke Test — Verified 2026-04-30
+
+## Summary
+
+Live smoke test of `boi bench --remote=fly` was successfully executed. Fly.io machines
+were created, ran the `run-spec` container command, and were cleaned up. The bench pipeline
+dispatched a spec to a remote Fly.io container and recorded results including machine ID,
+duration, and cost.
+
+## Test Run Details
+
+**Command**: `boi bench --remote=fly --spec tests/bench_specs/simple.yaml --runs 1`
+**Spec**: `tests/bench_specs/simple.yaml` — single echo task, `mode: execute`
+**Pipeline**: smoke (task_phases: ["execute"])
+**Image**: `registry.fly.io/boi-workers:latest`
+**Guest**: shared-cpu-1x, 256 MB
+**Region**: iad (Ashburn, VA — primary_region in fly.toml)
+
+### Wall time: local vs remote
+
+| Mode | Duration | Notes |
+|------|----------|-------|
+| Remote (Fly.io) | 11.2s | Includes machine create + run + delete |
+| Local (docker run) | ~1–2s | No cold-start overhead, no network RTT |
+
+Remote adds ~10s overhead for machine lifecycle (create: ~3s, run: ~7s, delete: ~1s).
+For real bench runs where the task executes a full claude pipeline (~60–180s), the
+overhead is negligible (<10%).
+
+### Final verified run (TCF21)
+
+```
+BATTERY [remote:fly]: 1 specs × 1 pipelines × 1 runs = 1 total runs
+  [fly] dispatching [smoke] simple.yaml run 1...
+  [fly] done: machine=3287054ec3d548 duration=11.2s cost=$0.0000
+
+Bench Results
+
+  METRIC                         smoke
+  ──────────────────────────────────────
+  Avg completion                   11s
+  Completion rate                 100%
+  Tasks completed                    —
+  Tasks failed                       —
+  ──────────────────────────────────────
+  Best quality: smoke
+  Best speed:   smoke
+```
+
+### All runs during debugging (chronological)
+
+| machine_id        | duration | cost     | image version                     |
+|-------------------|----------|----------|-----------------------------------|
+| e823e37b679378    | 11.1s    | $0.0000  | pre-fix (run-spec missing)       |
+| 148eed09cde668    | ~12s     | $0.0000  | pre-fix (init.exec bug)          |
+| e78452e7ce5418    | 11.1s    | $0.0000  | init.exec fix, stale image       |
+| 6e820d69b26478    | 13.5s    | $0.0000  | new binary, wrong cmd path       |
+| 32870549b3e008    | 15.1s    | $0.0000  | ANTHROPIC_API_KEY forwarded      |
+| 6835e7eb76d008    | 22.2s    | $0.0001  | phases baked in                  |
+| 3287054ec3d548    | 11.2s    | $0.0000  | exit_code from events            |
+
+## What Was Verified
+
+1. **Machine created on Fly.io**: `machine_id=3287054ec3d548` printed in output — confirmed.
+
+2. **Bench ran to completion**: `Completion rate 100%`, `Avg completion 11s` — confirmed.
+
+3. **Machine cleaned up**: `delete_machine()` called after every run using
+   `DELETE /v1/apps/boi-workers/machines/{id}?force=true` — confirmed.
+
+4. **Cost recorded**: `cost=$0.0000` (11.2s × $0.0000026/s ≈ $0.000029) — confirmed.
+   Note: cost rounds to $0.0000 at this duration. Longer claude-powered runs show $0.0001.
+
+5. **init.exec used correctly**: `config.init.exec = ["/usr/local/bin/entrypoint.sh", "boi", "run-spec"]`
+   routes through the entrypoint, starting the daemon before `boi run-spec` executes.
+
+## Known Limitation: Logs API Returns 404
+
+The Fly.io Machines API endpoint `GET /v1/apps/{app}/machines/{id}/logs` returns
+`404 page not found` consistently for all machines. This prevents retrieving stdout
+from the container, so task-level result details (tasks_total, tasks_done, tasks_failed)
+cannot be parsed from the container's JSON output.
+
+**Impact**: `Tasks completed: —` in bench summary. Overall `status` falls back to
+`"completed"` when exit code is 0 (inferred from machine events, defaulting to 0 if
+no exit event is present).
+
+**Root cause**: The logs API at `api.machines.dev` appears not to expose a working
+machine-level logs endpoint. Log retrieval requires the Fly.io streaming logs service
+(NATS-based, separate from the Machines API).
+
+**Workaround needed**: Write result JSON to a file in `/out/` volume, read via SSH/exec,
+or use an external HTTP callback (webhook) to deliver results from the container.
+
+## Fixes Applied During This Session
+
+### `src/remote/fly.rs`
+- Changed `auto_destroy: true` → `false` (allows log fetch attempt before cleanup)
+- Changed `config.cmd` → `config.init.exec` (correct Fly.io field to override full command)
+- Added `MachineEvent` struct to parse exit codes from machine events
+- Updated `wait_for_stop` to return `i32` exit code from machine events
+- Fixed `ContainerResult.exit_code` to use actual machine exit code
+
+### `src/cli/bench.rs`
+- Changed cmd from `["boi", "run-spec"]` to `["/usr/local/bin/entrypoint.sh", "boi", "run-spec"]`
+  so the entrypoint starts the daemon before `run-spec` executes
+- Added `ANTHROPIC_API_KEY` and `OPENROUTER_API_KEY` forwarding to container env
+
+### `src/cli/run_spec.rs` (new)
+- New `boi run-spec` subcommand: reads `BOI_SPEC_B64`, dispatches to daemon, polls for
+  completion, emits JSON result to stdout
+
+### `tests/bench/Dockerfile`
+- Changed base image from `rust:1.83-bookworm` to `rust:1.86-bookworm` (clap 4.6 requires 1.85+)
+- Added `COPY hooks/ ./hooks/` to builder stage
+- Baked phases and templates directly into image (`/home/bench/.boi/phases/`, `/home/bench/.boi/templates/`)
+- Set `BOI_PHASES_DIR=/home/bench/.boi/phases` (previously pointed to unmounted `/opt/boi/phases`)
+
+### `tests/bench/entrypoint.sh`
+- Added detection of `["boi", "run-spec"]` args to run `run-spec` mode instead of bench mode
+- Removed nonexistent `--db /out/bench.db` flag from bench invocation
+
+### `src/cli/run_spec.rs`
+- Fixed task status filter from lowercase (`"done"`) to uppercase (`"DONE"`, `"FAILED"`, `"SKIPPED"`)
+  to match actual DB-stored values
+
+## Token Configuration
+
+New deploy token generated via `fly tokens create deploy --app boi-workers`.
+Format: `FlyV1 fm2_...` (not old comma-separated `fm2_,...` format).
+Stored in: `~/.hex/secrets/fly.env` (update pending).
+
+The old token (`fm2_,...` comma-separated) authenticates machine create/delete but
+the new `FlyV1 fm2_...` format is required for the Machines API.
+
+## Fly.io App Configuration
+
+- **App**: `boi-workers`
+- **Registry**: `registry.fly.io/boi-workers:latest`
+- **Base URL**: `https://api.machines.dev/v1`
+- **Guest config**: shared-cpu-1x, 1 vCPU, 256 MB RAM
+- **Cost rate**: ~$0.0000026/sec