Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [0.3.0] - Unreleased

### Added
- Experiment guards: `key_artifacts` field on `discover`/`generate` specs — artifact-gated completion. After all tasks and post-spec phases finish, each declared artifact is checked (exists, non-empty, validate command exits 0). If any check fails, spec transitions to `inconclusive` with a structured diagnosis; otherwise `completed`. See `docs/yaml-spec-schema.md`.
- `inconclusive` terminal state: tasks ran, phases completed, but the spec did not produce its declared answer. Distinct from `failed` (execution error) and `completed` (signal produced).
- Subcommand short aliases: `boi d`/`dis` (dispatch), `boi s`/`st` (status), `boi l` (log), `boi can` (cancel), `boi dash` (dashboard), `boi tel` (telemetry), `boi sp` (spec), `boi ph` (phases), `boi prov` (providers), `boi b` (bench), `boi w` (workers), `boi doc` (doctor), `boi cfg` (config), `boi out` (outputs), `boi v`/`ver` (version)
- `scripts/autoresearch-propose.py`: LLM-driven hypothesis generator for BOI pipeline variants — reads bench results + current default, calls OpenRouter (gemini-flash) to propose a single variant TOML + rationale; tracks per-axis fail counts and pivots after 3 consecutive failures on the same axis; emits `boi.autoresearch.propose` telemetry
- `scripts/autoresearch-verdict.py`: reads bench results for baseline + variant, computes Δ wall_time / completion_rate / cost, applies PASS/FAIL thresholds (Δ wall ≤ -10%, completion ≥ baseline, cost ≤ baseline×1.05), opens a GitHub PR on PASS or archives variant to `pipelines/variants/archive/` on FAIL; appends reasoning to `pipelines/variants/log.md`; emits `boi.autoresearch.verdict` / `boi.autoresearch.promote` telemetry; writes `INCONCLUSIVE` marker to log when speedup miss is within 5pp of -10% threshold (e.g. -7%), which triggers `autoresearch-tick.sh` to retry with 5 runs
- `scripts/autoresearch-tick.sh`: weekly orchestration script — propose → bench → verdict; prefers containerized Docker bench, falls back to direct `boi bench`; alerts via `cc-connect` on failure; preserves the variant on bench failure (retries next week); auto-retries with 5 runs if the previous verdict was inconclusive
- `boi bench --remote fly|local`: dispatch bench containers to Fly.io (`--remote fly`) or run locally (default); Fly dispatch reads `FLY_IMAGE` env for the container image and enforces a cost guard before launching
- `boi bench --concurrency N`: max parallel Fly containers when using `--remote fly` (default: 4)
- `openrouter` runtime support: phases can specify `runtime = "openrouter"` + any model string; requires `OPENROUTER_API_KEY` env var
- `boi providers list`: new subcommand — list all registered and disabled runtime providers (claude, codex, openrouter) and their availability on the current machine
- Per-phase telemetry: `PhaseInvocation` struct captures runtime, model, effort, thinking config, prompt length, timeout, auth env var, CLI args, git SHA, and host fingerprint for every phase invocation
Expand Down
33 changes: 22 additions & 11 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,19 +63,23 @@ Hook points: `on_dispatch`, `on_worker_start`, `on_task_start`, `on_task_complet
### CLI

```
boi dispatch <spec.yaml> [--mode e|c|d|g] [--after q-N] [--priority N] [--max-iter N] [--timeout N] [--project X] [--dry-run]
boi status [spec-id] [--all] [--watch] [--json]
boi log <spec-id> [--full]
boi cancel <spec-id>
boi outputs <spec-id>
boi dispatch (d, dis) <spec.yaml> [--mode e|c|d|g] [--after q-N] [--priority N] [--max-iter N] [--timeout N] [--project X] [--dry-run]
boi status (s, st) [spec-id] [--all] [--watch] [--json]
boi log (l) <spec-id> [--full]
boi cancel (can) <spec-id>
boi outputs (out) <spec-id>
boi daemon [--foreground]
boi config [key] [value]
boi workers
boi config (cfg) [key] [value]
boi workers (w)
boi stop
boi telemetry <spec-id>
boi spec <spec-id> [add|skip|block]
boi doctor
boi version
boi telemetry (tel) <spec-id>
boi spec (sp) <spec-id> [add|skip|block]
boi phases (ph) [name] [--spec <spec-id>]
boi providers (prov) list
boi doctor (doc)
boi version (v, ver)
boi bench (b) --pipeline name:path [--pipeline ...] --spec FILE | --battery DIR [--runs N]
boi dashboard (dash)
```

### Spec format (YAML)
Expand All @@ -84,11 +88,18 @@ boi version
title: "Feature name"
mode: execute # execute | challenge | discover | generate
workspace: /path/to # optional, override workspace
# discover/generate mode only:
hypothesis: "What we expect to learn"
success_criteria: "What result means this worked"
key_artifacts: # files that must exist, be non-empty, and pass validate for COMPLETED
- path: relative/or/~/absolute
validate: "command that returns 0 on success" # optional extra check
tasks:
- id: t-1
title: "Task name"
status: PENDING # PENDING | DONE | FAILED | SKIPPED | RUNNING
depends: [t-N] # optional dependency list
containerized: false # true → run verify inside Fly.io container ($BOI_FLY_IMAGE)
spec: |
What to do.
verify: "command that returns 0 on success"
Expand Down
142 changes: 142 additions & 0 deletions docs/diagnostics/2026-04-30-fly-io-live-verified.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Fly.io Live Smoke Test — Verified 2026-04-30

## Summary

Live smoke test of `boi bench --remote=fly` was successfully executed. Fly.io machines
were created, ran the `run-spec` container command, and were cleaned up. The bench pipeline
dispatched a spec to a remote Fly.io container and recorded results including machine ID,
duration, and cost.

## Test Run Details

**Command**: `boi bench --remote=fly --spec tests/bench_specs/simple.yaml --runs 1`
**Spec**: `tests/bench_specs/simple.yaml` — single echo task, `mode: execute`
**Pipeline**: smoke (task_phases: ["execute"])
**Image**: `registry.fly.io/boi-workers:latest`
**Guest**: shared-cpu-1x, 256 MB
**Region**: iad (Ashburn, VA — primary_region in fly.toml)

### Wall time: local vs remote

| Mode | Duration | Notes |
|------|----------|-------|
| Remote (Fly.io) | 11.2s | Includes machine create + run + delete |
| Local (docker run) | ~1–2s | No cold-start overhead, no network RTT |

Remote adds ~10s overhead for machine lifecycle (create: ~3s, run: ~7s, delete: ~1s).
For real bench runs where the task executes a full claude pipeline (~60–180s), the
overhead is negligible (<10%).

### Final verified run (TCF21)

```
BATTERY [remote:fly]: 1 specs × 1 pipelines × 1 runs = 1 total runs
[fly] dispatching [smoke] simple.yaml run 1...
[fly] done: machine=3287054ec3d548 duration=11.2s cost=$0.0000

Bench Results

METRIC smoke
──────────────────────────────────────
Avg completion 11s
Completion rate 100%
Tasks completed —
Tasks failed —
──────────────────────────────────────
Best quality: smoke
Best speed: smoke
```

### All runs during debugging (chronological)

| machine_id | duration | cost | image version |
|-------------------|----------|----------|-----------------------------------|
| e823e37b679378 | 11.1s | $0.0000 | pre-fix (run-spec missing) |
| 148eed09cde668 | ~12s | $0.0000 | pre-fix (init.exec bug) |
| e78452e7ce5418 | 11.1s | $0.0000 | init.exec fix, stale image |
| 6e820d69b26478 | 13.5s | $0.0000 | new binary, wrong cmd path |
| 32870549b3e008 | 15.1s | $0.0000 | ANTHROPIC_API_KEY forwarded |
| 6835e7eb76d008 | 22.2s | $0.0001 | phases baked in |
| 3287054ec3d548 | 11.2s | $0.0000 | exit_code from events |

## What Was Verified

1. **Machine created on Fly.io**: `machine_id=3287054ec3d548` printed in output — confirmed.

2. **Bench ran to completion**: `Completion rate 100%`, `Avg completion 11s` — confirmed.

3. **Machine cleaned up**: `delete_machine()` called after every run using
`DELETE /v1/apps/boi-workers/machines/{id}?force=true` — confirmed.

4. **Cost recorded**: `cost=$0.0000` (11.2s × $0.0000026/s ≈ $0.000029) — confirmed.
Note: cost rounds to $0.0000 at this duration. Longer claude-powered runs show $0.0001.

5. **init.exec used correctly**: `config.init.exec = ["/usr/local/bin/entrypoint.sh", "boi", "run-spec"]`
routes through the entrypoint, starting the daemon before `boi run-spec` executes.

## Known Limitation: Logs API Returns 404

The Fly.io Machines API endpoint `GET /v1/apps/{app}/machines/{id}/logs` returns
`404 page not found` consistently for all machines. This prevents retrieving stdout
from the container, so task-level result details (tasks_total, tasks_done, tasks_failed)
cannot be parsed from the container's JSON output.

**Impact**: `Tasks completed: —` in bench summary. Overall `status` falls back to
`"completed"` when exit code is 0 (inferred from machine events, defaulting to 0 if
no exit event is present).

**Root cause**: The logs API at `api.machines.dev` appears not to expose a working
machine-level logs endpoint. Log retrieval requires the Fly.io streaming logs service
(NATS-based, separate from the Machines API).

**Workaround needed**: Write result JSON to a file in `/out/` volume, read via SSH/exec,
or use an external HTTP callback (webhook) to deliver results from the container.

## Fixes Applied During This Session

### `src/remote/fly.rs`
- Changed `auto_destroy: true` → `false` (allows log fetch attempt before cleanup)
- Changed `config.cmd` → `config.init.exec` (correct Fly.io field to override full command)
- Added `MachineEvent` struct to parse exit codes from machine events
- Updated `wait_for_stop` to return `i32` exit code from machine events
- Fixed `ContainerResult.exit_code` to use actual machine exit code

### `src/cli/bench.rs`
- Changed cmd from `["boi", "run-spec"]` to `["/usr/local/bin/entrypoint.sh", "boi", "run-spec"]`
so the entrypoint starts the daemon before `run-spec` executes
- Added `ANTHROPIC_API_KEY` and `OPENROUTER_API_KEY` forwarding to container env

### `src/cli/run_spec.rs` (new)
- New `boi run-spec` subcommand: reads `BOI_SPEC_B64`, dispatches to daemon, polls for
completion, emits JSON result to stdout

### `tests/bench/Dockerfile`
- Changed base image from `rust:1.83-bookworm` to `rust:1.86-bookworm` (clap 4.6 requires 1.85+)
- Added `COPY hooks/ ./hooks/` to builder stage
- Baked phases and templates directly into image (`/home/bench/.boi/phases/`, `/home/bench/.boi/templates/`)
- Set `BOI_PHASES_DIR=/home/bench/.boi/phases` (previously pointed to unmounted `/opt/boi/phases`)

### `tests/bench/entrypoint.sh`
- Added detection of `["boi", "run-spec"]` args to run `run-spec` mode instead of bench mode
- Removed nonexistent `--db /out/bench.db` flag from bench invocation

### `src/cli/run_spec.rs`
- Fixed task status filter from lowercase (`"done"`) to uppercase (`"DONE"`, `"FAILED"`, `"SKIPPED"`)
to match actual DB-stored values

## Token Configuration

New deploy token generated via `fly tokens create deploy --app boi-workers`.
Format: `FlyV1 fm2_...` (not old comma-separated `fm2_,...` format).
Stored in: `~/.hex/secrets/fly.env` (update pending).

The old token (`fm2_,...` comma-separated) authenticates machine create/delete but
the new `FlyV1 fm2_...` format is required for the Machines API.

## Fly.io App Configuration

- **App**: `boi-workers`
- **Registry**: `registry.fly.io/boi-workers:latest`
- **Base URL**: `https://api.machines.dev/v1`
- **Guest config**: shared-cpu-1x, 1 vCPU, 256 MB RAM
- **Cost rate**: ~$0.0000026/sec
Loading
Loading