Skip to content

feat(loop): complete real Harness Loop proof and operator recovery path#52

Merged
CTlanston merged 11 commits into
mainfrom
claude/overnight-harness-loop
Jun 11, 2026
Merged

feat(loop): complete real Harness Loop proof and operator recovery path#52
CTlanston merged 11 commits into
mainfrom
claude/overnight-harness-loop

Conversation

@CTlanston

Copy link
Copy Markdown
Owner

Overnight mission — complete real Harness Loop proof and operator recovery path

All cloud-executable phases done in one night, every commit gated. 950 passed / 0 failed (+61 over baseline).

Phase 1 — planner auth recovery ✅

HOLD-PLANNER-AUTH (distinct from generic CLI hold, calm bilingual text: "运行 claude login…"); opt-in AEDEV_PLANNER_FALLBACK=codex retries once via codex exec — events record codex-cli (fallback), never impersonation, never paid API; template never impersonates a real planner (regression-pinned). +18 tests.

Phase 3 — operator console ✅

Cards speak operator vocabulary (理解/计划/构建/验证/合并; Build↔Verify split visually), agent strip with active highlight (Claude/Codex/Gemini/GitHub), next-step button ON the card (single shared handler map), recovery list with recommended emphasis, evidence entries, PR-gate 为什么/谁说的/下一步. User-E2E 7/7 PASS + quality smoke, evidence committed. +17 tests.

Phase 4 — merge policy ✅ (no behavior flip)

docs/product/MERGE_POLICY.md + pure decideMergeAction; exhaustive 864-combination sweep proves autoMergeEnabled=false (GR#10) never yields auto-merge; gemini fail→no_pr; inconclusive→hold; security/workflow/dependency/config→hold always. +14 tests.

Phase 5 — evidence audit ✅

run-summary.md written on ALL four mission exits (done/DAG-failed/held/catch): summary/changed-paths/diff/validators/reviewer/PR-or-gate/cost/HOLDs/artifacts/real-vs-simulated classification; absent inputs say "absent", never fabricated. +12 tests.

Phase 6 — full 30-min soak ✅ uninterrupted

5/5 PASS: provisioning, no-double-execution, forged-evidence drill (HOLD+freeze+403), idle-zero-credit, per-operator attribution. evidence/fleet-soak/2026-06-11T07-35-14-553Z/.

Phase 2 — honest V6-P3 conclusion (the only HOLD)

REAL Draft PR exists: https://github.com/CTlanston/hermus-agent/pull/4 (operator-produced — the remote-write gate is truly proven). The full cockpit-driven chain + real Gemini verdict artifact remain HOLD-PLANNER-AUTH (operator's claude -p 401). Recovery (incl. tonight's new fallback) documented in evidence/v6/real-proof/.

Remaining HOLD list

  1. HOLD-PLANNER-AUTH — operator: claude login (or AEDEV_PLANNER_FALLBACK=codex), then rerun the cockpit mission → commit gemini-verdict.json + mission-events.jsonl.
  2. Week-long real soak (30-min harness proven; SOAK_OPERATIONS.md ready).
  3. V6-P6 ordinary-user acceptance (needs a real human).

Classification — Real: hermus#4, 30-min soak, 950 tests, browser E2E (real chromium). Simulated: engine sides of E2E/soak (self-labeled). Unproven: cockpit end-to-end real chain + real Gemini verdict (the HOLD's subject). No fabrication anywhere.

Merging per the overnight grant: tests all green, evidence complete, no security/workflow/dependency/config changes, auto-merge policy itself untouched (pure function only).

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ


Generated by Claude Code

claude and others added 11 commits June 11, 2026 07:36
…anner fallback, and bilingual recovery text

Failing-first tests for the operator's real failure (claude -p returns 401):
- detectPlannerAuthFailure / plannerFallbackProvider pure-policy contract
- runLocalPlannerText: 401 -> HOLD-PLANNER-AUTH (not generic), no template
  substitution in real mode, no codex attempt unless AEDEV_PLANNER_FALLBACK=codex
- AEDEV_PLANNER_FALLBACK=codex: ONE read-only codex exec retry recorded as
  planner_provider 'codex-cli (fallback)' (never pretends it was claude)
- runPlannerMissionDesign: same hold/fallback contract via fenced-JSON parse
- user-state + blocker card: calm bilingual fix text (claude login / /status /
  AEDEV_PLANNER_FALLBACK=codex), never raw 401 or HOLD- codes

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…FALLBACK=codex planner fallback

Fixes the operator's real failure: claude -p returning 401 left the cockpit
in a confusing generic HOLD-PLANNER-CLI.

- packages/daemon/src/planner-auth.ts: pure policy — detectPlannerAuthFailure
  (exitCode!=0 AND /401|unauthorized|auth|credit|login/i over stderr/transcript)
  and plannerFallbackProvider (only the exact value 'codex' enables fallback).
- runLocalPlannerText / runPlannerMissionDesign (now exported, with injectable
  PlannerAdapterDeps): auth-looking claude failures emit HOLD-PLANNER-AUTH with
  the matched hint in the reason; when AEDEV_PLANNER_FALLBACK=codex AND claude
  fails for any reason, retry ONCE via the local codex CLI (read-only exec,
  probe contract, prompt on stdin, same fenced-JSON parse), metered via
  recordHeadlessCall provider 'codex-cli', and recorded honestly as
  planner_provider 'codex-cli (fallback)' — never pretending it was claude.
  No paid-API fallback ever. If codex also fails → AUTH/CLI hold as before.
- Brainstorm/followup/roadmap paths persist the HOLD-PLANNER-AUTH row with the
  one-line nextAction fix; the hold message and HOLD-ROADMAP path carry
  'claude login' / /status / AEDEV_PLANNER_FALLBACK=codex guidance. No
  template is ever substituted in real mode (regression-pinned).
- user-state.ts + loop-cards.ts: calm bilingual explanation and recovery
  actions for HOLD-PLANNER-AUTH; visible text never shows raw 401 or codes.

Gates: pnpm typecheck + lint clean; 907 tests pass (baseline 889 + 18 new).

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…ary titles, agent strip, on-card action, evidence entries, PR-gate why/who/next

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…vocabulary titles, agent strip, on-card primary action, evidence entries, PR-gate why/who/next

- Visible card titles use the operator vocabulary 理解/计划/构建/验证/合并
  (Understand/Plan/Build/Verify/PR·Merge); progress splits VISUALLY into
  Build (running) vs Verify (evidence_ready/validating/validators_*) via
  machine.stage — title only, internal five types and data-card-type unchanged.
- Agent strip (cockpit-card-agents, data-active-agent) shows
  Claude(澄清/规划/审查) · Codex(编码) · Gemini(终审) · GitHub(PR) with the
  currently-active one highlighted, derived from card type + machine.stage
  (+ lastActivity.phase fallback for blockers).
- The daemon's primaryAction renders ON the card (cockpit-card-action) via a
  new onAction prop wired through resolvePrimaryActionHandler — the SAME
  id→handler map the guidance buttons use; no duplicated logic.
- Blocker recovery_actions render as a list with the recommended action
  emphasized (cockpit-card-recovery, data-recommended).
- Progress evidence_links and pr_ready files_changed render as
  clickable-looking read-only entries (cockpit-card-evidence).
- PR-gate decisions show three calm lines 为什么/谁说的/下一步
  (cockpit-card-pr-gate); Gemini vs 安全门 derived from the gate code, raw
  codes stay in data-* only; HOLD blockers never pretend to be gate decisions.

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…5 PASS

provisioning, no-double-execution, forged-evidence drill (HOLD+freeze+403),
idle-zero-credit, per-operator attribution. soak-pending -> complete.

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…e invariant

The on-card action button gets the primary LOOK via .ck-loop-action styling
but not the .ck-btn.primary class — the webui quality smoke pins exactly one
.ck-btn.primary per stage (the guidance row), and the card mirrors that same
single action through the shared handler.

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…p; passing browser evidence

- expectLoopCard now also requires the agent strip (cockpit-card-agents) with
  all four agents and records the active one at every journey step.
- New expectCardAction pins the card's own next-step button
  (cockpit-card-action) at the plan step (generate-plan/approve-roadmap) and
  the approve step (start-execution).
- Evidence: user-journey E2E PASS (7/7 steps) and webui quality smoke PASS,
  both on the final operator-console code.

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…to-merges) and the run-summary evidence-audit contract

Phase 4: exhaustive matrix tests for decideMergeAction — autoMergeEnabled=false
can never yield auto_merge_eligible; gemini fail -> no_pr; inconclusive /
not_configured -> hold; security/workflow/dependency/system_config -> hold
regardless. Phase 5: run-summary.md renderer honesty contract (absent inputs
render as absent, never a fabricated PASS; real/simulated/unproven
classification) plus mission-runner tests for the happy and held paths.

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…run-summary.md evidence audit on every mission exit

Phase 4: docs/product/MERGE_POLICY.md (mature-product action matrix,
reconciled with docs/AUTO_MERGE_POLICY.md) + decideMergeAction() in
packages/daemon/src/merge-policy-v6.ts. Export-only, NOT wired into any
merge execution path; autoMergeEnabled=false (GR#10 human merge only) can
never return auto_merge_eligible — pinned exhaustively.

Phase 5: packages/daemon/src/run-summary.ts — pure renderRunSummary /
writeRunSummary with summary, changed paths, diff stats, validator and
reviewer verdicts, PR-or-gate outcome, cost+headless tallies from the event
store, holds, artifacts, and a real/simulated/unproven classification line.
Wired into mission-runner on the done/failed/held/threw exits from data
already in scope; absent inputs render as explicit absent markers, never a
fabricated PASS.

Gates: pnpm typecheck + pnpm lint + full suite green (950 passed, 0 failed;
baseline 924 + 26 new tests).

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
@CTlanston CTlanston marked this pull request as ready for review June 11, 2026 08:29
@CTlanston CTlanston merged commit bb6ad97 into main Jun 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants