feat(loop): complete real Harness Loop proof and operator recovery path by CTlanston · Pull Request #52 · CTlanston/claude-code-247

CTlanston · 2026-06-11T08:24:13Z

Overnight mission — complete real Harness Loop proof and operator recovery path

All cloud-executable phases done in one night, every commit gated. 950 passed / 0 failed (+61 over baseline).

Phase 1 — planner auth recovery ✅

HOLD-PLANNER-AUTH (distinct from generic CLI hold, calm bilingual text: "运行 claude login…"); opt-in AEDEV_PLANNER_FALLBACK=codex retries once via codex exec — events record codex-cli (fallback), never impersonation, never paid API; template never impersonates a real planner (regression-pinned). +18 tests.

Phase 3 — operator console ✅

Cards speak operator vocabulary (理解/计划/构建/验证/合并; Build↔Verify split visually), agent strip with active highlight (Claude/Codex/Gemini/GitHub), next-step button ON the card (single shared handler map), recovery list with recommended emphasis, evidence entries, PR-gate 为什么/谁说的/下一步. User-E2E 7/7 PASS + quality smoke, evidence committed. +17 tests.

Phase 4 — merge policy ✅ (no behavior flip)

docs/product/MERGE_POLICY.md + pure decideMergeAction; exhaustive 864-combination sweep proves autoMergeEnabled=false (GR#10) never yields auto-merge; gemini fail→no_pr; inconclusive→hold; security/workflow/dependency/config→hold always. +14 tests.

Phase 5 — evidence audit ✅

run-summary.md written on ALL four mission exits (done/DAG-failed/held/catch): summary/changed-paths/diff/validators/reviewer/PR-or-gate/cost/HOLDs/artifacts/real-vs-simulated classification; absent inputs say "absent", never fabricated. +12 tests.

Phase 6 — full 30-min soak ✅ uninterrupted

5/5 PASS: provisioning, no-double-execution, forged-evidence drill (HOLD+freeze+403), idle-zero-credit, per-operator attribution. evidence/fleet-soak/2026-06-11T07-35-14-553Z/.

Phase 2 — honest V6-P3 conclusion (the only HOLD)

REAL Draft PR exists: https://github.com/CTlanston/hermus-agent/pull/4 (operator-produced — the remote-write gate is truly proven). The full cockpit-driven chain + real Gemini verdict artifact remain HOLD-PLANNER-AUTH (operator's claude -p 401). Recovery (incl. tonight's new fallback) documented in evidence/v6/real-proof/.

Remaining HOLD list

HOLD-PLANNER-AUTH — operator: claude login (or AEDEV_PLANNER_FALLBACK=codex), then rerun the cockpit mission → commit gemini-verdict.json + mission-events.jsonl.
Week-long real soak (30-min harness proven; SOAK_OPERATIONS.md ready).
V6-P6 ordinary-user acceptance (needs a real human).

Classification — Real: hermus#4, 30-min soak, 950 tests, browser E2E (real chromium). Simulated: engine sides of E2E/soak (self-labeled). Unproven: cockpit end-to-end real chain + real Gemini verdict (the HOLD's subject). No fabrication anywhere.

Merging per the overnight grant: tests all green, evidence complete, no security/workflow/dependency/config changes, auto-merge policy itself untouched (pure function only).

https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

Generated by Claude Code

…atus=running) https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…anner fallback, and bilingual recovery text Failing-first tests for the operator's real failure (claude -p returns 401): - detectPlannerAuthFailure / plannerFallbackProvider pure-policy contract - runLocalPlannerText: 401 -> HOLD-PLANNER-AUTH (not generic), no template substitution in real mode, no codex attempt unless AEDEV_PLANNER_FALLBACK=codex - AEDEV_PLANNER_FALLBACK=codex: ONE read-only codex exec retry recorded as planner_provider 'codex-cli (fallback)' (never pretends it was claude) - runPlannerMissionDesign: same hold/fallback contract via fenced-JSON parse - user-state + blocker card: calm bilingual fix text (claude login / /status / AEDEV_PLANNER_FALLBACK=codex), never raw 401 or HOLD- codes https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…FALLBACK=codex planner fallback Fixes the operator's real failure: claude -p returning 401 left the cockpit in a confusing generic HOLD-PLANNER-CLI. - packages/daemon/src/planner-auth.ts: pure policy — detectPlannerAuthFailure (exitCode!=0 AND /401|unauthorized|auth|credit|login/i over stderr/transcript) and plannerFallbackProvider (only the exact value 'codex' enables fallback). - runLocalPlannerText / runPlannerMissionDesign (now exported, with injectable PlannerAdapterDeps): auth-looking claude failures emit HOLD-PLANNER-AUTH with the matched hint in the reason; when AEDEV_PLANNER_FALLBACK=codex AND claude fails for any reason, retry ONCE via the local codex CLI (read-only exec, probe contract, prompt on stdin, same fenced-JSON parse), metered via recordHeadlessCall provider 'codex-cli', and recorded honestly as planner_provider 'codex-cli (fallback)' — never pretending it was claude. No paid-API fallback ever. If codex also fails → AUTH/CLI hold as before. - Brainstorm/followup/roadmap paths persist the HOLD-PLANNER-AUTH row with the one-line nextAction fix; the hold message and HOLD-ROADMAP path carry 'claude login' / /status / AEDEV_PLANNER_FALLBACK=codex guidance. No template is ever substituted in real mode (regression-pinned). - user-state.ts + loop-cards.ts: calm bilingual explanation and recovery actions for HOLD-PLANNER-AUTH; visible text never shows raw 401 or codes. Gates: pnpm typecheck + lint clean; 907 tests pass (baseline 889 + 18 new). https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…ary titles, agent strip, on-card action, evidence entries, PR-gate why/who/next https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…vocabulary titles, agent strip, on-card primary action, evidence entries, PR-gate why/who/next - Visible card titles use the operator vocabulary 理解/计划/构建/验证/合并 (Understand/Plan/Build/Verify/PR·Merge); progress splits VISUALLY into Build (running) vs Verify (evidence_ready/validating/validators_*) via machine.stage — title only, internal five types and data-card-type unchanged. - Agent strip (cockpit-card-agents, data-active-agent) shows Claude(澄清/规划/审查) · Codex(编码) · Gemini(终审) · GitHub(PR) with the currently-active one highlighted, derived from card type + machine.stage (+ lastActivity.phase fallback for blockers). - The daemon's primaryAction renders ON the card (cockpit-card-action) via a new onAction prop wired through resolvePrimaryActionHandler — the SAME id→handler map the guidance buttons use; no duplicated logic. - Blocker recovery_actions render as a list with the recommended action emphasized (cockpit-card-recovery, data-recommended). - Progress evidence_links and pr_ready files_changed render as clickable-looking read-only entries (cockpit-card-evidence). - PR-gate decisions show three calm lines 为什么/谁说的/下一步 (cockpit-card-pr-gate); Gemini vs 安全门 derived from the gate code, raw codes stay in data-* only; HOLD blockers never pretend to be gate decisions. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…5 PASS provisioning, no-double-execution, forged-evidence drill (HOLD+freeze+403), idle-zero-credit, per-operator attribution. soak-pending -> complete. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…e invariant The on-card action button gets the primary LOOK via .ck-loop-action styling but not the .ck-btn.primary class — the webui quality smoke pins exactly one .ck-btn.primary per stage (the guidance row), and the card mirrors that same single action through the shared handler. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…p; passing browser evidence - expectLoopCard now also requires the agent strip (cockpit-card-agents) with all four agents and records the active one at every journey step. - New expectCardAction pins the card's own next-step button (cockpit-card-action) at the plan step (generate-plan/approve-roadmap) and the approve step (start-execution). - Evidence: user-journey E2E PASS (7/7 steps) and webui quality smoke PASS, both on the final operator-console code. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…to-merges) and the run-summary evidence-audit contract Phase 4: exhaustive matrix tests for decideMergeAction — autoMergeEnabled=false can never yield auto_merge_eligible; gemini fail -> no_pr; inconclusive / not_configured -> hold; security/workflow/dependency/system_config -> hold regardless. Phase 5: run-summary.md renderer honesty contract (absent inputs render as absent, never a fabricated PASS; real/simulated/unproven classification) plus mission-runner tests for the happy and held paths. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…run-summary.md evidence audit on every mission exit Phase 4: docs/product/MERGE_POLICY.md (mature-product action matrix, reconciled with docs/AUTO_MERGE_POLICY.md) + decideMergeAction() in packages/daemon/src/merge-policy-v6.ts. Export-only, NOT wired into any merge execution path; autoMergeEnabled=false (GR#10 human merge only) can never return auto_merge_eligible — pinned exhaustively. Phase 5: packages/daemon/src/run-summary.ts — pure renderRunSummary / writeRunSummary with summary, changed paths, diff stats, validator and reviewer verdicts, PR-or-gate outcome, cost+headless tallies from the event store, holds, artifacts, and a real/simulated/unproven classification line. Wired into mission-runner on the done/failed/held/threw exits from data already in scope; absent inputs render as explicit absent markers, never a fabricated PASS. Gates: pnpm typecheck + pnpm lint + full suite green (950 passed, 0 failed; baseline 924 + 26 new tests). https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

…full chain HOLD-PLANNER-AUTH + recovery https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

claude and others added 11 commits June 11, 2026 07:36

[overnight-p6] soak-pending status artifact — 30-min soak started (st…

96b8dab

…atus=running) https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

[overnight-p3] test: pin the operator-console card contract — vocabul…

a4430c9

…ary titles, agent strip, on-card action, evidence entries, PR-gate why/who/next https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

[overnight-p6] evidence: full uninterrupted 30-minute fleet soak — 5/…

dca6c27

…5 PASS provisioning, no-double-execution, forged-evidence drill (HOLD+freeze+403), idle-zero-credit, per-operator attribution. soak-pending -> complete. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

[overnight-p2p7] Honest V6-P3 conclusion: real PR proven (hermus#4), …

1d7edf9

…full chain HOLD-PLANNER-AUTH + recovery https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ

CTlanston marked this pull request as ready for review June 11, 2026 08:29

CTlanston merged commit bb6ad97 into main Jun 11, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loop): complete real Harness Loop proof and operator recovery path#52

feat(loop): complete real Harness Loop proof and operator recovery path#52
CTlanston merged 11 commits into
mainfrom
claude/overnight-harness-loop

CTlanston commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CTlanston commented Jun 11, 2026

Overnight mission — complete real Harness Loop proof and operator recovery path

Phase 1 — planner auth recovery ✅

Phase 3 — operator console ✅

Phase 4 — merge policy ✅ (no behavior flip)

Phase 5 — evidence audit ✅

Phase 6 — full 30-min soak ✅ uninterrupted

Phase 2 — honest V6-P3 conclusion (the only HOLD)

Remaining HOLD list

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants