feat(loop): complete real Harness Loop proof and operator recovery path#52
Merged
Conversation
…anner fallback, and bilingual recovery text Failing-first tests for the operator's real failure (claude -p returns 401): - detectPlannerAuthFailure / plannerFallbackProvider pure-policy contract - runLocalPlannerText: 401 -> HOLD-PLANNER-AUTH (not generic), no template substitution in real mode, no codex attempt unless AEDEV_PLANNER_FALLBACK=codex - AEDEV_PLANNER_FALLBACK=codex: ONE read-only codex exec retry recorded as planner_provider 'codex-cli (fallback)' (never pretends it was claude) - runPlannerMissionDesign: same hold/fallback contract via fenced-JSON parse - user-state + blocker card: calm bilingual fix text (claude login / /status / AEDEV_PLANNER_FALLBACK=codex), never raw 401 or HOLD- codes https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…FALLBACK=codex planner fallback Fixes the operator's real failure: claude -p returning 401 left the cockpit in a confusing generic HOLD-PLANNER-CLI. - packages/daemon/src/planner-auth.ts: pure policy — detectPlannerAuthFailure (exitCode!=0 AND /401|unauthorized|auth|credit|login/i over stderr/transcript) and plannerFallbackProvider (only the exact value 'codex' enables fallback). - runLocalPlannerText / runPlannerMissionDesign (now exported, with injectable PlannerAdapterDeps): auth-looking claude failures emit HOLD-PLANNER-AUTH with the matched hint in the reason; when AEDEV_PLANNER_FALLBACK=codex AND claude fails for any reason, retry ONCE via the local codex CLI (read-only exec, probe contract, prompt on stdin, same fenced-JSON parse), metered via recordHeadlessCall provider 'codex-cli', and recorded honestly as planner_provider 'codex-cli (fallback)' — never pretending it was claude. No paid-API fallback ever. If codex also fails → AUTH/CLI hold as before. - Brainstorm/followup/roadmap paths persist the HOLD-PLANNER-AUTH row with the one-line nextAction fix; the hold message and HOLD-ROADMAP path carry 'claude login' / /status / AEDEV_PLANNER_FALLBACK=codex guidance. No template is ever substituted in real mode (regression-pinned). - user-state.ts + loop-cards.ts: calm bilingual explanation and recovery actions for HOLD-PLANNER-AUTH; visible text never shows raw 401 or codes. Gates: pnpm typecheck + lint clean; 907 tests pass (baseline 889 + 18 new). https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…ary titles, agent strip, on-card action, evidence entries, PR-gate why/who/next https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…vocabulary titles, agent strip, on-card primary action, evidence entries, PR-gate why/who/next - Visible card titles use the operator vocabulary 理解/计划/构建/验证/合并 (Understand/Plan/Build/Verify/PR·Merge); progress splits VISUALLY into Build (running) vs Verify (evidence_ready/validating/validators_*) via machine.stage — title only, internal five types and data-card-type unchanged. - Agent strip (cockpit-card-agents, data-active-agent) shows Claude(澄清/规划/审查) · Codex(编码) · Gemini(终审) · GitHub(PR) with the currently-active one highlighted, derived from card type + machine.stage (+ lastActivity.phase fallback for blockers). - The daemon's primaryAction renders ON the card (cockpit-card-action) via a new onAction prop wired through resolvePrimaryActionHandler — the SAME id→handler map the guidance buttons use; no duplicated logic. - Blocker recovery_actions render as a list with the recommended action emphasized (cockpit-card-recovery, data-recommended). - Progress evidence_links and pr_ready files_changed render as clickable-looking read-only entries (cockpit-card-evidence). - PR-gate decisions show three calm lines 为什么/谁说的/下一步 (cockpit-card-pr-gate); Gemini vs 安全门 derived from the gate code, raw codes stay in data-* only; HOLD blockers never pretend to be gate decisions. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…5 PASS provisioning, no-double-execution, forged-evidence drill (HOLD+freeze+403), idle-zero-credit, per-operator attribution. soak-pending -> complete. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…e invariant The on-card action button gets the primary LOOK via .ck-loop-action styling but not the .ck-btn.primary class — the webui quality smoke pins exactly one .ck-btn.primary per stage (the guidance row), and the card mirrors that same single action through the shared handler. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…p; passing browser evidence - expectLoopCard now also requires the agent strip (cockpit-card-agents) with all four agents and records the active one at every journey step. - New expectCardAction pins the card's own next-step button (cockpit-card-action) at the plan step (generate-plan/approve-roadmap) and the approve step (start-execution). - Evidence: user-journey E2E PASS (7/7 steps) and webui quality smoke PASS, both on the final operator-console code. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…to-merges) and the run-summary evidence-audit contract Phase 4: exhaustive matrix tests for decideMergeAction — autoMergeEnabled=false can never yield auto_merge_eligible; gemini fail -> no_pr; inconclusive / not_configured -> hold; security/workflow/dependency/system_config -> hold regardless. Phase 5: run-summary.md renderer honesty contract (absent inputs render as absent, never a fabricated PASS; real/simulated/unproven classification) plus mission-runner tests for the happy and held paths. https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…run-summary.md evidence audit on every mission exit Phase 4: docs/product/MERGE_POLICY.md (mature-product action matrix, reconciled with docs/AUTO_MERGE_POLICY.md) + decideMergeAction() in packages/daemon/src/merge-policy-v6.ts. Export-only, NOT wired into any merge execution path; autoMergeEnabled=false (GR#10 human merge only) can never return auto_merge_eligible — pinned exhaustively. Phase 5: packages/daemon/src/run-summary.ts — pure renderRunSummary / writeRunSummary with summary, changed paths, diff stats, validator and reviewer verdicts, PR-or-gate outcome, cost+headless tallies from the event store, holds, artifacts, and a real/simulated/unproven classification line. Wired into mission-runner on the done/failed/held/threw exits from data already in scope; absent inputs render as explicit absent markers, never a fabricated PASS. Gates: pnpm typecheck + pnpm lint + full suite green (950 passed, 0 failed; baseline 924 + 26 new tests). https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
…full chain HOLD-PLANNER-AUTH + recovery https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overnight mission — complete real Harness Loop proof and operator recovery path
All cloud-executable phases done in one night, every commit gated. 950 passed / 0 failed (+61 over baseline).
Phase 1 — planner auth recovery ✅
HOLD-PLANNER-AUTH(distinct from generic CLI hold, calm bilingual text: "运行 claude login…"); opt-inAEDEV_PLANNER_FALLBACK=codexretries once via codex exec — events recordcodex-cli (fallback), never impersonation, never paid API; template never impersonates a real planner (regression-pinned). +18 tests.Phase 3 — operator console ✅
Cards speak operator vocabulary (理解/计划/构建/验证/合并; Build↔Verify split visually), agent strip with active highlight (Claude/Codex/Gemini/GitHub), next-step button ON the card (single shared handler map), recovery list with recommended emphasis, evidence entries, PR-gate 为什么/谁说的/下一步. User-E2E 7/7 PASS + quality smoke, evidence committed. +17 tests.
Phase 4 — merge policy ✅ (no behavior flip)
docs/product/MERGE_POLICY.md+ puredecideMergeAction; exhaustive 864-combination sweep provesautoMergeEnabled=false(GR#10) never yields auto-merge; gemini fail→no_pr; inconclusive→hold; security/workflow/dependency/config→hold always. +14 tests.Phase 5 — evidence audit ✅
run-summary.mdwritten on ALL four mission exits (done/DAG-failed/held/catch): summary/changed-paths/diff/validators/reviewer/PR-or-gate/cost/HOLDs/artifacts/real-vs-simulated classification; absent inputs say "absent", never fabricated. +12 tests.Phase 6 — full 30-min soak ✅ uninterrupted
5/5 PASS: provisioning, no-double-execution, forged-evidence drill (HOLD+freeze+403), idle-zero-credit, per-operator attribution.
evidence/fleet-soak/2026-06-11T07-35-14-553Z/.Phase 2 — honest V6-P3 conclusion (the only HOLD)
REAL Draft PR exists: https://github.com/CTlanston/hermus-agent/pull/4 (operator-produced — the remote-write gate is truly proven). The full cockpit-driven chain + real Gemini verdict artifact remain HOLD-PLANNER-AUTH (operator's
claude -p401). Recovery (incl. tonight's new fallback) documented inevidence/v6/real-proof/.Remaining HOLD list
HOLD-PLANNER-AUTH— operator:claude login(orAEDEV_PLANNER_FALLBACK=codex), then rerun the cockpit mission → commitgemini-verdict.json+mission-events.jsonl.SOAK_OPERATIONS.mdready).Classification — Real: hermus#4, 30-min soak, 950 tests, browser E2E (real chromium). Simulated: engine sides of E2E/soak (self-labeled). Unproven: cockpit end-to-end real chain + real Gemini verdict (the HOLD's subject). No fabrication anywhere.
Merging per the overnight grant: tests all green, evidence complete, no security/workflow/dependency/config changes, auto-merge policy itself untouched (pure function only).
https://claude.ai/code/session_01AgdV9SKZZP6JbyTBo2gAWZ
Generated by Claude Code