Skip to content

fix: harden no-mistakes validation contract#97

Merged
kunchenguid merged 3 commits into
mainfrom
fm/nm-brief-contract-c4
Jun 26, 2026
Merged

fix: harden no-mistakes validation contract#97
kunchenguid merged 3 commits into
mainfrom
fm/nm-brief-contract-c4

Conversation

@kunchenguid

Copy link
Copy Markdown
Owner

Intent

Harden firstmate's no-mistakes validation contract to stop two related crewmate failure modes when driving the no-mistakes pipeline: (A) the gate-park deadlock - idle-waiting at a parked gate expecting it to auto-advance - and (B) self-fix duplication - a crewmate aborting an active run and hand-implementing fixes, including fixes for real bugs review found in its OWN code, instead of letting the pipeline apply them via 'axi respond'. Both stem from the contract saying what NOT to do without naming the right move at the gate.

This is a wording/contract change only: no bin/ script logic beyond the brief text, and no changes to the no-mistakes tool itself (separate repo).

Changes:

  1. bin/fm-brief.sh - rewrote the no-mistakes validation block scaffolded into every ship brief so a crewmate knows the right move at each gate: the pipeline owns every fix (auto-fix findings AND the fix for a real bug review finds in the crewmate's own code), applied on the crewmate's branch in the pipeline's own worktree; never hand-edit/git commit/reset/checkout/abort/re-run during an active run (that duplicates the pipeline and forces a full re-validation); the ask-user loop end to end (escalate to firstmate, stop, then feed the decision back via 'axi respond' and let the pipeline apply it - do not hand-implement the decided fix or abort to go fix it); process every return (a run legitimately takes many minutes through test/CI/fix rounds, so quiet is working not stalled - backgrounding the call is fine, but on a returned gate respond and loop until an outcome, never idle-wait for auto-advance); avoid --yes (it silently auto-resolves ALL findings including ask-user with zero escalation); and a factual correction - review auto-fix is disabled (limit 0) so review findings ALWAYS gate (the prior wording wrongly implied the pipeline self-fixes review findings).
  2. AGENTS.md - added a supervisor heuristic in the Validate flow so firstmate keys off the no-mistakes run step status, not shell liveness: running/fixing/ci = working (leave alone); awaiting_approval/fix_review = parked waiting on the agent (surfaced as the 'awaiting_agent: parked ' field on 'axi status' from the merged park-signal work), so the crewmate owes a respond; plus a self-fix-duplication red flag (a validating crewmate making fresh hand-commits, aborting, or re-running mid-validation). Also reconciled the section 8 sentence that previously said a crewmate 'never backgrounds' its own validation, which now contradicted the new brief - it now says the crewmate drives the gate loop synchronously and processes every return, never idle-waiting.

Deliberate decisions a reviewer reading only the diff would not know: I intentionally steered the contract AWAY from --yes even though an earlier internal design investigation recommended --yes as a robust deadlock-proof default, because --yes bypasses the ask-user human-escalation contract and the captain wants the human in the loop on genuine product/intent decisions. The wording is grounded in an internal run-model investigation of the no-mistakes axi run/respond/gate behavior (return-vs-block semantics, --yes resolving ask-user findings, the unbounded approval-gate wait) and uses the actual merged field name 'awaiting_agent: parked '. This resolves the queued gate-park-deadlock item and the self-fix-duplication failure mode.

What Changed

  • Tightens the no-mistakes ship brief validation contract so crewmates route all gate findings through the pipeline/respond loop instead of hand-editing, aborting, or re-running active validation.
  • Documents the expected supervisor read on no-mistakes gate states, including parked agent waits and self-fix duplication as a red flag.
  • Updates contributor guidance to match the revised validation flow for captain escalation and pipeline-owned fixes.

Risk Assessment

✅ Low: Captain, the change is a narrow instruction-only update with no substantiated merge-blocking risks found in the changed contract text.

Testing

Exercised the existing brief-generation regression test and an evidence-producing end-to-end brief scaffold; the generated markdown and supervisor contract show the intended gate-driving behavior, with no UI surface involved.

Evidence: Generated no-mistakes ship brief
You are a crewmate: an autonomous worker agent managed by firstmate. Work on your own; do not wait for a human.

# Task
{TASK}

# Setup
You are in a disposable git worktree of alpha, at a detached HEAD on a clean default branch.

**Verify isolation before anything else.** Run `pwd -P` and `git rev-parse --show-toplevel`; both must resolve to the disposable treehouse worktree you were launched in, typically a path under a `.treehouse/` pool, not the primary checkout firstmate operates from.
The path check is authoritative: `git rev-parse --git-dir` and `git rev-parse --git-common-dir` can help inspect the repo, but they do not prove you are outside the primary checkout.
If the top-level path is the primary checkout or not the worktree you were launched in, STOP - do not branch or commit here - append `blocked: launched in primary checkout, not an isolated worktree` to the status file and stop.

1. First action: create your branch: `git checkout -b fm/gate-contract-e2e`
2. Run `no-mistakes doctor`; if it reports the repo is not initialized here, run `no-mistakes init`.

# Rules
1. Never push to the default branch. Never merge a PR.
2. Stay inside this worktree; modify nothing outside it.
3. Use gh-axi for GitHub operations and chrome-devtools-axi for browser operations.
4. Report status by appending one line:
   `echo "{state}: {one short line}" >> '/var/folders/5x/4nqprlbx0518k3ybcb1sz6gr0000gn/T/no-mistakes-evidence/01KW2TARPF8J51M5T5VY2B6VPF/fm-brief-contract-home/state/gate-contract-e2e.status'`
   States: working, needs-decision, blocked, done, failed.
   Each append wakes firstmate, so report sparingly: only phase changes a supervisor
   would act on (setup done, bug reproduced, fix implemented, validation passed) and the
   needs-decision/blocked/done/failed states. No step-by-step FYI progress lines;
   firstmate reads your pane for that.
5. If you hit the same obstacle twice, append `blocked: {why}` and stop; firstmate will help.
6. If a decision belongs to a human (product choices, destructive actions, ask-user findings),
   append `needs-decision: {summary of options}` and stop. Firstmate will reply with the decision.

# Project memory
If `AGENTS.md` or `CLAUDE.md` already exists, or if this task produced durable project-intrinsic knowledge, run `/Users/kunchen/.no-mistakes/worktrees/016d88035d58/01KW2TARPF8J51M5T5VY2B6VPF/bin/fm-ensure-agents-md.sh .` in the worktree.
If this task produced durable project-intrinsic knowledge, record it in `AGENTS.md` as part of your change.
Keep it proportionate: skip `AGENTS.md` edits for trivial tasks that produced no durable project knowledge.

# Definition of done
The task is complete only when committed on your branch.
When you believe it is complete, append `done: {summary}` to the status file and stop.
Firstmate will then instruct you to run /no-mistakes to validate and ship a PR.

During validation the pipeline owns every fix; you only drive the gates. Once a run is active, every fix - both auto-fix findings and the fix for a real bug the review finds in your own code - is applied by the pipeline on your branch, in its own worktree. Never hand-edit, `git commit`, `git reset`/`git checkout`, abort, or re-run while a run is active: doing so duplicates the pipeline's work and forces a full re-validation. You advance the work only by responding to gates:
- Process every return; never idle-wait. `no-mistakes axi run` / `axi respond` return either at a gate (a `gate:` object) or at a terminal or CI-ready outcome (no `gate:`). A run legitimately runs long - test, CI, and each fix round take many minutes - so a quiet call is working, not stalled. Backgrounding the call is fine; idle-waiting for the run to advance on its own is not, because it never advances past a gate by itself. Read every return: on a `gate:`, respond, and loop until you reach an outcome.
- Auto-fix findings: advance the gate with `no-mistakes axi respond --action fix --findings <ids>`; the pipeline applies the fix on your branch and re-reviews. You never apply it yourself.
- Review findings always gate. Review auto-fix is disabled, so every actionable review finding parks for your response instead of being self-fixed - drive it like any other gate.
- ask-user findings: escalate to firstmate (rule 6) and stop. When the decision comes back, feed it to the gate with `no-mistakes axi respond` (the `--action`/answer the decision implies) and let the pipeline apply it. Even when it is a real bug in your own code, do NOT implement the decided fix yourself and do NOT abort to go fix it - the pipeline applies it from your `respond`.
- Avoid `--yes`: it silently auto-resolves every finding, including `ask-user`, with zero escalation, so a decision the captain should make gets resolved without them. Drive gates manually and escalate `ask-user` findings.

After /no-mistakes reports CI green, append `done: PR {url} checks green` and stop. You are finished.
Evidence: Brief contract and supervisor heuristic evidence
Generated no-mistakes ship brief evidence

Command output:
scaffolded: /var/folders/5x/4nqprlbx0518k3ybcb1sz6gr0000gn/T/no-mistakes-evidence/01KW2TARPF8J51M5T5VY2B6VPF/fm-brief-contract-home/data/gate-contract-e2e/brief.md (ship, mode=no-mistakes; replace {TASK})

Assertions checked:
- Generated no-mistakes ship brief says the pipeline owns every fix, including real bugs found in the crewmate own code.
- Generated brief forbids hand-edit, git commit, git reset/git checkout, abort, and re-run while a run is active.
- Generated brief tells the crewmate to process every gate return, permits backgrounding the long call, and forbids idle-waiting for parked gates.
- Generated brief says review auto-fix is disabled and warns against --yes because it resolves ask-user without escalation.
- AGENTS.md supervisor guidance keys off run step status and names awaiting_agent: parked <duration> plus the self-fix duplication red flag.

Generated brief path:
/var/folders/5x/4nqprlbx0518k3ybcb1sz6gr0000gn/T/no-mistakes-evidence/01KW2TARPF8J51M5T5VY2B6VPF/fm-brief-contract-home/data/gate-contract-e2e/brief.md

Generated brief validation contract excerpt:
During validation the pipeline owns every fix; you only drive the gates. Once a run is active, every fix - both auto-fix findings and the fix for a real bug the review finds in your own code - is applied by the pipeline on your branch, in its own worktree. Never hand-edit, `git commit`, `git reset`/`git checkout`, abort, or re-run while a run is active: doing so duplicates the pipeline's work and forces a full re-validation. You advance the work only by responding to gates:
- Process every return; never idle-wait. `no-mistakes axi run` / `axi respond` return either at a gate (a `gate:` object) or at a terminal or CI-ready outcome (no `gate:`). A run legitimately runs long - test, CI, and each fix round take many minutes - so a quiet call is working, not stalled. Backgrounding the call is fine; idle-waiting for the run to advance on its own is not, because it never advances past a gate by itself. Read every return: on a `gate:`, respond, and loop until you reach an outcome.
- Auto-fix findings: advance the gate with `no-mistakes axi respond --action fix --findings <ids>`; the pipeline applies the fix on your branch and re-reviews. You never apply it yourself.
- Review findings always gate. Review auto-fix is disabled, so every actionable review finding parks for your response instead of being self-fixed - drive it like any other gate.
- ask-user findings: escalate to firstmate (rule 6) and stop. When the decision comes back, feed it to the gate with `no-mistakes axi respond` (the `--action`/answer the decision implies) and let the pipeline apply it. Even when it is a real bug in your own code, do NOT implement the decided fix yourself and do NOT abort to go fix it - the pipeline applies it from your `respond`.
- Avoid `--yes`: it silently auto-resolves every finding, including `ask-user`, with zero escalation, so a decision the captain should make gets resolved without them. Drive gates manually and escalate `ask-user` findings.

After /no-mistakes reports CI green, append `done: PR {url} checks green` and stop. You are finished.

Supervisor heuristic excerpt from AGENTS.md:
For `no-mistakes`-mode ship tasks, when a crewmate's status says `done`, trigger validation using the crew's harness from `state/<id>.meta`.
Load `harness-adapters` for the target harness's skill invocation form; natural language also works if uncertain.

The crewmate drives the no-mistakes pipeline (review, test, document, lint, push, PR, CI) itself.
The no-mistakes pipeline fixes auto-fix findings on its own (inside its own worktree); the crewmate advances each gate with `no-mistakes axi respond`, and must never edit or commit code while a run is active.
When it reports `needs-decision` (ask-user findings), relay the findings to the captain unless `yolo=on` permits routine approval on your judgment, then send the decision back as a short instruction (the crewmate responds via `no-mistakes axi respond`).
Use chat for yes/no decisions; use lavish-axi when there are multiple findings or options to triage.

Judge a validating crewmate by the run's step status, never by whether its shell is still running; read it cheaply with `no-mistakes axi status`.

- `running`/`fixing`/`ci` - the pipeline is working (a fix round, a test, or CI monitoring); these run for many minutes and quiet is normal, so leave it alone.
- `awaiting_approval`/`fix_review` - the run is parked waiting on the agent, surfaced as a top-level `awaiting_agent: parked <duration>` line right after `status:` in `axi status`. The crewmate owes a `respond`; if it is idle-waiting for the run to advance on its own, steer it to drive the gate, because a parked gate never self-resolves.
- Red flag - self-fix duplication: a validating crewmate making fresh hand-commits, aborting the run, or re-running it mid-validation is re-doing work the pipeline already owns. Steer it back to respond-only: the pipeline applies every fix on the branch from `axi respond`, including the fix for a real bug the review found in the crewmate's own code, and hand-fixing forces a full re-validation.

### PR ready

For PR-based ship tasks, the ready signal depends on mode: `no-mistakes` reports `done: PR <url> checks green` after CI is green, while `direct-PR` reports `done: PR <url>` after opening the PR.

Long-command guidance excerpt from AGENTS.md:
Watcher liveness is not enough if you are foreground-blocked.
Whenever one or more tasks are in flight, do not run long foreground-blocking operations in your own session.
This is about firstmate's own session: it includes a no-mistakes pipeline firstmate runs for this repo, long builds, and any other multi-minute command.
Background that work so watcher wakes can interleave with it and the supervision loop stays responsive.
A crewmate driving its own `no-mistakes` validation does the opposite: it drives that gate loop synchronously and processes every return, never idle-waiting for its own validation run to advance on its own.

Pipeline

Updates from git push no-mistakes

✅ **intent** - passed

✅ No issues found.

✅ **Rebase** - passed

✅ No issues found.

✅ **Review** - passed

✅ No issues found.

✅ **Test** - passed

✅ No issues found.

  • tests/fm-tangle-guard.test.sh
  • Generated a real no-mistakes-mode ship brief with FM_HOME=/var/folders/5x/4nqprlbx0518k3ybcb1sz6gr0000gn/T/no-mistakes-evidence/01KW2TARPF8J51M5T5VY2B6VPF/fm-brief-contract-home bin/fm-brief.sh gate-contract-e2e alpha and asserted the new validation contract text with focused grep checks.
  • Checked AGENTS.md with focused grep assertions for running/fixing/ci, awaiting_approval/fix_review, awaiting_agent: parked &lt;duration&gt;, and Red flag - self-fix duplication, plus absence of the old never backgrounding or idle-waiting contradiction.
✅ **Document** - passed

✅ No issues found.

✅ **Lint** - passed

✅ No issues found.

✅ **Push** - passed

✅ No issues found.

…f-fix duplication

Rewrite the no-mistakes validation block scaffolded into every ship brief so a
crewmate knows the right move at each gate: the pipeline owns every fix
(including the fix for a real bug review finds in the crewmate's own code);
respond, never self-implement/abort/re-run; the ask-user loop end to end (feed
the decision back via axi respond, do not hand-fix); process every return
(backgrounding ok, never idle-wait for auto-advance); avoid --yes (it
auto-resolves ask-user with zero escalation); and review findings always gate
(review auto-fix is disabled).

Add the supervisor heuristic in AGENTS.md: firstmate keys off the no-mistakes
run step status - running/fixing/ci means working, awaiting_approval/fix_review
means parked (surfaced as awaiting_agent: parked <duration> on axi status) - not
shell liveness, plus a self-fix-duplication red flag (hand-commits/abort/re-run
mid-validation).

Resolves the gate-park-deadlock and self-fix-duplication failure modes.
@kunchenguid kunchenguid merged commit fe3c867 into main Jun 26, 2026
4 checks passed
@kunchenguid kunchenguid deleted the fm/nm-brief-contract-c4 branch June 26, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant