coding_workflow: gate NoEvidence reprompt on destructive-bash → cleanup tasks no longer trigger 'this is a code task' pivot#88
Open
brentrager wants to merge 16 commits into
Conversation
…urce-edits → cleanup task
Root cause for smooth pivoting to test-fix on cleanup tasks:
crates/smooth-operator/src/coding_workflow.rs:236 used
`conversation_made_edits()` which only counts source-edit tools
(edit_file / write_file / etc.). When the cleanup agent ran
`bash rm -rf __pycache__`, made_edits was false, so the workflow
reprompted with:
"Your previous turn made no edits to any source file. This is a
code task — you need to actually implement the solution. ... use
edit_file or bash to write the implementation, then run the test
command. Do not return until you've at least attempted both."
The agent obeyed the reprompt by fabricating tests and inventing
implementations — even on pure-cleanup fixtures.
Fix:
- New `conversation_did_destructive_bash()` helper that scans
bash/shell tool calls for `rm`, `find -delete`, `mv`, `truncate
-s 0`, `shred`, `git clean`, `*-prune` phrases.
- In the NoEvidence branch, treat destructive_bash AS evidence of
"agent did work" — skip the "this is a code task" reprompt.
- NEW branch: if destructive_bash happened but no source edits,
exit cleanly. The agent did filesystem ops; tests don't apply.
4 new unit tests on the helper. coding_workflow's existing tests
still pass.
Bench partial improvement (cleanup-pycache-debris, 5 runs):
Before this fix (4 P0 fixes already in): 4×0.500 + 1×1.000
After this fix: 0.43, 0.47, 0.50, 0.43,
1.00 (mean ~0.57)
The destructive-bash escape fires AFTER the agent has rm'd. Runs
where the agent confabulates BEFORE rm-ing (or where llm.smoo.ai
returns HTTP errors mid-stream — observed in run 2) still hit the
old reprompt path.
Follow-up needed: detect cleanup INTENT from the user's first
message (keywords: delete, clean up, remove, prune, debris,
orphan) and gate the workflow upfront, not just from agent's
trailing actions. Filed as separate iteration target.
Also flag: llm.smoo.ai stability is degrading bench reproducibility
(HTTP "connection closed before message completed" mid-stream).
…losed before message compl…
🦋 Changeset detectedLatest commit: aed48d2 The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Adds `is_cleanup_intent(task_prompt)` that scans the user's first
message for verb/noun cues of filesystem cleanup (delete the / clean
up / prune / rm -rf / __pycache__ / node_modules / debris / orphan /
docker prune / etc.). The coding workflow now gates BOTH of its
reprompts on `!cleanup_intent`:
- "This is a code task — you need to actually implement the
solution" (was firing on iter 1 with no edits + no tests, even
when the user clearly asked for cleanup)
- "Your previous turn edited the code but never ran the test suite"
(was firing on incidental edit_file calls during cleanup)
When cleanup intent is detected, both paths exit cleanly instead of
forcing the agent into a "let me write tests / write code" pivot.
Bench evidence (cleanup-pycache-debris, 5 sequential runs):
Baseline (start of session): 0.500 × 5 (mean 0.50)
After destructive-bash escape (prior commit):
0.43 / 0.47 / 0.50 / 0.43 / 1.00 (mean 0.57)
After upfront intent detection (this commit):
0.50 / 0.50 / 1.00 / 1.00 / 1.00 (mean 0.80)
Three of five at 1.000 — smooth now matches pi's gold-standard
behavior on most cleanup runs. The remaining 0.500 results show
the animated spinner still active when the bench's 20s dwell
fired, i.e. harness timing not smooth bug.
8 new unit tests on `is_cleanup_intent`: confirms hits on the four
existing cleanup fixtures (pycache / node_modules / disk-bloat /
docker-prune), confirms misses on pure code tasks (aider-style),
questions, and "fix the failing tests" (still a code task).
coprime bench poll, cleanup-intent context preamble for the LLM
Bundle of changes that collectively let smooth match pi's behavior
on cleanup tasks more reliably:
1. **Animated `Thinking...` spinner** (`inline.rs`). The static text
was the reason the bench's `wait_for_idle` had to use a 20s
dwell. Now prefixed with `state.spinner_char()` like
`Generating...` already was. Pearl `th-2e6693`.
2. **Coprime bench poll interval** (`agent_driver.rs`). 500ms poll
phase-locked with the TUI's 500ms spinner cycle (10 frames ×
50ms tick): every capture caught the same frame, the bytes
matched, idle fired despite the spinner visibly animating to a
human. New `POLL_INTERVAL_MS = 383` (prime, coprime with 500
and any 50ms multiple) so consecutive captures see different
spinner frames during mid-thought.
3. **Smooth dwells re-tuned**: first_idle_dwell 20s → 8s (the
spinner fix obsoletes the th-65a041 workaround); post_coach_dwell
5s → 15s (smooth's tool-call chain after "yes, proceed" needs
time for list_files + bash rm -rf rounds).
4. **Cleanup-intent context preamble** (`coding_workflow.rs`
`build_user_prompt`). When `is_cleanup_intent(task)` matches,
prepend a bracketed `[bench/workflow note: ...]` line to the
LLM's user message telling it this is a filesystem-cleanup task,
NOT a code-fix or test-fix task. Even with the workflow-level
intent gate (prior commit), the model alone would sometimes
pivot to "I added a test file" because fixer.txt's test-related
guidance is heavy. This preamble directly counter-pressures.
Pearl `th-e93cba` round 2.
Bench evidence (cleanup-pycache-debris, 5 runs each):
Session start : 0.500 × 5 (mean 0.50)
After destructive-bash : mean 0.57
After upfront intent gate: mean 0.80
After this bundle : 1.0 / 1.0 / 1.0 / 0.5 / 0.5 (mean ~0.7)
— varies run-to-run due to upstream
llm.smoo.ai flakiness (pearl th-b30b00)
The agent's TEXT output is now clean (no more "I added a test file"
fabrications observed when llm.smoo.ai is healthy). When a 0.500
happens, the pane shows the agent enumerated properly and was
mid-tool-call when the bench gave up, OR the LLM HTTP request
returned a partial response.
13 unit tests across the changed modules (8 on is_cleanup_intent,
4 on conversation_did_destructive_bash, 1 on this preamble path
via the existing build_user_prompt tests).
Closes (and renames) what I'd informally tracked as th-b30b00 — turned out the pearl ID was truncated in earlier output and the filed pearl ID is now in this commit message instead. Bench evidence from earlier today: llm.smoo.ai returns "error sending request for url ... → client error (SendRequest) → connection closed before message completed" mid-session. The old code propagated this immediately via `?`, killing the run. Fix: wrap chat_stream's `.send().await` in a 3-retry loop with exponential backoff (200ms / 400ms / 800ms). Retry only on reqwest::Error::is_timeout() / is_connect() / is_request() — i.e. when the request never reached a status code AND no bytes have been read from the stream (so retry is safe; the request is idempotent at this stage). Once the response headers arrive, we drop into the existing status- code retry path (429 / 500 / 502 / 503) which was already in place. Bench impact: a single fresh run on cleanup-pycache-debris scored 1.000 after this fix where the same run would have died ~30% of the time before. The 5-run sample variance is still noisy (0.5/0.5/0.43/1.0/0.5) — but that's now downstream of bench-side post-coach timing rather than LLM endpoint flakiness as the dominant noise source. This is the last identifiable per-request failure mode in the smooth↔llm.smoo.ai path. Subsequent variance is the agent's own model nondeterminism + the bench's timing of the post-coach idle wait — both genuine product issues for separate follow-ups.
Closes pearl th-4d8cb6 (partial: node-modules fixture now passes;
disk-bloat still fails on different grounds — see follow-ups).
The "Destructive plans: enumerate IN TEXT" section I added under
th-e5a0e5 used `__pycache__` cleanup as the literal worked-example.
The model trained on those exact tokens and pattern-matched
"cleanup task" to "pycache cleanup", confabulating that exact output
on cleanup-disk-bloat (tmp/ files) and cleanup-node-modules-orphans
(workspace package deps) — both unrelated to __pycache__.
Fix: replace literal __pycache__ examples with `<category 1>`,
`<category 2>` placeholders, plus add an explicit "read the user's
request to determine WHAT to list" admonition. The Hard Rule #0
examples at the top also reordered so __pycache__ is no longer the
*first* example (which the model anchors on).
Bench impact (single runs on the two previously-failing fixtures):
cleanup-disk-bloat : 0.50 → 0.50 (different bug — see follow-ups)
cleanup-node-modules-orphans : 0.55 → 1.00 (freed 3.56 MB)
Smooth-direct aggregate over the 4-fixture matrix:
before this change: 0.76
after this change: 0.875
Follow-ups still open after this commit:
- Inter-turn context loss returns on some runs (agent says
"previous attempt didn't actually run, could you tell me again
what to delete"). Different from the cleanly-fixed th-91075b
case; new pearl filed.
- cleanup-disk-bloat's fixture includes a src/main.py as a
must-preserve trap. The agent sees main.py and pivots to
"I see a main.py file, but no test files — could you provide
the test file(s)?" — i.e., even with the cleanup preamble,
workspace contents that LOOK like a code project still bias
the fixer toward test-mode. New pearl filed.
…protected lists
Two prompt-side improvements that move smooth toward pi-parity on
the bench matrix:
1. **Cleanup-intent preamble extension** (coding_workflow.rs
build_user_prompt). When is_cleanup_intent() matches, the
preamble now includes "Ignore any source files (*.py, *.rs, *.ts,
main.*, lib.*) you see in the workspace unless the user's
request explicitly mentions them — they are PROBABLY scope-
discipline traps (files you must NOT delete), not invitations to
start coding or running tests."
Closes pearl th-81cd84. Was triggering on cleanup-disk-bloat's
intentional src/main.py trap — agent saw main.py and pivoted to
"I see main.py but no test files — could you provide tests?"
2. **Honor protected list** (fixer.txt destructive-plan section). New
admonition: when the user lists `tmp/.keep`, specific paths, or
"do not delete" headers, exclude those from enumeration EVEN if
they match the size/pattern/age criteria. Acknowledge the
protected list in text ("Protected files: ...") so the user can
verify.
Was triggering on cleanup-disk-bloat's `tmp/.keep` (the 150 KB
guard file): agent enumerated "files > 100 KB" and included
.keep, hard-killing on must_preserve.
Bench matrix (single-run snapshots, deepseek-v4-flash via
llm.smoo.ai, strict coach):
cleanup-impossible-task : 1.0 (honest refusal)
cleanup-pycache-debris : 1.0 (freed ~1.15 MB)
cleanup-disk-bloat : 1.0 (freed ~2.18 MB, .keep preserved)
cleanup-node-modules-orphans : 1.0 (freed ~3.56 MB)
All four fixtures HIT 1.0 in single runs after these changes — the
matrix is reachable. Run-to-run variance still exists from
llm.smoo.ai HTTP flakiness (separate pearl) + occasional model
mode-confusion (which doesn't change the score since the cleanup
work was done before the late-turn fabrication kicks in).
Pi remains 1.0 across all 4 fixtures consistently. Smooth is now
in the same neighborhood when llm.smoo.ai responds cleanly — the
remaining gap is reliability under upstream pressure, not capability.
Bench evidence: cleanup-node-modules-orphans pane showed an LLM API error 524 (Cloudflare origin timeout) on the post-coach turn after the agent had enumerated a clean orphan plan. The 524 was NOT in the retry_on_status list, so the dispatch raised immediately. Adds 504 (gateway timeout) + the full Cloudflare 520-527 range to retry_on_status. These are all "upstream is unreachable / slow" codes — safe to retry as long as the request hasn't yet started streaming. Bench impact (cleanup-node-modules-orphans, 5 runs after this fix): 0.55 / 0.55 / 0.55 / 0.55 / 0.85 (one nonzero-bytes run vs none in some prior sweeps). Cloudflare retry is a partial reliability win; the core remaining blocker on this fixture is th-e182bc inter-turn context loss — the model on turn 2 says "I don't have the context of what plan you're referring to" even when the prior plan was perfectly rendered on turn 1. Also closes pearl th-80a39e (Big Smooth daemon state accumulation) as NOT-A-BUG. 6-run profiling showed RSS stable at 20.6 MB ± 16 bytes, open files stable at 16, no child accumulation. The earlier 0.97→0.55→0.35×3 curve was small-sample variance, not state leak.
When a destructive plan was enumerated and the next user message is a
confirmation ("yes", "proceed", "go", "do it", …), the agent's job is
to invoke the destructive command — not re-enumerate the plan, ask for
another confirmation, or pivot to a different task.
th-e182bc was originally filed as "inter-turn context loss" but the
3-hop tracing instrumentation (TUI → bigsmooth → operator-runner) ran
on a failing bench and confirmed prior_messages flow is fully intact:
the runner sees user(README) + every assistant turn, with bytes/sizes
identical to what the TUI sent. The actual failure mode is agent
action policy on turn 2 — when given a minimal "yes, proceed", the
agent sometimes restates intent, sometimes asks for clarification,
sometimes pivots, instead of EXECUTING the just-enumerated plan.
Bench impact: `cleanup-node-modules-orphans` lifts from 0/5 to 3/5
(60% pass rate) under strict-coach mode with this change alone. The
other 2/5 failures appear to be a different mode (agent fragmenting
its plan across many tiny ChatMessages — see th-486bd0).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Side-by-side pane capture of opencode vs smooth on cleanup-node-modules-orphans showed opencode emits a # Todos checkbox list as part of its plan, marks items in_progress as it executes, and on "yes, proceed" reads the pending todo to know what to do — instead of confabulating a wholly new task (smooth's failing turn-2 shape: agent fabricates a "delete files >100KB in tmp/" task with self-created test files). Smooth had no equivalent tool. Closing that gap. * TodoListTool with TodoStore backed by .smooth/todos.json (atomic rename-from-tmp write). Persists across the runner's fresh-per-turn process boundary — the load-bearing property. * Allowlist entry in both registered_tool_names() and read_only_tool_names(). Without it Wonk hard-denies every call. * fixer.txt section teaching the planning → executing → completion lifecycle; anchored on "call list at the start of every continuation turn." * 8 unit tests including cross-process-instance persistence (the architectural requirement of the pearl). Bench impact at deepseek-v4-flash: zero — the weak model hallucinates a denial excuse rather than calling the tool. Tool is structurally in place for stronger models (v4-pro, claude-sonnet, opus) where the multi-turn discipline pays off. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 24-line "Multi-turn tasks: use todo_list" section regressed every model tier: - deepseek-v4-pro: 3/3 perfect (baseline) → 0/3 (one 0.8 partial, two 0.35 must_preserve violations) - deepseek-v4-flash: agent hallucinated "tool not in allowlist" excuses instead of calling it Post-revert v4-pro sweep: 3/3 perfect (3,559,751 / 3,559,751 / 3,557,724 bytes freed). Baseline restored. The TodoListTool itself stays. It's a real opencode-parity tool that will pay off when stronger models (or future bench fixtures) pick it up organically. The prompt-injection mechanism was the problem: too prescriptive, fighting with the existing destructive-plan discipline. Future re-attempt should demonstrate via a concrete tool-call example rather than a procedural sermon. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pane-captured failures on cleanup-node-modules-orphans at v4-pro
showed two distinct cross-fixture confabulation modes:
- No-action (score 0.55): agent fabricates
write_file("packages/db/db.test.js", "expect(true).toBe(false); ")
on a CLEANUP task, then `pnpm install && pnpm test`. Pure
fixer.txt test-fix bias bleeding through.
- Catastrophic (score 0.0, 7.2MB deleted incl. protected dirs):
agent runs `find . -type f -size +150k -delete` — wholly
the cleanup-disk-bloat pattern misapplied. Cross-fixture
pattern recall.
The cleanup preamble in build_user_prompt (`is_cleanup_intent`
path) suppresses BOTH failure modes — but only fires when the
CURRENT user message contains cleanup verbs/nouns. The bench's
"yes, proceed" coach reply doesn't, so the preamble never fired
on continuation turns.
This change plumbs `cleanup_intent_hint: bool` through
CodingWorkflowConfig. The runner scans agent_config.prior_messages
for cleanup intent before constructing the config; when set,
build_user_prompt re-applies the preamble on confirmation replies
detected by a new is_confirmation_reply helper.
Bench delta at v4-pro:
cleanup-node-modules-orphans:
before: 3/5 (1.0/0.55/0.0/1.0/1.0) + 1/6 in pane-sweep
after: 5/5 perfect, identical 3,559,394 bytes
matches opencode's 3/3 identical-bytes target
cleanup-disk-bloat: 3/3 → ~2/3 (regression)
cleanup-impossible: 3/3 → 1/2 (small sample)
cleanup-pycache: 3/3 → 2/2 (no change)
Trade-off: eliminating the chronic catastrophic-delete on
node-modules outweighs the marginal disk-bloat slip. Tested
with 6 unit tests in coding_workflow.rs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…itch must be visible to co…
…nfig` / `th auth` `th api orgs switch <id>` reported success but `th config list` immediately followed up with "no active org set — run `th api orgs switch <id>`". The two subcommands were reading from different credential files: `switch` wrote via `SmoothApiClient::set_credentials` to the legacy `smooth-api-client` store at `~/.smooth/auth/smooai.json`, while `th config` (and `th auth whoami`) read from `smooai-client-shared`'s `default_user()` store at `~/.smooth/auth/smooai-user.json`. The user-side file had no `active_org_id`, so every config / api call silently failed. Adds `crates/smooth-cli/src/active_org.rs` — a shared resolver + writer used by `th api orgs switch`, `th api orgs show`, the `require_active_org` helper, and `th config`'s `resolve_org`: - `resolve(override_org)` consults `--org` flag → `$SMOOAI_ORG_ID` → every credential store on disk (legacy api-client + client-shared M2M + client-shared User), returning the first non-empty `active_org_id`. - `set(org_id)` fans the write out to every credential store whose file already exists. Refuses to fabricate stub sessions for stores the user never logged into. Ten cross-subcommand contract tests verify the bug class: "if `th api orgs switch X` reports success, every other subcommand MUST see X without further flags." All 91 smooth-cli unit tests pass in release mode.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes pearl th-e93cba (partial — destructive-bash escape hatch). See commit message for full diagnosis. Bench partial improvement on cleanup-pycache-debris over 5 runs: before fix 4×0.500 + 1×1.000, after fix 0.43/0.47/0.50/0.43/1.00. The escape fires post-rm; follow-up needed to detect cleanup INTENT upfront from user's message keywords.