SmooAI · brentrager · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/.changeset/active-org-cross-subcommand.md b/.changeset/active-org-cross-subcommand.md
@@ -0,0 +1,29 @@
+---
+"smooai-smooth": patch
+---
+
+th: unify active-org resolution across `th api`, `th config`, `th auth`
+
+`th api orgs switch <id>` wrote the active org only to the legacy
+`smooth-api-client` store at `~/.smooth/auth/smooai.json`, but
+`th config list` (and any other subcommand that uses
+`smooai-client-shared`'s `default_user()` store) read from a different
+file (`~/.smooth/auth/smooai-user.json`). Net effect: switch reported
+success, then `th config list` immediately failed with
+"no active org set — pass `--org-id <id>`, set SMOOAI_ORG_ID, or run
+`th api orgs switch <id>`" — the same command the user just ran.
+
+Adds a shared `crate::active_org` module with two functions:
+
+- `resolve(override_org)` — consults `--org` flag → `$SMOOAI_ORG_ID` →
+  every credential store on disk (legacy api-client + client-shared
+  M2M + client-shared User), returning the first non-empty
+  `active_org_id`.
+- `set(org_id)` — fans the write out to every credential store whose
+  file already exists. Won't fabricate a stub User session for an
+  M2M-only user.
+
+Wires `th api orgs switch`, `th api orgs show`, the `th api`
+`require_active_org` helper, and `th config`'s `resolve_org` through
+the shared module. Covered by ten new cross-subcommand contract
+tests in `crates/smooth-cli/src/active_org.rs`.
diff --git a/.changeset/cleanup-intent-hint-plumbing.md b/.changeset/cleanup-intent-hint-plumbing.md
@@ -0,0 +1,39 @@
+---
+"smooai-smooth": minor
+---
+
+coding_workflow: cleanup-intent hint plumbing for continuation turns
+
+The fixer's test-fix bias + cross-fixture pattern confabulation made
+`cleanup-node-modules-orphans` chronically unreliable on v4-pro
+(1/6 perfect in pane-captured samples — agents fabricating
+`packages/db/db.test.js` on cleanup tasks; running
+`find . -type f -size +150k -delete` on a node-modules orphan
+task). The existing `is_cleanup_intent(task)` preamble in
+`build_user_prompt` suppresses both failure modes — but it only
+fires when the CURRENT user message matches cleanup verbs/nouns,
+which the bench's "yes, proceed" coach reply does not.
+
+This change plumbs a `cleanup_intent_hint: bool` through
+`CodingWorkflowConfig`. The runner sets it by scanning
+`agent_config.prior_messages` for cleanup intent — so when the
+prior turn was a cleanup README, the workflow re-applies the
+preamble on the confirmation turn via a new `is_confirmation_reply`
+helper.
+
+Net result at deepseek-v4-pro:
+
+- `cleanup-node-modules-orphans`: prior 1/6 perfect (3/5 + 1 no-action
+  + 1 catastrophic 7.2MB protected-dir delete) → **5/5 perfect,
+  zero-variance identical 3,559,394 bytes**. Matches opencode's
+  3/3 identical-bytes baseline on the same fixture.
+- `cleanup-disk-bloat`: 3/3 → ~2/3 (~67% pass rate; one cross-fixture
+  hallucination remained). Net regression on this fixture.
+- `cleanup-impossible-task`: 3/3 → variance not yet characterized,
+  early sample 1/2.
+- `cleanup-pycache-debris`: 3/3 strong → 2/2 stable.
+
+Trade-off worth shipping: eliminating the chronic
+catastrophic-delete failure mode on node-modules (a fixture where
+v4-pro previously had a 17% catastrophic + 50% no-action rate)
+outweighs the marginal disk-bloat slip. Pearl th-e182bc.
diff --git a/.changeset/fixer-prompt-execute-on-confirm.md b/.changeset/fixer-prompt-execute-on-confirm.md
@@ -0,0 +1,23 @@
+---
+"smooai-smooth": patch
+---
+
+fixer prompt: add explicit "When the user confirms: EXECUTE" rule
+
+When the prior assistant turn enumerated a destructive plan ending in
+"Proceed?" and the user's next message is "yes" / "proceed" / "go" /
+"do it" etc., the agent must invoke the destructive command directly,
+not re-enumerate or re-ask for confirmation, and not pivot to a
+different task.
+
+Lifts `cleanup-node-modules-orphans` pass rate from 0/5 to 3/5 under
+strict-coach mode (minimal "yes, proceed" reply). The old prompt
+implied the meaning of "yes" but never explicitly told the agent what
+behavior to perform on receipt — the model was free to interpret
+"yes" as a context-restate cue, which the bench's idle detector then
+mistook for a fresh first-idle and pasted the coach reply again,
+producing the score-0.55 zero-bytes-freed failure shape.
+
+Pearl: th-e182bc (re-scoped — was misdiagnosed as inter-turn context
+loss; instrumentation confirmed prior_messages flow is intact through
+all 3 hops, the failure is in agent action policy)
diff --git a/.changeset/fixer-todo-prompt-revert.md b/.changeset/fixer-todo-prompt-revert.md
@@ -0,0 +1,22 @@
+---
+"smooai-smooth": patch
+---
+
+fixer.txt: revert todo_list teaching section (regressed v4-pro 3/3 → 0/3)
+
+The "Multi-turn tasks: use `todo_list`" section added in
+th-1d6699's commit hurt every model tier tested:
+
+- deepseek-v4-pro: 3/3 perfect → 1/3 partial (0.8) + 2/3 must_preserve
+  violations (0.35)
+- deepseek-v4-flash: agent hallucinated "tool not in allowlist"
+  excuses, didn't actually call the tool
+
+Post-revert v4-pro is back to 3/3 perfect (3,559,751 / 3,559,751 /
+3,557,724 bytes freed). The TodoListTool itself stays — it's
+architecturally correct and ready for stronger models to pick up
+organically. The prompt-injection approach was too prescriptive
+and conflicted with the existing destructive-plan discipline. Pearl
+th-1d6699 remains in_progress for a re-attempt that demonstrates
+the tool via a concrete example rather than a 24-line procedural
+sermon.
diff --git a/.changeset/todo-list-tool.md b/.changeset/todo-list-tool.md
@@ -0,0 +1,43 @@
+---
+"smooai-smooth": minor
+---
+
+runner: add `todo_list` tool for cross-turn task state (opencode parity)
+
+Adds a `todo_list` tool to smooth-operator-runner. Operates on a small
+JSON file at `.smooth/todos.json` with four actions:
+`add` / `list` / `update` / `clear`. Persists across the runner's
+fresh-per-turn process boundary so on turn 2 the agent can
+`todo_list action='list'` to find what it was doing — the structural
+anchor opencode uses and smooth was missing.
+
+Pearl `th-1d6699`. Diagnosed by side-by-side pane capture of opencode
+vs smooth on `cleanup-node-modules-orphans`: opencode emits a
+`# Todos` checkbox list as part of its plan, marks items in_progress
+as it executes, and on `"yes, proceed"` reads the pending todo and
+issues ONE concrete `rm -rf <paths>` command. Smooth had no equivalent
+tool — every other registered tool (read_file, write_file, edit_file,
+apply_patch, list_files, grep, lsp, bash, bg_run, http_fetch,
+project_inspect, read_memory, write_memory) is single-shot or
+project-scoped, none track per-task state.
+
+Wired through:
+- `crates/smooth-operator-runner/src/main.rs` — `TodoListTool` impl
+  + `TodoStore` (JSON-file-backed, atomic rename-from-tmp write) +
+  8 unit tests including cross-process persistence.
+- `crates/smooth-bigsmooth/src/policy.rs` — added `todo_list` to both
+  `registered_tool_names()` and `read_only_tool_names()`. Without
+  this entry Wonk denies every call and the agent logs the
+  "I cannot use the todo_list tool" excuse.
+- `crates/smooth-operator/src/cast/prompts/fixer.txt` — new section
+  teaching the agent the planning → executing → completion lifecycle
+  for the tool. Anchored on "call `list` at the start of every
+  continuation turn — it tells you what was already done and what's
+  next."
+
+Bench impact at `deepseek-v4-flash`: not measurable — the weak model
+hallucinates "tool not in allowlist" rather than calling it (no
+allowlist gate exists in direct mode; the LLM is making up an
+excuse). The tool is structurally in place for stronger models
+(v4-pro, claude-sonnet) where the multi-turn discipline pays off.
+Filed as architectural parity, not a single-fixture lift.
diff --git a/.smooth/dolt/pearls/.dolt/noms/journal.idx b/.smooth/dolt/pearls/.dolt/noms/journal.idx
diff --git a/.smooth/dolt/pearls/.dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv b/.smooth/dolt/pearls/.dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
diff --git a/crates/smooth-bench/src/agent_driver.rs b/crates/smooth-bench/src/agent_driver.rs
@@ -40,6 +40,18 @@ use async_trait::async_trait;
 
 use crate::score_cleanup::{AgentRunArtifacts, CoachMode, RefusalKind};
 
+/// Poll interval for `wait_for_idle` calls. Deliberately chosen to be
+/// **coprime** with smooth-code's 500ms spinner cycle (10 braille
+/// frames × 50ms TUI tick). Pearl `th-2e6693` — at 500ms the bench
+/// poll phase-locked with the spinner: every poll caught the exact
+/// same frame, the captured pane bytes were identical, and idle
+/// fired before the agent actually finished responding. 383ms is a
+/// prime that has no GCD > 1 with 500 or any multiple of 50, so
+/// consecutive captures genuinely see different spinner frames
+/// when the TUI is mid-thought, and idle only fires when the agent
+/// is actually quiet.
+const POLL_INTERVAL_MS: u64 = 383;
+
 /// Inputs every driver receives for a single task dispatch.
 ///
 /// Borrowed because the caller holds the task fixture for the whole
@@ -571,7 +583,7 @@ fn drive_tmux_agent(spec: TmuxAgentSpec) -> AgentRunArtifacts {
 
     let start = std::time::Instant::now();
     let total_budget = timeout.saturating_sub(Duration::from_secs(2));
-    let pane1 = match driver.wait_for_idle(first_idle_dwell, Duration::from_millis(500), total_budget) {
+    let pane1 = match driver.wait_for_idle(first_idle_dwell, Duration::from_millis(POLL_INTERVAL_MS), total_budget) {
         Ok(p) => p,
         Err(e) => {
             let partial = driver.capture().unwrap_or_default();
@@ -618,7 +630,7 @@ fn drive_tmux_agent(spec: TmuxAgentSpec) -> AgentRunArtifacts {
                 pane1
             } else {
                 let remaining = total_budget.saturating_sub(start.elapsed());
-                driver.wait_for_idle(post_coach_dwell, Duration::from_millis(500), remaining).map_or_else(
+                driver.wait_for_idle(post_coach_dwell, Duration::from_millis(POLL_INTERVAL_MS), remaining).map_or_else(
                     |e| {
                         eprintln!("[{driver_name}/{task_id}] post-coach idle timeout: {e}");
                         driver.capture().unwrap_or_else(|_| pane1.clone())
@@ -887,18 +899,19 @@ fn drive_smooth_via_tmux(
         // the same 120s ceiling as `tui_score::TuiTaskConfig::default`.
         boot_timeout: Duration::from_secs(120),
         paste_warmup: Duration::from_millis(800),
-        // Smooth's `Thinking...` is static text (no animation), so an
-        // 8s idle dwell mis-fires on it before the model's first token
-        // arrives — especially on small workspaces where Big Smooth's
-        // cold-start tax can push first-token latency past 8s. Pearl
-        // `th-65a041`: bench impossible-task variability was traced to
-        // this. 20s gives the model room to think without breaking
-        // the warm-case fast path (warm runs still finish around the
-        // 60-second mark for typical fixtures). OpenCode keeps 8s
-        // because its TUI shows visible token-streaming as soon as
-        // the model starts emitting.
-        first_idle_dwell: Duration::from_secs(20),
-        post_coach_dwell: Duration::from_secs(10),
+        // Was 20s under pearl `th-65a041` because smooth's `Thinking...`
+        // was static text and the idle detector false-fired on it.
+        // Pearl `th-2e6693` made the spinner animate every ~100ms +
+        // POLL_INTERVAL_MS is coprime with the spinner cycle, so 8s
+        // of byte-stable dwell now genuinely means "no agent
+        // activity" on the first idle.
+        first_idle_dwell: Duration::from_secs(8),
+        // Post-coach is different — smooth's tool-call chain after
+        // "yes, proceed" can include multiple `list_files` /
+        // `bash rm -rf` rounds. 5s sometimes catches the agent
+        // mid-tool-call. 15s gives the chain time to finish without
+        // pushing wallclock budget excessively.
+        post_coach_dwell: Duration::from_secs(15),
         task_id,
         workspace,
         prompt,

diff --git a/crates/smooth-bigsmooth/src/policy.rs b/crates/smooth-bigsmooth/src/policy.rs
@@ -572,6 +572,11 @@ fn registered_tool_names() -> Vec<String> {
         "list_files".into(),
         "read_memory".into(),
         "write_memory".into(),
+        // Pearl th-1d6699: cross-turn task state, opencode-parity.
+        // Without this entry Wonk denies every call and the agent
+        // logs "I cannot use the todo_list tool as it is not in
+        // the allowlist" then proceeds without it.
+        "todo_list".into(),
         "grep".into(),
         "glob".into(),
         "lsp".into(),
@@ -610,6 +615,11 @@ fn read_only_tool_names() -> Vec<String> {
         "project_inspect".into(),
         "read_memory".into(),
         "write_memory".into(),
+        // Pearl th-1d6699: read-only roles (oracle, mapper, heckler)
+        // also need to track their own plan across turns. Cheap call
+        // — the tool only writes to .smooth/todos.json which is
+        // already covered by the workspace write allowlist.
+        "todo_list".into(),
     ]
 }