Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
a1a727a
Pearl th-e93cba: gate NoEvidence reprompt on destructive-bash + no-so…
brentrager Jun 4, 2026
354fea7
pearl: create th-b30b00 bench: llm.smoo.ai HTTP errors ('connection c…
brentrager Jun 4, 2026
2cd90d8
Pearl th-e93cba: upfront cleanup-intent detection (mean 0.60 → 0.80)
brentrager Jun 4, 2026
f47ab73
Pearls th-2e6693 + th-e93cba (round 2): animated Thinking spinner,
brentrager Jun 4, 2026
1dcc9a3
llm.rs: retry chat_stream `.send()` on transient reqwest errors
brentrager Jun 4, 2026
6e20ea0
fixer.txt: remove __pycache__ bias from the destructive-plan example
brentrager Jun 4, 2026
c4d3c8d
Pearls th-81cd84 + scope-discipline: extend cleanup preamble + honor …
brentrager Jun 5, 2026
1e852bd
pearl: close th-81cd84
brentrager Jun 5, 2026
44a9e27
llm.rs: retry on Cloudflare 5xx (520-527) + 504 — observed 524 mid-run
brentrager Jun 5, 2026
f66e711
pearl: close th-80a39e
brentrager Jun 5, 2026
ae10d6b
Pearl th-e182bc (rescoped): fixer prompt — execute on user confirm
brentrager Jun 5, 2026
864e834
Pearl th-1d6699: add todo_list tool to runner (opencode parity)
brentrager Jun 5, 2026
f5abcd3
Pearl th-1d6699: revert fixer.txt todo_list section (regressed v4-pro)
brentrager Jun 5, 2026
06800c1
Pearl th-e182bc: cleanup-intent hint plumbing (5/5 node-modules)
brentrager Jun 5, 2026
c52f12b
pearl: create th-3217db th: cross-subcommand active-org contract — sw…
brentrager Jun 6, 2026
aed48d2
Pearl th-3217db: unify active-org resolution across `th api` / `th co…
brentrager Jun 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .changeset/active-org-cross-subcommand.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
"smooai-smooth": patch
---

th: unify active-org resolution across `th api`, `th config`, `th auth`

`th api orgs switch <id>` wrote the active org only to the legacy
`smooth-api-client` store at `~/.smooth/auth/smooai.json`, but
`th config list` (and any other subcommand that uses
`smooai-client-shared`'s `default_user()` store) read from a different
file (`~/.smooth/auth/smooai-user.json`). Net effect: switch reported
success, then `th config list` immediately failed with
"no active org set — pass `--org-id <id>`, set SMOOAI_ORG_ID, or run
`th api orgs switch <id>`" — the same command the user just ran.

Adds a shared `crate::active_org` module with two functions:

- `resolve(override_org)` — consults `--org` flag → `$SMOOAI_ORG_ID` →
every credential store on disk (legacy api-client + client-shared
M2M + client-shared User), returning the first non-empty
`active_org_id`.
- `set(org_id)` — fans the write out to every credential store whose
file already exists. Won't fabricate a stub User session for an
M2M-only user.

Wires `th api orgs switch`, `th api orgs show`, the `th api`
`require_active_org` helper, and `th config`'s `resolve_org` through
the shared module. Covered by ten new cross-subcommand contract
tests in `crates/smooth-cli/src/active_org.rs`.
39 changes: 39 additions & 0 deletions .changeset/cleanup-intent-hint-plumbing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
"smooai-smooth": minor
---

coding_workflow: cleanup-intent hint plumbing for continuation turns

The fixer's test-fix bias + cross-fixture pattern confabulation made
`cleanup-node-modules-orphans` chronically unreliable on v4-pro
(1/6 perfect in pane-captured samples — agents fabricating
`packages/db/db.test.js` on cleanup tasks; running
`find . -type f -size +150k -delete` on a node-modules orphan
task). The existing `is_cleanup_intent(task)` preamble in
`build_user_prompt` suppresses both failure modes — but it only
fires when the CURRENT user message matches cleanup verbs/nouns,
which the bench's "yes, proceed" coach reply does not.

This change plumbs a `cleanup_intent_hint: bool` through
`CodingWorkflowConfig`. The runner sets it by scanning
`agent_config.prior_messages` for cleanup intent — so when the
prior turn was a cleanup README, the workflow re-applies the
preamble on the confirmation turn via a new `is_confirmation_reply`
helper.

Net result at deepseek-v4-pro:

- `cleanup-node-modules-orphans`: prior 1/6 perfect (3/5 + 1 no-action
+ 1 catastrophic 7.2MB protected-dir delete) → **5/5 perfect,
zero-variance identical 3,559,394 bytes**. Matches opencode's
3/3 identical-bytes baseline on the same fixture.
- `cleanup-disk-bloat`: 3/3 → ~2/3 (~67% pass rate; one cross-fixture
hallucination remained). Net regression on this fixture.
- `cleanup-impossible-task`: 3/3 → variance not yet characterized,
early sample 1/2.
- `cleanup-pycache-debris`: 3/3 strong → 2/2 stable.

Trade-off worth shipping: eliminating the chronic
catastrophic-delete failure mode on node-modules (a fixture where
v4-pro previously had a 17% catastrophic + 50% no-action rate)
outweighs the marginal disk-bloat slip. Pearl th-e182bc.
23 changes: 23 additions & 0 deletions .changeset/fixer-prompt-execute-on-confirm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
"smooai-smooth": patch
---

fixer prompt: add explicit "When the user confirms: EXECUTE" rule

When the prior assistant turn enumerated a destructive plan ending in
"Proceed?" and the user's next message is "yes" / "proceed" / "go" /
"do it" etc., the agent must invoke the destructive command directly,
not re-enumerate or re-ask for confirmation, and not pivot to a
different task.

Lifts `cleanup-node-modules-orphans` pass rate from 0/5 to 3/5 under
strict-coach mode (minimal "yes, proceed" reply). The old prompt
implied the meaning of "yes" but never explicitly told the agent what
behavior to perform on receipt — the model was free to interpret
"yes" as a context-restate cue, which the bench's idle detector then
mistook for a fresh first-idle and pasted the coach reply again,
producing the score-0.55 zero-bytes-freed failure shape.

Pearl: th-e182bc (re-scoped — was misdiagnosed as inter-turn context
loss; instrumentation confirmed prior_messages flow is intact through
all 3 hops, the failure is in agent action policy)
22 changes: 22 additions & 0 deletions .changeset/fixer-todo-prompt-revert.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
"smooai-smooth": patch
---

fixer.txt: revert todo_list teaching section (regressed v4-pro 3/3 → 0/3)

The "Multi-turn tasks: use `todo_list`" section added in
th-1d6699's commit hurt every model tier tested:

- deepseek-v4-pro: 3/3 perfect → 1/3 partial (0.8) + 2/3 must_preserve
violations (0.35)
- deepseek-v4-flash: agent hallucinated "tool not in allowlist"
excuses, didn't actually call the tool

Post-revert v4-pro is back to 3/3 perfect (3,559,751 / 3,559,751 /
3,557,724 bytes freed). The TodoListTool itself stays — it's
architecturally correct and ready for stronger models to pick up
organically. The prompt-injection approach was too prescriptive
and conflicted with the existing destructive-plan discipline. Pearl
th-1d6699 remains in_progress for a re-attempt that demonstrates
the tool via a concrete example rather than a 24-line procedural
sermon.
43 changes: 43 additions & 0 deletions .changeset/todo-list-tool.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
"smooai-smooth": minor
---

runner: add `todo_list` tool for cross-turn task state (opencode parity)

Adds a `todo_list` tool to smooth-operator-runner. Operates on a small
JSON file at `.smooth/todos.json` with four actions:
`add` / `list` / `update` / `clear`. Persists across the runner's
fresh-per-turn process boundary so on turn 2 the agent can
`todo_list action='list'` to find what it was doing — the structural
anchor opencode uses and smooth was missing.

Pearl `th-1d6699`. Diagnosed by side-by-side pane capture of opencode
vs smooth on `cleanup-node-modules-orphans`: opencode emits a
`# Todos` checkbox list as part of its plan, marks items in_progress
as it executes, and on `"yes, proceed"` reads the pending todo and
issues ONE concrete `rm -rf <paths>` command. Smooth had no equivalent
tool — every other registered tool (read_file, write_file, edit_file,
apply_patch, list_files, grep, lsp, bash, bg_run, http_fetch,
project_inspect, read_memory, write_memory) is single-shot or
project-scoped, none track per-task state.

Wired through:
- `crates/smooth-operator-runner/src/main.rs` — `TodoListTool` impl
+ `TodoStore` (JSON-file-backed, atomic rename-from-tmp write) +
8 unit tests including cross-process persistence.
- `crates/smooth-bigsmooth/src/policy.rs` — added `todo_list` to both
`registered_tool_names()` and `read_only_tool_names()`. Without
this entry Wonk denies every call and the agent logs the
"I cannot use the todo_list tool" excuse.
- `crates/smooth-operator/src/cast/prompts/fixer.txt` — new section
teaching the agent the planning → executing → completion lifecycle
for the tool. Anchored on "call `list` at the start of every
continuation turn — it tells you what was already done and what's
next."

Bench impact at `deepseek-v4-flash`: not measurable — the weak model
hallucinates "tool not in allowlist" rather than calling it (no
allowlist gate exists in direct mode; the LLM is making up an
excuse). The tool is structurally in place for stronger models
(v4-pro, claude-sonnet) where the multi-turn discipline pays off.
Filed as architectural parity, not a single-fixture lift.
Binary file modified .smooth/dolt/pearls/.dolt/noms/journal.idx
Binary file not shown.
Binary file modified .smooth/dolt/pearls/.dolt/noms/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Binary file not shown.
41 changes: 27 additions & 14 deletions crates/smooth-bench/src/agent_driver.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,18 @@ use async_trait::async_trait;

use crate::score_cleanup::{AgentRunArtifacts, CoachMode, RefusalKind};

/// Poll interval for `wait_for_idle` calls. Deliberately chosen to be
/// **coprime** with smooth-code's 500ms spinner cycle (10 braille
/// frames × 50ms TUI tick). Pearl `th-2e6693` — at 500ms the bench
/// poll phase-locked with the spinner: every poll caught the exact
/// same frame, the captured pane bytes were identical, and idle
/// fired before the agent actually finished responding. 383ms is a
/// prime that has no GCD > 1 with 500 or any multiple of 50, so
/// consecutive captures genuinely see different spinner frames
/// when the TUI is mid-thought, and idle only fires when the agent
/// is actually quiet.
const POLL_INTERVAL_MS: u64 = 383;

/// Inputs every driver receives for a single task dispatch.
///
/// Borrowed because the caller holds the task fixture for the whole
Expand Down Expand Up @@ -571,7 +583,7 @@ fn drive_tmux_agent(spec: TmuxAgentSpec) -> AgentRunArtifacts {

let start = std::time::Instant::now();
let total_budget = timeout.saturating_sub(Duration::from_secs(2));
let pane1 = match driver.wait_for_idle(first_idle_dwell, Duration::from_millis(500), total_budget) {
let pane1 = match driver.wait_for_idle(first_idle_dwell, Duration::from_millis(POLL_INTERVAL_MS), total_budget) {
Ok(p) => p,
Err(e) => {
let partial = driver.capture().unwrap_or_default();
Expand Down Expand Up @@ -618,7 +630,7 @@ fn drive_tmux_agent(spec: TmuxAgentSpec) -> AgentRunArtifacts {
pane1
} else {
let remaining = total_budget.saturating_sub(start.elapsed());
driver.wait_for_idle(post_coach_dwell, Duration::from_millis(500), remaining).map_or_else(
driver.wait_for_idle(post_coach_dwell, Duration::from_millis(POLL_INTERVAL_MS), remaining).map_or_else(
|e| {
eprintln!("[{driver_name}/{task_id}] post-coach idle timeout: {e}");
driver.capture().unwrap_or_else(|_| pane1.clone())
Expand Down Expand Up @@ -887,18 +899,19 @@ fn drive_smooth_via_tmux(
// the same 120s ceiling as `tui_score::TuiTaskConfig::default`.
boot_timeout: Duration::from_secs(120),
paste_warmup: Duration::from_millis(800),
// Smooth's `Thinking...` is static text (no animation), so an
// 8s idle dwell mis-fires on it before the model's first token
// arrives — especially on small workspaces where Big Smooth's
// cold-start tax can push first-token latency past 8s. Pearl
// `th-65a041`: bench impossible-task variability was traced to
// this. 20s gives the model room to think without breaking
// the warm-case fast path (warm runs still finish around the
// 60-second mark for typical fixtures). OpenCode keeps 8s
// because its TUI shows visible token-streaming as soon as
// the model starts emitting.
first_idle_dwell: Duration::from_secs(20),
post_coach_dwell: Duration::from_secs(10),
// Was 20s under pearl `th-65a041` because smooth's `Thinking...`
// was static text and the idle detector false-fired on it.
// Pearl `th-2e6693` made the spinner animate every ~100ms +
// POLL_INTERVAL_MS is coprime with the spinner cycle, so 8s
// of byte-stable dwell now genuinely means "no agent
// activity" on the first idle.
first_idle_dwell: Duration::from_secs(8),
// Post-coach is different — smooth's tool-call chain after
// "yes, proceed" can include multiple `list_files` /
// `bash rm -rf` rounds. 5s sometimes catches the agent
// mid-tool-call. 15s gives the chain time to finish without
// pushing wallclock budget excessively.
post_coach_dwell: Duration::from_secs(15),
task_id,
workspace,
prompt,
Expand Down
10 changes: 10 additions & 0 deletions crates/smooth-bigsmooth/src/policy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,11 @@ fn registered_tool_names() -> Vec<String> {
"list_files".into(),
"read_memory".into(),
"write_memory".into(),
// Pearl th-1d6699: cross-turn task state, opencode-parity.
// Without this entry Wonk denies every call and the agent
// logs "I cannot use the todo_list tool as it is not in
// the allowlist" then proceeds without it.
"todo_list".into(),
"grep".into(),
"glob".into(),
"lsp".into(),
Expand Down Expand Up @@ -610,6 +615,11 @@ fn read_only_tool_names() -> Vec<String> {
"project_inspect".into(),
"read_memory".into(),
"write_memory".into(),
// Pearl th-1d6699: read-only roles (oracle, mapper, heckler)
// also need to track their own plan across turns. Cheap call
// — the tool only writes to .smooth/todos.json which is
// already covered by the workspace write allowlist.
"todo_list".into(),
]
}

Expand Down
Loading