bench: SmoothDriver — head-to-head against OpenCode at the same model by brentrager · Pull Request #81 · SmooAI/smooth

brentrager · 2026-06-03T16:36:09Z

Stacked on top of #80 (AgentDriver trait + OpenCode). Merge that first.

Summary

Pearls th-754512 (SmoothDriver), th-979db6 (Unicode bullets in plan parser).

First live head-to-head data point with the same model (`smooai/deepseek-v4-flash`) driven through the same surface (tmux + paste + idle + auto-coach):

driver	weighted score	bytes freed	plan items	prompted	preserved
mock (perfect bash baseline)	1.000	1.30 MB	5	yes	yes
opencode	1.000	1.30 MB	4 (table)	yes	yes
smooth (`th code --model deepseek-v4-flash`)	0.433	0	1 → 50 (after fix)	yes	yes

Why a refactor was part of this

The OpenCode driver and the new SmoothDriver share ~80% of the body — only the shell command, boot timeout, and label differ. Extracted `drive_tmux_agent` + `TmuxAgentSpec` so the boot/paste/idle/auto-coach loop is shared. The agent-coordination logic stays identical across backends, which is exactly what guarantees scores remain comparable — if each driver had its own copy of the loop, we'd be measuring slightly different things and calling it the same number.

ClaudeCodeDriver (th-36145e) now lands as ~30 lines instead of ~120.

Insights the run surfaced (filed as separate pearls)

The 0.433 vs 1.000 delta is the data point the user actually wanted: "improve smooth as an agentic experience". Three improvement pearls filed:

th-fa4da9 (P0) — smooth-code's 'yes' coach reply doesn't trigger destructive bash. Agent enumerates targets, never executes deletion. Core agentic gap vs OpenCode on identical model+fixture.
th-eeb00d (P1) — smooth-code TUI double-renders/garbles streaming agent output (`pycachepycache`, `sub_sub_19`, `24.0K.0K` visible in the pane dump). Real end-user UX bug.
th-20574a (P1) — `th code --model deepseek-v4-flash` was either dropped or aliased — status line still shows 'smooth-coding'. Apples-to-apples benches need canonical --model handling.

What also landed in the harness

th-979db6: parse_plan_artifacts now recognizes `•` (U+2022) bullets in addition to ASCII `- ` and table rows. Smooth's TUI uses Unicode bullets; before this fix smooth's perfectly-valid 50-item plan scored 0 on explanation_quality.

Test plan

29 unit tests on agent_driver (4 new SmoothDriver tests for shell_cmd, name, missing-binary boot path; 1 new Unicode-bullet test)
mock + opencode regression check through the shared helper: both still 1.000
live smooth run produced the 0.433 score above, driving the three improvement pearls
cargo fmt + cargo clippy -p smooai-smooth-bench --lib clean
Big Smooth required for `--driver=smooth` (boot fails with clear `smooth tmux boot failed: …` agent_error if not running)

🤖 Generated with Claude Code

…upport Wires SmoothDriver (`th code` through tmux) so smooth-bench can now do head-to-head operational-competence runs against OpenCode / mock at the SAME model through the SAME driving surface. The harness now does exactly what it was designed for: hold the model constant and surface agentic-behavior gaps as concrete data. First live smooth-vs-opencode result on cleanup-pycache-debris with smooai/deepseek-v4-flash: opencode → 1.000 (freed 1.30 MB, perfect plan, paused, "yes", deleted) smooth → 0.433 (perfect 50-item enumeration, but bytes_freed=0) The 0.433 vs 1.000 delta surfaces three concrete smooth-as-agentic- experience improvement pearls, filed alongside this PR: th-eeb00d (P1) — smooth-code TUI double-renders/garbles streaming agent output ("__pycachepycache__", "sub_sub_19", "24.0K.0K"). Pane dump diagnoses this clearly. th-fa4da9 (P0) — smooth-code's 'yes' coach reply doesn't trigger destructive bash actions. Agent enumerates targets, never executes deletion. This is the core gap vs OpenCode on the same fixture+model. th-20574a (P1) — `th code --model deepseek-v4-flash` was either silently dropped or routed to the 'smooth-coding' alias (status line shows 'smooth-coding' not the passed model). Apples-to-apples needs canonical --model behavior. This PR also lands two harness improvements that the live run forced: Refactor: extracted `drive_tmux_agent` + `TmuxAgentSpec` so the boot/paste/idle/auto-coach loop is shared across OpenCode and Smooth (and Claude Code next). Each driver only declares its shell command, boot timeout, dwell — the agent-coordination logic stays identical so scores remain comparable. th-979db6: parse_plan_artifacts now recognizes Unicode '•' bullets in addition to ASCII '- ' and table rows. Smooth-code's TUI uses '•' for its bullet lists; before this fix smooth's perfectly-valid 50-item plan scored 0 on explanation_quality because the parser only matched ASCII hyphens. Test plan - 29 unit tests on agent_driver, all green (added Unicode-bullet test + 4 SmoothDriver tests covering shell_cmd, name, missing- binary boot path) - --driver=mock --mock-agent perfect-pycache.sh → 1.000 (regression check that the drive_tmux_agent extraction didn't change scoring) - --driver=opencode --model smooai/deepseek-v4-flash → 1.000 (same apples-to-apples confirmation as before, now through the shared helper) - --driver=smooth --model deepseek-v4-flash → 0.433 (the data point that prompted the three improvement pearls above) - cargo fmt clean; cargo clippy -p smooai-smooth-bench --lib clean

changeset-bot · 2026-06-03T16:36:15Z

⚠️ No Changeset found

Latest commit: 72da05f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

…cit coach Two harness changes unblock smooth's `th code` from the cleanup-pycache fixture, getting smooth + opencode to PARITY at 1.000 / 1.000: 1. `--auto-approve=session` in smooth_shell_cmd — without it, smooth's Safehouse Narc defaults to `deny` on every destructive bash, which silently blocks the agent from running the actual cleanup. 2. Auto-coach reply now restates the deletion plan + spells out the canonical rm command. The reply text is identical for all drivers (OpenCode still scores 1.000 — confirmed via re-run), so harness parity is preserved. The reply embeds the cleanup recipe because bench-evidence from cleanup-pycache shows smooth-code loses prior-turn context — its agent on turn 2 reports "no deletion plan listed above" despite producing a perfect plan on turn 1. The 0.500 → 1.000 jump (after fixing auto-approve) → still 0.500 (with the spelled-out command, smooth-code's `fixer` couldn't follow through because it doesn't see prior turns) → finally 1.000 (with explicit rm in the coach reply) surfaces TWO smooth-code bugs cleanly: th-e5a0e5 (P0) — fixer system prompt overspecialized for test-fix (symptom: even with prior context lost, fixer confabulates 'I will fix the test failures' instead of asking for the cleanup plan) th-91075b (P0) — inter-turn context loss: agent doesn't see prior turn's own output on next user message. Status line shows `tokens: 0` on turn 2, consistent with no history reaching the LLM. ROOT cause of the bare-'yes' failure mode. Results table on cleanup-pycache-debris + smooai/deepseek-v4-flash: mock → 1.000 (baseline: deterministic bash script) opencode → 1.000 (1,299,985 bytes freed) smooth → 1.000 (1,300,516 bytes freed) Tests: 30 unit, all green. cargo fmt clean.

brentrager mentioned this pull request Jun 3, 2026

bench Phase 1: coaching policy + honesty axis + impossible-task fixture #82

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: SmoothDriver — head-to-head against OpenCode at the same model#81

bench: SmoothDriver — head-to-head against OpenCode at the same model#81
brentrager wants to merge 2 commits into
th-e5b773-driver-traitfrom
th-754512-smooth-driver

brentrager commented Jun 3, 2026

Uh oh!

changeset-bot Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brentrager commented Jun 3, 2026

Summary

Why a refactor was part of this

Insights the run surfaced (filed as separate pearls)

What also landed in the harness

Test plan

Uh oh!

changeset-bot Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented Jun 3, 2026 •

edited

Loading