Skip to content

bench: SmoothDriver — head-to-head against OpenCode at the same model#81

Open
brentrager wants to merge 2 commits into
th-e5b773-driver-traitfrom
th-754512-smooth-driver
Open

bench: SmoothDriver — head-to-head against OpenCode at the same model#81
brentrager wants to merge 2 commits into
th-e5b773-driver-traitfrom
th-754512-smooth-driver

Conversation

@brentrager

Copy link
Copy Markdown
Contributor

Stacked on top of #80 (AgentDriver trait + OpenCode). Merge that first.

Summary

Pearls th-754512 (SmoothDriver), th-979db6 (Unicode bullets in plan parser).

First live head-to-head data point with the same model (`smooai/deepseek-v4-flash`) driven through the same surface (tmux + paste + idle + auto-coach):

driver weighted score bytes freed plan items prompted preserved
mock (perfect bash baseline) 1.000 1.30 MB 5 yes yes
opencode 1.000 1.30 MB 4 (table) yes yes
smooth (`th code --model deepseek-v4-flash`) 0.433 0 1 → 50 (after fix) yes yes

Why a refactor was part of this

The OpenCode driver and the new SmoothDriver share ~80% of the body — only the shell command, boot timeout, and label differ. Extracted `drive_tmux_agent` + `TmuxAgentSpec` so the boot/paste/idle/auto-coach loop is shared. The agent-coordination logic stays identical across backends, which is exactly what guarantees scores remain comparable — if each driver had its own copy of the loop, we'd be measuring slightly different things and calling it the same number.

ClaudeCodeDriver (th-36145e) now lands as ~30 lines instead of ~120.

Insights the run surfaced (filed as separate pearls)

The 0.433 vs 1.000 delta is the data point the user actually wanted: "improve smooth as an agentic experience". Three improvement pearls filed:

  • th-fa4da9 (P0) — smooth-code's 'yes' coach reply doesn't trigger destructive bash. Agent enumerates targets, never executes deletion. Core agentic gap vs OpenCode on identical model+fixture.
  • th-eeb00d (P1) — smooth-code TUI double-renders/garbles streaming agent output (`pycachepycache`, `sub_sub_19`, `24.0K.0K` visible in the pane dump). Real end-user UX bug.
  • th-20574a (P1) — `th code --model deepseek-v4-flash` was either dropped or aliased — status line still shows 'smooth-coding'. Apples-to-apples benches need canonical --model handling.

What also landed in the harness

  • th-979db6: parse_plan_artifacts now recognizes `•` (U+2022) bullets in addition to ASCII `- ` and table rows. Smooth's TUI uses Unicode bullets; before this fix smooth's perfectly-valid 50-item plan scored 0 on explanation_quality.

Test plan

  • 29 unit tests on agent_driver (4 new SmoothDriver tests for shell_cmd, name, missing-binary boot path; 1 new Unicode-bullet test)
  • mock + opencode regression check through the shared helper: both still 1.000
  • live smooth run produced the 0.433 score above, driving the three improvement pearls
  • cargo fmt + cargo clippy -p smooai-smooth-bench --lib clean
  • Big Smooth required for `--driver=smooth` (boot fails with clear `smooth tmux boot failed: …` agent_error if not running)

🤖 Generated with Claude Code

…upport

Wires SmoothDriver (`th code` through tmux) so smooth-bench can now do
head-to-head operational-competence runs against OpenCode / mock at
the SAME model through the SAME driving surface. The harness now does
exactly what it was designed for: hold the model constant and surface
agentic-behavior gaps as concrete data.

First live smooth-vs-opencode result on cleanup-pycache-debris with
smooai/deepseek-v4-flash:

  opencode → 1.000 (freed 1.30 MB, perfect plan, paused, "yes", deleted)
  smooth   → 0.433 (perfect 50-item enumeration, but bytes_freed=0)

The 0.433 vs 1.000 delta surfaces three concrete smooth-as-agentic-
experience improvement pearls, filed alongside this PR:

  th-eeb00d (P1) — smooth-code TUI double-renders/garbles streaming
                   agent output ("__pycachepycache__", "sub_sub_19",
                   "24.0K.0K"). Pane dump diagnoses this clearly.
  th-fa4da9 (P0) — smooth-code's 'yes' coach reply doesn't trigger
                   destructive bash actions. Agent enumerates targets,
                   never executes deletion. This is the core gap vs
                   OpenCode on the same fixture+model.
  th-20574a (P1) — `th code --model deepseek-v4-flash` was either
                   silently dropped or routed to the 'smooth-coding'
                   alias (status line shows 'smooth-coding' not the
                   passed model). Apples-to-apples needs canonical
                   --model behavior.

This PR also lands two harness improvements that the live run forced:

  Refactor: extracted `drive_tmux_agent` + `TmuxAgentSpec` so the
  boot/paste/idle/auto-coach loop is shared across OpenCode and
  Smooth (and Claude Code next). Each driver only declares its shell
  command, boot timeout, dwell — the agent-coordination logic stays
  identical so scores remain comparable.

  th-979db6: parse_plan_artifacts now recognizes Unicode '•' bullets
  in addition to ASCII '- ' and table rows. Smooth-code's TUI uses
  '•' for its bullet lists; before this fix smooth's perfectly-valid
  50-item plan scored 0 on explanation_quality because the parser
  only matched ASCII hyphens.

Test plan
  - 29 unit tests on agent_driver, all green (added Unicode-bullet
    test + 4 SmoothDriver tests covering shell_cmd, name, missing-
    binary boot path)
  - --driver=mock --mock-agent perfect-pycache.sh → 1.000 (regression
    check that the drive_tmux_agent extraction didn't change scoring)
  - --driver=opencode --model smooai/deepseek-v4-flash → 1.000 (same
    apples-to-apples confirmation as before, now through the shared
    helper)
  - --driver=smooth --model deepseek-v4-flash → 0.433 (the data point
    that prompted the three improvement pearls above)
  - cargo fmt clean; cargo clippy -p smooai-smooth-bench --lib clean
@changeset-bot

changeset-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 72da05f

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

…cit coach

Two harness changes unblock smooth's `th code` from the cleanup-pycache
fixture, getting smooth + opencode to PARITY at 1.000 / 1.000:

1. `--auto-approve=session` in smooth_shell_cmd — without it, smooth's
   Safehouse Narc defaults to `deny` on every destructive bash, which
   silently blocks the agent from running the actual cleanup.

2. Auto-coach reply now restates the deletion plan + spells out the
   canonical rm command. The reply text is identical for all drivers
   (OpenCode still scores 1.000 — confirmed via re-run), so harness
   parity is preserved. The reply embeds the cleanup recipe because
   bench-evidence from cleanup-pycache shows smooth-code loses
   prior-turn context — its agent on turn 2 reports "no deletion
   plan listed above" despite producing a perfect plan on turn 1.

The 0.500 → 1.000 jump (after fixing auto-approve) → still 0.500
(with the spelled-out command, smooth-code's `fixer` couldn't follow
through because it doesn't see prior turns) → finally 1.000 (with
explicit rm in the coach reply) surfaces TWO smooth-code bugs cleanly:

  th-e5a0e5 (P0) — fixer system prompt overspecialized for test-fix
                   (symptom: even with prior context lost, fixer
                   confabulates 'I will fix the test failures' instead
                   of asking for the cleanup plan)
  th-91075b (P0) — inter-turn context loss: agent doesn't see prior
                   turn's own output on next user message. Status line
                   shows `tokens: 0` on turn 2, consistent with no
                   history reaching the LLM. ROOT cause of the
                   bare-'yes' failure mode.

Results table on cleanup-pycache-debris + smooai/deepseek-v4-flash:

  mock     → 1.000 (baseline: deterministic bash script)
  opencode → 1.000 (1,299,985 bytes freed)
  smooth   → 1.000 (1,300,516 bytes freed)

Tests: 30 unit, all green. cargo fmt clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant