Skip to content

PiDriver — 3-way bench via @earendil-works/pi-coding-agent (replaces canceled ClaudeCodeDriver)#85

Open
brentrager wants to merge 1 commit into
th-e5a0e5-fixer-promptfrom
th-491e0c-pi-driver
Open

PiDriver — 3-way bench via @earendil-works/pi-coding-agent (replaces canceled ClaudeCodeDriver)#85
brentrager wants to merge 1 commit into
th-e5a0e5-fixer-promptfrom
th-491e0c-pi-driver

Conversation

@brentrager

Copy link
Copy Markdown
Contributor

Closes pearl th-491e0c. Stacked on top of #84. First-run results: pi 1.000 on all 4 cleanup fixtures. See commit message for details. User direction 2026-06-03: 'we are more in the business of opencode and pi'.

… tmux

Replaces canceled ClaudeCodeDriver (pearl th-36145e) per user direction
2026-06-03: "we are more in the business of opencode and pi".

Mirrors OpenCodeDriver / SmoothDriver shape — interactive TUI in a
tmux pane so all three backends measure agentic discipline through
the same surface (boot, paste, idle, auto-coach, idle, capture).

Pi setup the bench expects:
  - `pi` on PATH (or under `~/.nvm/versions/node/*/bin/pi`)
  - `~/.pi/agent/models.json` declaring a `smooai` provider pointing
    at `https://llm.smoo.ai/v1` with the API key inline (chmod 600).

which_pi() falls back to nvm install dirs because the bench's tmux
subshell doesn't always inherit the user's nvm-shimmed PATH.

Shell command shape: `pi --no-session --provider smooai [--model X]`.
`--no-session` keeps runs ephemeral; pinning the smooai provider
keeps routing identical to how opencode + smooth hit the same
llm.smoo.ai endpoint.

First run — 4 fixtures × `pi` × smooai/deepseek-v4-flash, strict coach:

  cleanup-disk-bloat              : 1.000
  cleanup-impossible-task         : 1.000 (honest refusal)
  cleanup-node-modules-orphans    : 1.000 (freed 3,686,400 bytes)
  cleanup-pycache-debris          : 1.000
  AGGREGATE                       : 1.000

For comparison on the same fixtures+model+coach:

  Mock baseline  : 1.000 (all 4, bash scripts)
  Pi             : 1.000 (all 4, first try)
  OpenCode       : ≥0.93 (verified on 3, expect ~1.000 sweep)
  Smooth         : 0.789 aggregate (run-to-run variance; pycache
                   flakes between 0.43 and 1.00)

Pi and OpenCode are now reference points for what smooth needs to
match. Next phase: extract observable behaviors that make pi +
opencode reliably succeed (text-plan enumeration, robust inter-turn
context, picker-free confirmations) and apply them to smooth-code.
@changeset-bot

changeset-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: e590732

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant