Skip to content

th code: --model flag now routes through to TaskStart (was silently dropped on TUI path)#87

Open
brentrager wants to merge 5 commits into
th-491e0c-pi-driverfrom
th-20574a-model-flag-routes
Open

th code: --model flag now routes through to TaskStart (was silently dropped on TUI path)#87
brentrager wants to merge 5 commits into
th-491e0c-pi-driverfrom
th-20574a-model-flag-routes

Conversation

@brentrager

Copy link
Copy Markdown
Contributor

Closes pearl th-20574a. See commit message for the full diagnosis. Smoking gun was that the daemon log read model=smooth-fast-gemini when bench passed --model deepseek-v4-flash. Now reads overrode routing.coding.model from opts.model model=deepseek-v4-flash as expected.

Root-caused the "Big Smooth crashes unprompted" symptom: it wasn't
crashing — it was gracefully shutting down on a 30-minute idle
timer (server.rs:600+). Bench evidence today showed the 30-minute
cliff fired after almost every /loop pause (1800s wakeup intervals),
killing the daemon mid-session and forcing 3+ manual `th up`s.

Pi + OpenCode have no daemon → no auto-shutdown → no "crashed
unprompted" symptom. Smooth's daemon model meant every loop pause
was implicitly a kill. This is exactly the kind of competitive-
parity gap the bench was designed to surface (user direction
2026-06-03: "i want smooth to learn from pi and opencode and make
smooth competitive").

24h default keeps a safety net for genuinely forgotten dev
sessions but doesn't fire during a single work session. The
existing `SMOOTH_BIGSMOOTH_IDLE_TIMEOUT_SECS` env override still
works (set to 0 to disable, or to a smaller value to opt back in
to aggressive timeouts) — caveat that the env must be set in the
daemon's process, which in sandboxed mode is the safehouse VM
(not the host shell).

Smoking-gun log line that closed the diagnosis:

  2026-06-03T21:07:17 INFO smooth_bigsmooth::server:
    Idle timeout reached (1800s), shutting down

Required rebuild path: scripts/build-safehouse.sh + cp the new
binary to ~/.smooth/runner-bin/safehouse, then th down + th up.
The shadow-bin mechanism (smooth-cli/src/main.rs:1292) bind-mounts
this over the OCI image's safehouse binary so dev iteration on
crates/smooth-bigsmooth doesn't need a full image push.
Documents the existing `th up direct` mode that we'd been overlooking:
boots in ~0.3s on the host instead of the ~30s safehouse-microVM
startup. That's pi/opencode-parity boot time (they're both ~3s).

Bench evidence from today:

  smooth-direct    : 0.850 aggregate, ~0.3s boot
  smooth-sandboxed : 0.789 aggregate, ~30s boot (with variance)
  pi               : 1.000 aggregate, ~3s boot
  opencode         : >=0.93 aggregate, ~3s boot

Direct mode trade-off is no isolation — the agent runs as a host
subprocess against the host filesystem. Fine for dev machines + CI
runners you own + bench harnesses. Sandboxed remains the default
for untrusted dispatch.

Required setup for direct mode (the runner-discovery error message
already tells you, but worth surfacing in docs):
  cargo build --release -p smooai-smooth-operator-runner
  SMOOTH_OPERATOR_RUNNER_NATIVE=~/.cargo/shared-target/release/smooth-operator-runner th up direct

Pearl follow-ups still open after this:
  th-6e361d — pycache run-to-run variance (direct still showed 0.500
              on disk-bloat in one run; smooth's nondeterminism isn't
              purely a sandbox artifact)
  th-e74aa6 — runner-discovery UX paper-cut (separate from this work)
The TUI path silently dropped `th code --model <X>`. `cmd_code`
parsed the flag into a local `model: Option<String>`, then called
`run_with_session(working_dir, resumed_session, agent)` — no model.
Every TaskStart fell back to the smooth-coding alias regardless of
what the user asked for.

Bench evidence: the harness passed `--model deepseek-v4-flash`,
the dispatch log read `model=smooth-fast-gemini`. Pi + OpenCode
were verified at deepseek-v4-flash on the same fixtures, so the
comparison wasn't apples-to-apples on model.

Fix:
  - Add `model_override: Option<String>` to AppState (separate from
    `model_name` which is display-only).
  - Thread `model: Option<String>` through `run_with_session`,
    setting `initial_state.model_override` when supplied.
  - In `run_agent_streaming`, read `state.model_override` and pass
    it to `client.run_task(...)` instead of the literal `None`.
  - `run()` callsite gets `None` for backwards compatibility (no
    --model means Big Smooth picks the default, same as before).

Confirmed in the daemon log after the fix:

  direct dispatch: overrode routing.coding.model from opts.model
    model=deepseek-v4-flash

That's the line that wasn't appearing before because routing always
got `opts.model = None`.

This UNBLOCKS th-6e361d (pycache variance) — now we can measure
smooth's agent behavior at the same model pi + opencode use,
isolating whether remaining variance is the model or the prompt /
workflow. First post-fix smooth-deepseek run on pycache still
scored 0.500 (agent confabulated test-fix narrative), so the
remaining gap is now clearly the fixer system prompt (th-e5a0e5
needs deeper restructure — the end-of-file addendum isn't enough).
@changeset-bot

changeset-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 50153d4

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Strong counter-bias against the agent pivoting to "fix the failing
tests" on tasks that have nothing to do with tests. Standalone
defensible improvement (the rule itself is unambiguous), but
empirically NOT sufficient on its own to stop the pivot:

  smooth on cleanup-pycache-debris @ deepseek-v4-flash + Hard Rule #0:
    "I need to fix the failing tests. Since no test files were found,
    I will create a test file tests/test_util.py … fixed the failing
    tests. Test Results: 1 passed, 0 failed."

  Score: 0.500. Agent literally created and ran an invented test.

This means the test-fix orientation is multi-layer:
  - The fixer.txt system prompt is heavy on test guidance (still
    correct for actual test-fix tasks, just badly weighted)
  - coding_workflow.rs's CODING phase may enforce a structure that
    expects tests
  - The model has been trained to pattern-match "Test Results"
    markers as a goal state

Hard Rule #0 stays as one piece of the fix. The deeper surgery is
filed as a new pearl: needs either a general-purpose agent persona
that doesn't go through coding_workflow.rs's test-centric path,
or a workflow-level branch on "is this actually a test task?"
classified from the user's first message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant