th code: --model flag now routes through to TaskStart (was silently dropped on TUI path) by brentrager · Pull Request #87 · SmooAI/smooth

brentrager · 2026-06-04T00:18:51Z

Closes pearl th-20574a. See commit message for the full diagnosis. Smoking gun was that the daemon log read model=smooth-fast-gemini when bench passed --model deepseek-v4-flash. Now reads overrode routing.coding.model from opts.model model=deepseek-v4-flash as expected.

Root-caused the "Big Smooth crashes unprompted" symptom: it wasn't crashing — it was gracefully shutting down on a 30-minute idle timer (server.rs:600+). Bench evidence today showed the 30-minute cliff fired after almost every /loop pause (1800s wakeup intervals), killing the daemon mid-session and forcing 3+ manual `th up`s. Pi + OpenCode have no daemon → no auto-shutdown → no "crashed unprompted" symptom. Smooth's daemon model meant every loop pause was implicitly a kill. This is exactly the kind of competitive- parity gap the bench was designed to surface (user direction 2026-06-03: "i want smooth to learn from pi and opencode and make smooth competitive"). 24h default keeps a safety net for genuinely forgotten dev sessions but doesn't fire during a single work session. The existing `SMOOTH_BIGSMOOTH_IDLE_TIMEOUT_SECS` env override still works (set to 0 to disable, or to a smaller value to opt back in to aggressive timeouts) — caveat that the env must be set in the daemon's process, which in sandboxed mode is the safehouse VM (not the host shell). Smoking-gun log line that closed the diagnosis: 2026-06-03T21:07:17 INFO smooth_bigsmooth::server: Idle timeout reached (1800s), shutting down Required rebuild path: scripts/build-safehouse.sh + cp the new binary to ~/.smooth/runner-bin/safehouse, then th down + th up. The shadow-bin mechanism (smooth-cli/src/main.rs:1292) bind-mounts this over the OCI image's safehouse binary so dev iteration on crates/smooth-bigsmooth doesn't need a full image push.

Documents the existing `th up direct` mode that we'd been overlooking: boots in ~0.3s on the host instead of the ~30s safehouse-microVM startup. That's pi/opencode-parity boot time (they're both ~3s). Bench evidence from today: smooth-direct : 0.850 aggregate, ~0.3s boot smooth-sandboxed : 0.789 aggregate, ~30s boot (with variance) pi : 1.000 aggregate, ~3s boot opencode : >=0.93 aggregate, ~3s boot Direct mode trade-off is no isolation — the agent runs as a host subprocess against the host filesystem. Fine for dev machines + CI runners you own + bench harnesses. Sandboxed remains the default for untrusted dispatch. Required setup for direct mode (the runner-discovery error message already tells you, but worth surfacing in docs): cargo build --release -p smooai-smooth-operator-runner SMOOTH_OPERATOR_RUNNER_NATIVE=~/.cargo/shared-target/release/smooth-operator-runner th up direct Pearl follow-ups still open after this: th-6e361d — pycache run-to-run variance (direct still showed 0.500 on disk-bloat in one run; smooth's nondeterminism isn't purely a sandbox artifact) th-e74aa6 — runner-discovery UX paper-cut (separate from this work)

The TUI path silently dropped `th code --model <X>`. `cmd_code` parsed the flag into a local `model: Option<String>`, then called `run_with_session(working_dir, resumed_session, agent)` — no model. Every TaskStart fell back to the smooth-coding alias regardless of what the user asked for. Bench evidence: the harness passed `--model deepseek-v4-flash`, the dispatch log read `model=smooth-fast-gemini`. Pi + OpenCode were verified at deepseek-v4-flash on the same fixtures, so the comparison wasn't apples-to-apples on model. Fix: - Add `model_override: Option<String>` to AppState (separate from `model_name` which is display-only). - Thread `model: Option<String>` through `run_with_session`, setting `initial_state.model_override` when supplied. - In `run_agent_streaming`, read `state.model_override` and pass it to `client.run_task(...)` instead of the literal `None`. - `run()` callsite gets `None` for backwards compatibility (no --model means Big Smooth picks the default, same as before). Confirmed in the daemon log after the fix: direct dispatch: overrode routing.coding.model from opts.model model=deepseek-v4-flash That's the line that wasn't appearing before because routing always got `opts.model = None`. This UNBLOCKS th-6e361d (pycache variance) — now we can measure smooth's agent behavior at the same model pi + opencode use, isolating whether remaining variance is the model or the prompt / workflow. First post-fix smooth-deepseek run on pycache still scored 0.500 (agent confabulated test-fix narrative), so the remaining gap is now clearly the fixer system prompt (th-e5a0e5 needs deeper restructure — the end-of-file addendum isn't enough).

changeset-bot · 2026-06-04T00:18:57Z

⚠️ No Changeset found

Latest commit: 50153d4

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Strong counter-bias against the agent pivoting to "fix the failing tests" on tasks that have nothing to do with tests. Standalone defensible improvement (the rule itself is unambiguous), but empirically NOT sufficient on its own to stop the pivot: smooth on cleanup-pycache-debris @ deepseek-v4-flash + Hard Rule #0: "I need to fix the failing tests. Since no test files were found, I will create a test file tests/test_util.py … fixed the failing tests. Test Results: 1 passed, 0 failed." Score: 0.500. Agent literally created and ran an invented test. This means the test-fix orientation is multi-layer: - The fixer.txt system prompt is heavy on test guidance (still correct for actual test-fix tasks, just badly weighted) - coding_workflow.rs's CODING phase may enforce a structure that expects tests - The model has been trained to pattern-match "Test Results" markers as a goal state Hard Rule #0 stays as one piece of the fix. The deeper surgery is filed as a new pearl: needs either a general-purpose agent persona that doesn't go through coding_workflow.rs's test-centric path, or a workflow-level branch on "is this actually a test task?" classified from the user's first message.

brentrager added 4 commits June 3, 2026 19:18

pearl: close th-20574a

10929a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

th code: --model flag now routes through to TaskStart (was silently dropped on TUI path)#87

th code: --model flag now routes through to TaskStart (was silently dropped on TUI path)#87
brentrager wants to merge 5 commits into
th-491e0c-pi-driverfrom
th-20574a-model-flag-routes

brentrager commented Jun 4, 2026

Uh oh!

changeset-bot Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brentrager commented Jun 4, 2026

Uh oh!

changeset-bot Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented Jun 4, 2026 •

edited

Loading