th code: --model flag now routes through to TaskStart (was silently dropped on TUI path)#87
Open
brentrager wants to merge 5 commits into
Open
th code: --model flag now routes through to TaskStart (was silently dropped on TUI path)#87brentrager wants to merge 5 commits into
brentrager wants to merge 5 commits into
Conversation
Root-caused the "Big Smooth crashes unprompted" symptom: it wasn't
crashing — it was gracefully shutting down on a 30-minute idle
timer (server.rs:600+). Bench evidence today showed the 30-minute
cliff fired after almost every /loop pause (1800s wakeup intervals),
killing the daemon mid-session and forcing 3+ manual `th up`s.
Pi + OpenCode have no daemon → no auto-shutdown → no "crashed
unprompted" symptom. Smooth's daemon model meant every loop pause
was implicitly a kill. This is exactly the kind of competitive-
parity gap the bench was designed to surface (user direction
2026-06-03: "i want smooth to learn from pi and opencode and make
smooth competitive").
24h default keeps a safety net for genuinely forgotten dev
sessions but doesn't fire during a single work session. The
existing `SMOOTH_BIGSMOOTH_IDLE_TIMEOUT_SECS` env override still
works (set to 0 to disable, or to a smaller value to opt back in
to aggressive timeouts) — caveat that the env must be set in the
daemon's process, which in sandboxed mode is the safehouse VM
(not the host shell).
Smoking-gun log line that closed the diagnosis:
2026-06-03T21:07:17 INFO smooth_bigsmooth::server:
Idle timeout reached (1800s), shutting down
Required rebuild path: scripts/build-safehouse.sh + cp the new
binary to ~/.smooth/runner-bin/safehouse, then th down + th up.
The shadow-bin mechanism (smooth-cli/src/main.rs:1292) bind-mounts
this over the OCI image's safehouse binary so dev iteration on
crates/smooth-bigsmooth doesn't need a full image push.
Documents the existing `th up direct` mode that we'd been overlooking:
boots in ~0.3s on the host instead of the ~30s safehouse-microVM
startup. That's pi/opencode-parity boot time (they're both ~3s).
Bench evidence from today:
smooth-direct : 0.850 aggregate, ~0.3s boot
smooth-sandboxed : 0.789 aggregate, ~30s boot (with variance)
pi : 1.000 aggregate, ~3s boot
opencode : >=0.93 aggregate, ~3s boot
Direct mode trade-off is no isolation — the agent runs as a host
subprocess against the host filesystem. Fine for dev machines + CI
runners you own + bench harnesses. Sandboxed remains the default
for untrusted dispatch.
Required setup for direct mode (the runner-discovery error message
already tells you, but worth surfacing in docs):
cargo build --release -p smooai-smooth-operator-runner
SMOOTH_OPERATOR_RUNNER_NATIVE=~/.cargo/shared-target/release/smooth-operator-runner th up direct
Pearl follow-ups still open after this:
th-6e361d — pycache run-to-run variance (direct still showed 0.500
on disk-bloat in one run; smooth's nondeterminism isn't
purely a sandbox artifact)
th-e74aa6 — runner-discovery UX paper-cut (separate from this work)
The TUI path silently dropped `th code --model <X>`. `cmd_code`
parsed the flag into a local `model: Option<String>`, then called
`run_with_session(working_dir, resumed_session, agent)` — no model.
Every TaskStart fell back to the smooth-coding alias regardless of
what the user asked for.
Bench evidence: the harness passed `--model deepseek-v4-flash`,
the dispatch log read `model=smooth-fast-gemini`. Pi + OpenCode
were verified at deepseek-v4-flash on the same fixtures, so the
comparison wasn't apples-to-apples on model.
Fix:
- Add `model_override: Option<String>` to AppState (separate from
`model_name` which is display-only).
- Thread `model: Option<String>` through `run_with_session`,
setting `initial_state.model_override` when supplied.
- In `run_agent_streaming`, read `state.model_override` and pass
it to `client.run_task(...)` instead of the literal `None`.
- `run()` callsite gets `None` for backwards compatibility (no
--model means Big Smooth picks the default, same as before).
Confirmed in the daemon log after the fix:
direct dispatch: overrode routing.coding.model from opts.model
model=deepseek-v4-flash
That's the line that wasn't appearing before because routing always
got `opts.model = None`.
This UNBLOCKS th-6e361d (pycache variance) — now we can measure
smooth's agent behavior at the same model pi + opencode use,
isolating whether remaining variance is the model or the prompt /
workflow. First post-fix smooth-deepseek run on pycache still
scored 0.500 (agent confabulated test-fix narrative), so the
remaining gap is now clearly the fixer system prompt (th-e5a0e5
needs deeper restructure — the end-of-file addendum isn't enough).
|
Strong counter-bias against the agent pivoting to "fix the failing
tests" on tasks that have nothing to do with tests. Standalone
defensible improvement (the rule itself is unambiguous), but
empirically NOT sufficient on its own to stop the pivot:
smooth on cleanup-pycache-debris @ deepseek-v4-flash + Hard Rule #0:
"I need to fix the failing tests. Since no test files were found,
I will create a test file tests/test_util.py … fixed the failing
tests. Test Results: 1 passed, 0 failed."
Score: 0.500. Agent literally created and ran an invented test.
This means the test-fix orientation is multi-layer:
- The fixer.txt system prompt is heavy on test guidance (still
correct for actual test-fix tasks, just badly weighted)
- coding_workflow.rs's CODING phase may enforce a structure that
expects tests
- The model has been trained to pattern-match "Test Results"
markers as a goal state
Hard Rule #0 stays as one piece of the fix. The deeper surgery is
filed as a new pearl: needs either a general-purpose agent persona
that doesn't go through coding_workflow.rs's test-centric path,
or a workflow-level branch on "is this actually a test task?"
classified from the user's first message.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes pearl th-20574a. See commit message for the full diagnosis. Smoking gun was that the daemon log read
model=smooth-fast-geminiwhen bench passed--model deepseek-v4-flash. Now readsoverrode routing.coding.model from opts.model model=deepseek-v4-flashas expected.