[codex] Discourage flow overuse on small tasks#14
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f5a10a9353
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (timer) clearTimeout(timer); | ||
| if (buffer.trim()) processLine(buffer); | ||
| if (stderr.trim()) state.stderr = stderr.trim(); | ||
| resolveExit(code ?? 0); |
There was a problem hiding this comment.
Treat killed selection evals as failures
When a case times out and this timer kills pi with SIGTERM/SIGKILL, Node reports close with code === null and the signal separately, so code ?? 0 turns the timeout into a successful exit. If the model already emitted a matching final answer before hanging, or if a future selection case omits answerPattern, the harness can mark a timed-out run as passing instead of surfacing the infrastructure failure.
Useful? React with 👍 / 👎.
Summary
Tightens the model-facing
flowtool guidance so parent models reserve pi-flow for substantial delegated work instead of using it as the default path for tiny tasks.Adds a dedicated
npm run eval:selectharness that loads the extension into headless pi and scores whether the parent model actually callsflow. The new cases cover two no-flow small-task controls and one explicit-flow positive control.Why
The existing evals prove flow behavior once invoked, but they do not test invocation discipline. Small tasks can be cheaper and clearer in the parent context, so overuse needs its own regression signal.
Validation
npm run eval:select -- --dry-run— 3/3 passednpm run eval:select— 3/3 passed livenpm run eval -- --dry-run— 7/7 behavior checks passed, 2 hard cases score-tracked, 3 canariesnpm run eval:compare -- --dry-run— passednpm test— 87/87 passednpm ci— passed; npm reported one high-severity dependency advisorynpm run check— passed