Skip to content

[codex] Discourage flow overuse on small tasks#14

Merged
justintime109 merged 1 commit into
mainfrom
codex/flow-selection-eval
Jun 13, 2026
Merged

[codex] Discourage flow overuse on small tasks#14
justintime109 merged 1 commit into
mainfrom
codex/flow-selection-eval

Conversation

@justintime109

Copy link
Copy Markdown
Contributor

Summary

Tightens the model-facing flow tool guidance so parent models reserve pi-flow for substantial delegated work instead of using it as the default path for tiny tasks.

Adds a dedicated npm run eval:select harness that loads the extension into headless pi and scores whether the parent model actually calls flow. The new cases cover two no-flow small-task controls and one explicit-flow positive control.

Why

The existing evals prove flow behavior once invoked, but they do not test invocation discipline. Small tasks can be cheaper and clearer in the parent context, so overuse needs its own regression signal.

Validation

  • npm run eval:select -- --dry-run — 3/3 passed
  • npm run eval:select — 3/3 passed live
  • npm run eval -- --dry-run — 7/7 behavior checks passed, 2 hard cases score-tracked, 3 canaries
  • npm run eval:compare -- --dry-run — passed
  • npm test — 87/87 passed
  • npm ci — passed; npm reported one high-severity dependency advisory
  • npm run check — passed

@justintime109 justintime109 marked this pull request as ready for review June 13, 2026 19:05
@justintime109 justintime109 merged commit 10890b4 into main Jun 13, 2026
3 checks passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5a10a9353

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread evals/select.mjs
if (timer) clearTimeout(timer);
if (buffer.trim()) processLine(buffer);
if (stderr.trim()) state.stderr = stderr.trim();
resolveExit(code ?? 0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat killed selection evals as failures

When a case times out and this timer kills pi with SIGTERM/SIGKILL, Node reports close with code === null and the signal separately, so code ?? 0 turns the timeout into a successful exit. If the model already emitted a matching final answer before hanging, or if a future selection case omits answerPattern, the harness can mark a timed-out run as passing instead of surfacing the infrastructure failure.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant