Skip to content

Add model-router skill for cost-aware LLM selection#268

Draft
juanmichelini wants to merge 1 commit into
mainfrom
add-model-router-skill
Draft

Add model-router skill for cost-aware LLM selection#268
juanmichelini wants to merge 1 commit into
mainfrom
add-model-router-skill

Conversation

@juanmichelini
Copy link
Copy Markdown
Collaborator

Summary

Adds a small new skill, model-router, that recommends the most cost-efficient LLM for a given task category, using benchmark data from the public OpenHands Index.

Inspired by the user observation that "Gemini is better for research cost-wise, Opus 4.7 for planning/analysis, DeepSeek Reasoner for heavy repetitive lifting, etc." This skill encodes those tradeoffs as a reusable lookup table the agent can consult when picking a model, configuring a sub-agent, or delegating a cloud conversation.

What the skill provides

For each OpenHands Index category, it surfaces a cost pick (best score-per-dollar on the Pareto frontier), a balanced pick, and a premium pick:

Task type Cost pick Balanced Premium
Research / Information gathering Gemini-3.1-Pro ($0.12, 76.4) claude-opus-4-6 ($0.44, 80.0) GPT-5.5 ($0.74, 86.1)
Bug fixing / Issue resolution Minimax-2.7 ($0.18, 75.6) claude-opus-4-6 ($0.77, 76.8) GPT-5.5 ($1.52, 78.2)
Planning / Greenfield GPT-5.4 ($4.04, 56.2) claude-opus-4-7 ($5.69, 56.2) claude-opus-4-7 ($5.69, 56.2)
Frontend / UI Gemini-3.1-Pro ($1.24, 44.1) claude-opus-4-6 ($2.37, 41.8) claude-opus-4-7 ($2.83, 48.5)
Testing Minimax-2.7 ($0.13, 69.1) claude-opus-4-6 ($0.43, 78.8) GPT-5.5 ($0.92, 83.4)
Bulk repetitive work DeepSeek-V3.2-Reasoner ($0.57) Minimax-2.7 ($0.13-0.18) n/a (escalate)

Numbers are average cost per problem (USD) and aggregate score from index.openhands.dev as of the May 2026 snapshot.

The skill also includes:

  • Heuristics for when to escalate from cost pick to balanced/premium.
  • An example mixed pipeline (research with Gemini, plan with Opus, implement with Minimax/Opus, test with Opus, repetitive cleanup with DeepSeek).
  • Caveats about Pareto frontier vs. headline score, benchmark age, one-shot vs. multi-turn, open vs. closed weights, and greenfield being expensive everywhere.
  • Direct links back to each category page on https://index.openhands.dev so users can verify or re-derive the picks.

Files

  • skills/model-router/SKILL.md - skill body with progressive-disclosure description, decision table, and heuristics.
  • skills/model-router/README.md - human-facing notes.
  • marketplaces/openhands-extensions.json - new catalog entry under productivity.
  • README.md - auto-regenerated catalog section (via scripts/sync_extensions.py).

Validation

  • python scripts/sync_extensions.py --check -> All extensions in sync. ✓
  • pytest tests/test_catalogs.py tests/test_skills_have_readme.py tests/test_sync_extensions.py -> 38 passed.

Triggers

Keyword triggers only (no slash command, since this is reference content rather than a workflow):

which model, model selection, pick a model, model router, cost efficient model, cheapest model, best model for.


This pull request was created by an AI agent (OpenHands) on behalf of the user.

@juanmichelini can click here to continue refining the PR

Adds a small skill that maps task categories (research, bug fixing,
planning, frontend, testing, bulk repetitive work) to the most
cost-efficient LLM according to the public OpenHands Index benchmark
(https://index.openhands.dev).

For each category the skill recommends a cost pick (best score-per-
dollar on the Pareto frontier), a balanced pick, and a premium pick,
along with usage heuristics and links to the per-category leaderboard.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants