feat: add compare-agents skill for cross-framework evaluation#13
Conversation
Introduces a new skill that uses evaluatorq (from orqkit) to compare agents across frameworks (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. The skill follows repo conventions (role statement, constraints, companion skills, workflow checklist, resources directory) and delegates dataset/evaluator creation to companion skills instead of duplicating them. Supports both Python and TypeScript. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Baukebrenninkmeijer
left a comment
There was a problem hiding this comment.
PR Review: compare-agents skill
Validated all code, API references, and documentation links against orq.ai docs and Claude Code skill best practices. Overall this is a well-structured skill with accurate API references.
✅ Docs & API Validation (all passed)
| Area | Status |
|---|---|
agents.responses.create() endpoint |
✅ Matches orq.ai API docs |
A2A message format (parts array) |
✅ Correct |
invoke() vs responses.create() distinction |
✅ Correctly documented |
| evaluatorq imports & function signatures | ✅ Match official tutorial |
Package names (evaluatorq, @orq-ai/evaluatorq) |
✅ Correct |
| All 5 documentation links | ✅ Valid and resolving |
Issues Found
🔴 Critical
1. Missing import asyncio in Python job patterns — resources/job-patterns.md
LangGraph (line 77), CrewAI (line 109), and Generic Agent (line 218) patterns all use asyncio.to_thread() but never show the import asyncio statement. Copy-pasting these patterns will produce NameError: name 'asyncio' is not defined.
Suggestion: Add a "Required Imports" section at the top of job-patterns.md:
import asyncio
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult2. Reference to nonexistent setup-observability companion skill — SKILL.md line 35
- `setup-observability` — instrument agents for tracing
No skills/setup-observability/ directory exists in the repo. The agent will try to delegate to this skill and fail. Should be removed or marked as (planned).
🟡 Important
3. DatasetIdInput shown in Python signature but undocumented — resources/evaluatorq-api.md line 117
The Python function signature includes data: DatasetIdInput | Sequence[DataPoint] but no Python example shows how to use DatasetIdInput. The note on line 137 implies only TypeScript supports { datasetId: "..." }, creating a contradiction. If Python supports it too, add an example. If not, remove it from the signature.
4. pass: false gotcha references undocumented field — resources/gotchas.md line 90
"TypeScript evaluatorq exits with code 1 when any evaluator returns
pass: false."
But the evaluator scorer examples only show EvaluationResult with value (a number) and explanation. The relationship between the numeric value and the boolean pass is never explained. This will confuse users trying to understand CI/CD exit behavior.
💡 Suggestions
5. Test coverage gaps — tests/skills.md
- No scenario tests what happens when an agent invocation fails (bad key, missing env var, unreachable agent). The SKILL.md says "ALWAYS confirm each agent can be invoked independently" but this is untested.
- No multi-agent (3+) test scenario despite the skill explicitly supporting it.
- Phase 5 (run command, ORQ_API_KEY reminder, Experiment UI link) has zero test verification.
Strengths
- Excellent progressive disclosure:
SKILL.md→resources/split keeps the main file focused - Strong constraint documentation with "Why" rationale paragraphs
- The
invoke()vsresponses.create()gotcha is well cross-referenced across all 3 resource files - Dataset bias prevention guidance is concrete with Wrong/Correct examples
- Test scenarios (5 scenarios) are substantially more thorough than other skills
- Clean delegation to companion skills avoids scope creep
- MCP tool naming follows repo convention (consistent with other skills)
PR Review:
|
| Area | Status |
|---|---|
agents.responses.create() endpoint |
Matches orq.ai API docs |
A2A message format (parts array) |
Matches docs |
invoke() vs responses.create() distinction |
Correctly documented |
| evaluatorq imports & function signatures | Match official tutorial |
Package names (evaluatorq, @orq-ai/evaluatorq) |
Correct |
| All 5 documentation links in SKILL.md | Valid and resolving |
No factual mismatches found between the skill and the official docs.
Issues
1. Missing import asyncio in Python job patterns
File: skills/compare-agents/resources/job-patterns.md (LangGraph, CrewAI, Generic Agent patterns)
These patterns use asyncio.to_thread() but never import asyncio. Copy-pasting will produce NameError.
2. DatasetIdInput documented in Python signature but no Python example
File: skills/compare-agents/resources/evaluatorq-api.md
The Python function signature includes DatasetIdInput but the note on line 137 says only "TypeScript supports { datasetId: "..." }", implying Python doesn't. If Python also supports it, add an example. If not, clarify the signature.
3. pass: false CI/CD gotcha references unexplained field
File: skills/compare-agents/resources/gotchas.md (lines 89-90)
The gotcha says TypeScript exits with code 1 when any evaluator returns pass: false, but evaluator examples only show a value field (a number). The relationship between value and pass is never explained.
Suggestions
4. Document that MCP tool short names map to fully qualified names
File: skills/compare-agents/SKILL.md (MCP tools table, lines 85-91)
The table uses short names (search_entities, create_dataset, etc.) which is consistent with the other skills in this repo. However, the actual tool names at runtime are fully qualified (e.g., mcp__orq-remote-mcp__search_entities). Consider adding a note above the table like:
Tool names below are shortened for readability. At runtime they resolve via the
orq*glob inallowed-tools(e.g.,search_entities→mcp__orq-remote-mcp__search_entities).
This would help skill authors who look at this skill as a reference understand the naming convention.
5. Missing test scenarios in tests/skills.md
- No error-handling scenario (missing API key, bad agent key, unreachable agent)
- No multi-agent (3+) test scenario despite the skill explicitly supporting it
- Phase 5 (run command, ORQ_API_KEY reminder, Experiment UI link) has zero test verification
Strengths
- Excellent progressive disclosure with
resources/directory - Strong constraint documentation with "Why" rationale
- The
invoke()vsresponses.create()gotcha is well cross-referenced across all 3 resource files - Dataset bias prevention guidance is concrete with Wrong/Correct examples
- Test scenarios are more thorough than other skills (5 scenarios vs typical 1-2)
- Clean delegation to companion skills avoids scope creep
Summary
compare-agentsskill — runs head-to-head experiments comparing agents across frameworks (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) using evaluatorq from orqkitgenerate-synthetic-dataset,build-evaluator) instead of duplicating contentevaluatorq) and TypeScript (@orq-ai/evaluatorq) with framework-specific job patternstests/skills.mdFiles
skills/compare-agents/SKILL.md(197 lines)skills/compare-agents/resources/job-patterns.mdskills/compare-agents/resources/evaluatorq-api.mdskills/compare-agents/resources/gotchas.mdtests/skills.mdTest plan
@orq-ai/evaluatorq)🤖 Generated with Claude Code