feat: add compare-agents skill for cross-framework evaluation by arianpasquali · Pull Request #13 · orq-ai/assistant-plugins

arianpasquali · 2026-03-26T16:20:00Z

Summary

Adds compare-agents skill — runs head-to-head experiments comparing agents across frameworks (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) using evaluatorq from orqkit
Follows repo conventions: role statement, constraints, companion skills, workflow checklist, resources directory
Delegates dataset/evaluator creation to companion skills (generate-synthetic-dataset, build-evaluator) instead of duplicating content
Covers both Python (evaluatorq) and TypeScript (@orq-ai/evaluatorq) with framework-specific job patterns
Adds 5 test scenarios to tests/skills.md

Files

File	Purpose
`skills/compare-agents/SKILL.md` (197 lines)	Main skill — orchestrator with phased workflow
`skills/compare-agents/resources/job-patterns.md`	Framework-specific job patterns (Python + TS)
`skills/compare-agents/resources/evaluatorq-api.md`	evaluatorq/orqkit API reference
`skills/compare-agents/resources/gotchas.md`	Known issues and workarounds
`tests/skills.md`	5 test scenarios for the skill

Test plan

Verify skill triggers on "compare agents", "benchmark", "test agents"
Verify companion skill redirects work (dataset → generate-synthetic-dataset, evaluator → build-evaluator)
Verify Python job patterns generate valid evaluatorq scripts
Verify TypeScript job patterns use correct imports (@orq-ai/evaluatorq)
Verify dataset bias prevention (no mock-data-biased expected outputs)

🤖 Generated with Claude Code

Introduces a new skill that uses evaluatorq (from orqkit) to compare agents across frameworks (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. The skill follows repo conventions (role statement, constraints, companion skills, workflow checklist, resources directory) and delegates dataset/evaluator creation to companion skills instead of duplicating them. Supports both Python and TypeScript. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Baukebrenninkmeijer

PR Review: compare-agents skill

Validated all code, API references, and documentation links against orq.ai docs and Claude Code skill best practices. Overall this is a well-structured skill with accurate API references.

✅ Docs & API Validation (all passed)

Area	Status
`agents.responses.create()` endpoint	✅ Matches orq.ai API docs
A2A message format (`parts` array)	✅ Correct
`invoke()` vs `responses.create()` distinction	✅ Correctly documented
evaluatorq imports & function signatures	✅ Match official tutorial
Package names (`evaluatorq`, `@orq-ai/evaluatorq`)	✅ Correct
All 5 documentation links	✅ Valid and resolving

Issues Found

🔴 Critical

1. Missing import asyncio in Python job patterns — resources/job-patterns.md

LangGraph (line 77), CrewAI (line 109), and Generic Agent (line 218) patterns all use asyncio.to_thread() but never show the import asyncio statement. Copy-pasting these patterns will produce NameError: name 'asyncio' is not defined.

Suggestion: Add a "Required Imports" section at the top of job-patterns.md:

import asyncio
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult

2. Reference to nonexistent setup-observability companion skill — SKILL.md line 35

- `setup-observability` — instrument agents for tracing

No skills/setup-observability/ directory exists in the repo. The agent will try to delegate to this skill and fail. Should be removed or marked as (planned).

🟡 Important

3. DatasetIdInput shown in Python signature but undocumented — resources/evaluatorq-api.md line 117

The Python function signature includes data: DatasetIdInput | Sequence[DataPoint] but no Python example shows how to use DatasetIdInput. The note on line 137 implies only TypeScript supports { datasetId: "..." }, creating a contradiction. If Python supports it too, add an example. If not, remove it from the signature.

4. pass: false gotcha references undocumented field — resources/gotchas.md line 90

"TypeScript evaluatorq exits with code 1 when any evaluator returns pass: false."

But the evaluator scorer examples only show EvaluationResult with value (a number) and explanation. The relationship between the numeric value and the boolean pass is never explained. This will confuse users trying to understand CI/CD exit behavior.

💡 Suggestions

5. Test coverage gaps — tests/skills.md

No scenario tests what happens when an agent invocation fails (bad key, missing env var, unreachable agent). The SKILL.md says "ALWAYS confirm each agent can be invoked independently" but this is untested.
No multi-agent (3+) test scenario despite the skill explicitly supporting it.
Phase 5 (run command, ORQ_API_KEY reminder, Experiment UI link) has zero test verification.

Strengths

Excellent progressive disclosure: SKILL.md → resources/ split keeps the main file focused
Strong constraint documentation with "Why" rationale paragraphs
The invoke() vs responses.create() gotcha is well cross-referenced across all 3 resource files
Dataset bias prevention guidance is concrete with Wrong/Correct examples
Test scenarios (5 scenarios) are substantially more thorough than other skills
Clean delegation to companion skills avoids scope creep
MCP tool naming follows repo convention (consistent with other skills)

Baukebrenninkmeijer

Correction on issue #2 (setup-observability): This skill exists on PR #12 (feat/RES-545-instrument-app-skill), so the companion reference is valid — just a cross-PR dependency. Not a real issue. Apologies for the false positive.

Baukebrenninkmeijer · 2026-03-27T15:02:34Z

PR Review: `compare-agents` Skill

Validated all code and endpoints against orq.ai docs and checked skill structure against repo conventions and Anthropic skill authoring best practices.

API & Docs Validation (all passed)

Area	Status
`agents.responses.create()` endpoint	Matches orq.ai API docs
A2A message format (`parts` array)	Matches docs
`invoke()` vs `responses.create()` distinction	Correctly documented
evaluatorq imports & function signatures	Match official tutorial
Package names (`evaluatorq`, `@orq-ai/evaluatorq`)	Correct
All 5 documentation links in SKILL.md	Valid and resolving

No factual mismatches found between the skill and the official docs.

Issues

1. Missing `import asyncio` in Python job patterns

File: skills/compare-agents/resources/job-patterns.md (LangGraph, CrewAI, Generic Agent patterns)

These patterns use asyncio.to_thread() but never import asyncio. Copy-pasting will produce NameError.

2. `DatasetIdInput` documented in Python signature but no Python example

File: skills/compare-agents/resources/evaluatorq-api.md

The Python function signature includes DatasetIdInput but the note on line 137 says only "TypeScript supports { datasetId: "..." }", implying Python doesn't. If Python also supports it, add an example. If not, clarify the signature.

3. `pass: false` CI/CD gotcha references unexplained field

File: skills/compare-agents/resources/gotchas.md (lines 89-90)

The gotcha says TypeScript exits with code 1 when any evaluator returns pass: false, but evaluator examples only show a value field (a number). The relationship between value and pass is never explained.

Suggestions

4. Document that MCP tool short names map to fully qualified names

File: skills/compare-agents/SKILL.md (MCP tools table, lines 85-91)

The table uses short names (search_entities, create_dataset, etc.) which is consistent with the other skills in this repo. However, the actual tool names at runtime are fully qualified (e.g., mcp__orq-remote-mcp__search_entities). Consider adding a note above the table like:

Tool names below are shortened for readability. At runtime they resolve via the orq* glob in allowed-tools (e.g., search_entities → mcp__orq-remote-mcp__search_entities).

This would help skill authors who look at this skill as a reference understand the naming convention.

5. Missing test scenarios in `tests/skills.md`

No error-handling scenario (missing API key, bad agent key, unreachable agent)
No multi-agent (3+) test scenario despite the skill explicitly supporting it
Phase 5 (run command, ORQ_API_KEY reminder, Experiment UI link) has zero test verification

Strengths

Excellent progressive disclosure with resources/ directory
Strong constraint documentation with "Why" rationale
The invoke() vs responses.create() gotcha is well cross-referenced across all 3 resource files
Dataset bias prevention guidance is concrete with Wrong/Correct examples
Test scenarios are more thorough than other skills (5 scenarios vs typical 1-2)
Clean delegation to companion skills avoids scope creep

arianpasquali requested a review from Baukebrenninkmeijer March 27, 2026 11:05

Baukebrenninkmeijer reviewed Mar 27, 2026

View reviewed changes

arianpasquali merged commit 6bf1159 into main Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add compare-agents skill for cross-framework evaluation#13

feat: add compare-agents skill for cross-framework evaluation#13
arianpasquali merged 1 commit into
mainfrom
orqkit-evaluatorq-agent-evaluation

arianpasquali commented Mar 26, 2026

Uh oh!

Baukebrenninkmeijer left a comment

Uh oh!

Baukebrenninkmeijer left a comment

Uh oh!

Baukebrenninkmeijer commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arianpasquali commented Mar 26, 2026

Summary

Files

Test plan

Uh oh!

Baukebrenninkmeijer left a comment

Choose a reason for hiding this comment

PR Review: compare-agents skill

✅ Docs & API Validation (all passed)

Issues Found

🔴 Critical

🟡 Important

💡 Suggestions

Strengths

Uh oh!

Baukebrenninkmeijer left a comment

Choose a reason for hiding this comment

Uh oh!

Baukebrenninkmeijer commented Mar 27, 2026

PR Review: compare-agents Skill

API & Docs Validation (all passed)

Issues

1. Missing import asyncio in Python job patterns

2. DatasetIdInput documented in Python signature but no Python example

3. pass: false CI/CD gotcha references unexplained field

Suggestions

4. Document that MCP tool short names map to fully qualified names

5. Missing test scenarios in tests/skills.md

Strengths

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR Review: `compare-agents` Skill

1. Missing `import asyncio` in Python job patterns

2. `DatasetIdInput` documented in Python signature but no Python example

3. `pass: false` CI/CD gotcha references unexplained field

5. Missing test scenarios in `tests/skills.md`