Skip to content

feat: add compare-agents skill for cross-framework evaluation#13

Merged
arianpasquali merged 1 commit into
mainfrom
orqkit-evaluatorq-agent-evaluation
Mar 27, 2026
Merged

feat: add compare-agents skill for cross-framework evaluation#13
arianpasquali merged 1 commit into
mainfrom
orqkit-evaluatorq-agent-evaluation

Conversation

@arianpasquali
Copy link
Copy Markdown
Collaborator

Summary

  • Adds compare-agents skill — runs head-to-head experiments comparing agents across frameworks (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) using evaluatorq from orqkit
  • Follows repo conventions: role statement, constraints, companion skills, workflow checklist, resources directory
  • Delegates dataset/evaluator creation to companion skills (generate-synthetic-dataset, build-evaluator) instead of duplicating content
  • Covers both Python (evaluatorq) and TypeScript (@orq-ai/evaluatorq) with framework-specific job patterns
  • Adds 5 test scenarios to tests/skills.md

Files

File Purpose
skills/compare-agents/SKILL.md (197 lines) Main skill — orchestrator with phased workflow
skills/compare-agents/resources/job-patterns.md Framework-specific job patterns (Python + TS)
skills/compare-agents/resources/evaluatorq-api.md evaluatorq/orqkit API reference
skills/compare-agents/resources/gotchas.md Known issues and workarounds
tests/skills.md 5 test scenarios for the skill

Test plan

  • Verify skill triggers on "compare agents", "benchmark", "test agents"
  • Verify companion skill redirects work (dataset → generate-synthetic-dataset, evaluator → build-evaluator)
  • Verify Python job patterns generate valid evaluatorq scripts
  • Verify TypeScript job patterns use correct imports (@orq-ai/evaluatorq)
  • Verify dataset bias prevention (no mock-data-biased expected outputs)

🤖 Generated with Claude Code

Introduces a new skill that uses evaluatorq (from orqkit) to compare agents
across frameworks (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK)
head-to-head on the same dataset with LLM-as-a-judge scoring.

The skill follows repo conventions (role statement, constraints, companion skills,
workflow checklist, resources directory) and delegates dataset/evaluator creation
to companion skills instead of duplicating them. Supports both Python and TypeScript.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Baukebrenninkmeijer Baukebrenninkmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: compare-agents skill

Validated all code, API references, and documentation links against orq.ai docs and Claude Code skill best practices. Overall this is a well-structured skill with accurate API references.

✅ Docs & API Validation (all passed)

Area Status
agents.responses.create() endpoint ✅ Matches orq.ai API docs
A2A message format (parts array) ✅ Correct
invoke() vs responses.create() distinction ✅ Correctly documented
evaluatorq imports & function signatures ✅ Match official tutorial
Package names (evaluatorq, @orq-ai/evaluatorq) ✅ Correct
All 5 documentation links ✅ Valid and resolving

Issues Found

🔴 Critical

1. Missing import asyncio in Python job patternsresources/job-patterns.md

LangGraph (line 77), CrewAI (line 109), and Generic Agent (line 218) patterns all use asyncio.to_thread() but never show the import asyncio statement. Copy-pasting these patterns will produce NameError: name 'asyncio' is not defined.

Suggestion: Add a "Required Imports" section at the top of job-patterns.md:

import asyncio
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult

2. Reference to nonexistent setup-observability companion skillSKILL.md line 35

- `setup-observability` — instrument agents for tracing

No skills/setup-observability/ directory exists in the repo. The agent will try to delegate to this skill and fail. Should be removed or marked as (planned).

🟡 Important

3. DatasetIdInput shown in Python signature but undocumentedresources/evaluatorq-api.md line 117

The Python function signature includes data: DatasetIdInput | Sequence[DataPoint] but no Python example shows how to use DatasetIdInput. The note on line 137 implies only TypeScript supports { datasetId: "..." }, creating a contradiction. If Python supports it too, add an example. If not, remove it from the signature.

4. pass: false gotcha references undocumented fieldresources/gotchas.md line 90

"TypeScript evaluatorq exits with code 1 when any evaluator returns pass: false."

But the evaluator scorer examples only show EvaluationResult with value (a number) and explanation. The relationship between the numeric value and the boolean pass is never explained. This will confuse users trying to understand CI/CD exit behavior.

💡 Suggestions

5. Test coverage gapstests/skills.md

  • No scenario tests what happens when an agent invocation fails (bad key, missing env var, unreachable agent). The SKILL.md says "ALWAYS confirm each agent can be invoked independently" but this is untested.
  • No multi-agent (3+) test scenario despite the skill explicitly supporting it.
  • Phase 5 (run command, ORQ_API_KEY reminder, Experiment UI link) has zero test verification.

Strengths

  • Excellent progressive disclosure: SKILL.mdresources/ split keeps the main file focused
  • Strong constraint documentation with "Why" rationale paragraphs
  • The invoke() vs responses.create() gotcha is well cross-referenced across all 3 resource files
  • Dataset bias prevention guidance is concrete with Wrong/Correct examples
  • Test scenarios (5 scenarios) are substantially more thorough than other skills
  • Clean delegation to companion skills avoids scope creep
  • MCP tool naming follows repo convention (consistent with other skills)

Copy link
Copy Markdown
Collaborator

@Baukebrenninkmeijer Baukebrenninkmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction on issue #2 (setup-observability): This skill exists on PR #12 (feat/RES-545-instrument-app-skill), so the companion reference is valid — just a cross-PR dependency. Not a real issue. Apologies for the false positive.

@Baukebrenninkmeijer
Copy link
Copy Markdown
Collaborator

PR Review: compare-agents Skill

Validated all code and endpoints against orq.ai docs and checked skill structure against repo conventions and Anthropic skill authoring best practices.

API & Docs Validation (all passed)

Area Status
agents.responses.create() endpoint Matches orq.ai API docs
A2A message format (parts array) Matches docs
invoke() vs responses.create() distinction Correctly documented
evaluatorq imports & function signatures Match official tutorial
Package names (evaluatorq, @orq-ai/evaluatorq) Correct
All 5 documentation links in SKILL.md Valid and resolving

No factual mismatches found between the skill and the official docs.


Issues

1. Missing import asyncio in Python job patterns

File: skills/compare-agents/resources/job-patterns.md (LangGraph, CrewAI, Generic Agent patterns)

These patterns use asyncio.to_thread() but never import asyncio. Copy-pasting will produce NameError.

2. DatasetIdInput documented in Python signature but no Python example

File: skills/compare-agents/resources/evaluatorq-api.md

The Python function signature includes DatasetIdInput but the note on line 137 says only "TypeScript supports { datasetId: "..." }", implying Python doesn't. If Python also supports it, add an example. If not, clarify the signature.

3. pass: false CI/CD gotcha references unexplained field

File: skills/compare-agents/resources/gotchas.md (lines 89-90)

The gotcha says TypeScript exits with code 1 when any evaluator returns pass: false, but evaluator examples only show a value field (a number). The relationship between value and pass is never explained.


Suggestions

4. Document that MCP tool short names map to fully qualified names

File: skills/compare-agents/SKILL.md (MCP tools table, lines 85-91)

The table uses short names (search_entities, create_dataset, etc.) which is consistent with the other skills in this repo. However, the actual tool names at runtime are fully qualified (e.g., mcp__orq-remote-mcp__search_entities). Consider adding a note above the table like:

Tool names below are shortened for readability. At runtime they resolve via the orq* glob in allowed-tools (e.g., search_entitiesmcp__orq-remote-mcp__search_entities).

This would help skill authors who look at this skill as a reference understand the naming convention.

5. Missing test scenarios in tests/skills.md

  • No error-handling scenario (missing API key, bad agent key, unreachable agent)
  • No multi-agent (3+) test scenario despite the skill explicitly supporting it
  • Phase 5 (run command, ORQ_API_KEY reminder, Experiment UI link) has zero test verification

Strengths

  • Excellent progressive disclosure with resources/ directory
  • Strong constraint documentation with "Why" rationale
  • The invoke() vs responses.create() gotcha is well cross-referenced across all 3 resource files
  • Dataset bias prevention guidance is concrete with Wrong/Correct examples
  • Test scenarios are more thorough than other skills (5 scenarios vs typical 1-2)
  • Clean delegation to companion skills avoids scope creep

@arianpasquali arianpasquali merged commit 6bf1159 into main Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants