From 44429115fabe8f8f6c239c33e0789a495b8cbcf3 Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Thu, 26 Mar 2026 17:19:38 +0100 Subject: [PATCH] feat: add compare-agents skill for cross-framework agent evaluation Introduces a new skill that uses evaluatorq (from orqkit) to compare agents across frameworks (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. The skill follows repo conventions (role statement, constraints, companion skills, workflow checklist, resources directory) and delegates dataset/evaluator creation to companion skills instead of duplicating them. Supports both Python and TypeScript. Co-Authored-By: Claude Opus 4.6 (1M context) --- skills/compare-agents/SKILL.md | 197 +++++++++++++ .../resources/evaluatorq-api.md | 174 ++++++++++++ skills/compare-agents/resources/gotchas.md | 98 +++++++ .../compare-agents/resources/job-patterns.md | 264 ++++++++++++++++++ tests/skills.md | 51 ++++ 5 files changed, 784 insertions(+) create mode 100644 skills/compare-agents/SKILL.md create mode 100644 skills/compare-agents/resources/evaluatorq-api.md create mode 100644 skills/compare-agents/resources/gotchas.md create mode 100644 skills/compare-agents/resources/job-patterns.md diff --git a/skills/compare-agents/SKILL.md b/skills/compare-agents/SKILL.md new file mode 100644 index 0000000..c5da7bc --- /dev/null +++ b/skills/compare-agents/SKILL.md @@ -0,0 +1,197 @@ +--- +name: compare-agents +description: Run cross-framework agent comparisons using evaluatorq from orqkit. Compares any combination of agents (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. Use when user says "compare agents", "benchmark", "test agents", or wants side-by-side evaluation. +allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch, Task, AskUserQuestion, orq* +--- + +# Compare Agents + +You are an **orq.ai agent comparison specialist**. Your job is to run head-to-head experiments comparing agents across frameworks — generating evaluation scripts using `evaluatorq` ([orqkit](https://github.com/orq-ai/orqkit)), then viewing results in the orq.ai Experiment UI. + +Supported comparison modes: +- **External vs orq.ai** — e.g., LangGraph agent vs orq.ai agent +- **orq.ai vs orq.ai** — e.g., two orq.ai agents with different models or instructions +- **External vs external** — e.g., LangGraph vs CrewAI, Vercel vs OpenAI Agents SDK +- **Multiple agents** — compare 3+ agents in a single experiment + +## Constraints + +- **NEVER** create datasets inline in the comparison script — delegate to `generate-synthetic-dataset` skill or use `{ datasetId: "..." }` to load from the platform. +- **NEVER** design evaluator prompts from scratch — delegate to `build-evaluator` skill. +- **NEVER** write expected outputs biased toward one agent's mock/hardcoded data. +- **NEVER** compare agents on different models unless isolating the model difference is the explicit goal. +- **ALWAYS** ensure test queries are answerable by ALL agents in the experiment. +- **ALWAYS** use the same evaluator(s) for all agents to ensure fair scoring. +- **ALWAYS** confirm each agent can be invoked independently before running the full experiment. + +**Why these constraints:** Biased datasets produce meaningless rankings. Inline datasets bypass validation. Different models confound framework comparisons. Untested agents waste experiment budget on invocation errors. + +## Companion Skills + +- `generate-synthetic-dataset` — create the evaluation dataset +- `build-evaluator` — design the LLM-as-a-judge evaluator +- `run-experiment` — run orq.ai-native experiments (when no external agents are involved) +- `build-agent` — create orq.ai agents to include in comparisons +- `setup-observability` — instrument agents for tracing + +## Workflow Checklist + +Copy this to track progress: + +``` +Agent Comparison Progress: +- [ ] Phase 1: Identify agents, frameworks, and language (Python/TS) +- [ ] Phase 2: Create dataset (→ generate-synthetic-dataset) +- [ ] Phase 3: Create evaluator (→ build-evaluator) +- [ ] Phase 4: Generate comparison script +- [ ] Phase 5: Run and view results in orq.ai +``` + +## When to use + +- User wants to compare agents built with different frameworks +- User wants to benchmark an orq.ai agent against an external agent +- User wants to compare 3+ agents in a single experiment +- User says "compare agents", "benchmark", "test agents side-by-side" + +## When NOT to use + +- Just need a dataset? → `generate-synthetic-dataset` +- Just need an evaluator? → `build-evaluator` +- Comparing orq.ai configurations only (no external agents)? → `run-experiment` +- Need to identify failure modes first? → `analyze-trace-failures` + +## Resources + +- **Job patterns** (all frameworks, Python + TypeScript): See [resources/job-patterns.md](resources/job-patterns.md) +- **evaluatorq API reference**: See [resources/evaluatorq-api.md](resources/evaluatorq-api.md) +- **Known gotchas**: See [resources/gotchas.md](resources/gotchas.md) + +## orq.ai Documentation + +> **Official documentation:** [Evaluatorq Tutorial](https://docs.orq.ai/docs/tutorials/evaluator-q) + +[Experiments](https://docs.orq.ai/docs/experiments/creating) · [Evaluators](https://docs.orq.ai/docs/evaluators/overview) · [Agent Responses API](https://docs.orq.ai/reference/agents/create-response) · [Datasets](https://docs.orq.ai/docs/datasets/overview) + +### Key Concepts + +- **evaluatorq** is the evaluation runner from [orqkit](https://github.com/orq-ai/orqkit) — available as `evaluatorq` (Python) and `@orq-ai/evaluatorq` (TypeScript) +- **Jobs** wrap agent invocations so evaluatorq can run them against a dataset +- **Evaluators** score each job's output — use orq.ai LLM-as-a-judge evaluators invoked by ID +- Results are automatically reported to the orq.ai Experiment UI when `ORQ_API_KEY` is set + +### orq MCP Tools + +| Tool | Purpose | +|------|---------| +| `search_entities` | Find orq.ai agent keys (use `type: "agent"`) | +| `create_dataset` | Create a dataset | +| `create_datapoints` | Populate dataset with test cases | +| `create_llm_eval` | Create an LLM-as-a-judge evaluator | + +## Prerequisites + +- The orq.ai MCP server is connected +- An `ORQ_API_KEY` environment variable is set +- **Python:** `pip install evaluatorq orq-ai-sdk` +- **TypeScript:** `npm install @orq-ai/evaluatorq` +- The agents to compare exist and are invocable (locally or via API) + +--- + +## Steps + +### Phase 1: Identify Agents + +1. **Ask the user** which agents to compare. For each agent, determine: + - Framework (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK, or generic) + - How to invoke it (agent key, import path, HTTP endpoint) + +2. **For orq.ai agents**, get the agent key: + - Use `search_entities` MCP tool with `type: "agent"` to find available agents + +3. **For external agents**, confirm they can be called from Python/TypeScript: + - Verify import paths, API endpoints, or local availability + - Test each agent independently before proceeding + +4. **Ask the user's language preference**: Python or TypeScript. Default to Python if no preference. + +### Phase 2: Create Dataset + +5. **Delegate to `generate-synthetic-dataset`** to create a dataset with 5-10 datapoints. + + Critical reminders for cross-framework comparison datasets: + - Queries must be answerable by **ALL** agents in the experiment + - Expected outputs must NOT be biased toward any agent's mock/hardcoded data + - For dynamic answers, write expected outputs as correctness criteria, not specific values + - Mix question types: computation, tool-dependent, multi-step + +### Phase 3: Create Evaluator + +6. **Delegate to `build-evaluator`** to create an LLM-as-a-judge evaluator. Save the returned evaluator ID. + + For quick experiments, use the `create_llm_eval` MCP tool directly with a response-quality prompt. Ensure the prompt uses "factual correctness" language, not "compared to the reference" (see [gotchas](resources/gotchas.md)). + +### Phase 4: Generate Comparison Script + +7. **Select job patterns** from [resources/job-patterns.md](resources/job-patterns.md) for each agent's framework. + +8. **Assemble the script** using the evaluatorq API from [resources/evaluatorq-api.md](resources/evaluatorq-api.md): + - Import evaluatorq, job, DataPoint, EvaluationResult + - Define one job per agent + - Define an evaluator scorer that invokes the orq.ai LLM-as-a-judge by ID + - Wire jobs + data + evaluators into the `evaluatorq()` call + +9. **Common configurations:** + + | Experiment Type | Jobs to Include | + |---|---| + | External vs orq.ai | One external job + one orq.ai job | + | orq.ai vs orq.ai | Two orq.ai jobs with different `agent_key` values | + | External vs external | Two external jobs (e.g., LangGraph + CrewAI) | + | Multi-agent | Three or more jobs of any type | + +10. **Replace all placeholders** in the generated script: + - `` — evaluator ID from Phase 3 + - `` — orq.ai agent key(s) from Phase 1 + - `` — descriptive experiment name + - Framework-specific placeholders (import paths, endpoints) + +### Phase 5: Run and View Results + +11. **Run the script:** + + ```bash + # Python + export ORQ_API_KEY="your-key" + python evaluate.py + + # TypeScript + export ORQ_API_KEY="your-key" + npx tsx evaluate.ts + ``` + +12. **View results** in orq.ai: + - Open the orq.ai Studio → navigate to your project → **Experiments** + - Compare scores across all agents — response quality, latency, and cost + +13. **If issues arise**, check [resources/gotchas.md](resources/gotchas.md) for common pitfalls. + +14. **Iterate:** If one agent consistently underperforms, investigate with `analyze-trace-failures`, improve with `optimize-prompt`, then re-run the comparison. + +--- + +## Checklist + +- [ ] Agents identified: frameworks, invocation methods confirmed +- [ ] Dataset created with unbiased, cross-agent datapoints +- [ ] Evaluator created with factual-correctness language +- [ ] Comparison script generated with correct IDs and import paths +- [ ] Script runs successfully and results appear in the orq.ai Experiment UI + +## Open in orq.ai + +After running the comparison: +- **Experiment results:** orq.ai Studio → Your Project → Experiments +- **Agent details:** orq.ai Studio → Agents +- **Traces:** orq.ai Studio → Observability → Traces diff --git a/skills/compare-agents/resources/evaluatorq-api.md b/skills/compare-agents/resources/evaluatorq-api.md new file mode 100644 index 0000000..7ab702e --- /dev/null +++ b/skills/compare-agents/resources/evaluatorq-api.md @@ -0,0 +1,174 @@ +# evaluatorq API Reference + +Quick reference for the evaluatorq library from [orqkit](https://github.com/orq-ai/orqkit). Available in both Python and TypeScript. + +--- + +## Installation + +| Language | Package | Install | +|----------|---------|---------| +| Python | `evaluatorq` | `pip install evaluatorq orq-ai-sdk` | +| TypeScript | `@orq-ai/evaluatorq` | `npm install @orq-ai/evaluatorq` | + +--- + +## Core API + +### Job Definition + +**Python:** +```python +from evaluatorq import job, DataPoint + +@job("AgentName") +async def my_job(data: DataPoint, row: int): + # Call your agent with data.inputs["query"] + return { + "agent": "AgentName", + "query": data.inputs["query"], + "response": "agent output here", + } +``` + +**TypeScript:** +```typescript +import { job } from "@orq-ai/evaluatorq"; + +const myJob = job("AgentName", async (data) => { + // Call your agent with data.inputs.query + return { + agent: "AgentName", + query: data.inputs.query, + response: "agent output here", + }; +}); +``` + +### Evaluator Scorer + +**Python:** +```python +from evaluatorq import EvaluationResult + +async def my_scorer(params): + data: DataPoint = params["data"] + output = params["output"] + # Score the output + return EvaluationResult( + value=0.85, + explanation="Factually correct", + ) +``` + +**TypeScript:** +```typescript +const myScorer = async ({ data, output }) => ({ + value: 0.85, + explanation: "Factually correct", +}); +``` + +### Running evaluatorq + +**Python:** +```python +from evaluatorq import evaluatorq, DataPoint + +async def main(): + await evaluatorq( + "experiment-name", + data=[ + DataPoint( + inputs={"query": "What is 2+2?"}, + expected_output="4", + ), + ], + jobs=[job_a, job_b], + evaluators=[ + {"name": "quality", "scorer": my_scorer}, + ], + parallelism=5, + ) +``` + +**TypeScript:** +```typescript +import { evaluatorq } from "@orq-ai/evaluatorq"; + +await evaluatorq("experiment-name", { + data: [{ inputs: { query: "What is 2+2?" } }], + jobs: [jobA, jobB], + evaluators: [{ name: "quality", scorer: myScorer }], +}); +``` + +--- + +## Function Signature + +### Python + +```python +async def evaluatorq( + name: str, + params: EvaluatorParams | dict | None = None, + *, + data: DatasetIdInput | Sequence[DataPoint] | None = None, + jobs: list[Job] | None = None, + evaluators: list[Evaluator] | None = None, + parallelism: int = 1, + print_results: bool = True, + description: str | None = None, +) -> EvaluatorqResult +``` + +### TypeScript + +```typescript +evaluatorq(name: string, options: { + data: DataPoint[] | { datasetId: string }; + jobs: Job[]; + evaluators: Evaluator[]; + parallelism?: number; +}): Promise +``` + +> **TypeScript supports `{ datasetId: "..." }`** to fetch data directly from the orq.ai platform instead of inlining datapoints. + +--- + +## Built-in Evaluators (Python) + +```python +from evaluatorq import string_contains_evaluator, exact_match_evaluator + +evaluators=[ + string_contains_evaluator(case_insensitive=True, name="contains-check"), + exact_match_evaluator(name="exact-match"), + {"name": "custom", "scorer": my_scorer}, +] +``` + +--- + +## Framework Wrappers (TypeScript) + +```typescript +import { wrapLangGraphAgent } from "@orq-ai/evaluatorq/langchain"; +import { wrapAISdkAgent } from "@orq-ai/evaluatorq/ai-sdk"; + +const langGraphJob = wrapLangGraphAgent("LangGraph", agent); +const vercelJob = wrapAISdkAgent("VercelAgent", agent); +``` + +--- + +## Environment + +| Variable | Purpose | Default | +|----------|---------|---------| +| `ORQ_API_KEY` | orq.ai API key (required for platform integration) | — | +| `ORQ_BASE_URL` | orq.ai base URL | `https://api.orq.ai` | + +When `ORQ_API_KEY` is set, evaluatorq automatically reports results to the orq.ai Experiment UI. diff --git a/skills/compare-agents/resources/gotchas.md b/skills/compare-agents/resources/gotchas.md new file mode 100644 index 0000000..9418857 --- /dev/null +++ b/skills/compare-agents/resources/gotchas.md @@ -0,0 +1,98 @@ +# Known Gotchas + +Common pitfalls when running cross-framework agent comparisons with evaluatorq. + +--- + +## orq.ai SDK + +### `agents.invoke()` vs `agents.responses.create()` + +`invoke()` returns an async A2A task with `status: "submitted"` — it does NOT return the agent's response. Use `responses.create(background=False)` to get a synchronous response with output. + +```python +# Wrong — returns task status, not output +response = orq.agents.invoke(agent_key="my-agent", ...) + +# Correct — returns actual response +response = orq.agents.responses.create( + agent_key="my-agent", + background=False, + message={"role": "user", "parts": [{"kind": "text", "text": query}]}, +) +``` + +### Message format + +The orq agent API uses A2A message format, not OpenAI-style: + +```python +# Wrong +message={"role": "user", "content": "Hello"} + +# Correct +message={"role": "user", "parts": [{"kind": "text", "text": "Hello"}]} +``` + +### Code tools: `additionalProperties: false` + +When creating custom code tools for orq agents, the parameter schema **must** include `"additionalProperties": false` or the model will reject the tool call with a 400 error. + +--- + +## Evaluator Design + +### Wording bias in evaluator prompts + +If the evaluator prompt says "compared to the reference", it will penalize correct answers that use different wording. Use "factual correctness" language instead: + +``` +# Wrong +"Compare the response to the reference answer and score accuracy." + +# Correct +"Check whether the response contains the same core facts as the reference. +Different wording is acceptable as long as the facts match." +``` + +### Same model for fair comparison + +When comparing frameworks, ensure all agents use the same underlying model (e.g., `openai/gpt-4o-mini`). Otherwise you're measuring model differences, not framework differences. + +--- + +## Dataset Design + +### Mock data bias + +Do NOT write expected outputs that match one agent's hardcoded data. If one agent has a fake weather tool returning "Sunny, 22C" and another uses a real API, the fake agent will always win against a reference matching its own data. + +For dynamic answers (weather, stock prices), write expected outputs as correctness criteria: + +``` +# Wrong +"The weather in Tokyo is sunny with a temperature of 22C." + +# Correct +"The response should include the current temperature and conditions in Tokyo from a real data source." +``` + +--- + +## TypeScript-specific + +### `wrapLangGraphAgent` type requirements + +`wrapLangGraphAgent` from `@orq-ai/evaluatorq/langchain` expects a LangChain-compatible agent instance — not a raw LangGraph `StateGraph`. + +### Exit codes for CI/CD + +TypeScript evaluatorq exits with code 1 when any evaluator returns `pass: false`. This is intentional for CI/CD pipelines but can be surprising in development. + +--- + +## Staging Environment + +When using `ORQ_BASE_URL` with a staging URL like `https://my.staging.orq.ai`, evaluatorq internally transforms `my.` to `api.` in URLs. If `api.staging.orq.ai` doesn't exist, this will cause connection errors. + +**Workaround:** Set `ORQ_BASE_URL=https://api.staging.orq.ai` directly, bypassing the transformation. diff --git a/skills/compare-agents/resources/job-patterns.md b/skills/compare-agents/resources/job-patterns.md new file mode 100644 index 0000000..6745b3d --- /dev/null +++ b/skills/compare-agents/resources/job-patterns.md @@ -0,0 +1,264 @@ +# Job Patterns for Agent Comparison + +Framework-specific job patterns for use with evaluatorq. Each job wraps an agent invocation so evaluatorq can run it against a dataset. + +Pick the pattern matching each agent's framework, then assemble them into the comparison script. + +--- + +## orq.ai Agent + +### Python + +```python +@job("OrqAgent") +async def orq_agent_job(data: DataPoint, row: int): + with Orq(api_key=ORQ_API_KEY, server_url=ORQ_BASE_URL) as orq: + response = orq.agents.responses.create( + agent_key="", + background=False, + message={ + "role": "user", + "parts": [{"kind": "text", "text": data.inputs["query"]}], + }, + ) + + text = "" + if response.output: + for msg in response.output: + for part in msg.parts: + if hasattr(part, "text"): + text += part.text + + return { + "agent": "OrqAgent", + "query": data.inputs["query"], + "response": text, + } +``` + +### TypeScript + +```typescript +const orqAgentJob = job("OrqAgent", async (data) => { + const orq = new Orq({ apiKey: process.env.ORQ_API_KEY }); + const response = await orq.agents.responses.create({ + agentKey: "", + background: false, + message: { + role: "user", + parts: [{ kind: "text", text: data.inputs.query }], + }, + }); + + let text = ""; + for (const msg of response.output ?? []) { + for (const part of msg.parts) { + if ("text" in part) text += part.text; + } + } + return { agent: "OrqAgent", query: data.inputs.query, response: text }; +}); +``` + +> **Important:** Use `agents.responses.create()`, NOT `agents.invoke()`. The `invoke()` method returns an async A2A task (status: "submitted"), not the actual response. See [gotchas](gotchas.md). + +--- + +## LangGraph + +### Python + +```python +@job("LangGraph") +async def langgraph_job(data: DataPoint, row: int): + from agent import agent + + result = await asyncio.to_thread( + agent.invoke, + {"messages": [("user", data.inputs["query"])]}, + ) + return { + "agent": "LangGraph", + "query": data.inputs["query"], + "response": result["messages"][-1].content, + } +``` + +### TypeScript + +```typescript +import { wrapLangGraphAgent } from "@orq-ai/evaluatorq/langchain"; + +const langGraphJob = wrapLangGraphAgent("LangGraph", agent); +``` + +> `wrapLangGraphAgent` expects a LangChain-compatible agent instance. + +--- + +## CrewAI + +### Python + +```python +@job("CrewAI") +async def crewai_job(data: DataPoint, row: int): + from crew import crew + + result = await asyncio.to_thread( + crew.kickoff, + inputs={"query": data.inputs["query"]}, + ) + return { + "agent": "CrewAI", + "query": data.inputs["query"], + "response": result.raw, + } +``` + +### TypeScript + +CrewAI is Python-native. Wrap it via HTTP if you need TypeScript: + +```typescript +const crewJob = job("CrewAI", async (data) => { + const res = await fetch("", { + method: "POST", + headers: { "Content-Type": "application/json" }, + body: JSON.stringify({ query: data.inputs.query }), + }); + const result = await res.json(); + return { agent: "CrewAI", query: data.inputs.query, response: result.output }; +}); +``` + +--- + +## OpenAI Agents SDK + +### Python + +```python +@job("OpenAIAgent") +async def openai_agent_job(data: DataPoint, row: int): + from agents import Runner + from my_agent import agent + + result = await Runner.run(agent, data.inputs["query"]) + return { + "agent": "OpenAIAgent", + "query": data.inputs["query"], + "response": result.final_output, + } +``` + +### TypeScript + +```typescript +const openaiAgentJob = job("OpenAIAgent", async (data) => { + const { Runner } = await import("@openai/agents"); + const { agent } = await import("./my-agent"); + + const result = await Runner.run(agent, data.inputs.query); + return { + agent: "OpenAIAgent", + query: data.inputs.query, + response: result.finalOutput, + }; +}); +``` + +--- + +## Vercel AI SDK + +### Python (via HTTP) + +```python +@job("VercelAgent") +async def vercel_agent_job(data: DataPoint, row: int): + import httpx + + async with httpx.AsyncClient(timeout=30.0) as client: + r = await client.post( + "", + json={"prompt": data.inputs["query"]}, + ) + result = r.json() + + return { + "agent": "VercelAgent", + "query": data.inputs["query"], + "response": result["text"], + } +``` + +### TypeScript + +```typescript +import { wrapAISdkAgent } from "@orq-ai/evaluatorq/ai-sdk"; + +const vercelJob = wrapAISdkAgent("VercelAgent", agent); +``` + +> `wrapAISdkAgent` works with any Vercel AI SDK agent instance. + +--- + +## Generic Agent (any callable) + +### Python + +```python +@job("MyAgent") +async def generic_agent_job(data: DataPoint, row: int): + from my_agent import run_agent + + response = await asyncio.to_thread(run_agent, data.inputs["query"]) + return { + "agent": "MyAgent", + "query": data.inputs["query"], + "response": response, + } +``` + +### TypeScript + +```typescript +const genericJob = job("MyAgent", async (data) => { + const { runAgent } = await import("./my-agent"); + const response = await runAgent(data.inputs.query); + return { agent: "MyAgent", query: data.inputs.query, response }; +}); +``` + +--- + +## orq.ai vs orq.ai (multiple agent keys) + +When comparing two orq.ai agents, duplicate the orq.ai pattern with different keys: + +### Python + +```python +@job("OrqAgent-GPT4o") +async def orq_agent_a(data: DataPoint, row: int): + # ... agent_key="agent-gpt4o" ... + +@job("OrqAgent-Claude") +async def orq_agent_b(data: DataPoint, row: int): + # ... agent_key="agent-claude" ... +``` + +### TypeScript + +```typescript +const orqGpt4o = job("OrqAgent-GPT4o", async (data) => { + // ... agentKey: "agent-gpt4o" ... +}); + +const orqClaude = job("OrqAgent-Claude", async (data) => { + // ... agentKey: "agent-claude" ... +}); +``` diff --git a/tests/skills.md b/tests/skills.md index 4449b22..e3def4a 100644 --- a/tests/skills.md +++ b/tests/skills.md @@ -42,6 +42,53 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). - Ask: "Run an experiment using orq-skills-test-dataset with orq-skills-test-eval-length" - Verify: calls `create_experiment` with correct references +## `compare-agents` + +### Scenario 1: orq.ai vs external agent (Python) + +- Ask: "Compare my orq.ai agent orq-skills-test-echo against a simple Python function that reverses the input" +- Verify Phase 1: identifies two agents — orq.ai (uses `search_entities` to find `orq-skills-test-echo`) and generic Python +- Verify Phase 1: asks or confirms language preference (Python) +- Verify Phase 2: delegates to `generate-synthetic-dataset` or creates dataset via `create_dataset` + `create_datapoints` with `orq-skills-test-` prefix +- Verify Phase 3: delegates to `build-evaluator` or creates evaluator via `create_llm_eval` +- Verify Phase 4: generates a Python script with: + - `from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult` + - One `@job("OrqAgent")` using `orq.agents.responses.create()` (NOT `agents.invoke()`) + - One `@job("ReverseAgent")` wrapping the Python function + - An evaluator scorer invoking the orq.ai judge by ID + - A `evaluatorq()` call wiring jobs + data + evaluators +- Verify: script uses A2A message format `{"role": "user", "parts": [{"kind": "text", "text": ...}]}` (NOT OpenAI-style) +- Verify: does NOT hardcode datapoints inline if a dataset was created on the platform + +### Scenario 2: orq.ai vs orq.ai + +- Ask: "Compare two versions of my agent — orq-skills-test-echo with model gpt-4o-mini vs the same agent" +- Verify: generates two orq.ai job patterns with different job names (e.g., `OrqAgent-A`, `OrqAgent-B`) +- Verify: uses the same `agent_key` for both (since it's the same agent) +- Verify: warns about same-model comparison if both use the same model + +### Scenario 3: TypeScript preference + +- Ask: "I want to benchmark a LangGraph agent against my orq.ai agent, using TypeScript" +- Verify Phase 4: generates TypeScript, not Python +- Verify: imports from `@orq-ai/evaluatorq` +- Verify: uses `wrapLangGraphAgent` from `@orq-ai/evaluatorq/langchain` for the LangGraph job +- Verify: uses `job()` function (not `@job` decorator) + +### Scenario 4: Skill boundary — redirects + +- Ask: "Create a dataset for testing my agents" +- Verify: redirects to `generate-synthetic-dataset` (does NOT handle dataset creation itself) +- Ask: "Run an experiment with my orq.ai deployment" +- Verify: redirects to `run-experiment` (no external agents involved) + +### Scenario 5: Dataset bias prevention + +- Provide: two agents — one with a mock weather tool returning "Sunny, 22C", one with a real API +- Ask: "Compare these agents on weather queries" +- Verify: does NOT write expected outputs matching the mock data +- Verify: expected outputs describe correctness criteria (e.g., "should include current temperature from a real source") + --- ## Critical Files @@ -52,3 +99,7 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). - `skills/optimize-prompt/SKILL.md` - `skills/analyze-trace-failures/SKILL.md` - `skills/run-experiment/SKILL.md` +- `skills/compare-agents/SKILL.md` +- `skills/compare-agents/resources/job-patterns.md` +- `skills/compare-agents/resources/evaluatorq-api.md` +- `skills/compare-agents/resources/gotchas.md`