From 44429115fabe8f8f6c239c33e0789a495b8cbcf3 Mon Sep 17 00:00:00 2001
From: Arian Pasquali <arianpasquali@gmail.com>
Date: Thu, 26 Mar 2026 17:19:38 +0100
Subject: [PATCH] feat: add compare-agents skill for cross-framework agent
 evaluation

Introduces a new skill that uses evaluatorq (from orqkit) to compare agents
across frameworks (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK)
head-to-head on the same dataset with LLM-as-a-judge scoring.

The skill follows repo conventions (role statement, constraints, companion skills,
workflow checklist, resources directory) and delegates dataset/evaluator creation
to companion skills instead of duplicating them. Supports both Python and TypeScript.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 skills/compare-agents/SKILL.md                | 197 +++++++++++++
 .../resources/evaluatorq-api.md               | 174 ++++++++++++
 skills/compare-agents/resources/gotchas.md    |  98 +++++++
 .../compare-agents/resources/job-patterns.md  | 264 ++++++++++++++++++
 tests/skills.md                               |  51 ++++
 5 files changed, 784 insertions(+)
 create mode 100644 skills/compare-agents/SKILL.md
 create mode 100644 skills/compare-agents/resources/evaluatorq-api.md
 create mode 100644 skills/compare-agents/resources/gotchas.md
 create mode 100644 skills/compare-agents/resources/job-patterns.md

diff --git a/skills/compare-agents/SKILL.md b/skills/compare-agents/SKILL.md
new file mode 100644
index 0000000..c5da7bc
--- /dev/null
+++ b/skills/compare-agents/SKILL.md
@@ -0,0 +1,197 @@
+---
+name: compare-agents
+description: Run cross-framework agent comparisons using evaluatorq from orqkit. Compares any combination of agents (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. Use when user says "compare agents", "benchmark", "test agents", or wants side-by-side evaluation.
+allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch, Task, AskUserQuestion, orq*
+---
+
+# Compare Agents
+
+You are an **orq.ai agent comparison specialist**. Your job is to run head-to-head experiments comparing agents across frameworks — generating evaluation scripts using `evaluatorq` ([orqkit](https://github.com/orq-ai/orqkit)), then viewing results in the orq.ai Experiment UI.
+
+Supported comparison modes:
+- **External vs orq.ai** — e.g., LangGraph agent vs orq.ai agent
+- **orq.ai vs orq.ai** — e.g., two orq.ai agents with different models or instructions
+- **External vs external** — e.g., LangGraph vs CrewAI, Vercel vs OpenAI Agents SDK
+- **Multiple agents** — compare 3+ agents in a single experiment
+
+## Constraints
+
+- **NEVER** create datasets inline in the comparison script — delegate to `generate-synthetic-dataset` skill or use `{ datasetId: "..." }` to load from the platform.
+- **NEVER** design evaluator prompts from scratch — delegate to `build-evaluator` skill.
+- **NEVER** write expected outputs biased toward one agent's mock/hardcoded data.
+- **NEVER** compare agents on different models unless isolating the model difference is the explicit goal.
+- **ALWAYS** ensure test queries are answerable by ALL agents in the experiment.
+- **ALWAYS** use the same evaluator(s) for all agents to ensure fair scoring.
+- **ALWAYS** confirm each agent can be invoked independently before running the full experiment.
+
+**Why these constraints:** Biased datasets produce meaningless rankings. Inline datasets bypass validation. Different models confound framework comparisons. Untested agents waste experiment budget on invocation errors.
+
+## Companion Skills
+
+- `generate-synthetic-dataset` — create the evaluation dataset
+- `build-evaluator` — design the LLM-as-a-judge evaluator
+- `run-experiment` — run orq.ai-native experiments (when no external agents are involved)
+- `build-agent` — create orq.ai agents to include in comparisons
+- `setup-observability` — instrument agents for tracing
+
+## Workflow Checklist
+
+Copy this to track progress:
+
+```
+Agent Comparison Progress:
+- [ ] Phase 1: Identify agents, frameworks, and language (Python/TS)
+- [ ] Phase 2: Create dataset (→ generate-synthetic-dataset)
+- [ ] Phase 3: Create evaluator (→ build-evaluator)
+- [ ] Phase 4: Generate comparison script
+- [ ] Phase 5: Run and view results in orq.ai
+```
+
+## When to use
+
+- User wants to compare agents built with different frameworks
+- User wants to benchmark an orq.ai agent against an external agent
+- User wants to compare 3+ agents in a single experiment
+- User says "compare agents", "benchmark", "test agents side-by-side"
+
+## When NOT to use
+
+- Just need a dataset? → `generate-synthetic-dataset`
+- Just need an evaluator? → `build-evaluator`
+- Comparing orq.ai configurations only (no external agents)? → `run-experiment`
+- Need to identify failure modes first? → `analyze-trace-failures`
+
+## Resources
+
+- **Job patterns** (all frameworks, Python + TypeScript): See [resources/job-patterns.md](resources/job-patterns.md)
+- **evaluatorq API reference**: See [resources/evaluatorq-api.md](resources/evaluatorq-api.md)
+- **Known gotchas**: See [resources/gotchas.md](resources/gotchas.md)
+
+## orq.ai Documentation
+
+> **Official documentation:** [Evaluatorq Tutorial](https://docs.orq.ai/docs/tutorials/evaluator-q)
+
+[Experiments](https://docs.orq.ai/docs/experiments/creating) · [Evaluators](https://docs.orq.ai/docs/evaluators/overview) · [Agent Responses API](https://docs.orq.ai/reference/agents/create-response) · [Datasets](https://docs.orq.ai/docs/datasets/overview)
+
+### Key Concepts
+
+- **evaluatorq** is the evaluation runner from [orqkit](https://github.com/orq-ai/orqkit) — available as `evaluatorq` (Python) and `@orq-ai/evaluatorq` (TypeScript)
+- **Jobs** wrap agent invocations so evaluatorq can run them against a dataset
+- **Evaluators** score each job's output — use orq.ai LLM-as-a-judge evaluators invoked by ID
+- Results are automatically reported to the orq.ai Experiment UI when `ORQ_API_KEY` is set
+
+### orq MCP Tools
+
+| Tool | Purpose |
+|------|---------|
+| `search_entities` | Find orq.ai agent keys (use `type: "agent"`) |
+| `create_dataset` | Create a dataset |
+| `create_datapoints` | Populate dataset with test cases |
+| `create_llm_eval` | Create an LLM-as-a-judge evaluator |
+
+## Prerequisites
+
+- The orq.ai MCP server is connected
+- An `ORQ_API_KEY` environment variable is set
+- **Python:** `pip install evaluatorq orq-ai-sdk`
+- **TypeScript:** `npm install @orq-ai/evaluatorq`
+- The agents to compare exist and are invocable (locally or via API)
+
+---
+
+## Steps
+
+### Phase 1: Identify Agents
+
+1. **Ask the user** which agents to compare. For each agent, determine:
+   - Framework (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK, or generic)
+   - How to invoke it (agent key, import path, HTTP endpoint)
+
+2. **For orq.ai agents**, get the agent key:
+   - Use `search_entities` MCP tool with `type: "agent"` to find available agents
+
+3. **For external agents**, confirm they can be called from Python/TypeScript:
+   - Verify import paths, API endpoints, or local availability
+   - Test each agent independently before proceeding
+
+4. **Ask the user's language preference**: Python or TypeScript. Default to Python if no preference.
+
+### Phase 2: Create Dataset
+
+5. **Delegate to `generate-synthetic-dataset`** to create a dataset with 5-10 datapoints.
+
+   Critical reminders for cross-framework comparison datasets:
+   - Queries must be answerable by **ALL** agents in the experiment
+   - Expected outputs must NOT be biased toward any agent's mock/hardcoded data
+   - For dynamic answers, write expected outputs as correctness criteria, not specific values
+   - Mix question types: computation, tool-dependent, multi-step
+
+### Phase 3: Create Evaluator
+
+6. **Delegate to `build-evaluator`** to create an LLM-as-a-judge evaluator. Save the returned evaluator ID.
+
+   For quick experiments, use the `create_llm_eval` MCP tool directly with a response-quality prompt. Ensure the prompt uses "factual correctness" language, not "compared to the reference" (see [gotchas](resources/gotchas.md)).
+
+### Phase 4: Generate Comparison Script
+
+7. **Select job patterns** from [resources/job-patterns.md](resources/job-patterns.md) for each agent's framework.
+
+8. **Assemble the script** using the evaluatorq API from [resources/evaluatorq-api.md](resources/evaluatorq-api.md):
+   - Import evaluatorq, job, DataPoint, EvaluationResult
+   - Define one job per agent
+   - Define an evaluator scorer that invokes the orq.ai LLM-as-a-judge by ID
+   - Wire jobs + data + evaluators into the `evaluatorq()` call
+
+9. **Common configurations:**
+
+   | Experiment Type | Jobs to Include |
+   |---|---|
+   | External vs orq.ai | One external job + one orq.ai job |
+   | orq.ai vs orq.ai | Two orq.ai jobs with different `agent_key` values |
+   | External vs external | Two external jobs (e.g., LangGraph + CrewAI) |
+   | Multi-agent | Three or more jobs of any type |
+
+10. **Replace all placeholders** in the generated script:
+    - `<EVALUATOR_ID>` — evaluator ID from Phase 3
+    - `<AGENT_KEY>` — orq.ai agent key(s) from Phase 1
+    - `<experiment-name>` — descriptive experiment name
+    - Framework-specific placeholders (import paths, endpoints)
+
+### Phase 5: Run and View Results
+
+11. **Run the script:**
+
+    ```bash
+    # Python
+    export ORQ_API_KEY="your-key"
+    python evaluate.py
+
+    # TypeScript
+    export ORQ_API_KEY="your-key"
+    npx tsx evaluate.ts
+    ```
+
+12. **View results** in orq.ai:
+    - Open the orq.ai Studio → navigate to your project → **Experiments**
+    - Compare scores across all agents — response quality, latency, and cost
+
+13. **If issues arise**, check [resources/gotchas.md](resources/gotchas.md) for common pitfalls.
+
+14. **Iterate:** If one agent consistently underperforms, investigate with `analyze-trace-failures`, improve with `optimize-prompt`, then re-run the comparison.
+
+---
+
+## Checklist
+
+- [ ] Agents identified: frameworks, invocation methods confirmed
+- [ ] Dataset created with unbiased, cross-agent datapoints
+- [ ] Evaluator created with factual-correctness language
+- [ ] Comparison script generated with correct IDs and import paths
+- [ ] Script runs successfully and results appear in the orq.ai Experiment UI
+
+## Open in orq.ai
+
+After running the comparison:
+- **Experiment results:** orq.ai Studio → Your Project → Experiments
+- **Agent details:** orq.ai Studio → Agents
+- **Traces:** orq.ai Studio → Observability → Traces
diff --git a/skills/compare-agents/resources/evaluatorq-api.md b/skills/compare-agents/resources/evaluatorq-api.md
new file mode 100644
index 0000000..7ab702e
--- /dev/null
+++ b/skills/compare-agents/resources/evaluatorq-api.md
@@ -0,0 +1,174 @@
+# evaluatorq API Reference
+
+Quick reference for the evaluatorq library from [orqkit](https://github.com/orq-ai/orqkit). Available in both Python and TypeScript.
+
+---
+
+## Installation
+
+| Language | Package | Install |
+|----------|---------|---------|
+| Python | `evaluatorq` | `pip install evaluatorq orq-ai-sdk` |
+| TypeScript | `@orq-ai/evaluatorq` | `npm install @orq-ai/evaluatorq` |
+
+---
+
+## Core API
+
+### Job Definition
+
+**Python:**
+```python
+from evaluatorq import job, DataPoint
+
+@job("AgentName")
+async def my_job(data: DataPoint, row: int):
+    # Call your agent with data.inputs["query"]
+    return {
+        "agent": "AgentName",
+        "query": data.inputs["query"],
+        "response": "agent output here",
+    }
+```
+
+**TypeScript:**
+```typescript
+import { job } from "@orq-ai/evaluatorq";
+
+const myJob = job("AgentName", async (data) => {
+  // Call your agent with data.inputs.query
+  return {
+    agent: "AgentName",
+    query: data.inputs.query,
+    response: "agent output here",
+  };
+});
+```
+
+### Evaluator Scorer
+
+**Python:**
+```python
+from evaluatorq import EvaluationResult
+
+async def my_scorer(params):
+    data: DataPoint = params["data"]
+    output = params["output"]
+    # Score the output
+    return EvaluationResult(
+        value=0.85,
+        explanation="Factually correct",
+    )
+```
+
+**TypeScript:**
+```typescript
+const myScorer = async ({ data, output }) => ({
+  value: 0.85,
+  explanation: "Factually correct",
+});
+```
+
+### Running evaluatorq
+
+**Python:**
+```python
+from evaluatorq import evaluatorq, DataPoint
+
+async def main():
+    await evaluatorq(
+        "experiment-name",
+        data=[
+            DataPoint(
+                inputs={"query": "What is 2+2?"},
+                expected_output="4",
+            ),
+        ],
+        jobs=[job_a, job_b],
+        evaluators=[
+            {"name": "quality", "scorer": my_scorer},
+        ],
+        parallelism=5,
+    )
+```
+
+**TypeScript:**
+```typescript
+import { evaluatorq } from "@orq-ai/evaluatorq";
+
+await evaluatorq("experiment-name", {
+  data: [{ inputs: { query: "What is 2+2?" } }],
+  jobs: [jobA, jobB],
+  evaluators: [{ name: "quality", scorer: myScorer }],
+});
+```
+
+---
+
+## Function Signature
+
+### Python
+
+```python
+async def evaluatorq(
+    name: str,
+    params: EvaluatorParams | dict | None = None,
+    *,
+    data: DatasetIdInput | Sequence[DataPoint] | None = None,
+    jobs: list[Job] | None = None,
+    evaluators: list[Evaluator] | None = None,
+    parallelism: int = 1,
+    print_results: bool = True,
+    description: str | None = None,
+) -> EvaluatorqResult
+```
+
+### TypeScript
+
+```typescript
+evaluatorq(name: string, options: {
+  data: DataPoint[] | { datasetId: string };
+  jobs: Job[];
+  evaluators: Evaluator[];
+  parallelism?: number;
+}): Promise<void>
+```
+
+> **TypeScript supports `{ datasetId: "..." }`** to fetch data directly from the orq.ai platform instead of inlining datapoints.
+
+---
+
+## Built-in Evaluators (Python)
+
+```python
+from evaluatorq import string_contains_evaluator, exact_match_evaluator
+
+evaluators=[
+    string_contains_evaluator(case_insensitive=True, name="contains-check"),
+    exact_match_evaluator(name="exact-match"),
+    {"name": "custom", "scorer": my_scorer},
+]
+```
+
+---
+
+## Framework Wrappers (TypeScript)
+
+```typescript
+import { wrapLangGraphAgent } from "@orq-ai/evaluatorq/langchain";
+import { wrapAISdkAgent } from "@orq-ai/evaluatorq/ai-sdk";
+
+const langGraphJob = wrapLangGraphAgent("LangGraph", agent);
+const vercelJob = wrapAISdkAgent("VercelAgent", agent);
+```
+
+---
+
+## Environment
+
+| Variable | Purpose | Default |
+|----------|---------|---------|
+| `ORQ_API_KEY` | orq.ai API key (required for platform integration) | — |
+| `ORQ_BASE_URL` | orq.ai base URL | `https://api.orq.ai` |
+
+When `ORQ_API_KEY` is set, evaluatorq automatically reports results to the orq.ai Experiment UI.
diff --git a/skills/compare-agents/resources/gotchas.md b/skills/compare-agents/resources/gotchas.md
new file mode 100644
index 0000000..9418857
--- /dev/null
+++ b/skills/compare-agents/resources/gotchas.md
@@ -0,0 +1,98 @@
+# Known Gotchas
+
+Common pitfalls when running cross-framework agent comparisons with evaluatorq.
+
+---
+
+## orq.ai SDK
+
+### `agents.invoke()` vs `agents.responses.create()`
+
+`invoke()` returns an async A2A task with `status: "submitted"` — it does NOT return the agent's response. Use `responses.create(background=False)` to get a synchronous response with output.
+
+```python
+# Wrong — returns task status, not output
+response = orq.agents.invoke(agent_key="my-agent", ...)
+
+# Correct — returns actual response
+response = orq.agents.responses.create(
+    agent_key="my-agent",
+    background=False,
+    message={"role": "user", "parts": [{"kind": "text", "text": query}]},
+)
+```
+
+### Message format
+
+The orq agent API uses A2A message format, not OpenAI-style:
+
+```python
+# Wrong
+message={"role": "user", "content": "Hello"}
+
+# Correct
+message={"role": "user", "parts": [{"kind": "text", "text": "Hello"}]}
+```
+
+### Code tools: `additionalProperties: false`
+
+When creating custom code tools for orq agents, the parameter schema **must** include `"additionalProperties": false` or the model will reject the tool call with a 400 error.
+
+---
+
+## Evaluator Design
+
+### Wording bias in evaluator prompts
+
+If the evaluator prompt says "compared to the reference", it will penalize correct answers that use different wording. Use "factual correctness" language instead:
+
+```
+# Wrong
+"Compare the response to the reference answer and score accuracy."
+
+# Correct
+"Check whether the response contains the same core facts as the reference.
+Different wording is acceptable as long as the facts match."
+```
+
+### Same model for fair comparison
+
+When comparing frameworks, ensure all agents use the same underlying model (e.g., `openai/gpt-4o-mini`). Otherwise you're measuring model differences, not framework differences.
+
+---
+
+## Dataset Design
+
+### Mock data bias
+
+Do NOT write expected outputs that match one agent's hardcoded data. If one agent has a fake weather tool returning "Sunny, 22C" and another uses a real API, the fake agent will always win against a reference matching its own data.
+
+For dynamic answers (weather, stock prices), write expected outputs as correctness criteria:
+
+```
+# Wrong
+"The weather in Tokyo is sunny with a temperature of 22C."
+
+# Correct
+"The response should include the current temperature and conditions in Tokyo from a real data source."
+```
+
+---
+
+## TypeScript-specific
+
+### `wrapLangGraphAgent` type requirements
+
+`wrapLangGraphAgent` from `@orq-ai/evaluatorq/langchain` expects a LangChain-compatible agent instance — not a raw LangGraph `StateGraph`.
+
+### Exit codes for CI/CD
+
+TypeScript evaluatorq exits with code 1 when any evaluator returns `pass: false`. This is intentional for CI/CD pipelines but can be surprising in development.
+
+---
+
+## Staging Environment
+
+When using `ORQ_BASE_URL` with a staging URL like `https://my.staging.orq.ai`, evaluatorq internally transforms `my.` to `api.` in URLs. If `api.staging.orq.ai` doesn't exist, this will cause connection errors.
+
+**Workaround:** Set `ORQ_BASE_URL=https://api.staging.orq.ai` directly, bypassing the transformation.
diff --git a/skills/compare-agents/resources/job-patterns.md b/skills/compare-agents/resources/job-patterns.md
new file mode 100644
index 0000000..6745b3d
--- /dev/null
+++ b/skills/compare-agents/resources/job-patterns.md
@@ -0,0 +1,264 @@
+# Job Patterns for Agent Comparison
+
+Framework-specific job patterns for use with evaluatorq. Each job wraps an agent invocation so evaluatorq can run it against a dataset.
+
+Pick the pattern matching each agent's framework, then assemble them into the comparison script.
+
+---
+
+## orq.ai Agent
+
+### Python
+
+```python
+@job("OrqAgent")
+async def orq_agent_job(data: DataPoint, row: int):
+    with Orq(api_key=ORQ_API_KEY, server_url=ORQ_BASE_URL) as orq:
+        response = orq.agents.responses.create(
+            agent_key="<AGENT_KEY>",
+            background=False,
+            message={
+                "role": "user",
+                "parts": [{"kind": "text", "text": data.inputs["query"]}],
+            },
+        )
+
+    text = ""
+    if response.output:
+        for msg in response.output:
+            for part in msg.parts:
+                if hasattr(part, "text"):
+                    text += part.text
+
+    return {
+        "agent": "OrqAgent",
+        "query": data.inputs["query"],
+        "response": text,
+    }
+```
+
+### TypeScript
+
+```typescript
+const orqAgentJob = job("OrqAgent", async (data) => {
+  const orq = new Orq({ apiKey: process.env.ORQ_API_KEY });
+  const response = await orq.agents.responses.create({
+    agentKey: "<AGENT_KEY>",
+    background: false,
+    message: {
+      role: "user",
+      parts: [{ kind: "text", text: data.inputs.query }],
+    },
+  });
+
+  let text = "";
+  for (const msg of response.output ?? []) {
+    for (const part of msg.parts) {
+      if ("text" in part) text += part.text;
+    }
+  }
+  return { agent: "OrqAgent", query: data.inputs.query, response: text };
+});
+```
+
+> **Important:** Use `agents.responses.create()`, NOT `agents.invoke()`. The `invoke()` method returns an async A2A task (status: "submitted"), not the actual response. See [gotchas](gotchas.md).
+
+---
+
+## LangGraph
+
+### Python
+
+```python
+@job("LangGraph")
+async def langgraph_job(data: DataPoint, row: int):
+    from agent import agent
+
+    result = await asyncio.to_thread(
+        agent.invoke,
+        {"messages": [("user", data.inputs["query"])]},
+    )
+    return {
+        "agent": "LangGraph",
+        "query": data.inputs["query"],
+        "response": result["messages"][-1].content,
+    }
+```
+
+### TypeScript
+
+```typescript
+import { wrapLangGraphAgent } from "@orq-ai/evaluatorq/langchain";
+
+const langGraphJob = wrapLangGraphAgent("LangGraph", agent);
+```
+
+> `wrapLangGraphAgent` expects a LangChain-compatible agent instance.
+
+---
+
+## CrewAI
+
+### Python
+
+```python
+@job("CrewAI")
+async def crewai_job(data: DataPoint, row: int):
+    from crew import crew
+
+    result = await asyncio.to_thread(
+        crew.kickoff,
+        inputs={"query": data.inputs["query"]},
+    )
+    return {
+        "agent": "CrewAI",
+        "query": data.inputs["query"],
+        "response": result.raw,
+    }
+```
+
+### TypeScript
+
+CrewAI is Python-native. Wrap it via HTTP if you need TypeScript:
+
+```typescript
+const crewJob = job("CrewAI", async (data) => {
+  const res = await fetch("<CREW_ENDPOINT>", {
+    method: "POST",
+    headers: { "Content-Type": "application/json" },
+    body: JSON.stringify({ query: data.inputs.query }),
+  });
+  const result = await res.json();
+  return { agent: "CrewAI", query: data.inputs.query, response: result.output };
+});
+```
+
+---
+
+## OpenAI Agents SDK
+
+### Python
+
+```python
+@job("OpenAIAgent")
+async def openai_agent_job(data: DataPoint, row: int):
+    from agents import Runner
+    from my_agent import agent
+
+    result = await Runner.run(agent, data.inputs["query"])
+    return {
+        "agent": "OpenAIAgent",
+        "query": data.inputs["query"],
+        "response": result.final_output,
+    }
+```
+
+### TypeScript
+
+```typescript
+const openaiAgentJob = job("OpenAIAgent", async (data) => {
+  const { Runner } = await import("@openai/agents");
+  const { agent } = await import("./my-agent");
+
+  const result = await Runner.run(agent, data.inputs.query);
+  return {
+    agent: "OpenAIAgent",
+    query: data.inputs.query,
+    response: result.finalOutput,
+  };
+});
+```
+
+---
+
+## Vercel AI SDK
+
+### Python (via HTTP)
+
+```python
+@job("VercelAgent")
+async def vercel_agent_job(data: DataPoint, row: int):
+    import httpx
+
+    async with httpx.AsyncClient(timeout=30.0) as client:
+        r = await client.post(
+            "<VERCEL_AGENT_ENDPOINT>",
+            json={"prompt": data.inputs["query"]},
+        )
+        result = r.json()
+
+    return {
+        "agent": "VercelAgent",
+        "query": data.inputs["query"],
+        "response": result["text"],
+    }
+```
+
+### TypeScript
+
+```typescript
+import { wrapAISdkAgent } from "@orq-ai/evaluatorq/ai-sdk";
+
+const vercelJob = wrapAISdkAgent("VercelAgent", agent);
+```
+
+> `wrapAISdkAgent` works with any Vercel AI SDK agent instance.
+
+---
+
+## Generic Agent (any callable)
+
+### Python
+
+```python
+@job("MyAgent")
+async def generic_agent_job(data: DataPoint, row: int):
+    from my_agent import run_agent
+
+    response = await asyncio.to_thread(run_agent, data.inputs["query"])
+    return {
+        "agent": "MyAgent",
+        "query": data.inputs["query"],
+        "response": response,
+    }
+```
+
+### TypeScript
+
+```typescript
+const genericJob = job("MyAgent", async (data) => {
+  const { runAgent } = await import("./my-agent");
+  const response = await runAgent(data.inputs.query);
+  return { agent: "MyAgent", query: data.inputs.query, response };
+});
+```
+
+---
+
+## orq.ai vs orq.ai (multiple agent keys)
+
+When comparing two orq.ai agents, duplicate the orq.ai pattern with different keys:
+
+### Python
+
+```python
+@job("OrqAgent-GPT4o")
+async def orq_agent_a(data: DataPoint, row: int):
+    # ... agent_key="agent-gpt4o" ...
+
+@job("OrqAgent-Claude")
+async def orq_agent_b(data: DataPoint, row: int):
+    # ... agent_key="agent-claude" ...
+```
+
+### TypeScript
+
+```typescript
+const orqGpt4o = job("OrqAgent-GPT4o", async (data) => {
+  // ... agentKey: "agent-gpt4o" ...
+});
+
+const orqClaude = job("OrqAgent-Claude", async (data) => {
+  // ... agentKey: "agent-claude" ...
+});
+```
diff --git a/tests/skills.md b/tests/skills.md
index 4449b22..e3def4a 100644
--- a/tests/skills.md
+++ b/tests/skills.md
@@ -42,6 +42,53 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test).
 - Ask: "Run an experiment using orq-skills-test-dataset with orq-skills-test-eval-length"
 - Verify: calls `create_experiment` with correct references
 
+## `compare-agents`
+
+### Scenario 1: orq.ai vs external agent (Python)
+
+- Ask: "Compare my orq.ai agent orq-skills-test-echo against a simple Python function that reverses the input"
+- Verify Phase 1: identifies two agents — orq.ai (uses `search_entities` to find `orq-skills-test-echo`) and generic Python
+- Verify Phase 1: asks or confirms language preference (Python)
+- Verify Phase 2: delegates to `generate-synthetic-dataset` or creates dataset via `create_dataset` + `create_datapoints` with `orq-skills-test-` prefix
+- Verify Phase 3: delegates to `build-evaluator` or creates evaluator via `create_llm_eval`
+- Verify Phase 4: generates a Python script with:
+  - `from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult`
+  - One `@job("OrqAgent")` using `orq.agents.responses.create()` (NOT `agents.invoke()`)
+  - One `@job("ReverseAgent")` wrapping the Python function
+  - An evaluator scorer invoking the orq.ai judge by ID
+  - A `evaluatorq()` call wiring jobs + data + evaluators
+- Verify: script uses A2A message format `{"role": "user", "parts": [{"kind": "text", "text": ...}]}` (NOT OpenAI-style)
+- Verify: does NOT hardcode datapoints inline if a dataset was created on the platform
+
+### Scenario 2: orq.ai vs orq.ai
+
+- Ask: "Compare two versions of my agent — orq-skills-test-echo with model gpt-4o-mini vs the same agent"
+- Verify: generates two orq.ai job patterns with different job names (e.g., `OrqAgent-A`, `OrqAgent-B`)
+- Verify: uses the same `agent_key` for both (since it's the same agent)
+- Verify: warns about same-model comparison if both use the same model
+
+### Scenario 3: TypeScript preference
+
+- Ask: "I want to benchmark a LangGraph agent against my orq.ai agent, using TypeScript"
+- Verify Phase 4: generates TypeScript, not Python
+- Verify: imports from `@orq-ai/evaluatorq`
+- Verify: uses `wrapLangGraphAgent` from `@orq-ai/evaluatorq/langchain` for the LangGraph job
+- Verify: uses `job()` function (not `@job` decorator)
+
+### Scenario 4: Skill boundary — redirects
+
+- Ask: "Create a dataset for testing my agents"
+- Verify: redirects to `generate-synthetic-dataset` (does NOT handle dataset creation itself)
+- Ask: "Run an experiment with my orq.ai deployment"
+- Verify: redirects to `run-experiment` (no external agents involved)
+
+### Scenario 5: Dataset bias prevention
+
+- Provide: two agents — one with a mock weather tool returning "Sunny, 22C", one with a real API
+- Ask: "Compare these agents on weather queries"
+- Verify: does NOT write expected outputs matching the mock data
+- Verify: expected outputs describe correctness criteria (e.g., "should include current temperature from a real source")
+
 ---
 
 ## Critical Files
@@ -52,3 +99,7 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test).
 - `skills/optimize-prompt/SKILL.md`
 - `skills/analyze-trace-failures/SKILL.md`
 - `skills/run-experiment/SKILL.md`
+- `skills/compare-agents/SKILL.md`
+- `skills/compare-agents/resources/job-patterns.md`
+- `skills/compare-agents/resources/evaluatorq-api.md`
+- `skills/compare-agents/resources/gotchas.md`