Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions skills/compare-agents/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
name: compare-agents
description: Run cross-framework agent comparisons using evaluatorq from orqkit. Compares any combination of agents (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. Use when user says "compare agents", "benchmark", "test agents", or wants side-by-side evaluation.
allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch, Task, AskUserQuestion, orq*
---

# Compare Agents

You are an **orq.ai agent comparison specialist**. Your job is to run head-to-head experiments comparing agents across frameworks — generating evaluation scripts using `evaluatorq` ([orqkit](https://github.com/orq-ai/orqkit)), then viewing results in the orq.ai Experiment UI.

Supported comparison modes:
- **External vs orq.ai** — e.g., LangGraph agent vs orq.ai agent
- **orq.ai vs orq.ai** — e.g., two orq.ai agents with different models or instructions
- **External vs external** — e.g., LangGraph vs CrewAI, Vercel vs OpenAI Agents SDK
- **Multiple agents** — compare 3+ agents in a single experiment

## Constraints

- **NEVER** create datasets inline in the comparison script — delegate to `generate-synthetic-dataset` skill or use `{ datasetId: "..." }` to load from the platform.
- **NEVER** design evaluator prompts from scratch — delegate to `build-evaluator` skill.
- **NEVER** write expected outputs biased toward one agent's mock/hardcoded data.
- **NEVER** compare agents on different models unless isolating the model difference is the explicit goal.
- **ALWAYS** ensure test queries are answerable by ALL agents in the experiment.
- **ALWAYS** use the same evaluator(s) for all agents to ensure fair scoring.
- **ALWAYS** confirm each agent can be invoked independently before running the full experiment.

**Why these constraints:** Biased datasets produce meaningless rankings. Inline datasets bypass validation. Different models confound framework comparisons. Untested agents waste experiment budget on invocation errors.

## Companion Skills

- `generate-synthetic-dataset` — create the evaluation dataset
- `build-evaluator` — design the LLM-as-a-judge evaluator
- `run-experiment` — run orq.ai-native experiments (when no external agents are involved)
- `build-agent` — create orq.ai agents to include in comparisons
- `setup-observability` — instrument agents for tracing

## Workflow Checklist

Copy this to track progress:

```
Agent Comparison Progress:
- [ ] Phase 1: Identify agents, frameworks, and language (Python/TS)
- [ ] Phase 2: Create dataset (→ generate-synthetic-dataset)
- [ ] Phase 3: Create evaluator (→ build-evaluator)
- [ ] Phase 4: Generate comparison script
- [ ] Phase 5: Run and view results in orq.ai
```

## When to use

- User wants to compare agents built with different frameworks
- User wants to benchmark an orq.ai agent against an external agent
- User wants to compare 3+ agents in a single experiment
- User says "compare agents", "benchmark", "test agents side-by-side"

## When NOT to use

- Just need a dataset? → `generate-synthetic-dataset`
- Just need an evaluator? → `build-evaluator`
- Comparing orq.ai configurations only (no external agents)? → `run-experiment`
- Need to identify failure modes first? → `analyze-trace-failures`

## Resources

- **Job patterns** (all frameworks, Python + TypeScript): See [resources/job-patterns.md](resources/job-patterns.md)
- **evaluatorq API reference**: See [resources/evaluatorq-api.md](resources/evaluatorq-api.md)
- **Known gotchas**: See [resources/gotchas.md](resources/gotchas.md)

## orq.ai Documentation

> **Official documentation:** [Evaluatorq Tutorial](https://docs.orq.ai/docs/tutorials/evaluator-q)

[Experiments](https://docs.orq.ai/docs/experiments/creating) · [Evaluators](https://docs.orq.ai/docs/evaluators/overview) · [Agent Responses API](https://docs.orq.ai/reference/agents/create-response) · [Datasets](https://docs.orq.ai/docs/datasets/overview)

### Key Concepts

- **evaluatorq** is the evaluation runner from [orqkit](https://github.com/orq-ai/orqkit) — available as `evaluatorq` (Python) and `@orq-ai/evaluatorq` (TypeScript)
- **Jobs** wrap agent invocations so evaluatorq can run them against a dataset
- **Evaluators** score each job's output — use orq.ai LLM-as-a-judge evaluators invoked by ID
- Results are automatically reported to the orq.ai Experiment UI when `ORQ_API_KEY` is set

### orq MCP Tools

| Tool | Purpose |
|------|---------|
| `search_entities` | Find orq.ai agent keys (use `type: "agent"`) |
| `create_dataset` | Create a dataset |
| `create_datapoints` | Populate dataset with test cases |
| `create_llm_eval` | Create an LLM-as-a-judge evaluator |

## Prerequisites

- The orq.ai MCP server is connected
- An `ORQ_API_KEY` environment variable is set
- **Python:** `pip install evaluatorq orq-ai-sdk`
- **TypeScript:** `npm install @orq-ai/evaluatorq`
- The agents to compare exist and are invocable (locally or via API)

---

## Steps

### Phase 1: Identify Agents

1. **Ask the user** which agents to compare. For each agent, determine:
- Framework (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK, or generic)
- How to invoke it (agent key, import path, HTTP endpoint)

2. **For orq.ai agents**, get the agent key:
- Use `search_entities` MCP tool with `type: "agent"` to find available agents

3. **For external agents**, confirm they can be called from Python/TypeScript:
- Verify import paths, API endpoints, or local availability
- Test each agent independently before proceeding

4. **Ask the user's language preference**: Python or TypeScript. Default to Python if no preference.

### Phase 2: Create Dataset

5. **Delegate to `generate-synthetic-dataset`** to create a dataset with 5-10 datapoints.

Critical reminders for cross-framework comparison datasets:
- Queries must be answerable by **ALL** agents in the experiment
- Expected outputs must NOT be biased toward any agent's mock/hardcoded data
- For dynamic answers, write expected outputs as correctness criteria, not specific values
- Mix question types: computation, tool-dependent, multi-step

### Phase 3: Create Evaluator

6. **Delegate to `build-evaluator`** to create an LLM-as-a-judge evaluator. Save the returned evaluator ID.

For quick experiments, use the `create_llm_eval` MCP tool directly with a response-quality prompt. Ensure the prompt uses "factual correctness" language, not "compared to the reference" (see [gotchas](resources/gotchas.md)).

### Phase 4: Generate Comparison Script

7. **Select job patterns** from [resources/job-patterns.md](resources/job-patterns.md) for each agent's framework.

8. **Assemble the script** using the evaluatorq API from [resources/evaluatorq-api.md](resources/evaluatorq-api.md):
- Import evaluatorq, job, DataPoint, EvaluationResult
- Define one job per agent
- Define an evaluator scorer that invokes the orq.ai LLM-as-a-judge by ID
- Wire jobs + data + evaluators into the `evaluatorq()` call

9. **Common configurations:**

| Experiment Type | Jobs to Include |
|---|---|
| External vs orq.ai | One external job + one orq.ai job |
| orq.ai vs orq.ai | Two orq.ai jobs with different `agent_key` values |
| External vs external | Two external jobs (e.g., LangGraph + CrewAI) |
| Multi-agent | Three or more jobs of any type |

10. **Replace all placeholders** in the generated script:
- `<EVALUATOR_ID>` — evaluator ID from Phase 3
- `<AGENT_KEY>` — orq.ai agent key(s) from Phase 1
- `<experiment-name>` — descriptive experiment name
- Framework-specific placeholders (import paths, endpoints)

### Phase 5: Run and View Results

11. **Run the script:**

```bash
# Python
export ORQ_API_KEY="your-key"
python evaluate.py

# TypeScript
export ORQ_API_KEY="your-key"
npx tsx evaluate.ts
```

12. **View results** in orq.ai:
- Open the orq.ai Studio → navigate to your project → **Experiments**
- Compare scores across all agents — response quality, latency, and cost

13. **If issues arise**, check [resources/gotchas.md](resources/gotchas.md) for common pitfalls.

14. **Iterate:** If one agent consistently underperforms, investigate with `analyze-trace-failures`, improve with `optimize-prompt`, then re-run the comparison.

---

## Checklist

- [ ] Agents identified: frameworks, invocation methods confirmed
- [ ] Dataset created with unbiased, cross-agent datapoints
- [ ] Evaluator created with factual-correctness language
- [ ] Comparison script generated with correct IDs and import paths
- [ ] Script runs successfully and results appear in the orq.ai Experiment UI

## Open in orq.ai

After running the comparison:
- **Experiment results:** orq.ai Studio → Your Project → Experiments
- **Agent details:** orq.ai Studio → Agents
- **Traces:** orq.ai Studio → Observability → Traces
174 changes: 174 additions & 0 deletions skills/compare-agents/resources/evaluatorq-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# evaluatorq API Reference

Quick reference for the evaluatorq library from [orqkit](https://github.com/orq-ai/orqkit). Available in both Python and TypeScript.

---

## Installation

| Language | Package | Install |
|----------|---------|---------|
| Python | `evaluatorq` | `pip install evaluatorq orq-ai-sdk` |
| TypeScript | `@orq-ai/evaluatorq` | `npm install @orq-ai/evaluatorq` |

---

## Core API

### Job Definition

**Python:**
```python
from evaluatorq import job, DataPoint

@job("AgentName")
async def my_job(data: DataPoint, row: int):
# Call your agent with data.inputs["query"]
return {
"agent": "AgentName",
"query": data.inputs["query"],
"response": "agent output here",
}
```

**TypeScript:**
```typescript
import { job } from "@orq-ai/evaluatorq";

const myJob = job("AgentName", async (data) => {
// Call your agent with data.inputs.query
return {
agent: "AgentName",
query: data.inputs.query,
response: "agent output here",
};
});
```

### Evaluator Scorer

**Python:**
```python
from evaluatorq import EvaluationResult

async def my_scorer(params):
data: DataPoint = params["data"]
output = params["output"]
# Score the output
return EvaluationResult(
value=0.85,
explanation="Factually correct",
)
```

**TypeScript:**
```typescript
const myScorer = async ({ data, output }) => ({
value: 0.85,
explanation: "Factually correct",
});
```

### Running evaluatorq

**Python:**
```python
from evaluatorq import evaluatorq, DataPoint

async def main():
await evaluatorq(
"experiment-name",
data=[
DataPoint(
inputs={"query": "What is 2+2?"},
expected_output="4",
),
],
jobs=[job_a, job_b],
evaluators=[
{"name": "quality", "scorer": my_scorer},
],
parallelism=5,
)
```

**TypeScript:**
```typescript
import { evaluatorq } from "@orq-ai/evaluatorq";

await evaluatorq("experiment-name", {
data: [{ inputs: { query: "What is 2+2?" } }],
jobs: [jobA, jobB],
evaluators: [{ name: "quality", scorer: myScorer }],
});
```

---

## Function Signature

### Python

```python
async def evaluatorq(
name: str,
params: EvaluatorParams | dict | None = None,
*,
data: DatasetIdInput | Sequence[DataPoint] | None = None,
jobs: list[Job] | None = None,
evaluators: list[Evaluator] | None = None,
parallelism: int = 1,
print_results: bool = True,
description: str | None = None,
) -> EvaluatorqResult
```

### TypeScript

```typescript
evaluatorq(name: string, options: {
data: DataPoint[] | { datasetId: string };
jobs: Job[];
evaluators: Evaluator[];
parallelism?: number;
}): Promise<void>
```

> **TypeScript supports `{ datasetId: "..." }`** to fetch data directly from the orq.ai platform instead of inlining datapoints.

---

## Built-in Evaluators (Python)

```python
from evaluatorq import string_contains_evaluator, exact_match_evaluator

evaluators=[
string_contains_evaluator(case_insensitive=True, name="contains-check"),
exact_match_evaluator(name="exact-match"),
{"name": "custom", "scorer": my_scorer},
]
```

---

## Framework Wrappers (TypeScript)

```typescript
import { wrapLangGraphAgent } from "@orq-ai/evaluatorq/langchain";
import { wrapAISdkAgent } from "@orq-ai/evaluatorq/ai-sdk";

const langGraphJob = wrapLangGraphAgent("LangGraph", agent);
const vercelJob = wrapAISdkAgent("VercelAgent", agent);
```

---

## Environment

| Variable | Purpose | Default |
|----------|---------|---------|
| `ORQ_API_KEY` | orq.ai API key (required for platform integration) | — |
| `ORQ_BASE_URL` | orq.ai base URL | `https://api.orq.ai` |

When `ORQ_API_KEY` is set, evaluatorq automatically reports results to the orq.ai Experiment UI.
Loading