From 9c75ee42513fea3d6fd93cae6603844296d20be5 Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Tue, 24 Mar 2026 12:02:55 +0100 Subject: [PATCH 1/9] feat: add instrument-app skill for orq.ai observability (RES-545) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New skill that guides users through instrumenting LLM applications with orq.ai tracing — covering AI Router proxy, OpenTelemetry integrations, the @traced decorator, and trace enrichment with metadata. Co-Authored-By: Claude Opus 4.6 (1M context) --- README.md | 4 +- skills/instrument-app/SKILL.md | 248 ++++++++++++++++++ .../resources/baseline-checklist.md | 74 ++++++ .../resources/framework-integrations.md | 104 ++++++++ .../resources/traced-decorator-guide.md | 122 +++++++++ 5 files changed, 551 insertions(+), 1 deletion(-) create mode 100644 skills/instrument-app/SKILL.md create mode 100644 skills/instrument-app/resources/baseline-checklist.md create mode 100644 skills/instrument-app/resources/framework-integrations.md create mode 100644 skills/instrument-app/resources/traced-decorator-guide.md diff --git a/README.md b/README.md index b1e7e02..5f59b8d 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,8 @@ Each skill encodes best practices from prompt engineering, agent design, evaluat Built on the [Agent Skills](https://agentskills.io/home#adoption) standard format, so it works with any compatible agent (Claude Code, Cursor, Gemini CLI, and others). +**Using Claude Code?** Check out [orq-ai/claude-plugins](https://github.com/orq-ai/claude-plugins) — it bundles orq-skills with **orq-trace** (automatic session tracing) and **orq-mcp** (workspace MCP server) in a single install. + ## Setup ### Prerequisites @@ -52,7 +54,6 @@ claude --plugin-dir . > **Note:** Commands (`/orq:quickstart`, `/orq:workspace`, etc.) and agents are only available when installed as a Claude Code plugin. - ### Verify Run the interactive onboarding to confirm everything works: @@ -93,6 +94,7 @@ Skills are triggered by describing what you need. Claude picks the right skill a | Skill | What It Does | Documentation | |-------|-------------|---------------| +| **instrument-app** | Instrument LLM applications with orq.ai observability — AI Router proxy, OpenTelemetry, `@traced` decorator, and trace enrichment | [SKILL.md](skills/instrument-app/SKILL.md) | | **build-agent** | Design, create, and configure an orq.ai Agent with tools, instructions, knowledge bases, and memory | [SKILL.md](skills/build-agent/SKILL.md) | | **build-evaluator** | Create validated LLM-as-a-Judge evaluators following evaluation best practices | [SKILL.md](skills/build-evaluator/SKILL.md) | | **analyze-trace-failures** | Read production traces, identify what's failing, build failure taxonomies, and categorize issues | [SKILL.md](skills/analyze-trace-failures/SKILL.md) | diff --git a/skills/instrument-app/SKILL.md b/skills/instrument-app/SKILL.md new file mode 100644 index 0000000..d4b692d --- /dev/null +++ b/skills/instrument-app/SKILL.md @@ -0,0 +1,248 @@ +--- +name: instrument-app +description: Instrument LLM applications with orq.ai observability. Use when setting up tracing, adding the AI Router proxy, integrating OpenTelemetry, auditing existing instrumentation, or enriching traces with metadata. +allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch, Task, AskUserQuestion, orq* +--- + +# Instrument App + +You are an **orq.ai observability engineer**. Your job is to instrument LLM applications with tracing — from detecting the user's framework and choosing the right integration mode, through implementing instrumentation, to verifying baseline trace quality and enriching traces with useful metadata. + +## Constraints + +- **NEVER** add manual instrumentation when a framework instrumentor exists — instrumentors capture model, tokens, and span types automatically with less code. +- **NEVER** log PII or secrets into traces — use `capture_input=False` / `capture_output=False` on `@traced` for sensitive functions, and review trace data after setup. +- **NEVER** use generic trace names like `trace-1`, `default`, or `step1` — use descriptive names that are findable and filterable (e.g., `chat-response`, `classify-intent`). +- **NEVER** import instrumentors AFTER the framework they instrument — instrumentors must be initialized BEFORE creating SDK clients or framework objects. +- **ALWAYS** verify traces appear in the orq.ai UI before adding enrichment — confirm the baseline works first. +- **ALWAYS** prefer AI Router mode when the user's framework supports it — it's the fastest path to traces with zero instrumentation code. +- **ALWAYS** set `service.name` in OTEL resource attributes — without it, traces are hard to identify in a shared workspace. + +**Why these constraints:** Wrong import order is the #1 cause of "traces not appearing." Generic names make traces unfindable at scale. Logging PII creates compliance risk. Framework instrumentors capture 10x more metadata than manual tracing with less code. + +## Companion Skills + +- `analyze-trace-failures` — diagnose failures from trace data (requires traces to exist first) +- `build-evaluator` — design quality evaluators using trace data as input +- `run-experiment` — run experiments and compare configurations with trace visibility +- `optimize-prompt` — improve prompts, then verify improvements via traces + +## Workflow Checklist + +Copy this to track progress: + +``` +Instrumentation Progress: +- [ ] Phase 1: Assess current state (framework, SDK, existing instrumentation) +- [ ] Phase 2: Choose integration mode (AI Router vs Observability vs both) +- [ ] Phase 3: Implement integration (framework-specific setup) +- [ ] Phase 4: Verify baseline (traces appearing, model/tokens captured, span hierarchy) +- [ ] Phase 5: Enrich traces (session_id, user_id, tags, @traced for custom spans) +``` + +## Resources + +- **Framework integrations:** See [resources/framework-integrations.md](resources/framework-integrations.md) +- **@traced decorator guide:** See [resources/traced-decorator-guide.md](resources/traced-decorator-guide.md) +- **Baseline checklist:** See [resources/baseline-checklist.md](resources/baseline-checklist.md) + +--- + +## orq.ai Documentation + +**Observability:** [Traces](https://docs.orq.ai/docs/observability/traces) · [Trace Automations](https://docs.orq.ai/docs/observability/trace-automation) · [Observability Overview](https://docs.orq.ai/docs/observability/overview) + +**Frameworks:** [Framework Integrations](https://docs.orq.ai/docs/proxy/frameworks/overview) · [OpenAI SDK](https://docs.orq.ai/docs/proxy/frameworks/openai) · [LangChain](https://docs.orq.ai/docs/proxy/frameworks/langchain) · [CrewAI](https://docs.orq.ai/docs/proxy/frameworks/crewai) · [Vercel AI](https://docs.orq.ai/docs/proxy/frameworks/vercel-ai) + +**AI Router:** [Getting Started](https://docs.orq.ai/docs/router/getting-started) · [API Keys](https://docs.orq.ai/docs/router/api-keys) · [OpenAI-Compatible API](https://docs.orq.ai/docs/proxy/openai-compatible-api) · [Supported Models](https://docs.orq.ai/docs/proxy/supported-models) + +**Integrations:** [Integration Overview](https://docs.orq.ai/docs/integrations/overview) · [OpenTelemetry Tracing](https://docs.orq.ai/docs/integrations/overview#opentelemetry-tracing) + +### Key Concepts + +- **AI Router** (`https://api.orq.ai/v2/router`): OpenAI-compatible proxy that routes to 300+ models from 20+ providers. Traces are generated automatically for every call. +- **Observability** (`https://api.orq.ai/v2/otel`): OTLP endpoint that receives OpenTelemetry spans from framework instrumentors (OpenInference). Captures agent steps, tool calls, chain execution. +- **`@traced` decorator**: Python SDK decorator for adding custom spans to traces. Supports typed spans: `agent`, `llm`, `tool`, `retrieval`, `embedding`, `function`. +- Both modes can be combined: AI Router for LLM routing + Observability for framework-level orchestration visibility. + +## Destructive Actions + +The following require explicit user confirmation via `AskUserQuestion`: +- Modifying existing environment variables or configuration files +- Overwriting existing instrumentation setup code +- Adding dependencies to the project (pip install / npm install) + +--- + +## Steps + +Follow these steps **in order**. Do NOT skip steps. + +### Phase 1: Assess Current State + +1. **Scan the project** to understand the LLM stack. Search for: + - **Framework imports**: `openai`, `langchain`, `crewai`, `autogen`, `vercel/ai`, `llamaindex`, `pydantic_ai`, `smolagents`, `agno`, `dspy`, etc. + - **Existing orq.ai usage**: `orq.ai`, `ORQ_API_KEY`, `api.orq.ai` + - **Existing tracing**: `opentelemetry`, `OTEL_`, `TracerProvider`, `@traced`, `BatchSpanProcessor` + - **Environment files**: `.env`, `.env.example`, config files with API keys or base URLs + +2. **Summarize findings** to the user: + - Framework(s) detected + - Whether orq.ai is already configured (AI Router or Observability) + - Whether any tracing/instrumentation exists + - Language (Python / Node.js / both) + +### Phase 2: Choose Integration Mode + +3. **Recommend the integration mode** based on findings. Use [resources/framework-integrations.md](resources/framework-integrations.md) for the decision guide: + + | Situation | Recommendation | + |-----------|---------------| + | No tracing yet, framework supports AI Router | **AI Router** — fastest path, traces are automatic | + | Already calling providers directly, don't want to change LLM calls | **Observability only** — add OTEL instrumentors | + | Want multi-provider routing AND framework-level span detail | **Both** — AI Router for routing, OTEL for orchestration spans | + | Framework only supports Observability (BeeAI, Haystack, LiteLLM, Google AI) | **Observability only** | + +4. **Confirm with the user** before proceeding. Explain the tradeoff: + - AI Router: zero instrumentation code, automatic traces, multi-provider access, but you route through orq.ai + - Observability: keep your existing LLM calls, add tracing on top, more setup but no routing change + +### Phase 3: Implement Integration + +5. **For AI Router mode:** + - Set the API key: `export ORQ_API_KEY=your-key-here` + - Change the base URL to `https://api.orq.ai/v2/router` + - Use `provider/model` format for model names (e.g., `openai/gpt-4o`, `anthropic/claude-sonnet-4-5-20250929`) + - That's it — traces appear automatically + + **Python (OpenAI SDK):** + ```python + from openai import OpenAI + import os + + client = OpenAI( + base_url="https://api.orq.ai/v2/router", + api_key=os.getenv("ORQ_API_KEY"), + ) + ``` + + **Node.js (OpenAI SDK):** + ```typescript + import OpenAI from "openai"; + + const client = new OpenAI({ + baseURL: "https://api.orq.ai/v2/router", + apiKey: process.env.ORQ_API_KEY, + }); + ``` + + For framework-specific setup (LangChain, CrewAI, etc.), refer to the framework's docs page linked in [resources/framework-integrations.md](resources/framework-integrations.md). + +6. **For Observability mode:** + - Set OTEL environment variables: + ```bash + export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.orq.ai/v2/otel" + export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer $ORQ_API_KEY" + export OTEL_RESOURCE_ATTRIBUTES="service.name=my-app,service.version=1.0.0" + export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL="http/json" + ``` + - Install the framework's OpenInference instrumentor package + - Initialize the instrumentor BEFORE creating SDK clients + - Refer to the framework's docs page for the exact instrumentor and setup + + **Python (OpenAI example):** + ```python + from opentelemetry import trace + from opentelemetry.sdk.trace import TracerProvider + from opentelemetry.sdk.trace.export import BatchSpanProcessor + from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter + from opentelemetry.instrumentation.openai import OpenAIInstrumentor + + # Initialize BEFORE creating OpenAI client + tracer_provider = TracerProvider() + tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) + trace.set_tracer_provider(tracer_provider) + OpenAIInstrumentor().instrument(tracer_provider=tracer_provider) + ``` + +7. **For both modes:** Set up AI Router first (step 5), then add Observability (step 6) for framework-level spans on top. + +### Phase 4: Verify Baseline + +8. **Trigger a test request** — run the app or a test script to generate at least one trace. + +9. **Check traces in orq.ai** — direct the user to open [Traces](https://my.orq.ai) in the orq.ai dashboard. + +10. **Verify baseline requirements** using [resources/baseline-checklist.md](resources/baseline-checklist.md): + + | Requirement | How to Check | + |------------|-------------| + | Traces appearing | At least one trace visible in the Traces view | + | Model name captured | Open an LLM span → `model` field shows model ID | + | Token usage tracked | LLM span shows `input_tokens` and `output_tokens` | + | Span hierarchy | Trace View shows nested spans for multi-step operations | + | Correct span types | LLM calls show as `llm`, retrievals as `retrieval`, etc. | + | No sensitive data | Spot-check span inputs/outputs for PII or secrets | + +11. **Fix any gaps** before moving to enrichment. Common fixes: + - Traces not appearing → check import order, API key, OTEL endpoint + - Flat hierarchy → ensure instrumentor is initialized before client creation + - Missing tokens → check if provider/framework supports token reporting + +12. **Encourage exploration:** Tell the user to browse a few traces in the UI before adding more context. This helps them form opinions about what data is useful vs missing. + +### Phase 5: Enrich Traces + +13. **Infer additional context needs from the code.** Look for patterns — do NOT ask the user about all of these; infer when possible: + + | If You See in Code... | Suggest Adding | + |----------------------|----------------| + | Conversation history, chat endpoints, message arrays | `session_id` to group conversations | + | User authentication, `user_id` variables | `user_id` for per-user filtering | + | Multiple distinct features or endpoints | `feature` tag for per-feature analytics | + | Customer/tenant identifiers | `customer_id` or tier tag | + | Feedback collection, ratings | Score annotations | + +14. **Add `@traced` for custom spans** where the user has application logic not captured by framework instrumentors. See [resources/traced-decorator-guide.md](resources/traced-decorator-guide.md) for the full reference. + + Priority targets for `@traced`: + - The top-level orchestration function (type: `agent`) + - Data preprocessing / postprocessing (type: `function`) + - Custom tool implementations (type: `tool`) + - RAG retrieval logic (type: `retrieval`) + +15. **Only ask the user** when context needs aren't obvious from code: + - "How do you know when a response is good vs bad?" → determines scoring approach + - "What would you want to filter by in a dashboard?" → surfaces non-obvious tags + - "Are there different user segments you'd want to compare?" → customer tiers, plans + +16. **Guide to relevant UI features** based on what was added: + - Traces view: see individual requests + - Timeline view: identify latency bottlenecks + - Thread view: see conversation flows (if session_id added) + - Trace automations: set up automatic quality monitoring + +--- + +## Anti-Patterns + +| Anti-Pattern | What to Do Instead | +|---|---| +| Manual tracing when framework instrumentor exists | Use the framework instrumentor — it captures model, tokens, spans automatically | +| Instrumentor imported AFTER framework client creation | Initialize instrumentor BEFORE creating SDK clients | +| Generic trace names (`default`, `trace-1`) | Use descriptive names: `chat-response`, `classify-intent`, `fetch-orders` | +| Logging PII/secrets in trace inputs | Use `capture_input=False` on `@traced`, review trace data post-setup | +| No `service.name` in OTEL attributes | Always set `service.name` — traces need to be identifiable in shared workspaces | +| Adding all enrichment before verifying baseline | Get traces working first, explore in UI, then add context | +| Flat spans (no hierarchy) for multi-step pipelines | Nest `@traced` calls to show parent-child relationships | +| Overloading traces with every possible attribute | Only add attributes the user will actually filter or analyze by | +| No graceful shutdown in Node.js | Call `sdk.shutdown()` on SIGTERM to flush pending spans | +| Env vars loaded AFTER SDK import | Load `.env` / set env vars BEFORE importing orq or OTEL packages | + +## Open in orq.ai + +After completing this skill, direct the user to: +- **Traces:** [my.orq.ai](https://my.orq.ai/) — inspect trace hierarchy, timing, and captured data +- **AI Router:** [my.orq.ai](https://my.orq.ai/) — manage providers, models, and API keys +- **Trace Automations:** [my.orq.ai](https://my.orq.ai/) — set up automatic monitoring rules +- **Next step:** Use `analyze-trace-failures` to diagnose issues from the traces you're now capturing diff --git a/skills/instrument-app/resources/baseline-checklist.md b/skills/instrument-app/resources/baseline-checklist.md new file mode 100644 index 0000000..0819cdc --- /dev/null +++ b/skills/instrument-app/resources/baseline-checklist.md @@ -0,0 +1,74 @@ +# Baseline Instrumentation Checklist + +Verify these requirements after setting up instrumentation. Framework integrations handle most automatically — only manual instrumentation needs all checks. + +## Requirements + +| # | Requirement | Why | Auto with AI Router? | Auto with Framework Instrumentor? | +|---|------------|-----|:---:|:---:| +| 1 | **Model name captured** | Enables model comparison, cost attribution, filtering by model | yes | yes | +| 2 | **Token usage tracked** | Enables cost calculation and usage analytics | yes | yes | +| 3 | **Descriptive trace names** | Makes traces findable — `chat-response` not `trace-1` | partial | partial | +| 4 | **Proper span hierarchy** | Shows which step is slow or failing in multi-step operations | n/a | yes | +| 5 | **Correct span types** | Enables type-specific analytics (LLM latency, retrieval quality) | yes | yes | +| 6 | **Sensitive data masked** | Prevents PII/secrets from leaking into trace storage | no | no | +| 7 | **Trace input/output set explicitly** | Makes traces readable; avoids logging irrelevant function args | partial | partial | + +### How to Verify Each + +**1. Model name** — Open a trace in [Traces](https://my.orq.ai) → click an LLM span → confirm `model` field shows the model ID (e.g., `openai/gpt-4o`). + +**2. Token usage** — Same LLM span → check `input_tokens` and `output_tokens` are populated. If zero, the instrumentor may not support the provider or streaming mode. + +**3. Trace names** — In the Traces list view, scan the Name column. Look for generic names (`default`, `trace-1`, `LLMChain`) and rename with descriptive alternatives. For `@traced`, set the `name` parameter. For frameworks, check how to customize trace/chain names in the framework docs. + +**4. Span hierarchy** — Open a trace → switch to Trace View. Multi-step operations should show nested spans (parent → child). Flat traces with all spans at the same level indicate missing nesting. For `@traced`, ensure child functions are called within the parent's traced scope. + +**5. Span types** — In Trace View, check that LLM calls show as `llm` type, retrievals as `retrieval`, tool calls as `tool`, etc. Framework instrumentors set these automatically. For `@traced`, set the `type` parameter correctly. + +**6. Sensitive data** — Review a few traces for PII (names, emails, tokens, API keys) in span inputs/outputs. Use `capture_input=False` / `capture_output=False` on `@traced` for sensitive functions. For framework instrumentors, check if they offer input/output filtering. + +**7. Trace input/output** — Open a trace → check the top-level input shows the user's actual request (not internal state). For `@traced` with `capture_input=True`, only the function args are logged — ensure they represent meaningful input. Use `attributes` for metadata instead of polluting input. + +## After Baseline Passes + +Encourage the user to explore traces in the orq.ai UI before adding more context: + +> "Your traces are appearing in orq.ai. Open a few in [Traces](https://my.orq.ai) — look at the span hierarchy, timing, and captured data. What's useful? What's missing? This helps us decide what additional context to add." + +## Additional Context (Add After Baseline) + +Only add these when relevant — infer from the user's code when possible: + +| If You See in Code... | Suggest Adding | Why | +|----------------------|----------------|-----| +| Conversation history, chat endpoints, message arrays | `session_id` | Groups messages from the same conversation | +| User authentication, `user_id` variables | `user_id` on traces | Enables per-user filtering and cost attribution | +| Multiple distinct features or endpoints | `feature` tag via attributes | Enables per-feature analytics | +| Customer/tenant identifiers | `customer_id` or tier tag | Cost/quality breakdown by segment | +| Feedback collection, ratings | Score annotations | Enables quality trend monitoring | +| Environment variables like `NODE_ENV`, `FLASK_ENV` | `environment` tag | Separates dev/staging/prod traces | + +### How to Add Context + +**With `@traced`:** +```python +@traced( + name="chat-response", + type="agent", + attributes={ + "session_id": session_id, + "user_id": user_id, + "feature": "customer-support", + } +) +``` + +**With OpenTelemetry span attributes:** +```python +from opentelemetry import trace + +span = trace.get_current_span() +span.set_attribute("session_id", session_id) +span.set_attribute("user_id", user_id) +``` diff --git a/skills/instrument-app/resources/framework-integrations.md b/skills/instrument-app/resources/framework-integrations.md new file mode 100644 index 0000000..f323aa5 --- /dev/null +++ b/skills/instrument-app/resources/framework-integrations.md @@ -0,0 +1,104 @@ +# Framework Integrations + +## Which Integration Mode? + +| Mode | What It Does | When to Use | +|------|-------------|-------------| +| **AI Router** | Route LLM calls through `https://api.orq.ai/v2/router` — traces generated automatically | You want multi-provider access, fallbacks, caching, cost tracking with zero instrumentation code | +| **Observability** | Send OpenTelemetry traces from your existing setup to `https://api.orq.ai/v2/otel` | You already call providers directly and want to add tracing without changing your LLM calls | +| **Both** | AI Router for routing + Observability for framework-level spans | You want full pipeline visibility: framework orchestration spans + LLM call traces | + +**Rule of thumb:** If the user's framework is in the AI Router column, start there — it's the fastest path to traces. Add Observability on top only if they need framework-level span detail (agent steps, tool calls, chain execution). + +## Supported Frameworks + +| Framework | AI Router | Observability | Control Tower | Docs | +|-----------|:---------:|:-------------:|:-------------:|------| +| Agno | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/agno) | +| AutoGen | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/autogen) | +| AWS Strands | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/aws-strands) | +| Azure AI Agents | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/azure-ai-agents) | +| BeeAI | | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/beeai) | +| CrewAI | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/crewai) | +| DSPy | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/dspy) | +| Google AI | | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/google-ai) | +| Haystack | | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/haystack) | +| Instructor | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/instructor) | +| LangChain | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/langchain) | +| LangGraph | yes | | yes | [docs](https://docs.orq.ai/docs/proxy/frameworks/langgraph) | +| LiteLLM | | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/litellm) | +| LiveKit | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/livekit) | +| LlamaIndex | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/llamaindex) | +| LlamaIndex Agents | yes | | | [docs](https://docs.orq.ai/docs/proxy/frameworks/llamaindex-agents) | +| Mastra | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/mastra) | +| OpenAI SDK | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/openai) | +| OpenAI Agents | yes | yes | yes | [docs](https://docs.orq.ai/docs/proxy/frameworks/openai-agents) | +| OpenClaw | | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/openclaw) | +| Pydantic AI | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/pydantic-ai) | +| Semantic Kernel | yes | | | [docs](https://docs.orq.ai/docs/proxy/frameworks/semantic-kernel) | +| SmolAgents | yes | yes | | [docs](https://docs.orq.ai/docs/proxy/frameworks/smolagents) | +| Vercel AI SDK | yes | yes | yes | [docs](https://docs.orq.ai/docs/proxy/frameworks/vercel-ai) | + +## AI Router Quick Setup Pattern + +All AI Router integrations follow the same pattern — point your SDK's base URL to orq.ai: + +**Python (OpenAI SDK):** +```python +from openai import OpenAI + +client = OpenAI( + base_url="https://api.orq.ai/v2/router", + api_key=os.getenv("ORQ_API_KEY"), +) +``` + +**Node.js (OpenAI SDK):** +```typescript +import OpenAI from "openai"; + +const client = new OpenAI({ + baseURL: "https://api.orq.ai/v2/router", + apiKey: process.env.ORQ_API_KEY, +}); +``` + +**LangChain:** +```python +from langchain_openai import ChatOpenAI + +llm = ChatOpenAI( + model="gpt-4o", + api_key=os.getenv("ORQ_API_KEY"), + base_url="https://api.orq.ai/v2/router", +) +``` + +## Observability (OpenTelemetry) Quick Setup Pattern + +All observability integrations use OpenInference instrumentors with OTLP export to orq.ai: + +**Environment variables (all frameworks):** +```bash +export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.orq.ai/v2/otel" +export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer $ORQ_API_KEY" +export OTEL_RESOURCE_ATTRIBUTES="service.name=my-app,service.version=1.0.0" +export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL="http/json" +``` + +**Python (OpenAI instrumentor):** +```python +from opentelemetry import trace +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import BatchSpanProcessor +from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter +from opentelemetry.instrumentation.openai import OpenAIInstrumentor + +tracer_provider = TracerProvider() +tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) +trace.set_tracer_provider(tracer_provider) + +OpenAIInstrumentor().instrument(tracer_provider=tracer_provider) +``` + +**Key:** Each framework has its own OpenInference instrumentor package. See the framework-specific docs page for the exact package name and import. diff --git a/skills/instrument-app/resources/traced-decorator-guide.md b/skills/instrument-app/resources/traced-decorator-guide.md new file mode 100644 index 0000000..43a68e8 --- /dev/null +++ b/skills/instrument-app/resources/traced-decorator-guide.md @@ -0,0 +1,122 @@ +# The `@traced` Decorator + +The `@traced` decorator from the orq.ai Python SDK adds custom spans to your traces for application logic that isn't automatically captured by framework instrumentors. + +**Docs:** [Custom Tracing using the @traced decorator](https://docs.orq.ai/docs/observability/traces#custom-tracing-using-the-@traced-decorator) + +## When to Use + +| Scenario | Use `@traced` | Use Framework Instrumentor | +|----------|:------------:|:--------------------------:| +| LLM calls via OpenAI/LangChain/etc. | | yes | +| Custom business logic between LLM calls | yes | | +| Data preprocessing / postprocessing | yes | | +| Tool implementations in an agent | yes | | +| RAG retrieval logic | yes | | +| Orchestration / routing functions | yes | | + +**Rule:** Use framework instrumentors for LLM calls (they capture model, tokens, etc. automatically). Use `@traced` for everything else that you want visible in the trace. + +## Span Types + +| Type | When to Use | +|------|-------------| +| `agent` | Orchestration workflows, agent execution loops | +| `llm` | Direct LLM API calls (prefer framework instrumentors when available) | +| `tool` | External tool invocations, API calls, database queries | +| `retrieval` | Knowledge lookups, vector search, document fetching | +| `embedding` | Embedding operations | +| `function` | General processing steps, data transformation, validation | + +## Parameters + +```python +@traced( + name="operation_name", # Descriptive name shown in trace UI + type="function", # Span type (see table above) + capture_input=True, # Whether to capture function input args + capture_output=True, # Whether to capture function return value + attributes={ # Custom key-value metadata + "custom_key": "value" + } +) +``` + +| Parameter | Default | Notes | +|-----------|---------|-------| +| `name` | function name | Use descriptive names: `"fetch-user-context"` not `"step1"` | +| `type` | `"function"` | Pick the semantic type that matches the operation | +| `capture_input` | `True` | Set `False` if inputs contain PII or secrets | +| `capture_output` | `True` | Set `False` if outputs contain sensitive data | +| `attributes` | `{}` | Add searchable metadata: user tier, feature name, etc. | + +## Examples + +### Sync Function +```python +from orq_ai_sdk.tracing import traced + +@traced(name="extract-keywords", type="function") +def extract_keywords(text: str) -> list[str]: + # Your logic here + return keywords +``` + +### Async Function +```python +from orq_ai_sdk.tracing import traced + +@traced(name="fetch-context", type="retrieval") +async def fetch_context(query: str) -> list[dict]: + results = await vector_db.search(query) + return results +``` + +### Agent Orchestration +```python +from orq_ai_sdk.tracing import traced + +@traced(name="support-agent", type="agent") +def run_support_agent(user_message: str) -> str: + context = fetch_context(user_message) # traced as retrieval + response = generate_response(context) # traced by framework instrumentor + log_interaction(user_message, response) # traced as function + return response +``` + +### Hiding Sensitive Data +```python +@traced( + name="process-payment", + type="tool", + capture_input=False, # Don't capture credit card details + capture_output=False, # Don't capture payment tokens + attributes={"service": "payments"} +) +def process_payment(card_number: str, amount: float) -> dict: + ... +``` + +### Adding Custom Attributes +```python +@traced( + name="classify-intent", + type="function", + attributes={ + "feature": "routing", + "version": "2.1", + } +) +def classify_intent(message: str) -> str: + ... +``` + +## Common Mistakes + +| Mistake | Problem | Fix | +|---------|---------|-----| +| Using `@traced` for LLM calls when instrumentor exists | Misses model/token metadata | Use framework instrumentor for LLM calls | +| Generic names like `"step1"`, `"process"` | Hard to find in trace UI | Use descriptive names: `"classify-intent"`, `"fetch-user-orders"` | +| `capture_input=True` on functions with secrets | Leaks API keys, tokens, PII into traces | Set `capture_input=False` and use `attributes` for safe metadata | +| Wrong span type | Breaks trace analytics (e.g., retrieval latency dashboard) | Match type to semantic meaning of the operation | +| Forgetting to trace orchestration function | Top-level agent loop invisible in traces | Wrap the entry point with `@traced(type="agent")` | From 12055d49ceafcaf82341a7032f8b73156c80578a Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Tue, 24 Mar 2026 12:06:22 +0100 Subject: [PATCH 2/9] docs: remove claude-plugins mention from README The claude-plugins repo is still a work in progress, deferring the mention until it's ready. Co-Authored-By: Claude Opus 4.6 (1M context) --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 5f59b8d..614e6a2 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,6 @@ Each skill encodes best practices from prompt engineering, agent design, evaluat Built on the [Agent Skills](https://agentskills.io/home#adoption) standard format, so it works with any compatible agent (Claude Code, Cursor, Gemini CLI, and others). -**Using Claude Code?** Check out [orq-ai/claude-plugins](https://github.com/orq-ai/claude-plugins) — it bundles orq-skills with **orq-trace** (automatic session tracing) and **orq-mcp** (workspace MCP server) in a single install. ## Setup From 9f9ffc27bf95d6f0ea1c1e6a978c750418ca116a Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Tue, 24 Mar 2026 17:41:19 +0100 Subject: [PATCH 3/9] refactor: rename instrument-app to setup-observability Rename skill to better reflect what it does. Update README skills table and add "Instrument an Existing App" workflow example. Co-Authored-By: Claude Opus 4.6 (1M context) --- README.md | 18 +++++++++++++----- .../SKILL.md | 4 ++-- .../resources/baseline-checklist.md | 0 .../resources/framework-integrations.md | 0 .../resources/traced-decorator-guide.md | 0 5 files changed, 15 insertions(+), 7 deletions(-) rename skills/{instrument-app => setup-observability}/SKILL.md (98%) rename skills/{instrument-app => setup-observability}/resources/baseline-checklist.md (100%) rename skills/{instrument-app => setup-observability}/resources/framework-integrations.md (100%) rename skills/{instrument-app => setup-observability}/resources/traced-decorator-guide.md (100%) diff --git a/README.md b/README.md index 614e6a2..3e020a2 100644 --- a/README.md +++ b/README.md @@ -93,7 +93,7 @@ Skills are triggered by describing what you need. Claude picks the right skill a | Skill | What It Does | Documentation | |-------|-------------|---------------| -| **instrument-app** | Instrument LLM applications with orq.ai observability — AI Router proxy, OpenTelemetry, `@traced` decorator, and trace enrichment | [SKILL.md](skills/instrument-app/SKILL.md) | +| **setup-observability** | Set up orq.ai observability for existing LLM applications — AI Router proxy, OpenTelemetry, `@traced` decorator, and trace enrichment | [SKILL.md](skills/setup-observability/SKILL.md) | | **build-agent** | Design, create, and configure an orq.ai Agent with tools, instructions, knowledge bases, and memory | [SKILL.md](skills/build-agent/SKILL.md) | | **build-evaluator** | Create validated LLM-as-a-Judge evaluators following evaluation best practices | [SKILL.md](skills/build-evaluator/SKILL.md) | | **analyze-trace-failures** | Read production traces, identify what's failing, build failure taxonomies, and categorize issues | [SKILL.md](skills/analyze-trace-failures/SKILL.md) | @@ -106,7 +106,15 @@ Skills are triggered by describing what you need. Claude picks the right skill a ## Workflows -### 1. Build a New Agent +### 1. Instrument an Existing App + +``` +"Add orq.ai tracing to my app" → setup-observability +/orq:traces --last 1h # Verify traces are flowing +"Analyze these traces for failures" → analyze-trace-failures +``` + +### 2. Build a New Agent ``` "I need a customer support agent" → build-agent @@ -115,7 +123,7 @@ Skills are triggered by describing what you need. Claude picks the right skill a "Run an experiment to get a baseline" → run-experiment ``` -### 2. Debug Production Issues +### 3. Debug Production Issues ``` /orq:traces --status error --last 24h # Find errors @@ -124,7 +132,7 @@ Skills are triggered by describing what you need. Claude picks the right skill a "Re-run the experiment to verify the fix" → run-experiment ``` -### 3. Improve an Existing Agent +### 4. Improve an Existing Agent ``` /orq:analytics --group-by deployment # Spot high error rates @@ -135,7 +143,7 @@ Skills are triggered by describing what you need. Claude picks the right skill a "Optimize the prompt based on results" → optimize-prompt ``` -### 4. Improve an existing Prompt +### 5. Improve an Existing Prompt ``` "My prompt isn't performing well, help me improve it" → optimize-prompt diff --git a/skills/instrument-app/SKILL.md b/skills/setup-observability/SKILL.md similarity index 98% rename from skills/instrument-app/SKILL.md rename to skills/setup-observability/SKILL.md index d4b692d..9578116 100644 --- a/skills/instrument-app/SKILL.md +++ b/skills/setup-observability/SKILL.md @@ -1,6 +1,6 @@ --- -name: instrument-app -description: Instrument LLM applications with orq.ai observability. Use when setting up tracing, adding the AI Router proxy, integrating OpenTelemetry, auditing existing instrumentation, or enriching traces with metadata. +name: setup-observability +description: Set up orq.ai observability for LLM applications. Use when setting up tracing, adding the AI Router proxy, integrating OpenTelemetry, auditing existing instrumentation, or enriching traces with metadata. allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch, Task, AskUserQuestion, orq* --- diff --git a/skills/instrument-app/resources/baseline-checklist.md b/skills/setup-observability/resources/baseline-checklist.md similarity index 100% rename from skills/instrument-app/resources/baseline-checklist.md rename to skills/setup-observability/resources/baseline-checklist.md diff --git a/skills/instrument-app/resources/framework-integrations.md b/skills/setup-observability/resources/framework-integrations.md similarity index 100% rename from skills/instrument-app/resources/framework-integrations.md rename to skills/setup-observability/resources/framework-integrations.md diff --git a/skills/instrument-app/resources/traced-decorator-guide.md b/skills/setup-observability/resources/traced-decorator-guide.md similarity index 100% rename from skills/instrument-app/resources/traced-decorator-guide.md rename to skills/setup-observability/resources/traced-decorator-guide.md From 96161ba76f494791509ba5a545d57ad3b8712c0a Mon Sep 17 00:00:00 2001 From: currentlycodinng <148545995+currentlycodinng@users.noreply.github.com> Date: Wed, 25 Mar 2026 12:22:24 +0100 Subject: [PATCH 4/9] fix: correct OpenInference import path and stale heading MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix import: opentelemetry.instrumentation.openai → openinference.instrumentation.openai - Rename heading from "Instrument App" to "Setup Observability" Co-Authored-By: Claude Opus 4.6 --- skills/setup-observability/SKILL.md | 4 ++-- .../resources/framework-integrations.md | 2 +- tests/skills.md | 7 +++++++ 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/skills/setup-observability/SKILL.md b/skills/setup-observability/SKILL.md index 9578116..60a08fa 100644 --- a/skills/setup-observability/SKILL.md +++ b/skills/setup-observability/SKILL.md @@ -4,7 +4,7 @@ description: Set up orq.ai observability for LLM applications. Use when setting allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch, Task, AskUserQuestion, orq* --- -# Instrument App +# Setup Observability You are an **orq.ai observability engineer**. Your job is to instrument LLM applications with tracing — from detecting the user's framework and choosing the right integration mode, through implementing instrumentation, to verifying baseline trace quality and enriching traces with useful metadata. @@ -156,7 +156,7 @@ Follow these steps **in order**. Do NOT skip steps. from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter - from opentelemetry.instrumentation.openai import OpenAIInstrumentor + from openinference.instrumentation.openai import OpenAIInstrumentor # Initialize BEFORE creating OpenAI client tracer_provider = TracerProvider() diff --git a/skills/setup-observability/resources/framework-integrations.md b/skills/setup-observability/resources/framework-integrations.md index f323aa5..b76610c 100644 --- a/skills/setup-observability/resources/framework-integrations.md +++ b/skills/setup-observability/resources/framework-integrations.md @@ -92,7 +92,7 @@ from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter -from opentelemetry.instrumentation.openai import OpenAIInstrumentor +from openinference.instrumentation.openai import OpenAIInstrumentor tracer_provider = TracerProvider() tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) diff --git a/tests/skills.md b/tests/skills.md index 4449b22..3e74459 100644 --- a/tests/skills.md +++ b/tests/skills.md @@ -6,6 +6,12 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). --- +## `setup-observability` + +- Ask: "Help me add orq.ai tracing to my app" +- Verify: scans project for framework imports and existing instrumentation +- Verify: recommends integration mode (AI Router vs Observability) based on findings + ## `build-agent` - Ask: "Build a simple FAQ agent for a pizza restaurant" @@ -46,6 +52,7 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). ## Critical Files +- `skills/setup-observability/SKILL.md` - `skills/build-agent/SKILL.md` - `skills/build-evaluator/SKILL.md` - `skills/generate-synthetic-dataset/SKILL.md` From df3cdefbe1afe75e4b3bedfe87f1157c1c5b2bbe Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Thu, 26 Mar 2026 13:14:43 +0100 Subject: [PATCH 5/9] fix: correct hallucinated code examples in setup-observability skill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix @traced import path: orq_ai_sdk.tracing → orq_ai_sdk.traced (verified against official docs) - Fix LangChain model format: gpt-4o → openai/gpt-4o (provider/model format) - Replace hardcoded service.name=my-app with placeholder - Soften unsubstantiated "10x more metadata" claim - Add warning about overwriting existing OTEL config (Datadog, Jaeger, etc.) - Add auto-formatter guidance (isort/noqa) for critical import ordering Co-Authored-By: Claude Opus 4.6 (1M context) --- skills/setup-observability/SKILL.md | 8 +++++--- .../resources/framework-integrations.md | 4 ++-- .../resources/traced-decorator-guide.md | 6 +++--- 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/skills/setup-observability/SKILL.md b/skills/setup-observability/SKILL.md index 60a08fa..71b12df 100644 --- a/skills/setup-observability/SKILL.md +++ b/skills/setup-observability/SKILL.md @@ -18,7 +18,7 @@ You are an **orq.ai observability engineer**. Your job is to instrument LLM appl - **ALWAYS** prefer AI Router mode when the user's framework supports it — it's the fastest path to traces with zero instrumentation code. - **ALWAYS** set `service.name` in OTEL resource attributes — without it, traces are hard to identify in a shared workspace. -**Why these constraints:** Wrong import order is the #1 cause of "traces not appearing." Generic names make traces unfindable at scale. Logging PII creates compliance risk. Framework instrumentors capture 10x more metadata than manual tracing with less code. +**Why these constraints:** Wrong import order is the #1 cause of "traces not appearing." Generic names make traces unfindable at scale. Logging PII creates compliance risk. Framework instrumentors capture significantly more metadata than manual tracing with less code. ## Companion Skills @@ -139,11 +139,11 @@ Follow these steps **in order**. Do NOT skip steps. For framework-specific setup (LangChain, CrewAI, etc.), refer to the framework's docs page linked in [resources/framework-integrations.md](resources/framework-integrations.md). 6. **For Observability mode:** - - Set OTEL environment variables: + - Set OTEL environment variables. **Warning:** If the project already has OpenTelemetry configured (e.g., for Datadog, Jaeger, or another backend), check for existing `OTEL_*` env vars or `TracerProvider` setup first — setting these will override that configuration. Confirm with the user before overwriting. ```bash export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.orq.ai/v2/otel" export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer $ORQ_API_KEY" - export OTEL_RESOURCE_ATTRIBUTES="service.name=my-app,service.version=1.0.0" + export OTEL_RESOURCE_ATTRIBUTES="service.name=,service.version=1.0.0" export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL="http/json" ``` - Install the framework's OpenInference instrumentor package @@ -165,6 +165,8 @@ Follow these steps **in order**. Do NOT skip steps. OpenAIInstrumentor().instrument(tracer_provider=tracer_provider) ``` + > **Note:** The import order above is critical — instrumentors must be initialized before framework clients. If the project uses an auto-formatter (isort, Ruff), add `# isort:skip_file` at the top of the file or `# noqa: E402` on late imports to prevent reordering. + 7. **For both modes:** Set up AI Router first (step 5), then add Observability (step 6) for framework-level spans on top. ### Phase 4: Verify Baseline diff --git a/skills/setup-observability/resources/framework-integrations.md b/skills/setup-observability/resources/framework-integrations.md index b76610c..f617229 100644 --- a/skills/setup-observability/resources/framework-integrations.md +++ b/skills/setup-observability/resources/framework-integrations.md @@ -68,7 +68,7 @@ const client = new OpenAI({ from langchain_openai import ChatOpenAI llm = ChatOpenAI( - model="gpt-4o", + model="openai/gpt-4o", api_key=os.getenv("ORQ_API_KEY"), base_url="https://api.orq.ai/v2/router", ) @@ -82,7 +82,7 @@ All observability integrations use OpenInference instrumentors with OTLP export ```bash export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.orq.ai/v2/otel" export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer $ORQ_API_KEY" -export OTEL_RESOURCE_ATTRIBUTES="service.name=my-app,service.version=1.0.0" +export OTEL_RESOURCE_ATTRIBUTES="service.name=,service.version=1.0.0" export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL="http/json" ``` diff --git a/skills/setup-observability/resources/traced-decorator-guide.md b/skills/setup-observability/resources/traced-decorator-guide.md index 43a68e8..f68a7b5 100644 --- a/skills/setup-observability/resources/traced-decorator-guide.md +++ b/skills/setup-observability/resources/traced-decorator-guide.md @@ -54,7 +54,7 @@ The `@traced` decorator from the orq.ai Python SDK adds custom spans to your tra ### Sync Function ```python -from orq_ai_sdk.tracing import traced +from orq_ai_sdk.traced import traced @traced(name="extract-keywords", type="function") def extract_keywords(text: str) -> list[str]: @@ -64,7 +64,7 @@ def extract_keywords(text: str) -> list[str]: ### Async Function ```python -from orq_ai_sdk.tracing import traced +from orq_ai_sdk.traced import traced @traced(name="fetch-context", type="retrieval") async def fetch_context(query: str) -> list[dict]: @@ -74,7 +74,7 @@ async def fetch_context(query: str) -> list[dict]: ### Agent Orchestration ```python -from orq_ai_sdk.tracing import traced +from orq_ai_sdk.traced import traced @traced(name="support-agent", type="agent") def run_support_agent(user_message: str) -> str: From bbde329255412d16338ae92e9c0dbdd2b355da8c Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Thu, 26 Mar 2026 13:15:55 +0100 Subject: [PATCH 6/9] test: add setup-observability smoke tests and resolve merge conflict Co-Authored-By: Claude Opus 4.6 (1M context) --- commands/workspace.md | 8 ++ skills/build-agent/resources/api-reference.md | 4 +- skills/build-evaluator/SKILL.md | 7 +- .../resources/api-reference.md | 2 + .../run-experiment/resources/api-reference.md | 8 +- tests/mcp-tools.md | 16 +-- tests/skills.md | 101 +++++++++++++++++- 7 files changed, 125 insertions(+), 21 deletions(-) diff --git a/commands/workspace.md b/commands/workspace.md index ca5dea0..39fcb0a 100644 --- a/commands/workspace.md +++ b/commands/workspace.md @@ -20,6 +20,7 @@ Show a quick overview of the user's orq.ai workspace — agents, deployments, pr - `experiments` — show only experiments - `projects` — show only projects - `knowledge` — show only knowledge bases +- `evaluator` — show only evaluators If empty, show all sections. @@ -35,6 +36,7 @@ Use the `search_entities` MCP tool and `get_analytics_overview` MCP tool to fetc - **Experiments:** `search_entities` with `type: "experiment"` - **Projects:** `search_entities` with `type: "project"` - **Knowledge:** `search_entities` with `type: "knowledge"` +- **Evaluator:** `search_entities` with `type: "evaluator"` Fetch only the sections needed based on arguments. Always fetch analytics overview regardless of section filter. @@ -91,6 +93,12 @@ Manage your workspace at **[Workspace → my.orq.ai](https://my.orq.ai/)**. - **product-docs** — 120 documents - **faq-database** — 45 documents + + +### Evaluators (2) + +- **coherence** — active +- **toxicity** — active ``` #### Formatting rules diff --git a/skills/build-agent/resources/api-reference.md b/skills/build-agent/resources/api-reference.md index a26e33b..4b25422 100644 --- a/skills/build-agent/resources/api-reference.md +++ b/skills/build-agent/resources/api-reference.md @@ -17,10 +17,12 @@ Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. Fo | `create_agent` | Create a new agent with configuration | | `get_agent` | Get agent details — verify configuration after creation or updates | | `update_agent` | Update agent configuration (instructions, model, tools) — iterate without recreating | -| `search_entities` | Find agents, knowledge bases (`type: "knowledge"`), memory stores (`type: "memory_store"`) | +| `search_entities` | Find agents, knowledge bases (`type: "knowledge"`), memory stores (`type: "memory_store"`), evaluators (`type: "evaluator"`) | | `search_directories` | Discover workspace project structure and paths — useful for KB `path` selection | | `list_models` | List available models for agent configuration | | `create_llm_eval` | Create evaluators for quality comparison | +| `get_evaluator_llm` | Retrieve an LLM evaluator by key or ID | +| `get_evaluator_python` | Retrieve a Python evaluator by key or ID | | `list_traces` | Inspect traces for latency/cost data | ## HTTP API diff --git a/skills/build-evaluator/SKILL.md b/skills/build-evaluator/SKILL.md index 182850b..cc662e2 100644 --- a/skills/build-evaluator/SKILL.md +++ b/skills/build-evaluator/SKILL.md @@ -81,6 +81,8 @@ Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. Fo |------|---------| | `create_llm_eval` | Create an LLM evaluator with your judge prompt | | `create_python_eval` | Create a Python evaluator for code-based checks | +| `get_evaluator_llm` | Retrieve an LLM evaluator by key or ID (not supported for jury-mode evaluators) | +| `get_evaluator_python` | Retrieve a Python evaluator by key or ID | | `list_models` | List available judge models | **HTTP API fallback** (for operations not yet in MCP): @@ -92,11 +94,6 @@ curl -s https://my.orq.ai/v2/evaluators \ -H "Authorization: Bearer $ORQ_API_KEY" \ -H "Content-Type: application/json" | jq -# Get evaluator details -curl -s https://my.orq.ai/v2/evaluators/ \ - -H "Authorization: Bearer $ORQ_API_KEY" \ - -H "Content-Type: application/json" | jq - # Test-invoke an evaluator against a sample output curl -s https://my.orq.ai/v2/evaluators//invoke \ -H "Authorization: Bearer $ORQ_API_KEY" \ diff --git a/skills/generate-synthetic-dataset/resources/api-reference.md b/skills/generate-synthetic-dataset/resources/api-reference.md index 522f828..32f66f0 100644 --- a/skills/generate-synthetic-dataset/resources/api-reference.md +++ b/skills/generate-synthetic-dataset/resources/api-reference.md @@ -21,6 +21,8 @@ Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. Fo | `search_entities` | Find existing datasets (`type: "dataset"`) | | `update_datapoint` | Modify existing datapoints (curation) | | `delete_datapoints` | Remove datapoints from a dataset (curation) | +| `get_evaluator_llm` | Retrieve an LLM evaluator to understand dataset requirements | +| `get_evaluator_python` | Retrieve a Python evaluator to understand dataset requirements | ## HTTP API diff --git a/skills/run-experiment/resources/api-reference.md b/skills/run-experiment/resources/api-reference.md index b560f3a..8e32d6f 100644 --- a/skills/run-experiment/resources/api-reference.md +++ b/skills/run-experiment/resources/api-reference.md @@ -15,6 +15,9 @@ Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. Fo | Tool | Purpose | |------|---------| | `create_llm_eval` | Create an LLM evaluator | +| `create_python_eval` | Create a Python evaluator for code-based checks | +| `get_evaluator_llm` | Retrieve an LLM evaluator by key or ID | +| `get_evaluator_python` | Retrieve a Python evaluator by key or ID | | `list_traces` | List and filter traces for error analysis | | `list_spans` | List spans within a trace | | `get_span` | Get detailed span information | @@ -39,11 +42,6 @@ curl -s https://my.orq.ai/v2/evaluators \ -H "Authorization: Bearer $ORQ_API_KEY" \ -H "Content-Type: application/json" | jq -# Get evaluator details -curl -s https://my.orq.ai/v2/evaluators/ \ - -H "Authorization: Bearer $ORQ_API_KEY" \ - -H "Content-Type: application/json" | jq - # Invoke an evaluator curl -s https://my.orq.ai/v2/evaluators//invoke \ -H "Authorization: Bearer $ORQ_API_KEY" \ diff --git a/tests/mcp-tools.md b/tests/mcp-tools.md index fbd6dca..9c7e172 100644 --- a/tests/mcp-tools.md +++ b/tests/mcp-tools.md @@ -8,7 +8,7 @@ Tests the orq.ai MCP server tools directly. Requires `setup.md` to have run firs ## Read-only tools (safe, no cleanup needed) -1. `search_entities` — all 8 types (agent, dataset, prompt, experiment, knowledge, memory_store, deployment, project) +1. `search_entities` — all 9 types (agent, dataset, prompt, experiment, knowledge, memory_store, deployment, project, evaluator) 2. `search_directories` — list project dirs 3. `list_models(modelType=chat)` → verify non-empty 4. `list_registry_keys` → verify returns array @@ -27,21 +27,23 @@ Tests the orq.ai MCP server tools directly. Requires `setup.md` to have run firs 14. `delete_datapoints` → delete 1, verify 2 remain 15. `delete_dataset` → delete `orq-skills-test-crud-dataset` (only this test resource) -## Evaluator creation *(manual cleanup required — no MCP delete tool)* +## Evaluator tools *(manual cleanup required — no MCP delete tool)* 16. `create_llm_eval` → key: `orq-skills-test-llm-eval`, with simple judge prompt 17. `create_python_eval` → key: `orq-skills-test-py-eval` +18. `get_evaluator_llm(key=orq-skills-test-llm-eval)` → verify returns prompt and model +19. `get_evaluator_python(key=orq-skills-test-py-eval)` → verify returns code ## Agent tools -18. `get_agent(key=orq-skills-test-echo)` → verify config matches what we created -19. `create_agent` → key: `orq-skills-test-crud-agent` *(manual cleanup required — no MCP delete tool)* -20. `update_agent(key=orq-skills-test-crud-agent)` → update instructions (only our test agent) +20. `get_agent(key=orq-skills-test-echo)` → verify config matches what we created +21. `create_agent` → key: `orq-skills-test-crud-agent` *(manual cleanup required — no MCP delete tool)* +22. `update_agent(key=orq-skills-test-crud-agent)` → update instructions (only our test agent) ## Experiment tools -21. `create_experiment` → key: `orq-skills-test-experiment` with seeded dataset + evaluator *(manual cleanup required — no MCP delete tool)* -22. `list_experiment_runs` +23. `create_experiment` → key: `orq-skills-test-experiment` with seeded dataset + evaluator *(manual cleanup required — no MCP delete tool)* +24. `list_experiment_runs` --- diff --git a/tests/skills.md b/tests/skills.md index 3e74459..33a15ba 100644 --- a/tests/skills.md +++ b/tests/skills.md @@ -8,9 +8,50 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). ## `setup-observability` -- Ask: "Help me add orq.ai tracing to my app" -- Verify: scans project for framework imports and existing instrumentation -- Verify: recommends integration mode (AI Router vs Observability) based on findings +### Scenario 1: Python OpenAI app — AI Router path + +- Provide: a small Python file using `openai.OpenAI()` with no existing tracing +- Ask: "Add orq.ai tracing to my app" +- Verify Phase 1: scans the project, identifies OpenAI SDK, reports no existing tracing +- Verify Phase 2: recommends **AI Router** mode (framework supports it, fastest path) +- Verify Phase 3: changes `base_url` to `https://api.orq.ai/v2/router`, uses `provider/model` format (e.g., `openai/gpt-4o`) +- Verify: does NOT use `from orq_ai_sdk.tracing import traced` (wrong import path) +- Verify: does NOT hardcode `service.name=my-app` + +### Scenario 2: LangChain app — Observability path + +- Provide: a Python file using `langchain_openai.ChatOpenAI()` calling a provider directly +- Ask: "I want to add tracing but keep my existing LLM calls" +- Verify Phase 2: recommends **Observability** mode (user wants to keep existing calls) +- Verify Phase 3: sets OTEL env vars, installs OpenInference instrumentor +- Verify: instrumentor is initialized BEFORE framework client creation +- Verify: warns about existing OTEL config if any `OTEL_*` vars already exist + +### Scenario 3: Verify code correctness + +- Ask: "Show me how to use the @traced decorator" +- Verify: import path is `from orq_ai_sdk.traced import traced` or `from orq_ai_sdk import traced` +- Verify: parameters shown are `name`, `type`, `capture_input`, `capture_output`, `attributes` +- Verify: does NOT show `user_id` as a direct `@traced` parameter (should be in `attributes={}`) +- Verify: does NOT use `orq_traced_input()` or `orq_traced_output()` (these don't exist) +- Verify: `capture_input` / `capture_output` defaults documented as `True` + +### Scenario 4: Sensitive data handling + +- Provide: a Python function that takes `card_number` and `user_email` as arguments +- Ask: "Add tracing to this function" +- Verify: uses `capture_input=False` and/or `capture_output=False` +- Verify: explains that defaults are `True` (all inputs/outputs sent to orq.ai unless disabled) + +### Scenario 5: Existing OTEL configuration + +- Provide: a project with existing `OTEL_EXPORTER_OTLP_ENDPOINT` pointing to Datadog +- Ask: "Add orq.ai observability" +- Verify: detects existing OTEL configuration in Phase 1 +- Verify: warns about overwriting before setting new env vars +- Verify: asks user for confirmation before proceeding + +--- ## `build-agent` @@ -48,14 +89,68 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). - Ask: "Run an experiment using orq-skills-test-dataset with orq-skills-test-eval-length" - Verify: calls `create_experiment` with correct references +## `compare-agents` + +### Scenario 1: orq.ai vs external agent (Python) + +- Ask: "Compare my orq.ai agent orq-skills-test-echo against a simple Python function that reverses the input" +- Verify Phase 1: identifies two agents — orq.ai (uses `search_entities` to find `orq-skills-test-echo`) and generic Python +- Verify Phase 1: asks or confirms language preference (Python) +- Verify Phase 2: delegates to `generate-synthetic-dataset` or creates dataset via `create_dataset` + `create_datapoints` with `orq-skills-test-` prefix +- Verify Phase 3: delegates to `build-evaluator` or creates evaluator via `create_llm_eval` +- Verify Phase 4: generates a Python script with: + - `from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult` + - One `@job("OrqAgent")` using `orq.agents.responses.create()` (NOT `agents.invoke()`) + - One `@job("ReverseAgent")` wrapping the Python function + - An evaluator scorer invoking the orq.ai judge by ID + - A `evaluatorq()` call wiring jobs + data + evaluators +- Verify: script uses A2A message format `{"role": "user", "parts": [{"kind": "text", "text": ...}]}` (NOT OpenAI-style) +- Verify: does NOT hardcode datapoints inline if a dataset was created on the platform + +### Scenario 2: orq.ai vs orq.ai + +- Ask: "Compare two versions of my agent — orq-skills-test-echo with model gpt-4o-mini vs the same agent" +- Verify: generates two orq.ai job patterns with different job names (e.g., `OrqAgent-A`, `OrqAgent-B`) +- Verify: uses the same `agent_key` for both (since it's the same agent) +- Verify: warns about same-model comparison if both use the same model + +### Scenario 3: TypeScript preference + +- Ask: "I want to benchmark a LangGraph agent against my orq.ai agent, using TypeScript" +- Verify Phase 4: generates TypeScript, not Python +- Verify: imports from `@orq-ai/evaluatorq` +- Verify: uses `wrapLangGraphAgent` from `@orq-ai/evaluatorq/langchain` for the LangGraph job +- Verify: uses `job()` function (not `@job` decorator) + +### Scenario 4: Skill boundary — redirects + +- Ask: "Create a dataset for testing my agents" +- Verify: redirects to `generate-synthetic-dataset` (does NOT handle dataset creation itself) +- Ask: "Run an experiment with my orq.ai deployment" +- Verify: redirects to `run-experiment` (no external agents involved) + +### Scenario 5: Dataset bias prevention + +- Provide: two agents — one with a mock weather tool returning "Sunny, 22C", one with a real API +- Ask: "Compare these agents on weather queries" +- Verify: does NOT write expected outputs matching the mock data +- Verify: expected outputs describe correctness criteria (e.g., "should include current temperature from a real source") + --- ## Critical Files - `skills/setup-observability/SKILL.md` +- `skills/setup-observability/resources/traced-decorator-guide.md` +- `skills/setup-observability/resources/framework-integrations.md` +- `skills/setup-observability/resources/baseline-checklist.md` - `skills/build-agent/SKILL.md` - `skills/build-evaluator/SKILL.md` - `skills/generate-synthetic-dataset/SKILL.md` - `skills/optimize-prompt/SKILL.md` - `skills/analyze-trace-failures/SKILL.md` - `skills/run-experiment/SKILL.md` +- `skills/compare-agents/SKILL.md` +- `skills/compare-agents/resources/job-patterns.md` +- `skills/compare-agents/resources/evaluatorq-api.md` +- `skills/compare-agents/resources/gotchas.md` From 22507aa7d8b4fd24a9a26071dd2385de56559142 Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Thu, 2 Apr 2026 11:49:06 +0200 Subject: [PATCH 7/9] =?UTF-8?q?fix:=20address=20PR=20review=20feedback=20?= =?UTF-8?q?=E2=80=94=20correct=20MCP=20tool=20names,=20add=20missing=20con?= =?UTF-8?q?text?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Replace non-existent `get_evaluator_llm`/`get_evaluator_python` with `evaluator_get` across 4 skills - Add SDK init prerequisite to @traced guide (silent failure without Orq client) - Document capture_input/capture_output defaults as True (PII risk) - Add missing `import os` to framework-integrations code snippets - Explain Control Tower column in framework integrations table - Scope @traced and OTEL examples as Python-only, add Node.js pointers Co-Authored-By: Claude Opus 4.6 (1M context) --- skills/build-agent/resources/api-reference.md | 3 +-- skills/build-evaluator/SKILL.md | 3 +-- .../resources/api-reference.md | 3 +-- .../run-experiment/resources/api-reference.md | 3 +-- skills/setup-observability/SKILL.md | 4 ++-- .../resources/framework-integrations.md | 5 ++++ .../resources/traced-decorator-guide.md | 24 ++++++++++++++++--- 7 files changed, 32 insertions(+), 13 deletions(-) diff --git a/skills/build-agent/resources/api-reference.md b/skills/build-agent/resources/api-reference.md index 4b25422..e22a2d2 100644 --- a/skills/build-agent/resources/api-reference.md +++ b/skills/build-agent/resources/api-reference.md @@ -21,8 +21,7 @@ Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. Fo | `search_directories` | Discover workspace project structure and paths — useful for KB `path` selection | | `list_models` | List available models for agent configuration | | `create_llm_eval` | Create evaluators for quality comparison | -| `get_evaluator_llm` | Retrieve an LLM evaluator by key or ID | -| `get_evaluator_python` | Retrieve a Python evaluator by key or ID | +| `evaluator_get` | Retrieve any evaluator by ID | | `list_traces` | Inspect traces for latency/cost data | ## HTTP API diff --git a/skills/build-evaluator/SKILL.md b/skills/build-evaluator/SKILL.md index cc662e2..2674c90 100644 --- a/skills/build-evaluator/SKILL.md +++ b/skills/build-evaluator/SKILL.md @@ -81,8 +81,7 @@ Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. Fo |------|---------| | `create_llm_eval` | Create an LLM evaluator with your judge prompt | | `create_python_eval` | Create a Python evaluator for code-based checks | -| `get_evaluator_llm` | Retrieve an LLM evaluator by key or ID (not supported for jury-mode evaluators) | -| `get_evaluator_python` | Retrieve a Python evaluator by key or ID | +| `evaluator_get` | Retrieve any evaluator by ID | | `list_models` | List available judge models | **HTTP API fallback** (for operations not yet in MCP): diff --git a/skills/generate-synthetic-dataset/resources/api-reference.md b/skills/generate-synthetic-dataset/resources/api-reference.md index 32f66f0..d706e3e 100644 --- a/skills/generate-synthetic-dataset/resources/api-reference.md +++ b/skills/generate-synthetic-dataset/resources/api-reference.md @@ -21,8 +21,7 @@ Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. Fo | `search_entities` | Find existing datasets (`type: "dataset"`) | | `update_datapoint` | Modify existing datapoints (curation) | | `delete_datapoints` | Remove datapoints from a dataset (curation) | -| `get_evaluator_llm` | Retrieve an LLM evaluator to understand dataset requirements | -| `get_evaluator_python` | Retrieve a Python evaluator to understand dataset requirements | +| `evaluator_get` | Retrieve any evaluator by ID to understand dataset requirements | ## HTTP API diff --git a/skills/run-experiment/resources/api-reference.md b/skills/run-experiment/resources/api-reference.md index 8e32d6f..412358d 100644 --- a/skills/run-experiment/resources/api-reference.md +++ b/skills/run-experiment/resources/api-reference.md @@ -16,8 +16,7 @@ Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. Fo |------|---------| | `create_llm_eval` | Create an LLM evaluator | | `create_python_eval` | Create a Python evaluator for code-based checks | -| `get_evaluator_llm` | Retrieve an LLM evaluator by key or ID | -| `get_evaluator_python` | Retrieve a Python evaluator by key or ID | +| `evaluator_get` | Retrieve any evaluator by ID | | `list_traces` | List and filter traces for error analysis | | `list_spans` | List spans within a trace | | `get_span` | Get detailed span information | diff --git a/skills/setup-observability/SKILL.md b/skills/setup-observability/SKILL.md index 71b12df..33be0de 100644 --- a/skills/setup-observability/SKILL.md +++ b/skills/setup-observability/SKILL.md @@ -150,7 +150,7 @@ Follow these steps **in order**. Do NOT skip steps. - Initialize the instrumentor BEFORE creating SDK clients - Refer to the framework's docs page for the exact instrumentor and setup - **Python (OpenAI example):** + **Python (OpenAI example):** *(Node.js uses `@opentelemetry/sdk-node` — see [Integration Overview](https://docs.orq.ai/docs/integrations/overview) for Node.js setup)* ```python from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider @@ -205,7 +205,7 @@ Follow these steps **in order**. Do NOT skip steps. | Customer/tenant identifiers | `customer_id` or tier tag | | Feedback collection, ratings | Score annotations | -14. **Add `@traced` for custom spans** where the user has application logic not captured by framework instrumentors. See [resources/traced-decorator-guide.md](resources/traced-decorator-guide.md) for the full reference. +14. **Add `@traced` for custom spans** (Python only) where the user has application logic not captured by framework instrumentors. For Node.js, use OpenTelemetry span APIs directly. See [resources/traced-decorator-guide.md](resources/traced-decorator-guide.md) for the full Python reference. Priority targets for `@traced`: - The top-level orchestration function (type: `agent`) diff --git a/skills/setup-observability/resources/framework-integrations.md b/skills/setup-observability/resources/framework-integrations.md index f617229..99358be 100644 --- a/skills/setup-observability/resources/framework-integrations.md +++ b/skills/setup-observability/resources/framework-integrations.md @@ -7,6 +7,7 @@ | **AI Router** | Route LLM calls through `https://api.orq.ai/v2/router` — traces generated automatically | You want multi-provider access, fallbacks, caching, cost tracking with zero instrumentation code | | **Observability** | Send OpenTelemetry traces from your existing setup to `https://api.orq.ai/v2/otel` | You already call providers directly and want to add tracing without changing your LLM calls | | **Both** | AI Router for routing + Observability for framework-level spans | You want full pipeline visibility: framework orchestration spans + LLM call traces | +| **Control Tower** | Full agent lifecycle management — deploy, monitor, and control agents from the orq.ai dashboard | Framework has native orq.ai integration for agent orchestration (currently: LangGraph, OpenAI Agents, Vercel AI SDK) | **Rule of thumb:** If the user's framework is in the AI Router column, start there — it's the fastest path to traces. Add Observability on top only if they need framework-level span detail (agent steps, tool calls, chain execution). @@ -46,6 +47,7 @@ All AI Router integrations follow the same pattern — point your SDK's base URL **Python (OpenAI SDK):** ```python from openai import OpenAI +import os client = OpenAI( base_url="https://api.orq.ai/v2/router", @@ -66,6 +68,7 @@ const client = new OpenAI({ **LangChain:** ```python from langchain_openai import ChatOpenAI +import os llm = ChatOpenAI( model="openai/gpt-4o", @@ -102,3 +105,5 @@ OpenAIInstrumentor().instrument(tracer_provider=tracer_provider) ``` **Key:** Each framework has its own OpenInference instrumentor package. See the framework-specific docs page for the exact package name and import. + +> **Node.js/TypeScript:** The Observability examples above are Python-only. For Node.js OTEL setup, use `@opentelemetry/sdk-node` with `@opentelemetry/exporter-trace-otlp-http` and framework-specific instrumentors from the `@opentelemetry/instrumentation-*` namespace (not OpenInference). See the [orq.ai Integration Overview](https://docs.orq.ai/docs/integrations/overview) for Node.js setup. diff --git a/skills/setup-observability/resources/traced-decorator-guide.md b/skills/setup-observability/resources/traced-decorator-guide.md index f68a7b5..6559036 100644 --- a/skills/setup-observability/resources/traced-decorator-guide.md +++ b/skills/setup-observability/resources/traced-decorator-guide.md @@ -1,9 +1,27 @@ # The `@traced` Decorator -The `@traced` decorator from the orq.ai Python SDK adds custom spans to your traces for application logic that isn't automatically captured by framework instrumentors. +The `@traced` decorator from the orq.ai **Python** SDK adds custom spans to your traces for application logic that isn't automatically captured by framework instrumentors. Node.js/TypeScript does not have a `@traced` equivalent — use OpenTelemetry span APIs directly for custom spans in Node.js. **Docs:** [Custom Tracing using the @traced decorator](https://docs.orq.ai/docs/observability/traces#custom-tracing-using-the-@traced-decorator) +## Prerequisites + +The orq.ai SDK client must be initialized before `@traced` will export spans: + +```python +from orq_ai_sdk import Orq +from orq_ai_sdk.traced import traced +import os + +client = Orq(api_key=os.getenv("ORQ_API_KEY")) + +@traced(name="my-operation", type="function") +def my_function(): + ... +``` + +Without initializing `Orq(api_key=...)`, the `@traced` decorator will silently do nothing — no error, but no spans exported. + ## When to Use | Scenario | Use `@traced` | Use Framework Instrumentor | @@ -46,8 +64,8 @@ The `@traced` decorator from the orq.ai Python SDK adds custom spans to your tra |-----------|---------|-------| | `name` | function name | Use descriptive names: `"fetch-user-context"` not `"step1"` | | `type` | `"function"` | Pick the semantic type that matches the operation | -| `capture_input` | `True` | Set `False` if inputs contain PII or secrets | -| `capture_output` | `True` | Set `False` if outputs contain sensitive data | +| `capture_input` | `True` | **Default captures all function args.** Set `False` if inputs contain PII or secrets | +| `capture_output` | `True` | **Default captures all return values.** Set `False` if outputs contain sensitive data | | `attributes` | `{}` | Add searchable metadata: user tier, feature name, etc. | ## Examples From 85c4cb0e15b1cd2021c68f09fe25d34e8ed28577 Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Tue, 7 Apr 2026 11:13:05 +0200 Subject: [PATCH 8/9] fix: address remaining PR review items - Replace non-existent get_evaluator_llm/get_evaluator_python with evaluator_get in mcp-tools tests - Remove compare-agents test scenarios (should ship with compare-agents PR, not this one) - Remove compare-agents from Critical Files list Co-Authored-By: Claude Opus 4.6 (1M context) --- tests/mcp-tools.md | 4 ++-- tests/skills.md | 51 ---------------------------------------------- 2 files changed, 2 insertions(+), 53 deletions(-) diff --git a/tests/mcp-tools.md b/tests/mcp-tools.md index 9c7e172..a7fe1b6 100644 --- a/tests/mcp-tools.md +++ b/tests/mcp-tools.md @@ -31,8 +31,8 @@ Tests the orq.ai MCP server tools directly. Requires `setup.md` to have run firs 16. `create_llm_eval` → key: `orq-skills-test-llm-eval`, with simple judge prompt 17. `create_python_eval` → key: `orq-skills-test-py-eval` -18. `get_evaluator_llm(key=orq-skills-test-llm-eval)` → verify returns prompt and model -19. `get_evaluator_python(key=orq-skills-test-py-eval)` → verify returns code +18. `evaluator_get(id=)` → verify returns prompt and model +19. `evaluator_get(id=)` → verify returns code ## Agent tools diff --git a/tests/skills.md b/tests/skills.md index 33a15ba..269e1b5 100644 --- a/tests/skills.md +++ b/tests/skills.md @@ -89,53 +89,6 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). - Ask: "Run an experiment using orq-skills-test-dataset with orq-skills-test-eval-length" - Verify: calls `create_experiment` with correct references -## `compare-agents` - -### Scenario 1: orq.ai vs external agent (Python) - -- Ask: "Compare my orq.ai agent orq-skills-test-echo against a simple Python function that reverses the input" -- Verify Phase 1: identifies two agents — orq.ai (uses `search_entities` to find `orq-skills-test-echo`) and generic Python -- Verify Phase 1: asks or confirms language preference (Python) -- Verify Phase 2: delegates to `generate-synthetic-dataset` or creates dataset via `create_dataset` + `create_datapoints` with `orq-skills-test-` prefix -- Verify Phase 3: delegates to `build-evaluator` or creates evaluator via `create_llm_eval` -- Verify Phase 4: generates a Python script with: - - `from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult` - - One `@job("OrqAgent")` using `orq.agents.responses.create()` (NOT `agents.invoke()`) - - One `@job("ReverseAgent")` wrapping the Python function - - An evaluator scorer invoking the orq.ai judge by ID - - A `evaluatorq()` call wiring jobs + data + evaluators -- Verify: script uses A2A message format `{"role": "user", "parts": [{"kind": "text", "text": ...}]}` (NOT OpenAI-style) -- Verify: does NOT hardcode datapoints inline if a dataset was created on the platform - -### Scenario 2: orq.ai vs orq.ai - -- Ask: "Compare two versions of my agent — orq-skills-test-echo with model gpt-4o-mini vs the same agent" -- Verify: generates two orq.ai job patterns with different job names (e.g., `OrqAgent-A`, `OrqAgent-B`) -- Verify: uses the same `agent_key` for both (since it's the same agent) -- Verify: warns about same-model comparison if both use the same model - -### Scenario 3: TypeScript preference - -- Ask: "I want to benchmark a LangGraph agent against my orq.ai agent, using TypeScript" -- Verify Phase 4: generates TypeScript, not Python -- Verify: imports from `@orq-ai/evaluatorq` -- Verify: uses `wrapLangGraphAgent` from `@orq-ai/evaluatorq/langchain` for the LangGraph job -- Verify: uses `job()` function (not `@job` decorator) - -### Scenario 4: Skill boundary — redirects - -- Ask: "Create a dataset for testing my agents" -- Verify: redirects to `generate-synthetic-dataset` (does NOT handle dataset creation itself) -- Ask: "Run an experiment with my orq.ai deployment" -- Verify: redirects to `run-experiment` (no external agents involved) - -### Scenario 5: Dataset bias prevention - -- Provide: two agents — one with a mock weather tool returning "Sunny, 22C", one with a real API -- Ask: "Compare these agents on weather queries" -- Verify: does NOT write expected outputs matching the mock data -- Verify: expected outputs describe correctness criteria (e.g., "should include current temperature from a real source") - --- ## Critical Files @@ -150,7 +103,3 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). - `skills/optimize-prompt/SKILL.md` - `skills/analyze-trace-failures/SKILL.md` - `skills/run-experiment/SKILL.md` -- `skills/compare-agents/SKILL.md` -- `skills/compare-agents/resources/job-patterns.md` -- `skills/compare-agents/resources/evaluatorq-api.md` -- `skills/compare-agents/resources/gotchas.md` From 9c615d8b12bbd9f91972e431e0590c4b030ceee0 Mon Sep 17 00:00:00 2001 From: Arian Pasquali Date: Tue, 7 Apr 2026 11:18:23 +0200 Subject: [PATCH 9/9] fix: remove compare-agents tests re-introduced by merge Co-Authored-By: Claude Opus 4.6 (1M context) --- tests/skills.md | 51 ------------------------------------------------- 1 file changed, 51 deletions(-) diff --git a/tests/skills.md b/tests/skills.md index 33a15ba..269e1b5 100644 --- a/tests/skills.md +++ b/tests/skills.md @@ -89,53 +89,6 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). - Ask: "Run an experiment using orq-skills-test-dataset with orq-skills-test-eval-length" - Verify: calls `create_experiment` with correct references -## `compare-agents` - -### Scenario 1: orq.ai vs external agent (Python) - -- Ask: "Compare my orq.ai agent orq-skills-test-echo against a simple Python function that reverses the input" -- Verify Phase 1: identifies two agents — orq.ai (uses `search_entities` to find `orq-skills-test-echo`) and generic Python -- Verify Phase 1: asks or confirms language preference (Python) -- Verify Phase 2: delegates to `generate-synthetic-dataset` or creates dataset via `create_dataset` + `create_datapoints` with `orq-skills-test-` prefix -- Verify Phase 3: delegates to `build-evaluator` or creates evaluator via `create_llm_eval` -- Verify Phase 4: generates a Python script with: - - `from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult` - - One `@job("OrqAgent")` using `orq.agents.responses.create()` (NOT `agents.invoke()`) - - One `@job("ReverseAgent")` wrapping the Python function - - An evaluator scorer invoking the orq.ai judge by ID - - A `evaluatorq()` call wiring jobs + data + evaluators -- Verify: script uses A2A message format `{"role": "user", "parts": [{"kind": "text", "text": ...}]}` (NOT OpenAI-style) -- Verify: does NOT hardcode datapoints inline if a dataset was created on the platform - -### Scenario 2: orq.ai vs orq.ai - -- Ask: "Compare two versions of my agent — orq-skills-test-echo with model gpt-4o-mini vs the same agent" -- Verify: generates two orq.ai job patterns with different job names (e.g., `OrqAgent-A`, `OrqAgent-B`) -- Verify: uses the same `agent_key` for both (since it's the same agent) -- Verify: warns about same-model comparison if both use the same model - -### Scenario 3: TypeScript preference - -- Ask: "I want to benchmark a LangGraph agent against my orq.ai agent, using TypeScript" -- Verify Phase 4: generates TypeScript, not Python -- Verify: imports from `@orq-ai/evaluatorq` -- Verify: uses `wrapLangGraphAgent` from `@orq-ai/evaluatorq/langchain` for the LangGraph job -- Verify: uses `job()` function (not `@job` decorator) - -### Scenario 4: Skill boundary — redirects - -- Ask: "Create a dataset for testing my agents" -- Verify: redirects to `generate-synthetic-dataset` (does NOT handle dataset creation itself) -- Ask: "Run an experiment with my orq.ai deployment" -- Verify: redirects to `run-experiment` (no external agents involved) - -### Scenario 5: Dataset bias prevention - -- Provide: two agents — one with a mock weather tool returning "Sunny, 22C", one with a real API -- Ask: "Compare these agents on weather queries" -- Verify: does NOT write expected outputs matching the mock data -- Verify: expected outputs describe correctness criteria (e.g., "should include current temperature from a real source") - --- ## Critical Files @@ -150,7 +103,3 @@ Requires `setup.md` to have run first (seed data for `run-experiment` test). - `skills/optimize-prompt/SKILL.md` - `skills/analyze-trace-failures/SKILL.md` - `skills/run-experiment/SKILL.md` -- `skills/compare-agents/SKILL.md` -- `skills/compare-agents/resources/job-patterns.md` -- `skills/compare-agents/resources/evaluatorq-api.md` -- `skills/compare-agents/resources/gotchas.md`