diff --git a/README.md b/README.md index 7abf888..9181186 100644 --- a/README.md +++ b/README.md @@ -235,6 +235,7 @@ The `action-plan` skill can file tickets directly to Linear. See [Linear MCP set | **manage-memory** | Create and configure orq.ai Memory Stores for persistent context in conversational agents | [SKILL.md](skills/manage-memory/SKILL.md) | | **monitor-production** | Analyze production trace data for anomalies, cost trends, latency regressions, and emerging failure modes | [SKILL.md](skills/monitor-production/SKILL.md) | | **optimize-prompt** | Systematically iterate on a prompt deployment using trace data, A/B testing, and structured refinement techniques | [SKILL.md](skills/optimize-prompt/SKILL.md) | +| **prompt-learning** | Automatically improve prompts by collecting feedback, generating "If [TRIGGER] then [ACTION]" rules via a meta-prompt, and validating with multi-judge experiments | [SKILL.md](skills/prompt-learning/SKILL.md) | | **regression-test** | Run a quick regression check against a golden dataset to verify recent changes haven't degraded quality | [SKILL.md](skills/regression-test/SKILL.md) | | **run-experiment** | End-to-end LLM evaluation workflow — error analysis, dataset creation, experiment execution, result analysis, and ticket filing | [SKILL.md](skills/run-experiment/SKILL.md) | | **scaffold-integration** | Generate SDK integration code (Python or Node) for orq.ai agents, deployments, and knowledge bases in the user's codebase | [SKILL.md](skills/scaffold-integration/SKILL.md) | diff --git a/agents/AGENTS.md b/agents/AGENTS.md index eef8644..5b03898 100644 --- a/agents/AGENTS.md +++ b/agents/AGENTS.md @@ -20,6 +20,7 @@ These skills are: - manage-memory -> "skills/manage-memory/SKILL.md" - monitor-production -> "skills/monitor-production/SKILL.md" - optimize-prompt -> "skills/optimize-prompt/SKILL.md" + - prompt-learning -> "skills/prompt-learning/SKILL.md" - regression-test -> "skills/regression-test/SKILL.md" - run-experiment -> "skills/run-experiment/SKILL.md" - scaffold-integration -> "skills/scaffold-integration/SKILL.md" @@ -47,6 +48,7 @@ manage-deployment: `Configure, version, and manage orq.ai deployments — model manage-memory: `Create and configure orq.ai Memory Stores for persistent context in conversational agents` monitor-production: `Analyze production trace data for anomalies, cost trends, latency regressions, and emerging failure modes` optimize-prompt: `Systematically iterate on a prompt deployment using trace data, A/B testing, and structured refinement techniques` +prompt-learning: `Automatically improve prompts by collecting feedback, generating "If [TRIGGER] then [ACTION]" rules via a meta-prompt, and validating with multi-judge experiments` regression-test: `Run a quick regression check against a golden dataset to verify recent changes haven't degraded quality` run-experiment: `End-to-end LLM evaluation workflow — error analysis, dataset creation, experiment execution, result analysis, and ticket filing` scaffold-integration: `Generate SDK integration code (Python or Node) for orq.ai agents, deployments, and knowledge bases in the user's codebase` diff --git a/scripts/validate_prompt_learning.py b/scripts/validate_prompt_learning.py new file mode 100644 index 0000000..29aa501 --- /dev/null +++ b/scripts/validate_prompt_learning.py @@ -0,0 +1,286 @@ +#!/usr/bin/env -S uv run +# /// script +# requires-python = ">=3.10" +# dependencies = [] +# /// +"""Validate the prompt-learning skill integration. + +Checks: +1. Frontmatter matches existing skill patterns +2. All companion skill references point to existing skills +3. AGENTS.md includes the new skill entry +4. Meta-prompt is inline in SKILL.md with required structural elements +5. RES-205 research findings are reflected (domain gating, multi-judge, P=0) +""" + +from __future__ import annotations + +import re +import sys +from pathlib import Path + + +ROOT = Path(__file__).resolve().parent.parent +SKILL_DIR = ROOT / "skills" / "prompt-learning" +SKILL_MD = SKILL_DIR / "SKILL.md" +AGENTS_MD = ROOT / "agents" / "AGENTS.md" + +PASS = "\033[32m✓\033[0m" +FAIL = "\033[31m✗\033[0m" + + +def parse_frontmatter(text: str) -> dict[str, str]: + match = re.search(r"^---\s*\n(.*?)\n---\s*", text, re.DOTALL) + if not match: + return {} + data: dict[str, str] = {} + for line in match.group(1).splitlines(): + if ":" not in line: + continue + key, value = line.split(":", 1) + data[key.strip()] = value.strip() + return data + + +def collect_existing_skills() -> set[str]: + """Discover all skills that have a SKILL.md file.""" + skills = set() + for skill_md in ROOT.glob("skills/*/SKILL.md"): + meta = parse_frontmatter(skill_md.read_text(encoding="utf-8")) + name = meta.get("name") + if name: + skills.add(name) + return skills + + +def test_frontmatter() -> list[str]: + """Check 1: Frontmatter matches existing skill patterns.""" + errors = [] + + if not SKILL_MD.exists(): + return [f"SKILL.md not found at {SKILL_MD}"] + + text = SKILL_MD.read_text(encoding="utf-8") + meta = parse_frontmatter(text) + + for field in ("name", "description", "allowed-tools"): + if field not in meta: + errors.append(f"Missing frontmatter field: {field}") + + if meta.get("name") != "prompt-learning": + errors.append( + f"Frontmatter name '{meta.get('name')}' doesn't match " + f"directory 'prompt-learning'" + ) + + if not meta.get("description"): + errors.append("Frontmatter description is empty") + + allowed = meta.get("allowed-tools", "") + for tool in ("Bash", "Read", "Write", "Edit", "Grep", "Glob", "AskUserQuestion"): + if tool not in allowed: + errors.append(f"allowed-tools missing core tool: {tool}") + + reference_skill = ROOT / "skills" / "optimize-prompt" / "SKILL.md" + if reference_skill.exists(): + ref_meta = parse_frontmatter( + reference_skill.read_text(encoding="utf-8") + ) + ref_fields = set(ref_meta.keys()) + our_fields = set(meta.keys()) + missing = ref_fields - our_fields + if missing: + errors.append( + f"Frontmatter missing fields present in optimize-prompt: {missing}" + ) + + return errors + + +def test_companion_skills() -> list[str]: + """Check 2: All companion skill references point to existing skills.""" + errors = [] + + if not SKILL_MD.exists(): + return [f"SKILL.md not found at {SKILL_MD}"] + + text = SKILL_MD.read_text(encoding="utf-8") + existing = collect_existing_skills() + + companion_section = re.search( + r"\*\*Companion skills:\*\*\s*\n((?:- .*\n)*)", text + ) + if not companion_section: + errors.append("No 'Companion skills' section found") + return errors + + companion_names = re.findall(r"`([^`]+)`", companion_section.group(1)) + if not companion_names: + errors.append("No companion skills listed") + return errors + + for name in companion_names: + if name not in existing: + errors.append(f"Companion skill '{name}' does not exist as a skill") + + for name in companion_names: + companion_path = ROOT / "skills" / name / "SKILL.md" + if companion_path.exists(): + companion_text = companion_path.read_text(encoding="utf-8") + if "prompt-learning" not in companion_text: + errors.append( + f"Companion '{name}' does not reference 'prompt-learning' back" + ) + + return errors + + +def test_agents_md() -> list[str]: + """Check 3: AGENTS.md has correct formatting with new entry.""" + errors = [] + + if not AGENTS_MD.exists(): + return [f"AGENTS.md not found at {AGENTS_MD}"] + + text = AGENTS_MD.read_text(encoding="utf-8") + + expected_path = 'prompt-learning -> "skills/prompt-learning/SKILL.md"' + if expected_path not in text: + errors.append(f"AGENTS.md missing path entry: {expected_path}") + + if "prompt-learning:" not in text: + errors.append("AGENTS.md missing description entry for prompt-learning") + + path_entries = re.findall(r" - (\S+) -> ", text) + if path_entries: + sorted_entries = sorted(path_entries, key=str.lower) + if path_entries != sorted_entries: + errors.append("AGENTS.md skill list is not alphabetically sorted") + + return errors + + +def test_inline_meta_prompt() -> list[str]: + """Check 4: Meta-prompt template is inline in SKILL.md with required elements.""" + errors = [] + + if not SKILL_MD.exists(): + return [f"SKILL.md not found at {SKILL_MD}"] + + text = SKILL_MD.read_text(encoding="utf-8") + + resources_dir = SKILL_DIR / "resources" + if resources_dir.exists() and (resources_dir / "meta-prompt.md").exists(): + errors.append( + "Meta-prompt exists as separate file resources/meta-prompt.md — " + "should be inlined in SKILL.md Phase 3" + ) + + if "Follow the meta-prompt process below" not in text: + errors.append("SKILL.md Phase 3 missing inline meta-prompt instruction") + + required_elements = [ + ("GOAL", "meta-prompt GOAL section"), + ("FAILURE_EXAMPLES", "failure examples input"), + ("FEEDBACK SHAPES", "feedback shape reference"), + ("STEP 1", "failure pattern analysis step"), + ("RULES_TO_APPEND", "rules output format"), + ("REGRESSION_TESTS", "regression test generation"), + ("ITERATION_GUIDANCE", "iteration guidance output"), + ("If [TRIGGER]", "rule format specification"), + ("LEARNED_RULES", "learned rules section reference"), + ] + for element, description in required_elements: + if element not in text: + errors.append(f"SKILL.md missing meta-prompt element: {description} ({element})") + + taxonomy_section = text.split("Issue Taxonomy")[1].split("##")[0] if "Issue Taxonomy" in text else "" + expected_tags = { + "accuracy", "missing_requirement", "policy", "safety", + "formatting", "verbosity", "tone", "tool_use", "reasoning", + "hallucination", + } + taxonomy_tags = re.findall(r"`(\w+)`", taxonomy_section) + found_tags = {t for t in taxonomy_tags if t in expected_tags} + missing_tags = expected_tags - found_tags + if missing_tags: + errors.append(f"Issue Taxonomy section missing tags: {missing_tags}") + + return errors + + +def test_research_alignment() -> list[str]: + """Check 5: RES-205 research findings are reflected in SKILL.md.""" + errors = [] + + if not SKILL_MD.exists(): + return [f"SKILL.md not found at {SKILL_MD}"] + + text = SKILL_MD.read_text(encoding="utf-8") + + # Domain gating + if "When NOT to use" not in text: + errors.append("Missing 'When NOT to use' section (domain gating)") + if "focused" not in text.lower(): + errors.append("Missing focused domain guidance") + if "broad" not in text.lower() and "general" not in text.lower(): + errors.append("Missing warning about broad/general domains") + + # Multi-judge validation + if "multi-judge" not in text.lower() and "multi_judge" not in text.lower(): + errors.append("Missing multi-judge validation requirement") + if "3+" not in text and "3 " not in text: + errors.append("Missing 3+ judge requirement") + if "overestimate" not in text.lower(): + errors.append("Missing single-judge overestimation warning (40-60%)") + + # P=0 default + defaults_section = text.split("## Defaults")[1].split("##")[0] if "## Defaults" in text else "" + if "P=0" not in defaults_section and "p) | 0" not in defaults_section: + errors.append("Defaults table should show P=0 (research finding)") + + # Ceiling effect + if "ceiling" not in text.lower(): + errors.append("Missing model ceiling effect warning") + + # No "preliminary" or "pending" language + if "preliminary" in text.lower() or "pending validation" in text.lower(): + errors.append("Still contains 'preliminary'/'pending' language — results are final") + + # Reference comparison anti-pattern + if "reference" not in text.lower() or "worse" not in text.lower(): + errors.append("Missing anti-pattern about reference comparisons making results worse") + + return errors + + +def main() -> None: + checks = [ + ("Frontmatter matches existing skill patterns", test_frontmatter), + ("Companion skill references are valid", test_companion_skills), + ("AGENTS.md includes prompt-learning correctly", test_agents_md), + ("Meta-prompt is inline and well-structured", test_inline_meta_prompt), + ("RES-205 research findings reflected", test_research_alignment), + ] + + total_errors = 0 + for label, check_fn in checks: + errors = check_fn() + if errors: + print(f"{FAIL} {label}") + for err in errors: + print(f" - {err}") + total_errors += len(errors) + else: + print(f"{PASS} {label}") + + print() + if total_errors: + print(f"{total_errors} error(s) found.") + sys.exit(1) + else: + print("All checks passed.") + + +if __name__ == "__main__": + main() diff --git a/skills/build-evaluator/SKILL.md b/skills/build-evaluator/SKILL.md index baf2066..3cf8a0f 100644 --- a/skills/build-evaluator/SKILL.md +++ b/skills/build-evaluator/SKILL.md @@ -8,6 +8,10 @@ allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch, Task, AskUserQuest Design and create production-grade LLM evaluators on the orq.ai platform, grounded in evaluation best practices. +**Companion skills:** +- `prompt-learning` — uses evaluator scores as AI feedback to generate prompt rules +- `run-experiment` — run experiments using evaluators built with this skill + ## When to use - User asks to create an LLM-as-a-Judge evaluator diff --git a/skills/feedback-loop/SKILL.md b/skills/feedback-loop/SKILL.md index 34df427..91a99af 100644 --- a/skills/feedback-loop/SKILL.md +++ b/skills/feedback-loop/SKILL.md @@ -11,6 +11,7 @@ Set up user feedback collection and analyze feedback patterns to drive data-info **Companion skills:** - `trace-analysis` — deep-dive into traces flagged by negative feedback - `action-plan` — prioritize improvements based on feedback patterns +- `prompt-learning` — automatically turn feedback patterns into prompt rules - `scaffold-integration` — generate SDK code for feedback collection ## When to use diff --git a/skills/optimize-prompt/SKILL.md b/skills/optimize-prompt/SKILL.md index 69a646c..0004da5 100644 --- a/skills/optimize-prompt/SKILL.md +++ b/skills/optimize-prompt/SKILL.md @@ -12,6 +12,7 @@ Systematically improve prompt deployments through trace-driven failure analysis, - `trace-analysis` — identify failure patterns that inform prompt edits - `run-experiment` — run A/B experiments comparing prompt versions - `build-evaluator` — create evaluators to measure prompt improvements +- `prompt-learning` — automated feedback-driven rule generation (complementary approach) ## When to use diff --git a/skills/prompt-learning/SKILL.md b/skills/prompt-learning/SKILL.md new file mode 100644 index 0000000..f5990c9 --- /dev/null +++ b/skills/prompt-learning/SKILL.md @@ -0,0 +1,404 @@ +--- +name: prompt-learning +description: Automatically improve prompts by collecting feedback, generating "If [TRIGGER] then [ACTION]" rules via a meta-prompt, and validating with multi-judge experiments +allowed-tools: Bash, Read, Write, Edit, Grep, Glob, WebFetch, Task, AskUserQuestion, mcp__linear-server__*, orq* +--- + +# Prompt Learning + +Automatically improve prompts through feedback-driven rule generation. Collects human or AI feedback, normalizes it to a shared representation, generates targeted "If [TRIGGER] then [ACTION]" rules via a meta-prompt, and appends them to a `### LEARNED_RULES` section — then validates with multi-judge experiments. + +This skill is **automated and feedback-driven**, distinct from `optimize-prompt` (which is manual/trace-driven). The pipeline is: Collect → Normalize → Meta-Prompt → Aggregate → Apply → Validate. + +**Companion skills:** +- `feedback-loop` — set up feedback collection and analyze feedback patterns +- `optimize-prompt` — manual trace-driven prompt refinement (complementary approach) +- `run-experiment` — run A/B experiments comparing prompt versions +- `build-evaluator` — create evaluators to measure prompt improvements +- `trace-analysis` — deep-dive into traces to understand failure modes + +## When to use + +- User has feedback (human or AI evaluator) and wants automated prompt improvement +- User wants to learn rules from production feedback patterns +- User asks "how do I automatically improve my prompt from feedback?" +- Action plan recommends feedback-driven prompt improvement +- User wants to close the loop between feedback collection and prompt updates +- User has evaluator scores and wants to turn failures into prompt rules +- The target prompt serves a **focused domain** (email writing, code review, customer support, data extraction) + +## When NOT to use + +- **Broad/general tasks** — prompt learning shows 0% significant improvement on general helpfulness or open-ended chat (RES-205: 0/70 configs significant on MTBench helpfulness) +- **Top-tier models already at ceiling** — models scoring >4.5/5 on baseline show no improvement (GPT-4o at 4.75/5 had zero gain) +- **Without multi-judge validation** — single-judge evaluation overestimates improvement by 40-60%. If you cannot set up 3+ diverse judge models, results will be unreliable +- Use `optimize-prompt` instead for manual, trace-driven refinement on any domain type + +## orq.ai Documentation + +Consult these docs when working with the orq.ai platform: +- **Prompts overview:** https://docs.orq.ai/docs/prompts/overview +- **Prompt management:** https://docs.orq.ai/docs/prompts/management +- **Prompt versioning:** https://docs.orq.ai/docs/prompts/versioning +- **Deployments overview:** https://docs.orq.ai/docs/deployments/overview +- **Experiments:** https://docs.orq.ai/docs/experiments/creating +- **Traces:** https://docs.orq.ai/docs/observability/traces +- **Feedback:** https://docs.orq.ai/docs/feedback/overview +- **Evaluators:** https://docs.orq.ai/docs/evaluators/overview + +### orq.ai Prompt Capabilities +- Prompts are versioned — each edit creates a new version, previous versions are preserved +- Deployments link to specific prompt versions and model configurations +- Experiments can compare two prompt versions on the same dataset +- Template variables: `{{log.input}}`, `{{log.output}}`, `{{log.messages}}`, `{{log.retrievals}}`, `{{log.reference}}` +- Rules are appended to a `### LEARNED_RULES` section — the rest of the prompt remains untouched + +### orq MCP Tools + +Use the orq MCP server (`https://my.orq.ai/v2/mcp`) as the primary interface. For operations not yet available via MCP, use the HTTP API as fallback. + +**Available MCP tools for this skill:** + +| Tool | Purpose | +|------|---------| +| `search_entities` | Find prompts (`type: "prompts"`) and deployments | +| `list_traces` | Pull recent traces with feedback data | +| `list_spans` | List spans within a trace | +| `get_span` | Get detailed span information | +| `create_experiment` | Run A/B experiment comparing prompt versions | +| `list_experiment_runs` | Check experiment progress | +| `get_experiment_run` | Get experiment results | + +**HTTP API fallback** (for operations not yet in MCP): + +```bash +# List prompts +curl -s https://my.orq.ai/v2/prompts \ + -H "Authorization: Bearer $ORQ_API_KEY" \ + -H "Content-Type: application/json" | jq + +# Get prompt details with versions +curl -s https://my.orq.ai/v2/prompts/ \ + -H "Authorization: Bearer $ORQ_API_KEY" \ + -H "Content-Type: application/json" | jq + +# Create a new prompt version +curl -s -X POST https://my.orq.ai/v2/prompts//versions \ + -H "Authorization: Bearer $ORQ_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"messages": [...], "model": "...", "parameters": {...}}' | jq +``` + +## Core Principles + +### 1. Focused Domains Only +Prompt learning works on **narrow, well-defined tasks** (email writing, code review, customer support, data extraction). It does not work on broad/general helpfulness — RES-205 showed 0% significant improvement across 70 configurations on general helpfulness tasks. Always verify the target prompt serves a focused domain before proceeding. + +### 2. Feedback Is Fuel, Not Truth +Human and AI feedback use the **same core method** — only preprocessing differs. Normalize all feedback to a shared representation (verdict + severity + issue tags + expected behavior) before processing. Never trust a single piece of feedback in isolation. + +### 3. Only Recurring Patterns Get Rules +Require 2+ occurrences of a pattern before generating a rule. One-off issues are noise, not signal. The meta-prompt explicitly skips single-occurrence patterns. + +### 4. Rules Are Additive, Never Destructive +Rules are appended to a `### LEARNED_RULES` section in the prompt. Never rewrite or remove existing prompt instructions. Rules augment the prompt — they don't replace it. + +### 5. Multi-Judge Validation Is Mandatory +Never validate with a single judge model. Single-judge evaluation overestimates improvement by 40-60% (RES-205: single-judge showed +40%, multi-judge showed +6%). Always use 3+ diverse judge models for validation experiments. + +## Issue Taxonomy + +The meta-prompt classifies failures into these types: + +| Issue Type | Description | +|------------|-------------| +| `accuracy` | Factually incorrect or imprecise outputs | +| `missing_requirement` | Fails to address part of the user's request | +| `policy` | Violates organizational policies or guidelines | +| `safety` | Produces harmful, biased, or inappropriate content | +| `formatting` | Wrong output structure, missing fields, schema violations | +| `verbosity` | Too long or too short for the context | +| `tone` | Inappropriate register, persona drift | +| `tool_use` | Wrong tool selected, incorrect arguments, misinterpreted results | +| `reasoning` | Flawed logic, incorrect deductions | +| `hallucination` | Fabricated facts, citations, or capabilities | + +## Destructive Actions + +The following actions require explicit user confirmation via `AskUserQuestion` before execution: +- Applying generated rules to a prompt (creating a new version with `### LEARNED_RULES`) +- Promoting a rule-enhanced prompt version to a production deployment +- Removing or modifying existing learned rules + +## Defaults + +Research-validated configuration (RES-205): + +| Parameter | Default | Range | Notes | +|-----------|---------|-------|-------| +| Failures per batch (f) | 10 | 5-15 | f=10 is optimal; f=5 too few to find patterns | +| Positives per batch (p) | 0 | 0-5 | P=0 outperforms P=3-5 on focused domains | +| Iterations | 1 | 1-2 | 1 iteration gives best results; 2 for consistency. More causes prompt bloat | +| Occurrence threshold | 2+ | — | One-offs are skipped | +| Rules per iteration | 1-5 | — | Prioritized by frequency × severity | +| Total rule cap | 10 | — | Across all iterations | +| Validation judges | 3+ | 3-5 | Diverse models required; single-judge overestimates by 40-60% | + +**Expected effect size:** +0.4 to +0.9 on a 5-point scale for focused domains. Set expectations accordingly. + +**Model tier matters:** + +| Model Tier | Recommendation | +|------------|----------------| +| Small (Claude Haiku, Gemini Flash) | Best candidate — +40% improvement observed | +| Mid-tier (GPT-4o-mini) | Good candidate — room to improve | +| Top-tier (GPT-4o, Claude Sonnet) | Skip — likely at ceiling (>4.5/5 baseline), -3% to -10% observed | + +**Model-specific optimal configs:** + +| Model Family | F | P | Iterations | Notes | +|-------------|---|---|------------|-------| +| Claude | 10 | 0 | 3 | Small models (Haiku) learn best | +| Gemini | 15 | 0 | 1 | Higher failure count, single iteration | +| GPT | 10-15 | 3-5 | 3 | Benefits from positive anchors unlike others | +| Other / unknown | 10 | 0 | 1 | Conservative defaults; not experimentally validated — monitor closely | + +**Split model strategy (recommended):** Use a cheap model as the learner (the model being improved) and a powerful model as the generator (the model running the meta-prompt). RES-205 showed +20% win rate with split models vs +13% with same-model — the powerful generator produces better rules when analyzing a smaller model's failures. + +## Steps + +Follow these steps **in order**. Do NOT skip steps. + +### Phase 1: Identify Target and Assess Feasibility + +1. **Identify the target prompt/deployment:** + - Use `search_entities` with `type: "prompts"` to find the target prompt + - Use HTTP API to get full prompt details including current version + - Document: system message, user template, model, parameters + +2. **Assess feasibility — check domain and model tier:** + - **Domain check:** Is this a focused task (email, code review, support, extraction) or a broad/general task (open chat, helpfulness)? + - If broad/general → **stop**. Inform user that prompt learning is not effective for this domain type. Suggest `optimize-prompt` instead. + - **Model check:** What is the baseline quality score? + - If baseline > 4.5/5 → **warn** the user that top-tier models show ceiling effects and prompt learning may not yield improvement. + - **Present assessment** to the user before proceeding. + +3. **Collect feedback data:** + - Use `list_traces` to pull traces with feedback for the target deployment + - Collect at least 50 traces (more is better) from a meaningful time period + - Separate into: negative feedback traces and positive feedback traces + - Identify the feedback source: human (thumbs up/down, corrections, free-text) or AI evaluator scores + - **Prefer freetext feedback** when available — RES-205 showed +26.7% improvement from freetext vs +6.7% from categorical feedback. If the user only has thumbs up/down, recommend enriching with freetext explanations. + +4. **Verify sufficient data:** + - Need at least f=10 failure traces to proceed + - If insufficient, inform user and suggest using `feedback-loop` to set up collection first + +### Phase 2: Normalize Feedback (Normalize) + +5. **Normalize feedback to shared representation:** + + All feedback — human or AI — gets normalized to this shape: + ```json + { + "verdict": "fail" | "pass" | "borderline", + "severity": 1-5, + "issue_tags": [""], + "expected_behavior": "what should have happened" + } + ``` + + **Severity mapping:** + + | Condition | Severity | + |-----------|----------| + | Score on 1-5 scale (5 best) | `severity = 6 - score` | + | Score on 1-10 scale (10 best) | `severity = ceil((10 - score) / 2)` | + | Policy/safety violation or hallucination | 5 | + | Wrong answer or missed key requirement | 4 | + | Partial/incomplete response | 3 | + | Minor format/tone/verbosity issue | 2 | + | Nitpick/stylistic preference | 1 | + + **For human feedback:** + - Thumbs down → `{"verdict": "fail", "severity": 3, "issue_tags": [], "expected_behavior": ""}` + - Free-text correction → extract issue tags and expected behavior from the text + - Numerical rating (e.g., 1-5) → map to verdict (1-2: fail, 3: borderline, 4-5: pass), use severity mapping above + + **For AI evaluator feedback:** + - Boolean false → `{"verdict": "fail", "severity": 3, "issue_tags": [], "expected_behavior": ""}` + - Categorical/numerical → map to verdict based on scale, carry explanation through + + If raw feedback lacks explanations (e.g., bare thumbs-down), use the LLM to enrich: pass the input/output pair and ask for a brief failure analysis to populate `issue_tags` and `expected_behavior`. + +6. **Sample the batch:** + - Sample f=10 representative failures (diverse issue types, not all the same failure) + - If more failures exist, prioritize diversity across issue types + +### Phase 3: Generate Rules (Meta-Prompt) + +7. **Build the meta-prompt** by filling in the template below with collected data: + - `PROMPT_TYPE`: "agent" or "evaluator" based on target + - `CURRENT_PROMPT`: full text of the current prompt version + - `ITERATION`: current iteration number (starts at 1) + - `FEEDBACK_SOURCE`: "human" or "ai_eval" + - `FAILURE_EXAMPLES`: the f=10 sampled failures with normalized feedback + +8. **Follow the meta-prompt process below** with the variables filled in from the collected data: + + ~~~ + You are a prompt engineer improving a prompt based on feedback from multiple examples. + + GOAL: Analyze a batch of feedback failures and produce minimal, high-impact rules that fix recurring failure patterns. + + INPUTS: + 1) PROMPT_TYPE: "agent" | "evaluator" + 2) CURRENT_PROMPT: The prompt to improve + 3) ITERATION: Current iteration number + 4) FEEDBACK_SOURCE: "human" | "ai_eval" + 5) FAILURE_EXAMPLES (10 samples with negative feedback): + [{"user_input": "...", "model_output": "...", "feedback": }, ...] + + FEEDBACK SHAPES: + - Human categorical: "fail" | "pass" | "borderline" + - Human numerical: 3 (just the number) + - Human free text: "The response was too vague..." + - AI eval boolean: {"value": true|false, "explanation": "..."} + - AI eval categorical: {"value": "A"|"B"|"C", "explanation": "..."} + - AI eval numerical: {"value": 6, "scale": "1-10", "explanation": "..."} + - Enriched normalized: {"verdict": "fail", "severity": 4, "issue_tags": ["missing_requirement"], "expected_behavior": "..."} + + PROCESS: + + STEP 1 — ANALYZE FAILURE PATTERNS: + Group failures by issue type. Identify recurring patterns (2+ occurrences). + Issue taxonomy: accuracy, missing_requirement, policy, safety, formatting, verbosity, tone, tool_use, reasoning, hallucination. + Output: {"patterns": [{"issue_tag": "...", "count": N, "severity": 1-5, "examples": [indices], "root_cause": "..."}], "one_off_issues": [...]} + + STEP 2 — GENERATE RULES (only for recurring patterns): + Create 1-5 rules. Format: "If [TRIGGER], then [ACTION]." + Prioritize by: frequency × severity. + Skip: one-offs, patterns too vague to test. + + STEP 3 — FORMAT RULES_TO_APPEND: + Text block for ### LEARNED_RULES section. + + STEP 4 — GENERATE REGRESSION TESTS: + Create 5-10 test cases: 3-5 "should_now_pass" + 2-5 "should_still_pass". + + STEP 5 — ITERATION GUIDANCE: + Recommend "stop" (default after iteration 1) or "continue" (only if major patterns remain unfixed). + + OUTPUT FORMAT: + A) PATTERN_ANALYSIS — JSON with patterns and one_off_issues + B) RULES — numbered list + C) RULES_TO_APPEND — text block for the prompt + D) REGRESSION_TESTS — JSON array of test cases + E) ITERATION_GUIDANCE — {"recommendation": "continue"|"stop", "reason": "...", "remaining_issues": N} + + NOW PROCESS THE ACTUAL INPUT. + ~~~ + +9. **Produce the structured output** (sections A through E) from the analysis above. + +10. **Review the output** with the user: + - Show the identified patterns and their frequency/severity + - Show the generated rules + - Present ITERATION_GUIDANCE recommendation + +### Phase 4: Aggregate and Apply + +11. **Aggregate rules** across iterations (if iteration > 1): + - Merge new rules with existing `### LEARNED_RULES` section + - Remove duplicates or conflicting rules + - Enforce total rule cap of 10 + +12. **Apply rules to the prompt** — **ask user confirmation first:** + - Create a new prompt version with `### LEARNED_RULES` section appended + - Use HTTP API to create the new version + - Document what rules were added and which patterns they address + + Format of the appended section: + ``` + ### LEARNED_RULES + - If [TRIGGER], then [ACTION]. + - If [TRIGGER], then [ACTION]. + ... + ``` + +### Phase 5: Validate (Multi-Judge Experiment) + +13. **Set up a multi-judge validation experiment:** + - Use `create_experiment` to compare baseline (no rules) vs variant (with rules) + - Use the same dataset for both runs (10-50 examples) + - **Configure 3+ diverse judge models** (e.g., Gemini, GPT, Claude) — single-judge overestimates by 40-60% + - Include evaluators that measure the targeted failure types + - Include the regression tests from the meta-prompt output + +14. **Run the experiment and analyze results:** + - Use `list_experiment_runs` to monitor progress + - Use `get_experiment_run` to fetch results + - Compare across **all judges**: + ``` + | Judge Model | Evaluator | Baseline | Variant | Delta | + |-------------|-----------|----------|---------|-------| + | [judge 1] | [metric] | X% | Y% | +Z% | + | [judge 2] | [metric] | X% | Y% | +Z% | + | [judge 3] | [metric] | X% | Y% | +Z% | + ``` + +15. **Decision framework:** + - **Clear win** — majority of judges show improvement, no regression → Promote variant + - **Mixed results** — judges disagree → Investigate, may be noise + - **No improvement** — most judges show no change → Re-examine feedback, try different samples + - **Regression** — any judge shows regression → Revert, rules may be too aggressive + - **Single judge shows large gain but others don't** → Discard. This is the 40-60% overestimation pattern. + +### Phase 6: Iterate (Usually Stop at 1) + +16. **Check iteration guidance:** + - **Default: stop after iteration 1.** Research shows 1 iteration gives best results; more iterations cause prompt bloat and diminishing returns. + - If significant failure patterns remain AND iteration < 2: + - Return to Phase 2 Step 6 with updated prompt (now including rules) + - Use remaining unprocessed failures + - Increment iteration counter + - If meta-prompt recommends `"stop"` OR iteration = 2: + - Present final summary to user + - If validated, **ask user confirmation** to promote to production deployment + +17. **Final summary:** + ``` + ## Prompt Learning Summary + - **Target:** [prompt/deployment name] + - **Domain type:** [focused domain description] + - **Iterations completed:** [N] + - **Rules generated:** [N] + - **Feedback source:** human | ai_eval + - **Key patterns addressed:** [list] + - **Multi-judge validation:** [pass/fail with per-judge metrics] + - **Effect size:** [delta on 5-point scale] + - **Status:** [promoted / pending promotion / reverted] + ``` + +## Anti-Patterns + +| Anti-Pattern | Why It's Wrong | What to Do Instead | +|---|---|---| +| Using on broad/general tasks | 0% significant improvement on helpfulness (RES-205) | Only use on focused domains (email, code review, support) | +| Validating with a single judge | Overestimates improvement by 40-60% | Always use 3+ diverse judge models | +| Acting on single-occurrence feedback | Noise, not signal | Require 2+ occurrences before generating rules | +| Rewriting the whole prompt with rules | Destroys existing instructions | Only append to `### LEARNED_RULES` section | +| Running more than 2 iterations | Prompt bloat, diminishing returns | Stop at 1 iteration (2 max) | +| Treating human and AI feedback differently in the pipeline | Research shows same method works for both | Normalize to shared representation, then process identically | +| Deploying rules without multi-judge validation | No reliable evidence the rules actually help | Always validate with 3+ judges before promoting | +| Applying to top-tier models at ceiling | Models scoring >4.5/5 show no improvement | Check baseline score first; skip if already high | +| Adding reference comparisons to the meta-prompt | CriSPO-style references made results worse (-21% vs -14%) | Use failure-only analysis without reference comparison | + +## Open in orq.ai + +After completing this skill, direct the user to the relevant platform page: + +- **View/edit the prompt:** `https://my.orq.ai/prompts` — review the prompt with `### LEARNED_RULES` section +- **Check traces with feedback:** `https://my.orq.ai/traces` — inspect traces that provided the feedback signal +- **View experiment results:** `https://my.orq.ai/experiments` — review the multi-judge validation experiment +- **Feedback overview:** `https://my.orq.ai/feedback` — monitor ongoing feedback collection diff --git a/skills/run-experiment/SKILL.md b/skills/run-experiment/SKILL.md index 566a8e8..38c26dc 100644 --- a/skills/run-experiment/SKILL.md +++ b/skills/run-experiment/SKILL.md @@ -8,7 +8,9 @@ allowed-tools: Bash, Read, Write, Edit, Grep, Glob, Task, AskUserQuestion, mcp__ End-to-end workflow for evaluating LLM pipelines using the orq.ai platform, grounded in evaluation best practices. -**Companion skill:** `build-evaluator` — use that skill for detailed judge prompt design. This skill orchestrates the broader workflow that wraps around it. +**Companion skills:** +- `build-evaluator` — use that skill for detailed judge prompt design. This skill orchestrates the broader workflow that wraps around it. +- `prompt-learning` — automated feedback-driven rule generation validated via experiments ## When to use diff --git a/skills/trace-analysis/SKILL.md b/skills/trace-analysis/SKILL.md index cf9c9e2..d454174 100644 --- a/skills/trace-analysis/SKILL.md +++ b/skills/trace-analysis/SKILL.md @@ -11,6 +11,7 @@ Systematic methodology for reading LLM traces, identifying failure modes, and bu **Companion skills:** - `build-evaluator` — build automated evaluators for persistent failure modes - `action-plan` — turn findings into prioritized improvement plans +- `prompt-learning` — automatically turn trace-identified patterns into prompt rules ## When to use