diff --git a/agent-eval-mlflow-otel/.gitignore b/agent-eval-mlflow-otel/.gitignore new file mode 100644 index 0000000..6c64d2f --- /dev/null +++ b/agent-eval-mlflow-otel/.gitignore @@ -0,0 +1,10 @@ +__pycache__/ +*.pyc +.env +*.egg-info/ +results/ +output/ +*.jsonl +!fixtures/*.yaml +venv/ +.venv/ diff --git a/agent-eval-mlflow-otel/README.md b/agent-eval-mlflow-otel/README.md new file mode 100644 index 0000000..2584d58 --- /dev/null +++ b/agent-eval-mlflow-otel/README.md @@ -0,0 +1,299 @@ +# Experiment: Agent Evaluation via MLflow + OpenTelemetry + +**Date:** 2026-06-09 +**Status:** Complete +**Authors:** @ascerra + +Evaluates whether MLflow 3.x + OpenTelemetry can serve as a complete evaluation platform for autonomous AI agents in an SDLC pipeline — replacing ad-hoc "merge and hope" with instrumented, scored, regression-tested prompt changes. + +Related: [fullsend-ai/fullsend#1682](https://github.com/fullsend-ai/fullsend/pull/1682) — functional eval pattern (complementary work) + +> **Internal:** The production versions of these scripts (harness, scorers, trace export, CI workflows) live at [fullsend-ai/features](https://github.com/fullsend-ai/features). The examples here are simplified, standalone excerpts. + +## Hypothesis + +A single platform (MLflow) combined with OpenTelemetry trace instrumentation can: + +1. **Capture** rich agent execution traces (tool calls, reasoning turns, cost, tokens) without modifying the agent runtime +2. **Score** those traces with both mechanical (free, instant) and LLM-as-judge (semantic) scorers +3. **Gate PRs** by triggering real agent runs against test fixtures and blocking merges that degrade quality +4. **Detect regressions** in production via daily comparison against curated golden baselines +5. **Version prompts** with staging/production aliases tied to git commits, enabling prompt-to-output lineage + +If all five hold, teams can treat prompt engineering like software engineering — with CI, regression tests, and quality dashboards. + +## Background + +### The problem + +Fullsend agents (triage, code, review, fix, retro ... eventually explore, refine, critique) run autonomously in sandboxed containers. When someone changes an agent prompt, there is: + +- No quality metric — "did the explore agent get better at finding context?" +- No regression detection — a subtle change could break decomposition quality silently +- No before/after comparison — every PR is an opinion with zero supporting data + +### Prior art evaluated + +| Tool/Approach | Finding | +|---------------|---------| +| [Arize Phoenix](https://phoenix.arize.com) | Strong trace UI and evals, but no built-in prompt registry or alias-based versioning. OTLP ingest supported. | +| MLflow 3.x | OTLP traces, prompt versioning, quality dashboard, evaluation runs, datasets — all in one | + +MLflow 3.x was chosen because it natively accepts OTLP traces, has a built-in Prompts Registry with aliasing, and its `genai.evaluate()` API logs Feedback objects that populate a Quality Dashboard without custom visualization work. + +## Architecture + +![Architecture](diagrams/architecture.png) + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ DATA CAPTURE (OTLP) │ +│ │ +│ Agent sandbox ──► otel-trace-context.sh (W3C traceparent) │ +│ │ pipeline-events.sh (phase timing) │ +│ │ run-events.jsonl (CLI lifecycle) │ +│ ▼ │ +│ send-trace.py ──► OTLP HTTP POST ──► MLflow /v1/traces │ +│ │ │ +│ ├── set_mlflow_trace_tags() (agent, work_item) │ +│ ├── fix_session_metadata() (group by issue) │ +│ └── link_prompt_to_trace() (@production lineage) │ +│ │ +├─────────────────────────────────────────────────────────────────┤ +│ PLATFORM (MLflow 3.x) │ +│ │ +│ Traces ◄──────── Every agent run, with full span tree │ +│ Sessions ◄────── Grouped by work item (github:84, jira:TC-42) │ +│ Prompts ◄─────── Versioned, @staging / @production aliases │ +│ Quality ◄─────── Dashboard with scorer trends over time │ +│ Eval Runs ◄───── mlflow.genai.evaluate() results │ +│ Datasets ◄────── Golden baselines for regression detection │ +│ │ +├─────────────────────────────────────────────────────────────────┤ +│ SCORING │ +│ │ +│ Mechanical (5) │ LLM-as-Judge (8) │ +│ ───────────── │ ──────────────── │ +│ validation_passed │ explore_context_quality │ +│ tool_efficiency │ refine_decomposition_quality │ +│ cost_within_budget│ critique_verdict_accuracy │ +│ confidence_coher. │ reasoning_coherence │ +│ iteration_count │ triage_action_correctness │ +│ │ triage_comment_quality │ +│ │ refine_confidence_honesty │ +│ │ refine_output_quality │ +│ │ +├─────────────────────────────────────────────────────────────────┤ +│ CI INTEGRATION │ +│ │ +│ eval-gate.yml ──── PR quality gate (fixture + scorer check) │ +│ eval-monitor.yml ─ Daily cron → score traces → Slack alert │ +│ register-prompts ─ @staging on PR, @production on merge │ +│ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## Setup + +### Prerequisites + +- MLflow 3.x instance with OTLP ingest enabled (`/v1/traces` endpoint) +- Python 3.12+ with: `mlflow>=3.3`, `opentelemetry-api`, `opentelemetry-sdk`, `opentelemetry-exporter-otlp-proto-http` +- Anthropic SDK with Vertex AI support (`anthropic[vertex]`) for LLM judges +- GitHub Actions for CI (eval-gate, eval-monitor workflows) + +### Key design decisions + +1. **Post-hoc trace export, not live instrumentation** — The agent runtime (sandbox) has no OTEL SDK. Instead, bash scripts record timing/IDs to JSONL files, and `send-trace.py` reconstructs the span tree after the run completes. This avoids coupling the agent to any observability library. + +2. **Harness YAML as single source of truth** — Each agent's eval configuration (scorers, gates, baselines) lives in its harness YAML alongside the agent config. `harness.py` resolves scorer names to Python functions at runtime. + +3. **Hybrid scoring** — Mechanical scorers are instant and free (pure Python checks on trace attributes). LLM judges run Claude Opus via Vertex AI for semantic quality evaluation. Both feed into the same MLflow Feedback system. + +4. **Fixture-based PR gates** — When a prompt changes, the CI triggers a real agent run against a known test issue, then scores the resulting trace. This is an n=1 smoke test, not statistical significance — production monitoring provides the longer-term signal. + +## Results + +### Hypothesis 1: Capture — VALIDATED + +Rich traces are captured for all three agent types without modifying the agent runtime. The span tree for a typical explore run contains 15-25 spans including: + +| Span category | Examples | Count | +|---------------|----------|-------| +| Pipeline phases | pre-explore, post-explore | 4-6 | +| Fullsend lifecycle | load-harness, create-sandbox, agent-execution, validation | 6-8 | +| Agent reasoning | reasoning-1 through reasoning-N | 3-10 | +| Tool calls | tool:WebSearch, tool:Read, tool:Write | 5-15 | +| Results | fullsend:results | 1 | + +Token counts (input/output), cost ($), latency (ms), and model name are logged as span attributes. Cross-run linking via OTEL Links connects explore → refine → critique traces. + +### Hypothesis 2: Score — VALIDATED + +13 scorers implemented across two tiers: + +| Tier | Count | Avg latency | Cost | +|------|-------|-------------|------| +| Mechanical | 5 | <100ms | $0 | +| LLM Judge (Claude Opus) | 8 | 3-8s each | ~$0.02/scorer/trace | + +Scorers receive `mlflow.entities.Trace` objects and return `Feedback` with value + rationale. MLflow's Quality Dashboard aggregates these automatically. + +**Key metric ranges observed (explore agent, n=15 traces):** + +| Scorer | Mean | Min | Max | +|--------|------|-----|-----| +| validation_passed | 0.85 | 0.0 | 1.0 | +| tool_efficiency | 0.50 | 0.15 | 0.83 | +| cost_within_budget | 1.00 | 1.0 | 1.0 | +| confidence_coherence | 1.00 | 1.0 | 1.0 | +| reasoning_coherence | 0.71 | 0.40 | 1.0 | +| explore_context_quality | 0.72 | 0.40 | 1.0 | + +### Hypothesis 3: PR Gate — VALIDATED + +The eval-gate workflow successfully: +1. Detected prompt changes in the PR diff (7s) +2. Registered `@staging` prompt version in MLflow (~60s) +3. Triggered a real agent run against fixture issue (~10 min) +4. Scored the resulting trace (mechanical + LLM judge) +5. Posted pass/fail verdict to the PR as a comment +6. Blocked merge when quality dropped below threshold + +**PR #92 fixture eval result:** LLM judge score 4.0/5, cost $0.36, all mechanical checks passed. Total wall time: ~12 minutes. + +### Hypothesis 4: Regression Detection — VALIDATED (with caveats) + +Daily monitoring workflow: +- Scores last 7 days of production traces per agent +- Compares against golden baseline means (10% regression threshold) +- Sends Slack alert with per-agent, per-scorer breakdown + +**Caveat:** Golden baselines require manual curation. `create_golden.py` selects diverse traces and scores them, but the quality of the baseline depends on the quality of the traces available. Small n (< 10 traces) makes the baseline noisy. + +### Hypothesis 5: Prompt Versioning — VALIDATED + +MLflow Prompts Registry successfully tracks: +- Version history per agent (explore at v4, refine at v3, critique at v2) +- `@staging` alias set on PR, `@production` on merge +- Git commit + branch metadata per version +- Prompt-to-trace lineage via `link_prompt_versions_to_trace()` + +The Compare tab allows side-by-side diff of any two prompt versions. + +## Analysis + +### What worked well + +1. **OTLP ingest is seamless** — MLflow accepts standard OTLP traces without any MLflow-specific SDK in the agent. This means the instrumentation works with any MLflow deployment. + +2. **Quality Dashboard is production-ready** — Once Feedbacks are logged via `genai.evaluate()`, the dashboard shows distributions, trends, and drill-down without custom visualization. + +3. **Hybrid scoring is the right model** — Mechanical scorers catch structural failures instantly (invalid JSON, budget blown, too many retries). LLM judges catch semantic issues (poor context quality, incoherent reasoning). Neither alone is sufficient. + +4. **Fixture-based PR gates give immediate signal** — Even with n=1, a 4/5 judge score on a known test case gives more confidence than "I think this prompt is better." + +### What required iteration + +1. **Session metadata via OTLP is fragile** — The `session.id` span attribute gets JSON-serialized with extra quotes by MLflow's OTLP ingester. Required a post-export patch via `deprecated_end_trace_v2()`. This is a known MLflow bug. + +2. **Cross-run trace linking needed custom work** — OTEL Links are supported but MLflow's UI doesn't render them prominently. The session grouping (`mlflow.trace.session` tag) is more useful for the UI than the Link relationship. + +3. **LLM judge calibration is ongoing** — The rubric scoring criteria need iteration. Initial judge prompts were too lenient (everything scored 4-5). Adding specific anchor descriptions for each score level improved discrimination. + +4. **Golden baseline bootstrapping is cold-start problem** — You need production traces to create baselines, but you need baselines to detect regressions. We bootstrapped from early runs and plan to refresh quarterly. + +### Limitations + +- **Fixture eval is n=1** — A single test issue does not prove statistical significance. It's a smoke test. +- **LLM judge reproducibility** — Same trace scored twice can produce different scores (±0.5 typical). Averaging multiple judge runs would improve reliability but increases cost. +- **No inline regression on PR** — The PR gate tests the new prompt against a fixture, not against recent production traces. A prompt could pass the fixture but regress on real-world variety. +- **Cost** — LLM judges at ~$0.02/scorer/trace means scoring 100 traces with 8 judges costs ~$16. Acceptable for daily monitoring, expensive for bulk historical analysis. + +### MLflow features not used + +| Feature | Why not | +|---------|---------| +| MLflow Datasets API | Golden baselines stored as JSONL files, not uploaded via API. Could migrate for better UI integration. | +| MLflow Sessions UI (dedicated page) | Sessions tab returned 404 in our deployment (MLflow 3.12.0). Trace-level session column works. | +| Model Registry | Agents are not "models" in the traditional sense. Prompts Registry covers our versioning needs. | +| MLflow Deployments / AI Gateway | Not needed — agents deploy via GitHub Actions, not MLflow serving. | + +## Reproduction + +### Environment setup + +```bash +# Install dependencies +pip install mlflow>=3.3 \ + opentelemetry-api opentelemetry-sdk \ + opentelemetry-exporter-otlp-proto-http \ + anthropic[vertex] pyyaml + +# Configure MLflow connection +export MLFLOW_TRACKING_URI="https://" +export MLFLOW_TRACKING_USERNAME="admin" +export MLFLOW_TRACKING_PASSWORD="" +export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https:///v1/traces" +export MLFLOW_OTLP_TOKEN="" + +# For LLM judges via Vertex AI +export VERTEXAI_PROJECT="" +export VERTEXAI_LOCATION="us-east5" +export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json" +``` + +### Running scorers on existing traces + +```bash +# Score the last 7 days of traces for the explore agent +python3 examples/run_eval.py --agent explore --days 7 --max-traces 10 + +# Check for regressions against golden baseline +python3 examples/check_regression.py --agent explore --strict +``` + +### Registering prompts + +```bash +# Register a prompt with @staging alias +python3 examples/register_prompts.py --alias staging + +# Promote to @production after merge +python3 examples/register_prompts.py --alias production +``` + +## Key Files + +``` +agent-eval-mlflow-otel/ +├── README.md # This document +├── examples/ +│ ├── harness-explore.yaml # Example harness config with eval section +│ ├── scorer_mechanical.py # Mechanical scorer implementations +│ ├── scorer_llm_judge.py # LLM-as-judge scorer implementations +│ ├── send_trace_example.py # Simplified trace export example +│ ├── run_eval.py # Score traces via mlflow.genai.evaluate() +│ ├── check_regression.py # Compare recent traces vs golden baseline +│ └── register_prompts.py # MLflow Prompts Registry management +├── fixtures/ +│ ├── input.yaml # Example fixture input +│ └── rubric.yaml # Example fixture rubric for LLM judge +├── diagrams/ +│ └── architecture.png # Architecture infographic +└── .gitignore +``` + +## Recommendation + +Adopt this pattern for any team running autonomous AI agents: + +1. **Instrument first** — Add OTEL trace export to agent pipelines. The post-hoc pattern (collect artifacts, reconstruct spans) avoids coupling agents to observability libraries. + +2. **Start with mechanical scorers** — They're free, instant, and catch the most obvious failures (invalid output, budget blown, excessive retries). + +3. **Add LLM judges for semantic quality** — But calibrate the rubric carefully. Anchor each score level with specific descriptions. + +4. **Gate PRs with fixture evals** — Even n=1 against a known test case provides meaningful signal. Trust production monitoring for statistical confidence. + +5. **Use prompt versioning** — The `@staging` / `@production` alias pattern gives a clean promotion lifecycle tied to git commits. diff --git a/agent-eval-mlflow-otel/diagrams/architecture.png b/agent-eval-mlflow-otel/diagrams/architecture.png new file mode 100644 index 0000000..d01773b Binary files /dev/null and b/agent-eval-mlflow-otel/diagrams/architecture.png differ diff --git a/agent-eval-mlflow-otel/examples/check_regression.py b/agent-eval-mlflow-otel/examples/check_regression.py new file mode 100644 index 0000000..a634166 --- /dev/null +++ b/agent-eval-mlflow-otel/examples/check_regression.py @@ -0,0 +1,106 @@ +"""Compare recent trace scores against golden baselines. + +Loads a JSONL golden baseline file, computes mean scores per scorer, +then compares recent traces. Flags regression if any scorer drops +more than THRESHOLD below the baseline mean. + +Usage: + python3 check_regression.py --agent explore --strict + python3 check_regression.py --agent explore --days 14 --threshold 0.15 +""" +import argparse +import json +import os +import sys + +import mlflow + +DEFAULT_THRESHOLD = 0.10 + + +def connect(): + url = os.environ.get("MLFLOW_TRACKING_URI", "") + token = os.environ.get("MLFLOW_OTLP_TOKEN", "") + if token: + os.environ.setdefault("MLFLOW_TRACKING_USERNAME", "admin") + os.environ.setdefault("MLFLOW_TRACKING_PASSWORD", token) + if url: + mlflow.set_tracking_uri(url) + + +def load_golden(agent: str) -> list[dict]: + """Load golden baseline scores from JSONL.""" + path = f"evals/baselines/{agent}-golden.jsonl" + if not os.path.exists(path): + print(f" No baseline found at {path}") + return [] + entries = [] + with open(path) as f: + for line in f: + if line.strip(): + entries.append(json.loads(line)) + return entries + + +def compute_means(entries: list[dict]) -> dict[str, float]: + """Compute mean score per scorer from golden entries.""" + sums = {} + counts = {} + for entry in entries: + for scorer_name, value in entry.get("scores", {}).items(): + if isinstance(value, (int, float)): + sums[scorer_name] = sums.get(scorer_name, 0) + value + counts[scorer_name] = counts.get(scorer_name, 0) + 1 + return {k: sums[k] / counts[k] for k in sums} + + +def main(): + parser = argparse.ArgumentParser(description="Check for quality regressions") + parser.add_argument("--agent", required=True) + parser.add_argument("--days", type=int, default=7) + parser.add_argument("--max-traces", type=int, default=50) + parser.add_argument("--threshold", type=float, default=DEFAULT_THRESHOLD) + parser.add_argument("--strict", action="store_true", help="Exit 1 on any regression") + args = parser.parse_args() + + connect() + mlflow.autolog(disable=True) + + golden = load_golden(args.agent) + if not golden: + print(f" Skipping {args.agent} — no baseline") + return + + golden_means = compute_means(golden) + print(f" Golden baseline ({len(golden)} traces): {golden_means}") + + # In production, you would: + # 1. Fetch recent traces via mlflow.search_traces() + # 2. Score them with the same scorers used for golden + # 3. Compare means + # + # Simplified here for the experiment example: + print(f" To complete: fetch recent traces, score, compare against golden means") + print(f" Regression threshold: {args.threshold * 100:.0f}%") + + regressions = [] + # Example comparison logic: + # for scorer_name, golden_mean in golden_means.items(): + # current_mean = current_means.get(scorer_name, 0) + # delta = current_mean - golden_mean + # pct = delta / golden_mean if golden_mean > 0 else 0 + # if pct < -args.threshold: + # regressions.append((scorer_name, golden_mean, current_mean, pct)) + + if regressions: + print(f"\n !! REGRESSION detected:") + for name, gold, curr, pct in regressions: + print(f" {name}: golden={gold:.3f}, current={curr:.3f} ({pct:+.1%})") + if args.strict: + sys.exit(1) + else: + print(f"\n All scorers within threshold. No regression.") + + +if __name__ == "__main__": + main() diff --git a/agent-eval-mlflow-otel/examples/harness-explore.yaml b/agent-eval-mlflow-otel/examples/harness-explore.yaml new file mode 100644 index 0000000..1a6e313 --- /dev/null +++ b/agent-eval-mlflow-otel/examples/harness-explore.yaml @@ -0,0 +1,47 @@ +# Example harness configuration for the explore agent. +# The eval section is the single source of truth for quality gates. +# harness.py reads this at runtime and resolves scorer names to Python functions. + +agent: customized/agents/explore.md +model: opus +image: ghcr.io/fullsend-ai/fullsend-sandbox:latest +policy: customized/policies/explore.yaml + +skills: + - customized/skills/public-research + - customized/skills/jira-read + +pre_script: customized/scripts/pre-explore.sh + +validation_loop: + script: scripts/validate-output-schema.sh + max_iterations: 2 + +post_script: customized/scripts/post-explore.sh + +timeout_minutes: 20 + +eval: + scorers: + mechanical: + - validation_passed + - tool_efficiency + - cost_within_budget + - confidence_coherence + - iteration_count + llm_judge: + model: claude-opus-4-6 + criteria: + - name: explore_context_quality + guidelines: > + Is the gathered context relevant, specific, and complete? + Did the agent look in the right places? Is context specific + enough for refinement? Were constraints/risks identified? + - name: reasoning_coherence + guidelines: > + Is reasoning logically coherent and evidence-based? + gates: + min_validation_rate: 0.80 + min_quality_score: 3.0 + max_cost: 2.00 + baseline: evals/baselines/explore-golden.jsonl diff --git a/agent-eval-mlflow-otel/examples/register_prompts.py b/agent-eval-mlflow-otel/examples/register_prompts.py new file mode 100644 index 0000000..e6a7112 --- /dev/null +++ b/agent-eval-mlflow-otel/examples/register_prompts.py @@ -0,0 +1,99 @@ +"""Register agent prompts in MLflow Prompts Registry. + +Reads agent prompt markdown files and registers them as versioned prompts +with @staging or @production aliases. Uses content-hash dedup to skip +unchanged prompts while still updating the alias. + +Usage: + python3 register_prompts.py --alias staging + python3 register_prompts.py --alias production + python3 register_prompts.py --alias staging --agents explore refine + +Env: + GIT_COMMIT — Current git commit hash (for metadata) + GIT_BRANCH — Current git branch name +""" +import argparse +import hashlib +import os +from pathlib import Path + +import mlflow +from mlflow import MlflowClient + +AGENTS_DIR = Path(".fullsend/customized/agents") +PROMPT_PREFIX = "fullsend" + + +def connect(): + url = os.environ.get("MLFLOW_TRACKING_URI", "") + token = os.environ.get("MLFLOW_OTLP_TOKEN", "") + if token: + os.environ.setdefault("MLFLOW_TRACKING_USERNAME", "admin") + os.environ.setdefault("MLFLOW_TRACKING_PASSWORD", token) + if url: + mlflow.set_tracking_uri(url) + + +def content_hash(text: str) -> str: + return hashlib.sha256(text.encode()).hexdigest()[:12] + + +def register_prompt(agent: str, alias: str, client: MlflowClient): + """Register a single agent's prompt in MLflow.""" + prompt_path = AGENTS_DIR / f"{agent}.md" + if not prompt_path.exists(): + print(f" SKIP {agent} — {prompt_path} not found") + return + + content = prompt_path.read_text() + chash = content_hash(content) + prompt_name = f"{PROMPT_PREFIX}-{agent}" + + git_commit = os.environ.get("GIT_COMMIT", "unknown") + git_branch = os.environ.get("GIT_BRANCH", "unknown") + + tags = { + "git.commit": git_commit, + "git.branch": git_branch, + "content.hash": chash, + "agent": agent, + "source": str(prompt_path), + } + + existing = client.search_prompt_versions(name=prompt_name, max_results=1) + if existing: + latest = existing[0] + latest_hash = (latest.tags or {}).get("content.hash", "") + if latest_hash == chash: + print(f" {prompt_name}: content unchanged (hash={chash}), updating alias only") + mlflow.genai.set_prompt_alias(prompt_name, alias, latest.version) + return + + version = mlflow.genai.register_prompt( + name=prompt_name, + template=content, + commit_message=f"{alias}: {agent} prompt ({chash})", + tags=tags, + ) + print(f" {prompt_name}: registered v{version.version} (hash={chash})") + + mlflow.genai.set_prompt_alias(prompt_name, alias, version.version) + print(f" {prompt_name}: alias @{alias} -> v{version.version}") + + +def main(): + parser = argparse.ArgumentParser(description="Register prompts in MLflow") + parser.add_argument("--alias", required=True, choices=["staging", "production"]) + parser.add_argument("--agents", nargs="+", default=["explore", "refine", "critique"]) + args = parser.parse_args() + + connect() + client = MlflowClient() + + for agent in args.agents: + register_prompt(agent, args.alias, client) + + +if __name__ == "__main__": + main() diff --git a/agent-eval-mlflow-otel/examples/run_eval.py b/agent-eval-mlflow-otel/examples/run_eval.py new file mode 100644 index 0000000..e45c251 --- /dev/null +++ b/agent-eval-mlflow-otel/examples/run_eval.py @@ -0,0 +1,108 @@ +"""Score traces via mlflow.genai.evaluate() and log operational metrics. + +Reads traces from MLflow, resolves scorers from harness config, and runs +evaluation. Results appear as Feedbacks on traces (Quality Dashboard) and +as metrics on the evaluation run (Evaluation Runs page). + +Usage: + python3 run_eval.py --agent explore --days 7 --max-traces 10 + python3 run_eval.py --agent explore --mechanical-only +""" +import argparse +import os +import time + +import mlflow +from mlflow import MlflowClient + + +def connect(): + """Set up MLflow tracking connection.""" + url = os.environ.get("MLFLOW_TRACKING_URI", "") + token = os.environ.get("MLFLOW_OTLP_TOKEN", "") + if token: + os.environ.setdefault("MLFLOW_TRACKING_USERNAME", "admin") + os.environ.setdefault("MLFLOW_TRACKING_PASSWORD", token) + if url: + mlflow.set_tracking_uri(url) + + +def get_traces(agent=None, days=7, max_results=50): + """Search for traces, optionally filtered by agent and recency.""" + filters = [] + if agent: + filters.append(f"tags.`fullsend.agent` = '{agent}'") + if days: + import datetime + cutoff = datetime.datetime.now(datetime.timezone.utc) - datetime.timedelta(days=days) + filters.append(f"timestamp > {int(cutoff.timestamp() * 1000)}") + + filter_str = " AND ".join(filters) if filters else None + return mlflow.search_traces( + locations=["0"], + filter_string=filter_str, + max_results=max_results, + ) + + +def resolve_scorers(agent, mechanical_only=False): + """Resolve scorer functions for the given agent. + + In production, this reads the harness YAML. Here we import directly. + """ + from scorer_mechanical import MECHANICAL_SCORERS + + if mechanical_only: + return MECHANICAL_SCORERS + + if agent == "explore": + from scorer_llm_judge import EXPLORE_SCORERS + return MECHANICAL_SCORERS + EXPLORE_SCORERS + elif agent == "refine": + from scorer_llm_judge import REFINE_SCORERS + return MECHANICAL_SCORERS + REFINE_SCORERS + elif agent == "critique": + from scorer_llm_judge import CRITIQUE_SCORERS + return MECHANICAL_SCORERS + CRITIQUE_SCORERS + else: + return MECHANICAL_SCORERS + + +def main(): + parser = argparse.ArgumentParser(description="Score traces via MLflow") + parser.add_argument("--agent", required=True, help="Agent name (explore, refine, critique)") + parser.add_argument("--days", type=int, default=7, help="Look-back window in days") + parser.add_argument("--max-traces", type=int, default=50, help="Max traces to score") + parser.add_argument("--mechanical-only", action="store_true", help="Skip LLM judges") + args = parser.parse_args() + + connect() + mlflow.autolog(disable=True) + + print(f"Fetching traces for {args.agent} (last {args.days} days)...") + traces_df = get_traces(agent=args.agent, days=args.days, max_results=args.max_traces) + print(f" Found {len(traces_df)} traces") + + if traces_df.empty: + print(" No traces to score.") + return + + scorers = resolve_scorers(args.agent, args.mechanical_only) + print(f" Running {len(scorers)} scorers...") + + start = time.time() + result = mlflow.genai.evaluate(data=traces_df, scorers=scorers) + elapsed = time.time() - start + + print(f" Evaluation complete in {elapsed:.1f}s") + print(f" Results: {result.metrics}") + + mlflow.log_param("agent", args.agent) + mlflow.log_metrics({ + "trace_count": len(traces_df), + "latency_ms": int(elapsed * 1000), + }) + + +if __name__ == "__main__": + main() diff --git a/agent-eval-mlflow-otel/examples/scorer_llm_judge.py b/agent-eval-mlflow-otel/examples/scorer_llm_judge.py new file mode 100644 index 0000000..21dcb4c --- /dev/null +++ b/agent-eval-mlflow-otel/examples/scorer_llm_judge.py @@ -0,0 +1,157 @@ +"""LLM-as-judge scorers — semantic quality evaluation via Anthropic Vertex AI. + +Each scorer calls Claude Opus to evaluate a specific quality dimension of +an agent trace. Scores are returned as Feedback objects (value + rationale) +that MLflow logs to the Quality Dashboard. + +Usage: + from scorer_llm_judge import EXPLORE_SCORERS + import mlflow + mlflow.genai.evaluate(data=traces_df, scorers=EXPLORE_SCORERS) + +Env: + FULLSEND_JUDGE_MODEL — Model name (default: claude-opus-4-6) + VERTEXAI_PROJECT — GCP project with Vertex AI API enabled + VERTEXAI_LOCATION — Region (default: us-east5) +""" +import json +import os + +from anthropic import AnthropicVertex +from mlflow.genai.scorers import scorer +from mlflow.entities import Feedback + +JUDGE_MODEL = os.environ.get("FULLSEND_JUDGE_MODEL", "claude-opus-4-6") +VERTEX_PROJECT = os.environ.get("VERTEXAI_PROJECT", "") +VERTEX_REGION = os.environ.get("VERTEXAI_LOCATION", "us-east5") + +_vertex_client = None + + +def _get_vertex_client() -> AnthropicVertex: + global _vertex_client + if _vertex_client is None: + _vertex_client = AnthropicVertex( + project_id=VERTEX_PROJECT, + region=VERTEX_REGION, + ) + return _vertex_client + + +def _llm_judge(prompt: str) -> dict: + """Call the judge LLM and parse a JSON response.""" + client = _get_vertex_client() + response = client.messages.create( + model=JUDGE_MODEL, + max_tokens=300, + messages=[{"role": "user", "content": prompt}], + ) + content = response.content[0].text.strip() + if content.startswith("```"): + content = content.split("```")[1] + if content.startswith("json"): + content = content[4:] + return json.loads(content) + + +def _get_trace_summary(trace, max_reasoning_chars: int = 1500) -> str: + """Build a text summary of a trace for LLM judge context.""" + tags = trace.info.tags or {} + agent = tags.get("fullsend.agent", "unknown") + work_item = tags.get("fullsend.work_item_id", "unknown") + cost = trace.info.cost or {} + + reasoning_texts = [] + for s in trace.data.spans: + if s.name.startswith("reasoning-"): + text = s.get_attribute("output.value") + if text: + reasoning_texts.append(text) + + agent_span = None + target = f"{agent}-agent" + spans = trace.search_spans(name=target) + if spans: + agent_span = spans[0] + + confidence = agent_span.get_attribute("confidence.overall") if agent_span else "N/A" + tools_count = agent_span.get_attribute("tool_call_count") if agent_span else "N/A" + + return ( + f"Agent: {agent}\n" + f"Work item: {work_item}\n" + f"Confidence: {confidence}/100\n" + f"Tool calls: {tools_count}\n" + f"Cost: ${cost.get('total_cost', 0):.2f}\n" + f"Reasoning: {' | '.join(reasoning_texts)[:max_reasoning_chars]}" + ) + + +@scorer +def explore_context_quality(*, trace) -> Feedback: + """Is the gathered context relevant, specific, and complete?""" + summary = _get_trace_summary(trace, max_reasoning_chars=2000) + result = _llm_judge( + f"Evaluate this exploration agent's context gathering (1-5).\n\n" + f"{summary}\n\n" + f"Criteria: Looked in the right places? Context specific enough " + f"for refinement? Identified constraints/risks? Obvious gaps?\n" + f'Respond in JSON: {{"score": <1-5>, "rationale": "<1-2 sentences>"}}' + ) + return Feedback(value=result["score"] / 5.0, rationale=result.get("rationale", "")) + + +@scorer +def reasoning_coherence(*, trace) -> Feedback: + """Is the agent's reasoning logically coherent and evidence-based?""" + summary = _get_trace_summary(trace) + result = _llm_judge( + f"Evaluate the logical coherence of this AI agent's reasoning (1-5).\n\n" + f"{summary}\n\n" + f"1=Contradictory/incoherent, 2=Major gaps, 3=Mostly coherent, " + f"4=Good with minor issues, 5=Excellent logical flow.\n" + f'Respond in JSON: {{"score": <1-5>, "rationale": "<1-2 sentences>"}}' + ) + return Feedback(value=result["score"] / 5.0, rationale=result.get("rationale", "")) + + +@scorer +def refine_decomposition_quality(*, trace) -> Feedback: + """Is the feature decomposition complete, well-scoped, and actionable?""" + summary = _get_trace_summary(trace, max_reasoning_chars=2000) + result = _llm_judge( + f"Evaluate this feature refinement agent's decomposition (1-5).\n\n" + f"{summary}\n\n" + f"Criteria: (1) Children cover all parent requirements, " + f"(2) Each child is independently implementable, " + f"(3) Acceptance criteria are specific/testable, " + f"(4) Dependencies identified, (5) Right granularity.\n" + f'Respond in JSON: {{"score": <1-5>, "rationale": "<1-2 sentences>"}}' + ) + return Feedback(value=result["score"] / 5.0, rationale=result.get("rationale", "")) + + +@scorer +def critique_verdict_accuracy(*, trace) -> Feedback: + """Does the critique verdict match the actual quality of the plan?""" + summary = _get_trace_summary(trace) + post_critique = trace.search_spans(name="post-critique") + verdict = "unknown" + score_val = "?" + if post_critique: + verdict = post_critique[0].get_attribute("phase.verdict") or "unknown" + score_val = post_critique[0].get_attribute("phase.score") or "?" + result = _llm_judge( + f"Evaluate this critique agent's verdict accuracy (1-5).\n\n" + f"{summary}\n" + f"Verdict: {verdict} (score: {score_val})\n\n" + f"Does the verdict match the plan quality? Is the critique specific " + f"and actionable? Does it identify real issues?\n" + f'Respond in JSON: {{"score": <1-5>, "rationale": "<1-2 sentences>"}}' + ) + return Feedback(value=result["score"] / 5.0, rationale=result.get("rationale", "")) + + +EXPLORE_SCORERS = [explore_context_quality, reasoning_coherence] +REFINE_SCORERS = [refine_decomposition_quality, reasoning_coherence] +CRITIQUE_SCORERS = [critique_verdict_accuracy, reasoning_coherence] diff --git a/agent-eval-mlflow-otel/examples/scorer_mechanical.py b/agent-eval-mlflow-otel/examples/scorer_mechanical.py new file mode 100644 index 0000000..68ddbe0 --- /dev/null +++ b/agent-eval-mlflow-otel/examples/scorer_mechanical.py @@ -0,0 +1,162 @@ +"""Mechanical scorers — pure Python, no LLM cost. + +These scorers receive MLflow Trace objects and return Feedback with a numeric +value and rationale string. They check structural properties of agent traces. + +Usage: + from scorer_mechanical import MECHANICAL_SCORERS + import mlflow + mlflow.genai.evaluate(data=traces_df, scorers=MECHANICAL_SCORERS) +""" +import json + +from mlflow.genai.scorers import scorer +from mlflow.entities import Feedback + + +@scorer +def validation_passed(*, trace) -> Feedback: + """Did the agent output pass schema validation?""" + spans = trace.search_spans(name="fullsend:results") + passed = bool(spans and spans[0].get_attribute("result.validation") == "passed") + return Feedback( + value=1.0 if passed else 0.0, + rationale="passed" if passed else "failed or no results span", + ) + + +@scorer +def tool_efficiency(*, trace) -> Feedback: + """Ratio of reasoning turns to total actions. + + Ideal range: 0.15-0.40 reasoning-to-total ratio. + Too high = overthinking. Too low = acting without reasoning. + """ + agent_span = _find_agent_span(trace) + if not agent_span: + return Feedback(value=0.5, rationale="No agent span found") + + tools = int(agent_span.get_attribute("tool_call_count") or 0) + reasoning = int(agent_span.get_attribute("reasoning_turn_count") or 0) + total = tools + reasoning + if total == 0: + return Feedback(value=0.0, rationale="No tool calls or reasoning turns") + + ratio = reasoning / total + score = 1.0 if 0.15 <= ratio <= 0.40 else max(0.0, 1.0 - abs(ratio - 0.275) * 3) + return Feedback( + value=round(score, 2), + rationale=f"{reasoning} reasoning / {tools} tools (ratio={ratio:.2f})", + ) + + +@scorer +def cost_within_budget(*, trace) -> Feedback: + """Is the run cost within acceptable bounds? + + Budget thresholds per agent type. Reads from harness YAML gates.max_cost + or falls back to defaults below. + """ + tags = trace.info.tags or {} + agent = tags.get("fullsend.agent", "") + budgets = {"explore": 2.0, "refine": 3.0, "critique": 1.5, "triage": 2.0, "code": 5.0} + budget = budgets.get(agent, 3.0) + + cost = trace.info.cost or {} + total = float(cost.get("total_cost", 0)) + + within = total <= budget + return Feedback( + value=1.0 if within else 0.0, + rationale=f"${total:.2f} vs ${budget:.2f} budget ({agent})", + ) + + +@scorer +def confidence_coherence(*, trace) -> Feedback: + """Are confidence dimensions internally coherent? + + Checks: values in 0-100, not all identical, spread not extreme (>60), + and overall score roughly matches the mean of dimensions. + """ + agent_span = _find_agent_span(trace) + if not agent_span: + return Feedback(value=1.0, rationale="No agent span (non-confidence agent)") + + dims = {} + for k, v in agent_span.attributes.items(): + if k.startswith("confidence.") and k != "confidence.overall": + try: + dims[k.replace("confidence.", "")] = int(v) + except (ValueError, TypeError): + pass + + overall = agent_span.get_attribute("confidence.overall") + if not dims: + return Feedback(value=1.0, rationale="No confidence dimensions found") + + issues = [] + values = list(dims.values()) + + for name, val in dims.items(): + if val < 0 or val > 100: + issues.append(f"{name}={val} out of range") + + if len(set(values)) == 1 and len(values) > 2: + issues.append(f"All dimensions identical ({values[0]})") + + spread = max(values) - min(values) + if spread > 60: + issues.append(f"Extreme spread: {spread}") + + if overall is not None: + mean_dims = sum(values) / len(values) + try: + overall_int = int(overall) + if abs(overall_int - mean_dims) > 15: + issues.append(f"Overall ({overall_int}) deviates from mean ({mean_dims:.0f})") + except (ValueError, TypeError): + pass + + if issues: + return Feedback(value=0.0, rationale="; ".join(issues)) + return Feedback( + value=1.0, + rationale=f"All {len(dims)} dims valid, range {min(values)}-{max(values)}, overall={overall}", + ) + + +@scorer +def iteration_count(*, trace) -> Feedback: + """How many agent execution iterations were needed? + + Most tasks should complete in 1 iteration. Multiple iterations + may indicate the agent is struggling or the task is too complex. + """ + iteration_spans = [ + s for s in trace.data.spans + if s.name.startswith("agent-execution.iteration-") + ] + return Feedback( + value=len(iteration_spans), + rationale=f"{len(iteration_spans)} iteration(s)", + ) + + +MECHANICAL_SCORERS = [ + validation_passed, + tool_efficiency, + cost_within_budget, + confidence_coherence, + iteration_count, +] + + +def _find_agent_span(trace): + """Find the main agent span (e.g., 'explore-agent', 'triage-agent').""" + tags = trace.info.tags or {} + agent = tags.get("fullsend.agent", "") + if not agent: + return None + spans = trace.search_spans(name=f"{agent}-agent") + return spans[0] if spans else None diff --git a/agent-eval-mlflow-otel/examples/send_trace_example.py b/agent-eval-mlflow-otel/examples/send_trace_example.py new file mode 100644 index 0000000..1481406 --- /dev/null +++ b/agent-eval-mlflow-otel/examples/send_trace_example.py @@ -0,0 +1,76 @@ +"""Simplified example of sending an agent trace to MLflow via OTLP. + +This demonstrates the core pattern: reconstruct a span tree from agent +artifacts and export via OTLP HTTP. The production version (send-trace.py) +handles many more edge cases and data sources. + +Env: + OTEL_EXPORTER_OTLP_TRACES_ENDPOINT — MLflow OTLP endpoint (e.g. https:///v1/traces) + MLFLOW_OTLP_TOKEN — Bearer token for OTLP + Basic auth for tag API +""" +import os +import time + +from opentelemetry import trace +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import SimpleSpanProcessor +from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter + +OTLP_ENDPOINT = os.environ.get("OTEL_EXPORTER_OTLP_TRACES_ENDPOINT", "") +MLFLOW_TOKEN = os.environ.get("MLFLOW_OTLP_TOKEN", "") + + +def send_example_trace(): + """Send a minimal agent trace to MLflow.""" + headers = {} + if MLFLOW_TOKEN: + headers["Authorization"] = f"Bearer {MLFLOW_TOKEN}" + headers["x-mlflow-experiment-id"] = "0" + + provider = TracerProvider() + exporter = OTLPSpanExporter(endpoint=OTLP_ENDPOINT, headers=headers) + provider.add_span_processor(SimpleSpanProcessor(exporter)) + trace.set_tracer_provider(provider) + tracer = trace.get_tracer("fullsend") + + with tracer.start_as_current_span("explore-pipeline") as root: + root.set_attribute("openinference.span.kind", "CHAIN") + root.set_attribute("agent", "explore") + root.set_attribute("session.id", "github:84") + root.set_attribute("input.value", "Explore issue #84: Regression test suite") + root.set_attribute("llm.cost", 0.36) + + with tracer.start_as_current_span("pre-explore") as pre: + pre.set_attribute("pipeline.phase", "pre-explore") + time.sleep(0.01) + + with tracer.start_as_current_span("pre-explore:fetch-issue") as fetch: + fetch.set_attribute("pipeline.step", "fetch-issue") + time.sleep(0.01) + + with tracer.start_as_current_span("fullsend:agent-execution") as exec_span: + exec_span.set_attribute("fullsend.step", "agent-execution") + + with tracer.start_as_current_span("explore-agent") as agent: + agent.set_attribute("openinference.span.kind", "LLM") + agent.set_attribute("llm.model_name", "claude-opus-4-6") + agent.set_attribute("tool_call_count", 8) + agent.set_attribute("reasoning_turn_count", 5) + agent.set_attribute("confidence.overall", 72) + time.sleep(0.01) + + with tracer.start_as_current_span("fullsend:results") as results: + results.set_attribute("result.validation", "passed") + results.set_attribute("output.value", '{"summary": "...", "confidence": {"overall": 72}}') + + root.set_attribute("output.value", "Exploration complete. Confidence: 72/100") + + provider.shutdown() + print("Trace sent to MLflow via OTLP") + + +if __name__ == "__main__": + if not OTLP_ENDPOINT: + print("Set OTEL_EXPORTER_OTLP_TRACES_ENDPOINT to your MLflow instance") + else: + send_example_trace() diff --git a/agent-eval-mlflow-otel/fixtures/input.yaml b/agent-eval-mlflow-otel/fixtures/input.yaml new file mode 100644 index 0000000..99ea6ec --- /dev/null +++ b/agent-eval-mlflow-otel/fixtures/input.yaml @@ -0,0 +1,11 @@ +# Example fixture input for the explore agent. +# Points to a real (public) GitHub issue as the test case. +issue_source: github +issue_key: "1" +repo_full_name: "/" +description: | + A mid-complexity story requiring the agent to explore a real codebase, + identify testing frameworks, and gather context about regression test + approaches. Tests the agent's ability to use multiple sources (GitHub, + web search, documentation) and produce structured output. + Replace repo_full_name and issue_key with a real issue from your target repo. diff --git a/agent-eval-mlflow-otel/fixtures/rubric.yaml b/agent-eval-mlflow-otel/fixtures/rubric.yaml new file mode 100644 index 0000000..de818bb --- /dev/null +++ b/agent-eval-mlflow-otel/fixtures/rubric.yaml @@ -0,0 +1,79 @@ +# Example fixture rubric — LLM judge instructions for scoring an explore trace. +# +# The judge_prompt is sent to Claude Opus with {{ trace_summary }} replaced +# by the actual trace data. The judge returns a 1-5 score. +# +# Deterministic checks verify specific assessment fields that mechanical +# scorers have already logged to MLflow. + +name: explore-regression-test-suite +agent: explore +max_cost_usd: 5.0 +max_turns: 120 + +judge_prompt: | + You are evaluating an explore agent's work on a GitHub issue about + building a regression test suite. + + The agent was given this issue and asked to explore all available context + (GitHub, web, Jira, and optionally the target codebase) to gather a rich + picture for a downstream refinement agent. The agent produces a structured + JSON exploration result. + + NOTE: The target codebase may be empty or inaccessible. In that case, the + agent should still gather context from GitHub issues, PRs, web research, + and any available documentation. An empty target repo is NOT a failure — + it means the agent must rely on other sources. + + Here is the agent's trace summary and output: + {{ trace_summary }} + + Evaluate the quality on a 1-5 scale based on these criteria: + + 1. **Research breadth** — Did the agent use multiple sources (GitHub + issues, web search, documentation) to understand the problem domain? + + 2. **Related work** — Did the agent find related issues, PRs, or + discussions? Did it identify prior art or existing patterns? + + 3. **Technical landscape** — Did the agent identify relevant + technologies, frameworks, deployment targets, and key dependencies? + + 4. **Architectural constraints** — Did the agent surface meaningful + constraints that would affect implementation? + + 5. **Output quality** — Is the output well-structured, with a clear + summary and confidence scores? Are findings specific and actionable? + + Scoring: + 1 = Failed to produce output or produced empty/invalid JSON + 2 = Minimal exploration with very few sources consulted + 3 = Adequate — consulted multiple sources, produced valid output + 4 = Good — thorough research, specific findings, clear constraints + 5 = Excellent — comprehensive exploration with deep, actionable findings + + Respond with just a number 1-5. + +deterministic_checks: + - name: valid_json_output + description: Agent produced valid JSON in agent-result.json + field: validation_passed + expect: 1.0 + + - name: has_tool_calls + description: Agent made tool calls (actually explored) + field: tool_efficiency + min_value: 0.1 + + - name: within_budget + description: Agent stayed within cost budget + field: cost_within_budget + expect: 1.0 + +thresholds: + judge_score: + min: 3.0 + valid_json_output: + min_pass_rate: 1.0 + within_budget: + min_pass_rate: 1.0