-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Context
Princeton paper "Towards a Science of AI Agent Reliability" (arxiv 2602.16666) proposes 12 metrics across 4 dimensions (consistency, robustness, predictability, safety) for evaluating AI agent reliability. Our judge skill evaluates content quality (semantic, pragmatic, syntactic) but does NOT measure behavioral reliability. A perfect quality score on one run is meaningless if the next run produces contradictory output.
Plan
See plans/agent-reliability-judge-extension.md in cto-executive-system.
Phase 1: Judge Consistency Check (~1.5h)
- Add optional
reliabilityobject to verdict-schema.json (runs, outcome_consistency, score_variance) - Add
--reliabilityflag to run-judge.sh for multi-pass evaluation (3 runs, compute consistency) - Flag verdicts with consistency < 0.67 as UNRELIABLE
Phase 2: Quality-Gate Robustness Check (~1.5h)
- Add prompt robustness check for agent-generated outputs (re-run with paraphrase, compare)
- Add robustness section to Quality Gate Report template
Phase 3: Safety Compliance Extension (~1h)
- Merge SLB risk-tier model with Princeton S_comp metric (CRITICAL/DANGEROUS/CAUTION/SAFE tiers for file paths)
- Add
content_hash(SHA-256) to verdict-schema.json binding verdicts to exact content
Related
- Epic: Multi-Model Judge Skill for terraphim-engineering-skills #17 (judge epic)
- feat: integrate Attractor DOT pipeline patterns into disciplined-* skills #60 (Attractor DOT pipeline -- goal gates provide mechanical enforcement of S_comp)
- cto-executive-system KB:
knowledge/agent-reliability-metrics-princeton.md - cto-executive-system KB:
knowledge/slb-two-person-rule-agents.md - Paper: https://arxiv.org/abs/2602.16666
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request