Skip to content

Extend judge and quality-gate with reliability dimensions (Princeton metrics) #61

@AlexMikhalev

Description

@AlexMikhalev

Context

Princeton paper "Towards a Science of AI Agent Reliability" (arxiv 2602.16666) proposes 12 metrics across 4 dimensions (consistency, robustness, predictability, safety) for evaluating AI agent reliability. Our judge skill evaluates content quality (semantic, pragmatic, syntactic) but does NOT measure behavioral reliability. A perfect quality score on one run is meaningless if the next run produces contradictory output.

Plan

See plans/agent-reliability-judge-extension.md in cto-executive-system.

Phase 1: Judge Consistency Check (~1.5h)

  • Add optional reliability object to verdict-schema.json (runs, outcome_consistency, score_variance)
  • Add --reliability flag to run-judge.sh for multi-pass evaluation (3 runs, compute consistency)
  • Flag verdicts with consistency < 0.67 as UNRELIABLE

Phase 2: Quality-Gate Robustness Check (~1.5h)

  • Add prompt robustness check for agent-generated outputs (re-run with paraphrase, compare)
  • Add robustness section to Quality Gate Report template

Phase 3: Safety Compliance Extension (~1h)

  • Merge SLB risk-tier model with Princeton S_comp metric (CRITICAL/DANGEROUS/CAUTION/SAFE tiers for file paths)
  • Add content_hash (SHA-256) to verdict-schema.json binding verdicts to exact content

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions