Skip to content

Clarify supported rubric input path for rubric_based_* metrics #132

@erauner12

Description

@erauner12

Hi! I’m trying to understand the intended public flow for rubric-based metrics such as rubric_based_final_response_quality_v1 and rubric_based_tool_use_quality_v1.

I realize these appear to sit on top of experimental ADK evaluator APIs. When running the final-response rubric evaluator through an internal/repo-owned helper path, I see the expected ADK experimental warnings, for example:

[EXPERIMENTAL] RubricBasedFinalResponseQualityV1Evaluator
[EXPERIMENTAL] RubricBasedEvaluator
[EXPERIMENTAL] LlmAsJudge

In that controlled path, I can construct the metric with build_eval_metric(..., rubrics=[...]) and get rubric-based scoring from rubric_based_final_response_quality_v1. For example, a small calibration run with four reviewed cases produced the expected pass/fail outcomes, and an advisory positive trace scored successfully with score: 1.0.

So my question is less “is this broken?” and more: what is the intended public surface for this capability?

From current main, /api/metrics exposes these metrics and marks them as requiring rubrics. I also see rubrics documented on eval-set cases/invocations, and the internal builder accepts rubrics. What I could not find is the supported API/CLI/MCP/config path for supplying those rubrics when running the metrics.

This looks like it may simply be a gap in the public surface rather than a disagreement in direction: the metric metadata, eval-set docs, and internal RubricsBasedCriterion construction are already present, while the runner/API/config path does not yet appear to pass rubrics through. If that is the right read, I would be interested in helping fill the gap, but wanted to ask for the preferred design before opening a PR.

Questions:

  • Are rubric-based metrics intended to consume rubrics from eval-set case/invocation fields?
  • Is a request/config-level rubric field planned for API/CLI/MCP runs?
  • Would you prefer config-level rubrics, eval-set rubrics, or both?
  • Are these metrics intentionally marked working=false until that public surface is decided?
  • Should users treat build_eval_metric(..., rubrics=...) as internal only for now?

Relevant code/docs I checked:

  • src/agentevals/api/routes.py
  • src/agentevals/builtin_metrics.py
  • src/agentevals/config.py
  • src/agentevals/eval_config_loader.py
  • src/agentevals/cli.py
  • src/agentevals/mcp_server.py
  • docs/eval-set-format.md

The flow I’m hoping to support eventually is:

  • define rubric criteria with IDs/text
  • run rubric_based_final_response_quality_v1 with a configured judge model
  • get overall and ideally per-rubric scoring
  • do this through a supported API/CLI/MCP/config path rather than reaching into internals

I asked a similar question in Discord and wanted to open a tracking issue for the intended public API/config direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions