Clarify supported rubric input path for rubric_based_* metrics

Hi! I’m trying to understand the intended public flow for rubric-based metrics such as `rubric_based_final_response_quality_v1` and `rubric_based_tool_use_quality_v1`.

I realize these appear to sit on top of experimental ADK evaluator APIs. When running the final-response rubric evaluator through an internal/repo-owned helper path, I see the expected ADK experimental warnings, for example:

```text
[EXPERIMENTAL] RubricBasedFinalResponseQualityV1Evaluator
[EXPERIMENTAL] RubricBasedEvaluator
[EXPERIMENTAL] LlmAsJudge
```

In that controlled path, I can construct the metric with `build_eval_metric(..., rubrics=[...])` and get rubric-based scoring from `rubric_based_final_response_quality_v1`. For example, a small calibration run with four reviewed cases produced the expected pass/fail outcomes, and an advisory positive trace scored successfully with `score: 1.0`.

So my question is less “is this broken?” and more: what is the intended public surface for this capability?

From current `main`, `/api/metrics` exposes these metrics and marks them as requiring rubrics. I also see `rubrics` documented on eval-set cases/invocations, and the internal builder accepts rubrics. What I could not find is the supported API/CLI/MCP/config path for supplying those rubrics when running the metrics.

This looks like it may simply be a gap in the public surface rather than a disagreement in direction: the metric metadata, eval-set docs, and internal `RubricsBasedCriterion` construction are already present, while the runner/API/config path does not yet appear to pass rubrics through. If that is the right read, I would be interested in helping fill the gap, but wanted to ask for the preferred design before opening a PR.

Questions:
- Are rubric-based metrics intended to consume rubrics from eval-set case/invocation fields?
- Is a request/config-level rubric field planned for API/CLI/MCP runs?
- Would you prefer config-level rubrics, eval-set rubrics, or both?
- Are these metrics intentionally marked `working=false` until that public surface is decided?
- Should users treat `build_eval_metric(..., rubrics=...)` as internal only for now?

Relevant code/docs I checked:
- `src/agentevals/api/routes.py`
- `src/agentevals/builtin_metrics.py`
- `src/agentevals/config.py`
- `src/agentevals/eval_config_loader.py`
- `src/agentevals/cli.py`
- `src/agentevals/mcp_server.py`
- `docs/eval-set-format.md`

The flow I’m hoping to support eventually is:
- define rubric criteria with IDs/text
- run `rubric_based_final_response_quality_v1` with a configured judge model
- get overall and ideally per-rubric scoring
- do this through a supported API/CLI/MCP/config path rather than reaching into internals

I asked a similar question in Discord and wanted to open a tracking issue for the intended public API/config direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify supported rubric input path for rubric_based_* metrics #132

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarify supported rubric input path for rubric_based_* metrics #132

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions