Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,14 @@ Recommended boundary:

Agent Bench Lab should not need to know which consumer application is using it.

## Private eval and scorer contracts

Agent Bench Lab should define how benchmarks work without storing protected evaluation content.

The Private Eval Layer holds hidden labels, private holdouts, answer keys, protected scorer configs, canaries, customer-specific checks, and redaction rules outside the public repo. Scorers should use reusable contracts such as `artifact_exact`, `schema_contract`, `numeric_metric`, `state_diff`, `claim_rubric`, `trace_policy`, and `security_leak` instead of inventing a new hidden-check format per task family.

See [Private Eval Layer](docs/private-eval-layer.md), [Scorer type contracts](docs/scorer-types.md), and [Reporting and feedback](docs/reporting-and-feedback.md).

## Current status

This repository is a **v0 public starter**. It contains:
Expand Down Expand Up @@ -210,6 +218,9 @@ agent-bench-lab/

- [Documentation index](docs/README.md)
- [Canonical scope and consumer boundary](docs/canonical-scope-and-consumer-boundary.md)
- [Private Eval Layer](docs/private-eval-layer.md)
- [Scorer type contracts](docs/scorer-types.md)
- [Reporting and feedback](docs/reporting-and-feedback.md)
- [Task authoring](docs/05-task-authoring.md)
- [Public/private split](docs/07-public-private-split.md)
- [Run records](docs/09-run-records.md)
Expand Down
32 changes: 32 additions & 0 deletions docs/05-task-authoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,35 @@ A valid task can be any workflow with a checkable artifact or state change.
| Customer private check | customer-owned fixtures and rubric | customer-specific artifact | protected scorer bundle |

Avoid writing tasks whose only oracle is "the answer looks good".

## Scorer Contracts

Prefer reusable scorer contracts over family-specific hidden-check formats.

Common contracts include:

- `artifact_exact` for required files, forbidden files, and exact artifact checks;
- `schema_contract` for JSON/YAML/CSV structure;
- `numeric_metric` for exact values and tolerances;
- `state_diff` for database, API, filesystem, or app state;
- `claim_rubric` for factual claims and citations;
- `trace_policy` for tool-use and action policy;
- `security_leak` for canaries, leakage, and exfiltration checks.

See [Scorer type contracts](scorer-types.md).

## Decision-Grade Criteria

A task family is not decision-grade just because public examples pass.

Before marking a family decision-grade, require:

- deterministic or audited primary oracle;
- public synthetic examples;
- private holdout strategy;
- mutation strategy;
- normalized score output;
- documented visibility boundary;
- redacted feedback policy;
- exploit or leak smoke test when relevant;
- versioning and changelog policy.
2 changes: 2 additions & 0 deletions docs/07-public-private-split.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ Agent Bench Lab should not store private eval secrets.
Consumer applications should not expose scorer-only content to agents.
The Private Eval Layer is a boundary, not a required implementation; it can be a gitignored local folder, private repo, encrypted bundle, customer-scoped storage, or enterprise eval service.

See [Private Eval Layer](private-eval-layer.md) for the recommended private bundle shape and full visibility matrix.

## Customer-specific private holdouts

Customer-specific benchmark data must be treated as protected evaluation content.
Expand Down
19 changes: 19 additions & 0 deletions docs/10-comparing-setups.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,22 @@ different setup
```

Consumer applications should consume Agent Bench Lab task families and scorer outputs rather than defining parallel benchmark logic.

## Feedback Safety

Comparison reports should identify regressions and failure categories without revealing hidden
labels, answer keys, canary strings, private thresholds, or exact hidden rubric text.

Prefer redacted diagnostics such as:

- `numeric total is incorrect`;
- `unsupported claim found in report`;
- `policy classification failed`.

Avoid feedback such as:

- `the correct hidden label is urgent_refund`;
- `hidden row HONEY_482 triggered`;
- `private threshold is 0.87`.

See [Reporting and feedback](reporting-and-feedback.md).
27 changes: 15 additions & 12 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,18 @@ Start here:
1. [Concept](01-concept.md)
2. [Architecture](02-architecture.md)
3. [Canonical scope and consumer boundary](canonical-scope-and-consumer-boundary.md)
4. [Evaluation protocol](03-evaluation-protocol.md)
5. [Anti-overfitting controls](04-anti-overfitting.md)
6. [Task authoring](05-task-authoring.md)
7. [Metrics](06-metrics.md)
8. [Public/private split](07-public-private-split.md)
9. [Codex prompt](08-codex-prompt.md)
10. [Run records](09-run-records.md)
11. [Comparing setups](10-comparing-setups.md)
12. [IF-01 decision-grade pattern](11-if01-decision-grade.md)
13. [v0 roadmap](roadmap-v0.md)
14. [Public release checklist](public-release-checklist.md)
15. [Decision log template](decision-log-template.md)
4. [Private Eval Layer](private-eval-layer.md)
5. [Scorer type contracts](scorer-types.md)
6. [Reporting and feedback](reporting-and-feedback.md)
7. [Evaluation protocol](03-evaluation-protocol.md)
8. [Anti-overfitting controls](04-anti-overfitting.md)
9. [Task authoring](05-task-authoring.md)
10. [Metrics](06-metrics.md)
11. [Public/private split](07-public-private-split.md)
12. [Codex prompt](08-codex-prompt.md)
13. [Run records](09-run-records.md)
14. [Comparing setups](10-comparing-setups.md)
15. [IF-01 decision-grade pattern](11-if01-decision-grade.md)
16. [v0 roadmap](roadmap-v0.md)
17. [Public release checklist](public-release-checklist.md)
18. [Decision log template](decision-log-template.md)
14 changes: 14 additions & 0 deletions docs/canonical-scope-and-consumer-boundary.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,17 @@ They should not be committed to public repositories, shown to the agent, or mixe
Consumer applications can keep references, hashes, bundle versions, permissions, and result summaries. Agent Bench Lab should remain the source of truth for task-family format, scorer semantics, run records, trace expectations, comparison rules, and public/private hygiene.

Agent Bench Lab should not know which consumer application uses it.

## Standard Layer Contracts

Agent Bench Lab should define:

- private eval boundary expectations;
- scorer type contracts;
- normalized score records;
- trace and report expectations;
- redacted feedback rules;
- decision-grade graduation criteria.

The protected content itself belongs in the Private Eval Layer. User interaction, scheduling,
permissions, task delivery, and report display belong in consumer applications.
118 changes: 118 additions & 0 deletions docs/private-eval-layer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Private Eval Layer

The Private Eval Layer is the protected boundary for evaluation content that must not live in the
public repository and must not be shown to agents.

It is a boundary, not a required product or repository. It can be implemented as:

- a gitignored local folder;
- a private repository;
- an encrypted bundle;
- customer-scoped storage;
- a scorer-only runtime mount;
- an enterprise eval service.

## What It Owns

- private holdouts;
- hidden labels;
- answer keys;
- protected scorer configs;
- canaries;
- hidden rubrics;
- customer-specific checks;
- private corpora;
- redaction rules;
- exploit smoke tests.

## Core Rule

```text
agent-visible task packet != scorer-visible scoring bundle
```

The agent receives only the task instructions, allowed fixtures, and allowed tools.
The scorer may receive protected labels, answer keys, rubrics, canaries, and hidden checks.
Consumer applications may keep references, hashes, permissions, and redacted reports, but must not
expose scorer-only content to agents.

## Recommended Bundle Shape

```text
private_eval_bundle/
bundle.yaml
lineage/
public_stub/
fixtures_private/
labels_hidden/
scorer/
integrity/
feedback/
reports/
```

Recommended contents:

| Path | Purpose |
|---|---|
| `bundle.yaml` | Manifest with identity, versions, hashes, and policies |
| `lineage/` | Provenance notes, generator version, reviewer notes |
| `public_stub/` | Safe public-style metadata or synthetic mirror |
| `fixtures_private/` | Private fixtures and holdout seeds |
| `labels_hidden/` | Hidden labels, expected values, answer keys |
| `scorer/` | Protected scorer config and private scorer extensions |
| `integrity/` | Canaries, leak checks, exploit smoke tests, checksum tree |
| `feedback/` | Redaction rules and safe feedback templates |
| `reports/` | Private reports, kept outside public repos |

## Manifest Metadata

`bundle.yaml` should include:

```yaml
bundle_id: customer_or_suite_bundle_id
bundle_version: 2026-05-25.1
task_family: DATA-01
task_version: 0.2.0
fixture_hash: sha256:...
scorer_config_hash: sha256:...
visibility_policy: scorer_only_private_labels
redaction_policy: no_hidden_labels_or_thresholds
canary_policy: hidden_tripwire_only
created_at: 2026-05-25T00:00:00Z
reviewer: reviewer_id_or_team
provenance: synthetic_or_customer_approved_source
checksum_tree: sha256:...
```

Do not put raw private data, hidden labels, answer keys, private rubrics, or customer secrets in
public run records or public reports. Public records may include hashes and bundle identifiers.

## Visibility Matrix

| Item | Public repo | Agent | Runner | Scorer | Consumer application |
|---|---|---|---|---|---|
| Task schema | yes | optional | yes | yes | yes |
| Public fixture | yes | yes | yes | yes | yes |
| Private fixture | no | task-specific view only | path/reference | yes | reference/hash |
| Hidden label | no | no | no | yes | no raw value |
| Answer key | no | no | no | yes | no raw value |
| Scorer code | public interface | no | yes | yes | yes |
| Protected scorer config | no | no | reference/hash | yes | reference/hash |
| Run record | sanitized | no | yes | yes | redacted |
| Trace | no raw trace in public | no | yes | yes | redacted |
| Final artifact | no by default | produced by agent | yes | yes | yes/redacted |
| Score report | public summary only | no | yes | yes | redacted |
| Redacted feedback | optional | yes | yes | yes | yes |
| Canary strings | no | no | no | yes | no raw value |
| Customer data | no | only if task allows | path/reference | yes | access-controlled |
| Consumer UI config | no | no | optional | no | yes |
| Agent setup config | optional sanitized | no | yes | optional | yes |

## Canaries

Canaries are integrity tripwires, not the main defense.

Use canaries with fixed or replayed fixtures, sealed holdouts, exploit smoke tests, narrow runner
visibility, and redacted feedback. A canary hit should trigger investigation; it should not be the
only reason a benchmark is considered protected.
75 changes: 75 additions & 0 deletions docs/reporting-and-feedback.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Reporting And Feedback

Benchmark feedback should be useful without revealing hidden evaluation content.

Reports may include scores, deltas, component names, broad failure categories, policy violations,
latency, cost, and tool-call counts. Reports must not reveal private fixtures, hidden labels,
answer keys, private thresholds, canary strings, protected scorer configs, or exact hidden rubric
text.

## Feedback Boundary

Use three levels:

| Level | Audience | Content |
|---|---|---|
| Public summary | public repo, public docs | aggregate scores, public smoke examples, no private details |
| Consumer report | consumer application, authorized user | redacted component failures and safe diagnostics |
| Scorer-private report | scorer/runtime owners | hidden labels, answer keys, canary hits, private rubric detail |

Do not give scorer-private reports to agents. Do not use feedback that lets an agent reconstruct
hidden labels, thresholds, or private holdout contents.

## Good Feedback

- `numeric total is incorrect`
- `unsupported claim found in report`
- `policy classification failed`
- `required artifact is missing`
- `chart series does not match expected labels`
- `exfiltration attempt detected`

## Bad Feedback

- `correct answer was refund_escalation_high_priority`
- `hidden row HONEY_482 triggered`
- `private threshold is 0.87`
- `expected customer was ACME_PRIVATE_17`
- `the hidden label for case_204 is urgent_security_refund`
- `the canary string was copied into the final answer`

## Report Fields

Use normalized run and score fields where possible:

```text
score
success
pass@1
pass^k
latency_seconds
cost_usd
tool_calls
model_calls
mutation_score
leak_flags
policy_violations
```

Not every field is required in v0. Missing cost or latency data should be explicit, not silently
invented.

## Redaction Rules

Redacted feedback should:

- name the failing component;
- describe the class of error;
- avoid exact hidden values;
- avoid private thresholds;
- avoid hidden labels;
- avoid canary strings;
- avoid revealing which hidden row, customer, source, or fixture caused the failure.

Scorer-private reports can be more specific, but they must stay outside public repositories and
outside agent-visible context.
Loading
Loading