diff --git a/README.md b/README.md index fe94c34..35e142f 100644 --- a/README.md +++ b/README.md @@ -49,6 +49,14 @@ Recommended boundary: Agent Bench Lab should not need to know which consumer application is using it. +## Private eval and scorer contracts + +Agent Bench Lab should define how benchmarks work without storing protected evaluation content. + +The Private Eval Layer holds hidden labels, private holdouts, answer keys, protected scorer configs, canaries, customer-specific checks, and redaction rules outside the public repo. Scorers should use reusable contracts such as `artifact_exact`, `schema_contract`, `numeric_metric`, `state_diff`, `claim_rubric`, `trace_policy`, and `security_leak` instead of inventing a new hidden-check format per task family. + +See [Private Eval Layer](docs/private-eval-layer.md), [Scorer type contracts](docs/scorer-types.md), and [Reporting and feedback](docs/reporting-and-feedback.md). + ## Current status This repository is a **v0 public starter**. It contains: @@ -210,6 +218,9 @@ agent-bench-lab/ - [Documentation index](docs/README.md) - [Canonical scope and consumer boundary](docs/canonical-scope-and-consumer-boundary.md) +- [Private Eval Layer](docs/private-eval-layer.md) +- [Scorer type contracts](docs/scorer-types.md) +- [Reporting and feedback](docs/reporting-and-feedback.md) - [Task authoring](docs/05-task-authoring.md) - [Public/private split](docs/07-public-private-split.md) - [Run records](docs/09-run-records.md) diff --git a/docs/05-task-authoring.md b/docs/05-task-authoring.md index 305b4d9..6ef798c 100644 --- a/docs/05-task-authoring.md +++ b/docs/05-task-authoring.md @@ -57,3 +57,35 @@ A valid task can be any workflow with a checkable artifact or state change. | Customer private check | customer-owned fixtures and rubric | customer-specific artifact | protected scorer bundle | Avoid writing tasks whose only oracle is "the answer looks good". + +## Scorer Contracts + +Prefer reusable scorer contracts over family-specific hidden-check formats. + +Common contracts include: + +- `artifact_exact` for required files, forbidden files, and exact artifact checks; +- `schema_contract` for JSON/YAML/CSV structure; +- `numeric_metric` for exact values and tolerances; +- `state_diff` for database, API, filesystem, or app state; +- `claim_rubric` for factual claims and citations; +- `trace_policy` for tool-use and action policy; +- `security_leak` for canaries, leakage, and exfiltration checks. + +See [Scorer type contracts](scorer-types.md). + +## Decision-Grade Criteria + +A task family is not decision-grade just because public examples pass. + +Before marking a family decision-grade, require: + +- deterministic or audited primary oracle; +- public synthetic examples; +- private holdout strategy; +- mutation strategy; +- normalized score output; +- documented visibility boundary; +- redacted feedback policy; +- exploit or leak smoke test when relevant; +- versioning and changelog policy. diff --git a/docs/07-public-private-split.md b/docs/07-public-private-split.md index dd26438..3831ec2 100644 --- a/docs/07-public-private-split.md +++ b/docs/07-public-private-split.md @@ -63,6 +63,8 @@ Agent Bench Lab should not store private eval secrets. Consumer applications should not expose scorer-only content to agents. The Private Eval Layer is a boundary, not a required implementation; it can be a gitignored local folder, private repo, encrypted bundle, customer-scoped storage, or enterprise eval service. +See [Private Eval Layer](private-eval-layer.md) for the recommended private bundle shape and full visibility matrix. + ## Customer-specific private holdouts Customer-specific benchmark data must be treated as protected evaluation content. diff --git a/docs/10-comparing-setups.md b/docs/10-comparing-setups.md index dec6696..4ceee95 100644 --- a/docs/10-comparing-setups.md +++ b/docs/10-comparing-setups.md @@ -66,3 +66,22 @@ different setup ``` Consumer applications should consume Agent Bench Lab task families and scorer outputs rather than defining parallel benchmark logic. + +## Feedback Safety + +Comparison reports should identify regressions and failure categories without revealing hidden +labels, answer keys, canary strings, private thresholds, or exact hidden rubric text. + +Prefer redacted diagnostics such as: + +- `numeric total is incorrect`; +- `unsupported claim found in report`; +- `policy classification failed`. + +Avoid feedback such as: + +- `the correct hidden label is urgent_refund`; +- `hidden row HONEY_482 triggered`; +- `private threshold is 0.87`. + +See [Reporting and feedback](reporting-and-feedback.md). diff --git a/docs/README.md b/docs/README.md index 275d3b0..0162665 100644 --- a/docs/README.md +++ b/docs/README.md @@ -5,15 +5,18 @@ Start here: 1. [Concept](01-concept.md) 2. [Architecture](02-architecture.md) 3. [Canonical scope and consumer boundary](canonical-scope-and-consumer-boundary.md) -4. [Evaluation protocol](03-evaluation-protocol.md) -5. [Anti-overfitting controls](04-anti-overfitting.md) -6. [Task authoring](05-task-authoring.md) -7. [Metrics](06-metrics.md) -8. [Public/private split](07-public-private-split.md) -9. [Codex prompt](08-codex-prompt.md) -10. [Run records](09-run-records.md) -11. [Comparing setups](10-comparing-setups.md) -12. [IF-01 decision-grade pattern](11-if01-decision-grade.md) -13. [v0 roadmap](roadmap-v0.md) -14. [Public release checklist](public-release-checklist.md) -15. [Decision log template](decision-log-template.md) +4. [Private Eval Layer](private-eval-layer.md) +5. [Scorer type contracts](scorer-types.md) +6. [Reporting and feedback](reporting-and-feedback.md) +7. [Evaluation protocol](03-evaluation-protocol.md) +8. [Anti-overfitting controls](04-anti-overfitting.md) +9. [Task authoring](05-task-authoring.md) +10. [Metrics](06-metrics.md) +11. [Public/private split](07-public-private-split.md) +12. [Codex prompt](08-codex-prompt.md) +13. [Run records](09-run-records.md) +14. [Comparing setups](10-comparing-setups.md) +15. [IF-01 decision-grade pattern](11-if01-decision-grade.md) +16. [v0 roadmap](roadmap-v0.md) +17. [Public release checklist](public-release-checklist.md) +18. [Decision log template](decision-log-template.md) diff --git a/docs/canonical-scope-and-consumer-boundary.md b/docs/canonical-scope-and-consumer-boundary.md index 11d8a6c..b1df942 100644 --- a/docs/canonical-scope-and-consumer-boundary.md +++ b/docs/canonical-scope-and-consumer-boundary.md @@ -56,3 +56,17 @@ They should not be committed to public repositories, shown to the agent, or mixe Consumer applications can keep references, hashes, bundle versions, permissions, and result summaries. Agent Bench Lab should remain the source of truth for task-family format, scorer semantics, run records, trace expectations, comparison rules, and public/private hygiene. Agent Bench Lab should not know which consumer application uses it. + +## Standard Layer Contracts + +Agent Bench Lab should define: + +- private eval boundary expectations; +- scorer type contracts; +- normalized score records; +- trace and report expectations; +- redacted feedback rules; +- decision-grade graduation criteria. + +The protected content itself belongs in the Private Eval Layer. User interaction, scheduling, +permissions, task delivery, and report display belong in consumer applications. diff --git a/docs/private-eval-layer.md b/docs/private-eval-layer.md new file mode 100644 index 0000000..9df248b --- /dev/null +++ b/docs/private-eval-layer.md @@ -0,0 +1,118 @@ +# Private Eval Layer + +The Private Eval Layer is the protected boundary for evaluation content that must not live in the +public repository and must not be shown to agents. + +It is a boundary, not a required product or repository. It can be implemented as: + +- a gitignored local folder; +- a private repository; +- an encrypted bundle; +- customer-scoped storage; +- a scorer-only runtime mount; +- an enterprise eval service. + +## What It Owns + +- private holdouts; +- hidden labels; +- answer keys; +- protected scorer configs; +- canaries; +- hidden rubrics; +- customer-specific checks; +- private corpora; +- redaction rules; +- exploit smoke tests. + +## Core Rule + +```text +agent-visible task packet != scorer-visible scoring bundle +``` + +The agent receives only the task instructions, allowed fixtures, and allowed tools. +The scorer may receive protected labels, answer keys, rubrics, canaries, and hidden checks. +Consumer applications may keep references, hashes, permissions, and redacted reports, but must not +expose scorer-only content to agents. + +## Recommended Bundle Shape + +```text +private_eval_bundle/ + bundle.yaml + lineage/ + public_stub/ + fixtures_private/ + labels_hidden/ + scorer/ + integrity/ + feedback/ + reports/ +``` + +Recommended contents: + +| Path | Purpose | +|---|---| +| `bundle.yaml` | Manifest with identity, versions, hashes, and policies | +| `lineage/` | Provenance notes, generator version, reviewer notes | +| `public_stub/` | Safe public-style metadata or synthetic mirror | +| `fixtures_private/` | Private fixtures and holdout seeds | +| `labels_hidden/` | Hidden labels, expected values, answer keys | +| `scorer/` | Protected scorer config and private scorer extensions | +| `integrity/` | Canaries, leak checks, exploit smoke tests, checksum tree | +| `feedback/` | Redaction rules and safe feedback templates | +| `reports/` | Private reports, kept outside public repos | + +## Manifest Metadata + +`bundle.yaml` should include: + +```yaml +bundle_id: customer_or_suite_bundle_id +bundle_version: 2026-05-25.1 +task_family: DATA-01 +task_version: 0.2.0 +fixture_hash: sha256:... +scorer_config_hash: sha256:... +visibility_policy: scorer_only_private_labels +redaction_policy: no_hidden_labels_or_thresholds +canary_policy: hidden_tripwire_only +created_at: 2026-05-25T00:00:00Z +reviewer: reviewer_id_or_team +provenance: synthetic_or_customer_approved_source +checksum_tree: sha256:... +``` + +Do not put raw private data, hidden labels, answer keys, private rubrics, or customer secrets in +public run records or public reports. Public records may include hashes and bundle identifiers. + +## Visibility Matrix + +| Item | Public repo | Agent | Runner | Scorer | Consumer application | +|---|---|---|---|---|---| +| Task schema | yes | optional | yes | yes | yes | +| Public fixture | yes | yes | yes | yes | yes | +| Private fixture | no | task-specific view only | path/reference | yes | reference/hash | +| Hidden label | no | no | no | yes | no raw value | +| Answer key | no | no | no | yes | no raw value | +| Scorer code | public interface | no | yes | yes | yes | +| Protected scorer config | no | no | reference/hash | yes | reference/hash | +| Run record | sanitized | no | yes | yes | redacted | +| Trace | no raw trace in public | no | yes | yes | redacted | +| Final artifact | no by default | produced by agent | yes | yes | yes/redacted | +| Score report | public summary only | no | yes | yes | redacted | +| Redacted feedback | optional | yes | yes | yes | yes | +| Canary strings | no | no | no | yes | no raw value | +| Customer data | no | only if task allows | path/reference | yes | access-controlled | +| Consumer UI config | no | no | optional | no | yes | +| Agent setup config | optional sanitized | no | yes | optional | yes | + +## Canaries + +Canaries are integrity tripwires, not the main defense. + +Use canaries with fixed or replayed fixtures, sealed holdouts, exploit smoke tests, narrow runner +visibility, and redacted feedback. A canary hit should trigger investigation; it should not be the +only reason a benchmark is considered protected. diff --git a/docs/reporting-and-feedback.md b/docs/reporting-and-feedback.md new file mode 100644 index 0000000..3414941 --- /dev/null +++ b/docs/reporting-and-feedback.md @@ -0,0 +1,75 @@ +# Reporting And Feedback + +Benchmark feedback should be useful without revealing hidden evaluation content. + +Reports may include scores, deltas, component names, broad failure categories, policy violations, +latency, cost, and tool-call counts. Reports must not reveal private fixtures, hidden labels, +answer keys, private thresholds, canary strings, protected scorer configs, or exact hidden rubric +text. + +## Feedback Boundary + +Use three levels: + +| Level | Audience | Content | +|---|---|---| +| Public summary | public repo, public docs | aggregate scores, public smoke examples, no private details | +| Consumer report | consumer application, authorized user | redacted component failures and safe diagnostics | +| Scorer-private report | scorer/runtime owners | hidden labels, answer keys, canary hits, private rubric detail | + +Do not give scorer-private reports to agents. Do not use feedback that lets an agent reconstruct +hidden labels, thresholds, or private holdout contents. + +## Good Feedback + +- `numeric total is incorrect` +- `unsupported claim found in report` +- `policy classification failed` +- `required artifact is missing` +- `chart series does not match expected labels` +- `exfiltration attempt detected` + +## Bad Feedback + +- `correct answer was refund_escalation_high_priority` +- `hidden row HONEY_482 triggered` +- `private threshold is 0.87` +- `expected customer was ACME_PRIVATE_17` +- `the hidden label for case_204 is urgent_security_refund` +- `the canary string was copied into the final answer` + +## Report Fields + +Use normalized run and score fields where possible: + +```text +score +success +pass@1 +pass^k +latency_seconds +cost_usd +tool_calls +model_calls +mutation_score +leak_flags +policy_violations +``` + +Not every field is required in v0. Missing cost or latency data should be explicit, not silently +invented. + +## Redaction Rules + +Redacted feedback should: + +- name the failing component; +- describe the class of error; +- avoid exact hidden values; +- avoid private thresholds; +- avoid hidden labels; +- avoid canary strings; +- avoid revealing which hidden row, customer, source, or fixture caused the failure. + +Scorer-private reports can be more specific, but they must stay outside public repositories and +outside agent-visible context. diff --git a/docs/scorer-types.md b/docs/scorer-types.md new file mode 100644 index 0000000..b71f073 --- /dev/null +++ b/docs/scorer-types.md @@ -0,0 +1,93 @@ +# Scorer Type Contracts + +Agent Bench Lab scorers should be built from reusable scorer contracts instead of one-off +task-family logic whenever possible. + +Every score result should still normalize into the public score record format: + +```text +score +success +pass_threshold +components +policy_violations +errors +metadata +``` + +## First-Class Scorer Types + +| Type | Purpose | Typical inputs | Public vs private config | Example task families | Common failure modes | +|---|---|---|---|---|---| +| `artifact_exact` | Check required files and exact text/bytes/paths | artifacts, allowed file list | public file contract; private forbidden files or hidden probes | IF, DATA, DOC, SUP | missing files, extra files, wrong names | +| `schema_contract` | Validate JSON/YAML/CSV shape and required fields | artifact, schema, field rules | public schema; private hidden required fields if needed | IF, DATA, API | invalid JSON, missing keys, wrong types | +| `numeric_metric` | Compare numeric values with exact or tolerance rules | metrics artifact, expected values | public smoke values; private hidden expected metrics | DATA, OFFICE, API | rounding drift, wrong aggregation, boundary errors | +| `state_diff` | Compare final environment state to expected state | DB state, API state, filesystem state | public state checks; private hidden state diffs | APP, WEB, SUP, API | partial updates, wrong row, stale state | +| `claim_rubric` | Check factual claims, citations, and unsupported statements | report, citations, source pack | public rubric examples; private hidden labels/rubric | DOC, DATA, RSR, SUP | unsupported claims, missing citations, hallucinated facts | +| `trace_policy` | Check tool-use and action policy from traces | trace events, tool calls, policy rules | public policy shape; private hidden policy exceptions | API, SEC, MCP-like tasks | forbidden tool use, unsafe order, missing audit trail | +| `security_leak` | Detect leakage, exfiltration, or prompt-injection failure | artifacts, trace, canary config | public leak patterns; private canaries and tripwires | SEC, PRIV, SUP, API | leaked labels, copied hidden text, unsafe disclosure | +| `cost_latency` | Score budgets and efficiency diagnostics | run metadata, trace timing, cost fields | public budgets; private customer-specific budgets | all families | too many calls, timeout, excessive cost | +| `mutation_robustness` | Compare performance across variants and mutations | paired scores, mutation metadata | public mutation examples; private hidden variants | IF, DATA, DOC, SUP | brittle prompt, overfit public case, unstable output | + +## Contract Requirements + +Each scorer type should define: + +- purpose; +- accepted inputs; +- expected outputs; +- public config fields; +- private config fields; +- component names; +- policy violation semantics; +- error semantics; +- feedback redaction rules. + +Public config can describe the artifact shape and visible examples. +Private config can hold protected expected values, hidden labels, thresholds, canaries, or private +rubrics. Private config must not be committed to the public tree. + +## Primary And Secondary Oracles + +Every task family should have one mandatory primary oracle and optional secondary diagnostics. + +Examples: + +| Family | Primary oracle | Secondary diagnostics | +|---|---|---| +| IF-01 | `artifact_exact` + `schema_contract` | `mutation_robustness`, `cost_latency` | +| DATA-01 | `numeric_metric` + `schema_contract` | `claim_rubric`, `security_leak` | +| DOC-01 | `claim_rubric` with audited labels | `schema_contract`, `trace_policy` | +| SUP-01 | `state_diff` or exact labels | `claim_rubric`, `security_leak` | +| API-01 | `state_diff` + `trace_policy` | `cost_latency`, `security_leak` | + +## Decision-Grade Graduation Criteria + +A task family can be marked decision-grade only if: + +1. The primary oracle is deterministic or audited. +2. Public synthetic examples exist. +3. Private holdout strategy is documented. +4. Mutation strategy is documented. +5. Scorer output uses the normalized score record. +6. Agent/scorer/runner/consumer visibility boundary is documented. +7. Feedback redaction policy is documented. +8. Exploit or leak smoke tests exist when relevant. +9. Versioning and changelog policy exists. +10. Repeatability is measured or explicitly planned. + +Public examples are not final evaluation. They prove the task shape, scorer shape, and local smoke +loop. Decision-grade comparison needs private holdouts or protected bundles. + +## DATA-01 Preview + +DATA-01 should use these contracts: + +- `schema_contract` for `metrics.json`; +- `numeric_metric` for expected values and tolerances; +- `claim_rubric` for factual claims in `report.md`; +- `schema_contract` and `artifact_exact` for `chart_spec.json`; +- `security_leak` or canary checks for private bundles when relevant. + +That keeps DATA-01 from becoming a custom scorer shape and makes it a reusable pattern for +spreadsheet, reporting, and customer-data-style task families.