heurema · t3chn · May 25, 2026 · May 25, 2026
diff --git a/README.md b/README.md
@@ -49,6 +49,14 @@ Recommended boundary:
 
 Agent Bench Lab should not need to know which consumer application is using it.
 
+## Private eval and scorer contracts
+
+Agent Bench Lab should define how benchmarks work without storing protected evaluation content.
+
+The Private Eval Layer holds hidden labels, private holdouts, answer keys, protected scorer configs, canaries, customer-specific checks, and redaction rules outside the public repo. Scorers should use reusable contracts such as `artifact_exact`, `schema_contract`, `numeric_metric`, `state_diff`, `claim_rubric`, `trace_policy`, and `security_leak` instead of inventing a new hidden-check format per task family.
+
+See [Private Eval Layer](docs/private-eval-layer.md), [Scorer type contracts](docs/scorer-types.md), and [Reporting and feedback](docs/reporting-and-feedback.md).
+
 ## Current status
 
 This repository is a **v0 public starter**. It contains:
@@ -210,6 +218,9 @@ agent-bench-lab/
 
 - [Documentation index](docs/README.md)
 - [Canonical scope and consumer boundary](docs/canonical-scope-and-consumer-boundary.md)
+- [Private Eval Layer](docs/private-eval-layer.md)
+- [Scorer type contracts](docs/scorer-types.md)
+- [Reporting and feedback](docs/reporting-and-feedback.md)
 - [Task authoring](docs/05-task-authoring.md)
 - [Public/private split](docs/07-public-private-split.md)
 - [Run records](docs/09-run-records.md)

diff --git a/docs/05-task-authoring.md b/docs/05-task-authoring.md
@@ -57,3 +57,35 @@ A valid task can be any workflow with a checkable artifact or state change.
 | Customer private check | customer-owned fixtures and rubric | customer-specific artifact | protected scorer bundle |
 
 Avoid writing tasks whose only oracle is "the answer looks good".
+
+## Scorer Contracts
+
+Prefer reusable scorer contracts over family-specific hidden-check formats.
+
+Common contracts include:
+
+- `artifact_exact` for required files, forbidden files, and exact artifact checks;
+- `schema_contract` for JSON/YAML/CSV structure;
+- `numeric_metric` for exact values and tolerances;
+- `state_diff` for database, API, filesystem, or app state;
+- `claim_rubric` for factual claims and citations;
+- `trace_policy` for tool-use and action policy;
+- `security_leak` for canaries, leakage, and exfiltration checks.
+
+See [Scorer type contracts](scorer-types.md).
+
+## Decision-Grade Criteria
+
+A task family is not decision-grade just because public examples pass.
+
+Before marking a family decision-grade, require:
+
+- deterministic or audited primary oracle;
+- public synthetic examples;
+- private holdout strategy;
+- mutation strategy;
+- normalized score output;
+- documented visibility boundary;
+- redacted feedback policy;
+- exploit or leak smoke test when relevant;
+- versioning and changelog policy.
diff --git a/docs/07-public-private-split.md b/docs/07-public-private-split.md
@@ -63,6 +63,8 @@ Agent Bench Lab should not store private eval secrets.
 Consumer applications should not expose scorer-only content to agents.
 The Private Eval Layer is a boundary, not a required implementation; it can be a gitignored local folder, private repo, encrypted bundle, customer-scoped storage, or enterprise eval service.
 
+See [Private Eval Layer](private-eval-layer.md) for the recommended private bundle shape and full visibility matrix.
+
 ## Customer-specific private holdouts
 
 Customer-specific benchmark data must be treated as protected evaluation content.

diff --git a/docs/10-comparing-setups.md b/docs/10-comparing-setups.md
@@ -66,3 +66,22 @@ different setup
 ```
 
 Consumer applications should consume Agent Bench Lab task families and scorer outputs rather than defining parallel benchmark logic.
+
+## Feedback Safety
+
+Comparison reports should identify regressions and failure categories without revealing hidden
+labels, answer keys, canary strings, private thresholds, or exact hidden rubric text.
+
+Prefer redacted diagnostics such as:
+
+- `numeric total is incorrect`;
+- `unsupported claim found in report`;
+- `policy classification failed`.
+
+Avoid feedback such as:
+
+- `the correct hidden label is urgent_refund`;
+- `hidden row HONEY_482 triggered`;
+- `private threshold is 0.87`.
+
+See [Reporting and feedback](reporting-and-feedback.md).
diff --git a/docs/README.md b/docs/README.md
@@ -5,15 +5,18 @@ Start here:
 1. [Concept](01-concept.md)
 2. [Architecture](02-architecture.md)
 3. [Canonical scope and consumer boundary](canonical-scope-and-consumer-boundary.md)
-4. [Evaluation protocol](03-evaluation-protocol.md)
-5. [Anti-overfitting controls](04-anti-overfitting.md)
-6. [Task authoring](05-task-authoring.md)
-7. [Metrics](06-metrics.md)
-8. [Public/private split](07-public-private-split.md)
-9. [Codex prompt](08-codex-prompt.md)
-10. [Run records](09-run-records.md)
-11. [Comparing setups](10-comparing-setups.md)
-12. [IF-01 decision-grade pattern](11-if01-decision-grade.md)
-13. [v0 roadmap](roadmap-v0.md)
-14. [Public release checklist](public-release-checklist.md)
-15. [Decision log template](decision-log-template.md)
+4. [Private Eval Layer](private-eval-layer.md)
+5. [Scorer type contracts](scorer-types.md)
+6. [Reporting and feedback](reporting-and-feedback.md)
+7. [Evaluation protocol](03-evaluation-protocol.md)
+8. [Anti-overfitting controls](04-anti-overfitting.md)
+9. [Task authoring](05-task-authoring.md)
+10. [Metrics](06-metrics.md)
+11. [Public/private split](07-public-private-split.md)
+12. [Codex prompt](08-codex-prompt.md)
+13. [Run records](09-run-records.md)
+14. [Comparing setups](10-comparing-setups.md)
+15. [IF-01 decision-grade pattern](11-if01-decision-grade.md)
+16. [v0 roadmap](roadmap-v0.md)
+17. [Public release checklist](public-release-checklist.md)
+18. [Decision log template](decision-log-template.md)
diff --git a/docs/canonical-scope-and-consumer-boundary.md b/docs/canonical-scope-and-consumer-boundary.md
@@ -56,3 +56,17 @@ They should not be committed to public repositories, shown to the agent, or mixe
 Consumer applications can keep references, hashes, bundle versions, permissions, and result summaries. Agent Bench Lab should remain the source of truth for task-family format, scorer semantics, run records, trace expectations, comparison rules, and public/private hygiene.
 
 Agent Bench Lab should not know which consumer application uses it.
+
+## Standard Layer Contracts
+
+Agent Bench Lab should define:
+
+- private eval boundary expectations;
+- scorer type contracts;
+- normalized score records;
+- trace and report expectations;
+- redacted feedback rules;
+- decision-grade graduation criteria.
+
+The protected content itself belongs in the Private Eval Layer. User interaction, scheduling,
+permissions, task delivery, and report display belong in consumer applications.
diff --git a/docs/private-eval-layer.md b/docs/private-eval-layer.md
@@ -0,0 +1,118 @@
+# Private Eval Layer
+
+The Private Eval Layer is the protected boundary for evaluation content that must not live in the
+public repository and must not be shown to agents.
+
+It is a boundary, not a required product or repository. It can be implemented as:
+
+- a gitignored local folder;
+- a private repository;
+- an encrypted bundle;
+- customer-scoped storage;
+- a scorer-only runtime mount;
+- an enterprise eval service.
+
+## What It Owns
+
+- private holdouts;
+- hidden labels;
+- answer keys;
+- protected scorer configs;
+- canaries;
+- hidden rubrics;
+- customer-specific checks;
+- private corpora;
+- redaction rules;
+- exploit smoke tests.
+
+## Core Rule
+
+```text
+agent-visible task packet != scorer-visible scoring bundle
+```
+
+The agent receives only the task instructions, allowed fixtures, and allowed tools.
+The scorer may receive protected labels, answer keys, rubrics, canaries, and hidden checks.
+Consumer applications may keep references, hashes, permissions, and redacted reports, but must not
+expose scorer-only content to agents.
+
+## Recommended Bundle Shape
+
+```text
+private_eval_bundle/
+  bundle.yaml
+  lineage/
+  public_stub/
+  fixtures_private/
+  labels_hidden/
+  scorer/
+  integrity/
+  feedback/
+  reports/
+```
+
+Recommended contents:
+
+| Path | Purpose |
+|---|---|
+| `bundle.yaml` | Manifest with identity, versions, hashes, and policies |
+| `lineage/` | Provenance notes, generator version, reviewer notes |
+| `public_stub/` | Safe public-style metadata or synthetic mirror |
+| `fixtures_private/` | Private fixtures and holdout seeds |
+| `labels_hidden/` | Hidden labels, expected values, answer keys |
+| `scorer/` | Protected scorer config and private scorer extensions |
+| `integrity/` | Canaries, leak checks, exploit smoke tests, checksum tree |
+| `feedback/` | Redaction rules and safe feedback templates |
+| `reports/` | Private reports, kept outside public repos |
+
+## Manifest Metadata
+
+`bundle.yaml` should include:
+
+```yaml
+bundle_id: customer_or_suite_bundle_id
+bundle_version: 2026-05-25.1
+task_family: DATA-01
+task_version: 0.2.0
+fixture_hash: sha256:...
+scorer_config_hash: sha256:...
+visibility_policy: scorer_only_private_labels
+redaction_policy: no_hidden_labels_or_thresholds
+canary_policy: hidden_tripwire_only
+created_at: 2026-05-25T00:00:00Z
+reviewer: reviewer_id_or_team
+provenance: synthetic_or_customer_approved_source
+checksum_tree: sha256:...
+```
+
+Do not put raw private data, hidden labels, answer keys, private rubrics, or customer secrets in
+public run records or public reports. Public records may include hashes and bundle identifiers.
+
+## Visibility Matrix
+
+| Item | Public repo | Agent | Runner | Scorer | Consumer application |
+|---|---|---|---|---|---|
+| Task schema | yes | optional | yes | yes | yes |
+| Public fixture | yes | yes | yes | yes | yes |
+| Private fixture | no | task-specific view only | path/reference | yes | reference/hash |
+| Hidden label | no | no | no | yes | no raw value |
+| Answer key | no | no | no | yes | no raw value |
+| Scorer code | public interface | no | yes | yes | yes |
+| Protected scorer config | no | no | reference/hash | yes | reference/hash |
+| Run record | sanitized | no | yes | yes | redacted |
+| Trace | no raw trace in public | no | yes | yes | redacted |
+| Final artifact | no by default | produced by agent | yes | yes | yes/redacted |
+| Score report | public summary only | no | yes | yes | redacted |
+| Redacted feedback | optional | yes | yes | yes | yes |
+| Canary strings | no | no | no | yes | no raw value |
+| Customer data | no | only if task allows | path/reference | yes | access-controlled |
+| Consumer UI config | no | no | optional | no | yes |
+| Agent setup config | optional sanitized | no | yes | optional | yes |
+
+## Canaries
+
+Canaries are integrity tripwires, not the main defense.
+
+Use canaries with fixed or replayed fixtures, sealed holdouts, exploit smoke tests, narrow runner
+visibility, and redacted feedback. A canary hit should trigger investigation; it should not be the
+only reason a benchmark is considered protected.
diff --git a/docs/reporting-and-feedback.md b/docs/reporting-and-feedback.md
@@ -0,0 +1,75 @@
+# Reporting And Feedback
+
+Benchmark feedback should be useful without revealing hidden evaluation content.
+
+Reports may include scores, deltas, component names, broad failure categories, policy violations,
+latency, cost, and tool-call counts. Reports must not reveal private fixtures, hidden labels,
+answer keys, private thresholds, canary strings, protected scorer configs, or exact hidden rubric
+text.
+
+## Feedback Boundary
+
+Use three levels:
+
+| Level | Audience | Content |
+|---|---|---|
+| Public summary | public repo, public docs | aggregate scores, public smoke examples, no private details |
+| Consumer report | consumer application, authorized user | redacted component failures and safe diagnostics |
+| Scorer-private report | scorer/runtime owners | hidden labels, answer keys, canary hits, private rubric detail |
+
+Do not give scorer-private reports to agents. Do not use feedback that lets an agent reconstruct
+hidden labels, thresholds, or private holdout contents.
+
+## Good Feedback
+
+- `numeric total is incorrect`
+- `unsupported claim found in report`
+- `policy classification failed`
+- `required artifact is missing`
+- `chart series does not match expected labels`
+- `exfiltration attempt detected`
+
+## Bad Feedback
+
+- `correct answer was refund_escalation_high_priority`
+- `hidden row HONEY_482 triggered`
+- `private threshold is 0.87`
+- `expected customer was ACME_PRIVATE_17`
+- `the hidden label for case_204 is urgent_security_refund`
+- `the canary string was copied into the final answer`
+
+## Report Fields
+
+Use normalized run and score fields where possible:
+
+```text
+score
+success
+pass@1
+pass^k
+latency_seconds
+cost_usd
+tool_calls
+model_calls
+mutation_score
+leak_flags
+policy_violations
+```
+
+Not every field is required in v0. Missing cost or latency data should be explicit, not silently
+invented.
+
+## Redaction Rules
+
+Redacted feedback should:
+
+- name the failing component;
+- describe the class of error;
+- avoid exact hidden values;
+- avoid private thresholds;
+- avoid hidden labels;
+- avoid canary strings;
+- avoid revealing which hidden row, customer, source, or fixture caused the failure.
+
+Scorer-private reports can be more specific, but they must stay outside public repositories and
+outside agent-visible context.