The verified last mile from Assay evidence to reviewable CI artifacts.
Assay Harness runs its recipes against released Assay capabilities and produces verify-before-diff comparisons of baseline vs candidate evidence. New denials, hash mismatches, or policy changes surface as structured regression output — reviewable by humans and consumable by CI.
Version: 0.8.0 | Status: active development
A CI/review layer over canonical Assay artifacts: a recipe, gate, and report projection. It composes released Assay capabilities; it does not define artifact semantics — those live in Assay.
- not an agent runtime or execution harness
- not an eval runner
- not an observability platform or dashboard
- not a second policy or trust model beside Assay
The assay / assay-action / Assay-Harness split is compiler / button /
playbook: assay compiles evidence, assay-action is the one-click CI gate,
and Assay-Harness is the consumer that reviews claims against the evidence.
Adjacent product boundary: Assay-Harness is the open CI playbook for released Assay artifacts. Plimsoll is the release-review product for capability-surface deltas, human approval, and audit trails. Harness can prove and gate the underlying shapes; Plimsoll decides whether a release ships.
Assay-Harness can read Assay-Runner measured-run archives (.tar.gz carrying assay.runner.archive_manifest.v0) and the precomputed cross-runtime diff JSON Runner produces. This is opt-in alongside the existing NDJSON evidence path; NDJSON callers are unaffected.
Ten Runner-aware verbs are available:
| Verb | What it does | CI exit on regression |
|---|---|---|
assay-harness verify-runner <archive.tar.gz> |
Validate one archive's manifest, per-file SHA-256, and honest-health | — (single archive) |
assay-harness compare --baseline <a> --candidate <b> (with .tar.gz inputs) |
Tier-1 validation of both archives | n/a (Tier 1 only) |
assay-harness runner compare --baseline <a> --candidate <b> |
Strict Tier-2A capability-surface diff over two Tier-1-clean archives | 6 |
assay-harness runner cross-runtime report --diff <diff.json> |
Reviewer projection of cross_runtime_diff.v0 (informational) |
0 (rendered only) |
assay-harness runner cross-runtime gate --diff <diff.json> |
CI-blocking gate on the same cross-runtime signal | 6 |
assay-harness runner coverage report --annotation <annotation.json> |
Reviewer projection of an assay.coverage_aware_drift.annotation.v0 sidecar — per-dimension claim cells (strength × basis) and blocked claims |
0 (rendered only) |
assay-harness runner coverage gate --annotation <annotation.json> --assert-claim TYPE:DIM[,...] |
CI-blocking gate that permits an asserted coverage claim only when the annotation supports it (--format text|json|sarif) |
6 |
assay-harness runner coverage fleet --dir <dir> |
Fold many annotation sidecars into one fleet summary: per-dimension strength distribution + the fleet floor (strongest positive supportable across every run) | 0 (rendered only) |
assay-harness runner claims report --claims <claims.json> --annotation <annotation.json> |
Per asserted claim: does the observed evidence support it at the required strength? Outcomes: supported / degraded / blocked / not_evaluable |
0 (rendered only) |
assay-harness runner claims gate --claims <claims.json> --annotation <annotation.json> |
CI-blocking gate: passes only when every claim is supported (or degraded with --allow-degraded) |
6 |
runner claims reads a claim-assertions document (assay.harness.claim_assertions.v0)
plus a coverage annotation, and asks, per claim, whether the independently
observed evidence supports it at the required strength — and if not, what the
evidence does support. It speaks the claim-support vocabulary directly: claim_kind
∈ {positive, exhaustive, bounded_negative}; observed-effect dimension names
(filesystem_paths_touched, network_endpoints, …); claim_strength ∈
{strong,partial,weak,absent}; claim_basis ∈
{measured,reported,derived,inferred}.
Outcomes: supported (evidence meets the required strength × basis),
degraded (the effect is observed but weaker than required), blocked
(evidence contradicts the claim or coverage cannot justify it), not_evaluable
(no observed evidence for the dimension; fail-safe — the gate blocks). A claim's
value/effect_class are advisory only — support is evaluated at dimension
granularity and these are not independently verified. Observed support is the
ceiling; attestation is not consulted. Consumer-only: no new Runner capture, no
attestation adapter, no state.
Full reference, the four outcomes, the claim-support vocabulary, runnable examples,
and a CI-gate snippet: docs/CLAIM_SUPPORT.md. Worked
examples: examples/claims-clean/ (all supported) and
examples/claims-eval-honesty/ (over-claims
blocked).
runner sandbox consumes the evidence-bundle events that assay sandbox --bundle
emits when a coding agent runs under the sandbox (assay.sandbox.summary / .fs
/ .exec / .degraded). It reports the observed filesystem operations, executed
programs, and containment degradations, and gates on them:
assay-harness runner sandbox report --events events.json
assay-harness runner sandbox gate --events events.json [--allow-degraded]The gate exits 6 when a containment degradation is present (enforcement was
weakened while the run continued) unless --allow-degraded is set; observed
fs/exec effects alone never fail the gate, they are the record. Consumer-only: no
capture, no new semantics. Worked example:
examples/coding-agent-sandbox/. Honest
boundary: this reports what the sandbox observed, not intent, and Landlock is not
VM-level isolation.
runner coverage consumes the coverage annotation the comparator emits with
--coverage-annotation-out. It decides each asserted claim from
claim_strength × claim_basis, never inventing or upgrading a claim:
positive:DIM— permitted only on ameasured_DIM_driftcell with strengthstrong/partial.exhaustive:DIM— permitted only whenexhaustive_DIM_equalityis allowed (partial); a coverage-degradedweakcell is not.bounded_negative:DIM— permitted only on a measured dimension that is not inblocked_claims; not evaluable on a reported/unknown dimension.
The gate exits 6 on any blocked asserted claim and can emit SARIF 2.1.0
(--format sarif) so blocked claims surface in code-scanning UIs. Harness
composes this published Assay shape; it defines no claim semantics of its own.
What a Tier-2A regression looks like when an agent gains a new MCP tool between baseline and candidate:
# Runner Capability-Surface Diff (Tier 2A)
**Status:** RUNNER CAPABILITY REGRESSION
**Summary:** RUNNER CAPABILITY REGRESSION: filesystem_paths_added:1, mcp_tools_added:1, policy_allow_decisions_added:1
## MCP Tools
Added:
- `write_file`
Exit 6.
A reproducible end-to-end walkthrough with all four verbs, using locally-generated synthetic fixtures (no live eBPF, no delegated host), lives in docs/DEMO_RUNNER.md. The full per-verb exit-code contract lives in docs/contracts/EXIT_CODES.md.
Runner archives and cross-runtime diffs are defined and produced by the Runner side of Rul1an/assay — an internal measured-run subsystem of Assay, not a standalone product. Harness consumes and gates; it does not measure and does not compute cross-runtime diffs. Producing a cross-runtime diff from a raw archive pair (Tier 3B) remains deferred as a future demo recipe.
Assay (crates/assay-registry) emits an assay.supply_chain_conformance.v0
carrier when it verifies pack provenance: per-dimension integrity, provenance
(including the Sigstore-keyless dimensions), and pinning statuses, plus an
aggregate policy_result. The Harness consumes that carrier as a CI gate.
# Gate CI on a supply-chain conformance carrier and write reviewer artifacts
npx tsx harness/src/cli.ts carrier supply-chain \
--carrier path/to/supply-chain-conformance.json \
--out-dir results/supply-chain-conformanceIt validates the frozen v0 shape, gates on the producer-owned policy_result
(pass is clean; fail and incomplete are not, and incomplete is never read
as clean), and projects Markdown, JUnit, and SARIF. An unknown status or unknown
policy_result is rejected as a contract error rather than passing silently.
Consumer-not-owner: the Harness surfaces the carrier's own producer-computed
verdict and per-dimension statuses. It does not approve, certify, judge
compliance, or assert provider trust, runtime truth, or supply-chain safety;
policy-aware review is a separate step. See
harness/fixtures/supply-chain-conformance/ and
demo/run-supply-chain-conformance-gate.sh.
Assay (crates/assay-core/src/render_safety) emits an
assay.render_safety_conformance.v0 carrier: per render sink it records raw
secret / PII / terminal-control leak counts, redaction-before-truncation, and
benign-preserved. The Harness gates CI on those facts.
npx tsx harness/src/cli.ts carrier render-safety \
--carrier path/to/render-safety-conformance.json \
--out-dir results/render-safety-conformanceA sink is clean iff it leaked no raw secret/PII/terminal-control, preserved benign output, and honoured redaction-before-truncation; the report is clean iff every sink is clean. This covers render output safety per sink only, not secret lifecycle, vaulting, or rotation.
Assay (crates/assay-mcp-server/src/token_passthrough.rs) emits an
assay.token_passthrough_conformance.v0 carrier: value-free, it records that a
consumed inbound authentication value is not re-emitted on a checked outbound
channel (transport header, JSON body, spawned env). The Harness gates CI on those
per-channel facts.
npx tsx harness/src/cli.ts carrier token-passthrough \
--carrier path/to/token-passthrough-conformance.json \
--out-dir results/token-passthrough-conformanceThe carrier is clean iff every checked outbound channel reports leak_count == 0
and is not pass == false; not_applicable channels are out of scope. This covers
the confused-deputy boundary only, not arbitrary payload scrubbing, provider token
grants, or token lifecycle.
Assay (crates/assay-cli/src/enforcement_health_v1.rs) emits an
assay.enforcement_health.v1 carrier for the Landlock TCP-connect port-allowlist
domain: whether enforcement was active (the ruleset was applied) or failed
(requested but not installed), plus an optional real-block probe. The Harness gates
CI on that producer-reported status.
npx tsx harness/src/cli.ts carrier enforcement-health \
--carrier path/to/enforcement-health.json \
--out-dir results/enforcement-healthstatus=active is clean; status=failed is not clean. A real-block probe upgrades
the evidence to "a denied connect was really blocked before the listener was
reached" and is surfaced, but is not required for a clean gate.
This is the carrier-local honest-state gate (was enforcement requested, and did the
ruleset apply or fail), distinct from the enforcement-truth review (policy-aware
approval over the enforcement outcome), which is a separate step. Consuming this
carrier is a deliberate reversal of the earlier "the Harness does not consume
enforcement_health" position, scoped to the v1 (Landlock) domain; see
docs/RUNNER_SCHEMA_CONSUMPTION.md. The connect4/eBPF v0 carrier is a different
shape and is not consumed here.
Assay (assay mcp inventory, crates/assay-core/src/discovery/inventory_carrier.rs)
emits an assay.mcp_server_inventory.v0 carrier: a coverage-honest projection of
discovered MCP servers (command/args hashed, credentials flagged by name only). The
Harness projects it as reviewer context.
npx tsx harness/src/cli.ts carrier inventory \
--carrier path/to/mcp-server-inventory.json \
--out-dir results/mcp-server-inventoryThis is descriptive / non-gating: a valid inventory exits 0 regardless of
contents; only a malformed / wrong-schema / unknown-coverage-state carrier is a
contract error (exit 3). Coverage honesty is surfaced, never decided: only a
complete scan supports an absence claim, so the projection states which sources
are complete and whether the inventory can support "nothing else is there". Drift
and approval over the inventory (e.g. an unexpected server) are a separate review
step.
The other descriptive Tier-2B carriers, tool_decision_surface.v0 and
mcp_manifest_observed.v0, are not yet adapted (no clean single-carrier golden to
pin against); they are a tracked follow-up, not hidden debt.
carrier check dispatches any conformance carrier by its schema id to the
registered adapter and reports whether the Harness recognizes the contract and the
carrier matches its frozen shape. It is the schema / shape dimension only, distinct
from the per-carrier gate verbs: a well-formed carrier that reports a leak is
contract-valid here (exit 0) and fails its own gate verb (exit 6) separately.
npx tsx harness/src/cli.ts carrier check --carrier path/to/conformance.jsonAn unknown or unregistered schema id, a missing/non-string schema, malformed
JSON, or a recognized schema whose shape has drifted is a contract error (exit 3):
the Harness never silently accepts a carrier it does not recognize. A golden-drift
test pins every registered schema's shape, so a future producer change is caught
rather than mis-parsed.
The suite compatibility matrix (suite.compatibility.v0, checked in at
harness/suite-compatibility.json) is a versioned suite-contract artifact: it records
which Assay version emits each carrier, which Harness version consumes/projects/gates
it, which Plimsoll version reviews it, the support mode, the public/private backing, and
the proof state. It is a VSA-shaped compatibility summary, not a SLSA VSA. The Harness
validates this evidence and projects it; it does not approve carrier semantics or
organization policy.
Each row splits the proof in two, which is the load-bearing honesty: harness_consumption
(the Harness can validate/gate/project the carrier shape, proven by this repo's tests over
real producer golden bytes) and end_to_end (the released Assay binary emitted the carrier
and the Harness consumed it in a hosted run). A declared / pending end-to-end state means
the first is proven but the second is not yet, so it renders as pending, never as approved.
# Validate the matrix shape + digest only
npx tsx harness/src/cli.ts suite check --matrix harness/suite-compatibility.json
# Also check drift vs the live carrier registry (CI)
npx tsx harness/src/cli.ts suite check --matrix harness/suite-compatibility.json --against-registry
# Project a reviewer-facing Markdown (or JSON) view
npx tsx harness/src/cli.ts suite matrix --matrix harness/suite-compatibility.jsonA malformed matrix, an unknown enum state, a digest mismatch, a proven end-to-end row
without its hosted_run + artifact_digest, or (in registry mode) drift between the
matrix and the registry is a contract error (exit 3). An unknown state is never clean.
An Evidence Pack (suite.evidence_pack.v0) is a deterministic, digest-bound bundle of a
proven carrier recipe. It binds raw Assay carrier bytes, the Harness Markdown projection,
the suite compatibility matrix, and the recipe provenance (suite.recipe_provenance.v0) by
digest. It creates no new evidence, approves no policy, and does not replace Plimsoll — it is
VSA-shaped, not a SLSA VSA. v0 carries the proven inventory recipe only.
# Build a pack from a proven recipe's outputs
npx tsx harness/src/cli.ts evidence-pack create \
--carrier carriers/assay.mcp_server_inventory.v0.json \
--suite-matrix harness/suite-compatibility.json \
--provenance recipe.provenance.json \
--markdown inventory.review.md \
--out evidence-pack/
# Verify a pack (strict)
npx tsx harness/src/cli.ts evidence-pack verify evidence-pack/verify separates source-of-truth (carrier, matrix, provenance) from lossy projections
(every projection must resolve to a source digest), enforces path safety (no .. / absolute
/ symlink / duplicate / unlisted / escaping paths), and holds the coherence invariant: the
carrier bytes, the matrix's proven row, and the provenance must all agree on the same artifact
and the same proof — covering the binary, command, fixture, runner, and hosting context, not
just the artifact digest. A pack cannot be internally consistent by digest while lying about where the
evidence came from. The manifest digest is deterministic over the evidence (volatile metadata
excluded), so identical evidence yields the same pack identity.
baseline evidence (main) candidate evidence (PR branch)
│ │
└──────────┬───────────────────┘
│
assay-harness compare
│
┌─────────┴──────────┐
│ Regression Report │
│ ─ new denials? │
│ ─ hash mismatches? │
│ ─ counter deltas? │
└─────────────────────┘
│
exit 0 (ok) or exit 6 (regression)
cd harness && npm install
# Compare two evidence files
npx tsx src/cli.ts compare \
--baseline ../fixtures/valid.assay.ndjson \
--candidate ../fixtures/failure.assay.ndjson
# Output:
# # Evidence Comparison Report
# **Status:** REGRESSION DETECTED
# **Summary:** REGRESSION: 1 new denial(s), ...| Signal | Meaning | Exit code |
|---|---|---|
| New denial | A tool that was allowed is now denied | 6 |
| New event type | Unknown evidence category appeared | 6 |
| Hash mismatch | Same event at same seq has different content | 6 |
| Increased denied count | More actions blocked than baseline | 6 |
Removed denials, approval changes, and counter decreases are reported as changes but do not fail the gate.
# Store current evidence as baseline
npx tsx src/cli.ts baseline update --from ../fixtures/valid.assay.ndjson
# Inspect baseline
npx tsx src/cli.ts baseline show
# → [baseline] events: 4
# → - example.placeholder.harness.approval-interruption
# → - example.placeholder.harness.policy-decision
# → - ...The policy engine is deterministic: same tool name, same decision, every time. No transcript, no model reasoning, no volatile state.
npx tsx src/cli.ts policy --tool read_file
# → { "decision": "allow", ... }
npx tsx src/cli.ts policy --tool write_file
# → { "decision": "require_approval", ... }
npx tsx src/cli.ts policy --tool network_egress
# → { "decision": "deny", ... } (exit 1)python3 mapper/map_to_assay.py \
fixtures/valid.harness.json \
--output /tmp/valid.assay.ndjson \
--import-time 2026-04-16T12:00:00Z \
--overwrite
# Malformed fixtures are rejected with canonical codes
python3 mapper/map_to_assay.py \
fixtures/malformed.harness.json \
--output /tmp/malformed.ndjson \
--import-time 2026-04-16T12:10:00Z \
--overwrite
# → [REJECT_RAW_STATE] rejected key 'raw_run_state'npx tsx src/cli.ts verify ../fixtures/valid.assay.ndjson --category all
# Checks: envelope fields, SHA-256 hashes, type prefixes, rejected keys# JUnit XML (denied actions become <failure> elements)
python3 ci/emit_junit.py fixtures/valid.assay.ndjson --output results/junit.xml
# SARIF 2.1.0 (uploads to GitHub Security tab)
python3 ci/emit_sarif.py fixtures/valid.assay.ndjson --output results/sarif.jsonpython3 ci/emit_otel.py fixtures/valid.assay.ndjson --output results/otel-export.json
# Produces OTLP-shaped JSON for integration with Jaeger, Grafana Tempo, etc.See docs/contracts/OTEL_EXPORT.md for mapping rules and stability caveats.
Promptfoo can produce AI eval outputs in CI; Assay compiles selected outcomes into evidence receipts and Trust Basis artifacts; Harness gates/reports the resulting Trust Basis diff.
ASSAY_BIN=/path/to/assay \
demo/run-promptfoo-receipt-pipeline.sh \
--case nonregression \
--out-dir /tmp/assay-promptfoo-receipt-pipeline \
--overwriteSee docs/PROMPTFOO_RECEIPT_PIPELINE.md for the artifact chain and boundary rules.
OpenFeature can surface runtime flag evaluation details; Assay compiles one
bounded boolean EvaluationDetails input path into decision receipts and Trust
Basis artifacts; Harness gates/reports the resulting Trust Basis diff.
ASSAY_BIN=/path/to/assay \
demo/run-openfeature-decision-receipt-pipeline.sh \
--case nonregression \
--out-dir /tmp/assay-openfeature-decision-receipt-pipeline \
--overwriteSee docs/OPENFEATURE_DECISION_RECEIPT_PIPELINE.md for the artifact chain and boundary rules.
CycloneDX ML-BOM can describe AI/ML inventory surfaces; Assay compiles one
selected machine-learning-model component into inventory receipts and Trust
Basis artifacts; Harness gates/reports the resulting Trust Basis diff.
ASSAY_BIN=/path/to/assay \
demo/run-cyclonedx-mlbom-model-receipt-pipeline.sh \
--case nonregression \
--out-dir /tmp/assay-cyclonedx-mlbom-model-receipt-pipeline \
--overwriteSee docs/CYCLONEDX_MLBOM_MODEL_RECEIPT_PIPELINE.md for the artifact chain and boundary rules.
These receipt recipes require the released Assay v3.8.0 Trust Basis surface
or a later compatible release, tracking Assay through v3.27.0 (release-binary
proof verified through v3.27.0) for
assay.trust-basis.diff.v1, Trust Card schema v5, and the 10-claim eval /
decision / inventory family set, with machine-readable receipt contracts owned
by Assay. See
docs/ASSAY_COMPATIBILITY.md for the exact
compatibility boundary.
cd harness
export OPENAI_API_KEY=sk-...
npx tsx src/cli.ts run \
--input "List files in /tmp and write a summary to /tmp/summary.txt" \
--auto-approveThe regular GitHub Actions workflow runs 8 jobs on every push and PR. A manual
workflow_dispatch compatibility job is available before releases to test the
receipt recipes against a chosen Assay release binary.
| Job | What it checks |
|---|---|
| TypeScript Check | tsc --noEmit passes |
| Contract Validation | Mapper produces golden NDJSON, malformed input rejected |
| Golden Contract Tests | 24 tests: envelope, hashing, NDJSON format, rejection |
| Hardening Tests | 27 tests: resume safety, policy determinism, MCP boundaries |
| Policy Validation | allow/deny/require_approval decisions are correct |
| Verify Evidence | Evidence files pass all contract categories |
| Regression Gate | Baseline vs candidate compare — blocks on regressions |
| Evidence Export | Evidence JUnit + SARIF generated, artifacts uploaded |
| Assay Release Compatibility Recipes | Manual proof against a chosen Assay release binary |
Evidence export can generate SARIF for generic evidence findings and upload it
to the GitHub Security tab. Trust Basis gate/report output is different by
design: the canonical artifact is raw assay.trust-basis.diff.v1 JSON, with
Markdown and JUnit projections only. Trust Basis recipes do not emit SARIF.
For this line, Assay Harness is distributed as the repository CLI and GitHub
release artifacts. The harness/package.json version tracks release metadata;
it is not an npm publication claim.
Four concrete scenarios in fixtures/scenarios/ show the key product flows:
| Scenario | What it demonstrates |
|---|---|
clean-baseline |
No regressions — compare exits 0 |
new-approval |
PR introduces new require_approval tool |
deny-regression |
Policy change causes new denial |
policy-drift-resume |
Resume with different policy snapshot hash |
Run all scenarios locally:
bash demo/run-scenarios.shSee docs/SCENARIOS.md for details on each scenario.
The compare command detects these regression dimensions:
New denials — tool went from allowed → denied
Removed denials — tool went from denied → allowed (change, not regression)
New approvals — tool now requires approval
Removed approvals — tool no longer requires approval
Event count delta — more or fewer evidence events
New event types — unknown event type appeared
Removed event types — event type disappeared
Hash mismatches — same event position, different content hash
Process counter delta — approval/denial/resume counts changed
Example compare output (markdown format):
# Evidence Comparison Report
**Status:** REGRESSION DETECTED
**Summary:** REGRESSION: 1 new denial(s), event count delta: +0, 1 hash mismatch(es), 2 process counter change(s)
## New Denials
- `network_egress` (policy: harness-default-policy@1.0)
## Hash Mismatches
- seq 0 (`example.placeholder.harness.approval-interruption`)
- baseline: `sha256:a1b2c3...`
- candidate: `sha256:d4e5f6...`
## Process Summary Delta
| Counter | Baseline | Candidate | Delta |
|---------|----------|-----------|-------|
| denied_action_count | 0 | 1 | +1 |
| total_tool_calls | 3 | 3 | +0 |Every CLI command uses stable exit codes — safe for CI scripting:
| Code | Name | Meaning |
|---|---|---|
| 0 | success | No failures or regressions |
| 1 | policy_violation | Tool call denied by policy |
| 2 | config_error | Missing file or bad configuration |
| 3 | artifact_contract | Evidence fails contract validation |
| 4 | mapper_failure | Mapper rejected input |
| 5 | resume_error | Resume flow failed |
| 6 | regression | Baseline comparison found regressions |
| 7 | ci_formatter | JUnit/SARIF generation failed |
See docs/contracts/EXIT_CODES.md for per-command behavior.
tools:
allow:
- read_file
- list_directory
deny:
- network_egress
- delete_file
require_approval:
- write_file
- shell_execEvaluation order: deny > require_approval > allow > default deny (closed-by-default).
Policy uses only: tool name, tool category, target kind. No transcript, model reasoning, or volatile state influences the decision. See ADR-001.
This harness explicitly does not claim:
- transcript truth
- session truth
- provider-chaining truth
- full RunState truth
- full MCP protocol truth
Observed is not verified. Runtime signals are evidence input, not verification output. See ADR-002 and ADR-003.
resume_state_ref is an app-level fingerprint, not a portable wire hash.
Per SDK guidance (openai/openai-agents-js#1177),
RunState.toString() guarantees resumability via fromString() but does not
guarantee byte-stability across SDK versions. Consumers that need byte-stable
identity across SDK upgrades should derive their own anchor. See
docs/FIELD_PRESENCE.md for the full seam documentation.
Assay-Harness/
harness/ # TypeScript runtime
src/
cli.ts # CLI: run, verify, compare, policy
policy.ts # Deterministic policy engine
harness.ts # Pause/resume orchestrator
evidence.ts # CloudEvents envelopes + SHA-256 hashing
compare.ts # Baseline vs candidate regression detection
agent.ts # Agent definition + tools
mcp.ts # MCP tool lane (bounded evidence)
policy.yaml # Default policy
mapper/ # Python evidence mapper
map_to_assay.py # Harness JSON → Assay NDJSON
fixtures/ # Test corpus
valid.harness.json # Approval interruption artifact
failure.harness.json # Partial failure artifact
malformed.harness.json # Rejected (raw state, bad pause_reason)
valid.assay.ndjson # Golden mapper output
failure.assay.ndjson # Golden mapper output
valid.mcp.harness.json # MCP interaction fixture
scenarios/ # Demo scenario fixtures
clean-baseline.* # No-regression baseline
new-approval.* # New require_approval tool
deny-regression.* # Policy tightening
policy-drift-resume.* # Policy snapshot mismatch
demo/
run-scenarios.sh # Run all demo scenarios locally
tests/ # Python test suites
test_contracts.py # 24 golden contract tests
test_hardening.py # 22 hardening tests
ci/ # CI output generators
emit_junit.py # Evidence → JUnit XML
emit_sarif.py # Evidence → SARIF 2.1.0
emit_compare_junit.py # Compare results → JUnit XML
emit_compare_sarif.py # Compare results → SARIF 2.1.0
emit_otel.py # Evidence → OTLP JSON (experimental)
docs/
contracts/ # Stable contracts
EVIDENCE_ENVELOPE.md # CloudEvents envelope spec
EXIT_CODES.md # CLI exit code contract
REJECT_CODES.md # Mapper rejection codes
OTEL_EXPORT.md # OTel export mapping (experimental)
adr/ # Architecture decisions
ADR-001-*.md # Deterministic policy
ADR-002-*.md # No transcript truth
ADR-003-*.md # MCP bounded evidence
SCENARIOS.md # Demo scenario documentation
ROADMAP.md # Now / next / later roadmap
.github/workflows/
harness-ci.yml # 8-job CI pipeline (incl. regression gate)
release.yml # Tag-triggered release with attestations
sbom.yml # Dependency submission on push to main
Companion project to Assay. Follows Assay's evidence conventions, policy model, and artifact contract patterns. The mapper produces Assay-shaped NDJSON consumable by Assay's evidence tooling.
MIT