When you run a scenario (python -m agent.run --local --scenario all), each crew
produces a set of files in output/latest/ (and output/runs/vNN/). After all
scenarios complete, --validate generates a validation report and summary.
This guide explains how to read each file type and what to look for.
| Scenario | Envelope | PROV Graph(s) | Audit | Metrics | Other |
|---|---|---|---|---|---|
| Healthcare | healthcare_envelope.json |
healthcare_prov.ttl |
healthcare_audit.json |
healthcare_metrics.json |
— |
| Education | education_grading_envelope.json |
education_grading_prov.ttl, education_equity_prov.ttl |
education_audit.json |
education_grading_metrics.json |
— |
| Recommendation | recommendation_envelope.json |
recommendation_prov.ttl |
— | recommendation_metrics.json |
recommendation_output.json |
Cross-scenario files: validation_report.json, summary.md.
The envelope is the core protocol artifact. Key sections:
{
"context_id": "ctx-...", ← unique ID linking all files for this run
"schema_version": "jh:0.3",
"producer": "did:hospital:...", ← DID of the system that created it
"scope": "healthcare_...", ← what domain this envelope covers
"semantic_payload": [...], ← structured UserML data (may be [] if LLM
didn't produce valid format — see
semantic_conformance check)
"artifacts_registry": [ ← every artifact each agent produced
{
"artifact_id": "art-sensor",
"type": "token_sequence", ← artifact type (token_sequence, semantic_extraction, tool_result)
"content_hash": "sha256:...", ← tamper-detection hash
"timestamp": "...", ← when this artifact was created
}
],
"passed_artifact_pointer": "art-audit", ← final output of the pipeline
"decision_influence": [...], ← which agents influenced the decision,
what data categories they used
"privacy": { ← GDPR metadata
"data_category": "behavioral",
"legal_basis": "consent",
"retention": "P7D", ← 7-day retention
},
"compliance": { ← regulatory classification
"risk_level": "high", ← HIGH requires semantic_forward + oversight
"human_oversight_required": true,
"forwarding_policy": "semantic_forward"
},
"proof": { ← cryptographic seal
"content_hash": "sha256:...", ← hash of envelope contents
"signature": "base64:...", ← Ed25519 signature
"signer": "did:hospital:..." ← who signed it
}
}
What to check:
artifacts_registryshould have one entry per pipeline step (e.g., 5 for healthcare)content_hashvalues should be unique per artifact (no duplicates = no empty outputs)compliance.risk_levelshould match expected (high for healthcare/education, low for recommendation)semantic_payload— if empty ([]), thesemantic_conformancevalidation check will FAIL. This means the LLM produced free-form text instead of structured UserML.
Turtle RDF format. Three node types to look for:
<did:hospital:sensor-agent> a prov:Agent ;
rdfs:label "did:hospital:sensor-agent" ;
jh:role "sensor" .Each agent has a DID identifier and a role. Count agents to verify all pipeline participants are recorded.
jh:act-situation a prov:Activity ;
rdfs:label "situation" ;
prov:startedAtTime "2026-03-24T13:13:26..."^^xsd:dateTime ;
prov:endedAtTime "2026-03-24T13:13:26..."^^xsd:dateTime ;
prov:used jh:art-sensor ;
prov:wasAssociatedWith <did:hospital:situation-agent> .What to check:
wasAssociatedWithlinks the activity to its agentusedshows what inputs were consumed (dependency chain)startedAtTime/endedAtTimeprovide temporal evidence — for healthcare, verify oversight activities start AFTERact-decision
jh:art-situation a prov:Entity ;
rdfs:label "Task output: situation" ;
prov:wasDerivedFrom jh:art-sensor ;
prov:wasGeneratedBy jh:act-situation ;
jh:contentHash "9ed190..." .What to check:
wasGeneratedBylinks the entity to its producing activitywasDerivedFromtraces the dependency chain (what inputs produced this output)jh:contentHashshould match the corresponding entry in the envelope'sartifacts_registry
The healthcare PROV includes fine-grained document access activities:
jh:act-access-ct-scan a prov:Activity ;
prov:startedAtTime "..."^^xsd:dateTime ;
prov:endedAtTime "..."^^xsd:dateTime ;
prov:used jh:ent-ct-scan ;
prov:wasAssociatedWith <did:hospital:dr-chen> .These prove the physician accessed each source document with timestamps. The
temporal_oversight check verifies all these occur AFTER act-decision.
Two separate .ttl files with zero shared entity URIs. The workflow_isolation
check compares entity sets across both graphs. If any jh:art-* URI appears in both
files, isolation is broken.
Audit files have two sections:
{
"programmatic_checks": {
"results": [
{
"check_name": "temporal_oversight",
"passed": true,
"evidence": {
"ai_timestamp": "...",
"human_activities_after_ai": ["act-access-ct-scan", ...],
"total_review_seconds": 10.0,
"min_required_seconds": 5.0
},
"message": "Human oversight verified: 4/4 activities after AI, 10s review time"
}
]
}
}Reading evidence fields:
temporal_oversight: Look attotal_review_secondsvsmin_required_seconds. In production, min should be 300s (5 minutes). The simulated minimum is 5s.integrity:has_signature: truebutpassed: falsemeans the envelope was modified after signing (expected when the audit step adds artifacts post-signature).workflow_isolation:shared_entities: []means zero overlap (PASS).negative_proof:violations: []means no identity artifacts in grading chain (PASS).
{
"narrative_audit": "# EU AI Act Article 14 Compliance Audit Report\n..."
}A detailed, human-readable report produced by the audit agent. For healthcare, this includes the full decision chain reconstruction with timestamps, temporal verification, and compliance assessment. This is the artifact that would go to a regulatory auditor.
{
"context_id": "ctx-...",
"total_ms": 169053.25,
"steps": [
{
"step": "sensor",
"agent": "did:hospital:sensor-agent",
"artifact_id": "art-sensor",
"content_size_bytes": 653,
"persist_ms": 0,
"started_at": "2026-03-24T13:13:20...",
"ended_at": "2026-03-24T13:13:20..."
}
]
}| Field | Meaning |
|---|---|
total_ms |
Wall-clock time for the entire flow (init → last step) |
step |
Pipeline step name (matches agent task name) |
content_size_bytes |
Size of the agent's output (larger = more detailed response) |
persist_ms |
Time spent saving to backend API (0 = async/background persist) |
started_at / ended_at |
LLM call timestamps (same = CrewAI didn't expose real duration) |
What to look for:
total_msgives the end-to-end flow duration (healthcare ~170s, recommendation ~35s)persist_ms > 0only shows for steps that persist synchronously (usually the last step)- Large
content_size_bytesgaps between steps may indicate the LLM produced very different output lengths
Machine-readable report covering all three scenarios:
{
"scenarios": [
{
"scenario": "healthcare",
"checks": {
"risk_level": {"passed": true, "value": "high"},
"forwarding_policy": {"passed": true, "value": "semantic_forward"},
"semantic_conformance": {"passed": false, "message": "No UserML payload..."},
"temporal_oversight": {"passed": true, "evidence": {...}},
"integrity": {"passed": false, "evidence": {...}},
"overall_passed": {"passed": false}
},
"metrics": {
"envelope_bytes": 3682,
"prov_bytes": 5608,
"entity_count": 9,
"activity_count": 9,
"agent_count": 5,
"artifact_count": 5,
"performance": {...}
}
}
],
"overall_passed": false
}Each scenario has checks (pass/fail per audit check) and metrics (sizes and counts).
The top-level overall_passed is true only if ALL checks across ALL scenarios pass.
Human-readable interpretation of validation_report.json. Contains:
- Artifact characteristics table (envelope/PROV sizes, entity/activity/agent counts)
- Audit checks table with PASS/FAIL/n/a per scenario
- Explanation of each check and what it verifies
- List of all files in the run
This is the file to read first when reviewing a run.
Cause: LLM agents output free-form JSON instead of the structured UserML format
({"@model": "UserML", "layers": {...}}).
Impact: Protocol still functions — envelopes, PROV, and audit all work. But payloads are not formally typed, which weakens the semantic auditing claim.
Fix: Use FlatEnvelope with output_pydantic to force the LLM into the schema.
Or improve task YAML prompts with stricter output instructions.
Cause: The envelope is signed by the sensor-agent at the beginning of the pipeline. Subsequent steps (oversight, audit) add artifacts to the same envelope, changing its content hash. The original signature no longer matches.
Impact: Expected in multi-step pipelines where the envelope accumulates artifacts. The fix is to re-sign after the final step.
Fix: Call builder.sign(final_agent_did) after all artifacts are registered, not
after the first step.
Cause: A shared artifact URI appears in both the grading and equity PROV graphs.
Impact: Critical — breaks the non-discrimination proof.
Fix: Verify the equity flow uses a completely separate context_id and doesn't
reference any art-grading or art-ingestion entities.
Cause: Either physician activities have timestamps BEFORE the AI decision, or total review duration is below the minimum threshold.
Impact: Critical — breaks the human oversight proof.
Fix: Check that _persist_oversight_events() is called with correct timestamps and
that the simulated review delays sum to > min_required_seconds.