From 31570e0f35c32ab365524d0ca2a5f095b74c9558 Mon Sep 17 00:00:00 2001 From: Shady El Damaty Date: Wed, 24 Jun 2026 14:17:24 +0200 Subject: [PATCH 1/4] =?UTF-8?q?schema(v0.6.0):=20proactive=20provenance=20?= =?UTF-8?q?=E2=80=94=20run-record=20+=20node=20runner?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make the Glimmer graph executable, not just descriptive. The agentic loop (docs/agentic-loop.md) was specified but had no runtime primitive; the only "verification" script re-hashed files on disk ("not a full re-run"). Core schema (v0.5.1 → v0.6.0): - new `run-record` node type: one concrete, replayable invocation (PROV Activity) with a planned → ready → running → executed lifecycle, binding a method + pinned, standard-validated inputs + expected outputs + command + container digest + a runner-written verdict. The executable unit of the agentic loop. - edges reruns/consumes/regenerates/emits (+ inverse regenerated-by); broaden tests-hypothesis/addresses-concept to allow a run-record source. - method-registry affordances: registry-ref + implements/equivalent-to/refines. - standard.validator hint for the runtime gate. Runner (glimmer/tools/run.py + `glimmer run` / `glimmer rerun`): - pre-run standards gate (inputs pinned AND valid; validation delegated to each standard's validator or glimmer validate). - datalad containers-run replay (feature-detected; degrades honestly). - three verification tiers: byte-identical (NIfTI/GIFTI/JSON normalization), numeric-within-tolerance (re-derive published numbers), structural (agent/LLM). - certify_equivalence, dependency-ordered `all`, provenance manifest, verdict writeback. Validator: run-record field/edge/target/lifecycle checks + agent-protocol rule. Worked example: examples/synthetic-provenance/ exercises the loop + all three tiers + the gate + equivalence, with no real data and no heavy deps. Docs: new docs/proactive-provenance.md; agentic-loop.md made executable; roadmap (federation → v0.7, registry/minimal-path folded in); README repositioned as AI-native reproducibility; agent-protocol + datalad-pattern cross-linked. Co-Authored-By: Claude Opus 4.8 --- README.md | 14 +- docs/agent-protocol.md | 2 + docs/agentic-loop.md | 35 + docs/datalad-pattern.md | 2 + docs/proactive-provenance.md | 180 ++++++ docs/roadmap.md | 54 +- examples/synthetic-provenance/.gitignore | 3 + examples/synthetic-provenance/README.md | 85 +++ .../synthetic-provenance/code/compute_mean.py | 17 + .../code/compute_mean_fast.py | 22 + .../code/fit_classifier.py | 23 + .../synthetic-provenance/code/make_signal.py | 17 + .../code/summarize_agent.py | 23 + .../code/validate_signal.py | 27 + examples/synthetic-provenance/emit_graph.py | 40 ++ .../synthetic-provenance/inputs/signal.json | 15 + .../inputs/signal_bad.json | 4 + glimmer/schema/frontmatter.yaml | 105 ++- glimmer/schema/glimmer-version | 2 +- glimmer/schema/profiles/_profile.schema.yaml | 2 +- glimmer/schema/schema.md | 75 ++- glimmer/tools/cli.py | 38 ++ glimmer/tools/run.py | 610 ++++++++++++++++++ glimmer/tools/validate.py | 73 ++- 24 files changed, 1441 insertions(+), 27 deletions(-) create mode 100644 docs/proactive-provenance.md create mode 100644 examples/synthetic-provenance/.gitignore create mode 100644 examples/synthetic-provenance/README.md create mode 100644 examples/synthetic-provenance/code/compute_mean.py create mode 100644 examples/synthetic-provenance/code/compute_mean_fast.py create mode 100644 examples/synthetic-provenance/code/fit_classifier.py create mode 100644 examples/synthetic-provenance/code/make_signal.py create mode 100644 examples/synthetic-provenance/code/summarize_agent.py create mode 100644 examples/synthetic-provenance/code/validate_signal.py create mode 100644 examples/synthetic-provenance/emit_graph.py create mode 100644 examples/synthetic-provenance/inputs/signal.json create mode 100644 examples/synthetic-provenance/inputs/signal_bad.json create mode 100644 glimmer/tools/run.py diff --git a/README.md b/README.md index ec1ae07..d806b01 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,10 @@ # Glimmer -> **A research-object knowledge base for AI-native scientific workflows.** +> **An AI-native solution to the reproducibility problem.** > -> The 2010s gave us reproducible pipelines. Glimmer is the next layer up — the typed-entity graph that makes the agentic feedback loop traversable over those pipelines. +> The 2010s gave us reproducible pipelines. Glimmer is the next layer up — a typed-entity graph over those pipelines whose runs are **executable, standard-gated, and self-verifying**, so "this result reproduces" is a contract the machine checks, not a footnote you trust. -[![Status: v0.3](https://img.shields.io/badge/status-v0.3-blue.svg)](https://github.com/hebbianloop/glimmer) +[![Status: v0.6](https://img.shields.io/badge/status-v0.6-blue.svg)](https://github.com/hebbianloop/glimmer) [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) [![Template](https://img.shields.io/badge/repo-template-purple.svg)](https://github.com/hebbianloop/glimmer/generate) @@ -12,6 +12,8 @@ Existing standards (BIDS, DataLad, NIDM, Nipype) give your project syntactic structure. Glimmer adds the **graph layer**: datasets, methods, derivatives, findings, standards, and publications become first-class typed nodes with versioned edges, distributed across per-entity sidecars. An AI agent traverses the graph to render verifiable decisions with auditable reasoning traces. +**Proactive provenance (v0.6).** The graph is not just descriptive — it *runs*. A [`run-record`](glimmer/schema/schema.md#run-record) is one concrete, replayable invocation (the PROV `Activity`); the `glimmer run` / `glimmer rerun` node runner gates its inputs against their standards, replays the recorded command in a pinned container, and verifies outputs at one of three tiers — **byte-identical**, **numeric-within-tolerance** (re-derive a published number from source), or **structural** (for agent/LLM outputs). This is the executable unit of the [agentic loop](docs/agentic-loop.md). Start at [`docs/proactive-provenance.md`](docs/proactive-provenance.md) and the [`synthetic-provenance`](examples/synthetic-provenance/) example. + Glimmer is domain-agnostic. The canonical worked example in this repo is neuroimaging because that's where standards like BIDS and tools like DataLad and Nipype are most developed — but the architectural pattern (typed-entity graph over a versioned-data substrate) applies to any compute-intensive scientific domain backed by a mature standards ecosystem. Glimmer is the architectural pattern + a reference implementation. The full case for it is in the CAISC 2026 paper (see [`docs/paper-citation.md`](docs/paper-citation.md)). @@ -42,15 +44,17 @@ The line between "core" and "project" is the `glimmer/` directory. Anything insi ``` glimmer/ ├── schema/ -│ ├── schema.md # v0.3 spec — 10 entity types, edge taxonomy, sidecar format +│ ├── schema.md # v0.6 spec — 13 entity types (incl. run-record), edge taxonomy, sidecar format │ ├── frontmatter.yaml # machine-readable contract for validators -│ └── glimmer-version # current core version (0.3.1) +│ └── glimmer-version # current core version (0.6.0) └── tools/ ├── validate.py # schema validator (enforces agent-protocol verifiability) + ├── run.py # node runner — `glimmer run` / `glimmer rerun` (gate → replay → verify) ├── cli.py # `glimmer` CLI single entry point └── figure_schema.py # render the schema diagram examples/ +├── synthetic-provenance/ # v0.6 proactive-provenance demo: loop + 3 tiers + gate + equivalence └── ds000114-nipype/ # canonical worked example from the CAISC 2026 paper ├── install.sh # `datalad install ///openneuro/ds000114` + selective `datalad get` ├── workflow.py # Nipype anatomical preprocessing (BET → FAST) diff --git a/docs/agent-protocol.md b/docs/agent-protocol.md index cdb49f9..2f02118 100644 --- a/docs/agent-protocol.md +++ b/docs/agent-protocol.md @@ -92,6 +92,8 @@ The structure makes audit possible; the audit itself remains a human responsibil For deterministic outputs (a Nipype workflow), verification is exact: re-running from the cited SHAs must produce a byte-identical output. For LLM-inferred outputs, verification is structural: the trace must cite real nodes, the cited nodes must contain the values the trace claims, and the interpretation must be a plausible reading of those values. Neither test guarantees correctness; together they guarantee auditability. +As of v0.6 these regimes are no longer prose-only — they are the three tiers the **node runner** (`glimmer run` / `glimmer rerun`, see [`proactive-provenance.md`](proactive-provenance.md)) actually checks: **byte-identical** (deterministic, with header normalization), **numeric-within-tolerance** (stochastic), and **structural** (agent-inferred — this section's contract, executed). A `run-record` with `produced-by-agent` set carries the same mandatory `reasoning-trace`, and the runner's structural tier verifies exactly the three conditions above. + ## What this protocol does not solve - **The agent's reasoning may still be wrong**, even when grounded in real evidence. Glimmer's audit makes errors traceable; it does not eliminate them. diff --git a/docs/agentic-loop.md b/docs/agentic-loop.md index 6dcfae6..69f566a 100644 --- a/docs/agentic-loop.md +++ b/docs/agentic-loop.md @@ -57,6 +57,40 @@ └─────────────────────────────┘ ``` +## Making the loop executable: `run-record`s (v0.6) + +Through v0.5 this loop was a *pattern* — the "launch agent runs" box had no runtime +primitive; an analysis "run" left only a `derivative` behind, with no replayable record of +the act that produced it. v0.6 makes the box executable with the [`run-record`](../glimmer/schema/schema.md#run-record) +node type and the `glimmer run` / `glimmer rerun` runner (see [`proactive-provenance.md`](proactive-provenance.md)). + +The loop now closes *in the graph*: + +``` +concept ──decompose──▶ hypotheses + │ │ each hypothesis gets one or more PLANNED run-records + │ ▼ (tests-hypothesis → concept; inputs may be spec'd) + │ run-record [planned] ──gate──▶ [ready] ──glimmer run──▶ [executed] + │ │ the runner: validates inputs against their standards, + │ │ replays the command, hashes/verifies outputs + │ ▼ + │ regenerates → derivative + emits → finding + │ │ addresses-concept + └──────────────────── feedback ◀───────────────────────┘ + the next planning pass reads verdicts + findings, not memory +``` + +- **Plan** = a `concept` decomposed into hypotheses, each with `planned` run-records. +- **Run** = `glimmer run` gates, executes, and records — advancing `planned → executed` + and writing a verdict. The **analysis agent** role below now *authors and runs* + run-records rather than emitting bare derivatives. +- **Feedback** = the `replay-verdict` + emitted `finding` (which `addresses-concept`) are + what the next iteration reads. Re-running later (`glimmer rerun`) re-verifies the chain. + +This is why the loop is *self-sustaining*: every iteration leaves behind a replayable, +standard-gated, self-verifying record, so the agent can't silently lose a finding or +re-make a settled mistake — the verdict is in the graph. + ## The four agent roles A Glimmer project typically uses four distinct agent roles, each operating with a different protocol mode and access scope. @@ -158,6 +192,7 @@ Shipped in v0.3 (this loop now runs against the released schema): - `experiment` node type — for task/acquisition paradigms (Experiment Factory containers, jsPsych/PsychoPy tasks). - Cross-cutting edges: `addresses-concept` (finding/publication → concept), `tests-hypothesis` (experiment → concept), `extends-concept` / `subsumed-by` / `competes-with` / `superseded-by` (concept → concept), and the universal `contributed-by` attribution edge. - `persona` and `organization` node types + the in-graph attribution edges `authored-by`, `affiliated-with`, `funded-by`, `mentors`, `leads`, `part-of` (v0.3.1). A literature scout or synthesis agent can now resolve "who worked on this concept" by walking `authored-by` / `leads` to persona nodes rather than parsing free-text author strings. +- `run-record` node type + the `glimmer run` / `glimmer rerun` runner (v0.6) — the executable unit of this loop: a `planned → ready → running → executed` invocation tied to its hypothesis via `tests-hypothesis`, producing `derivative`s (`regenerates`) and a `finding` (`emits`) with a recorded verdict. See [`proactive-provenance.md`](proactive-provenance.md). Still on the roadmap: diff --git a/docs/datalad-pattern.md b/docs/datalad-pattern.md index 7c577a7..1bee331 100644 --- a/docs/datalad-pattern.md +++ b/docs/datalad-pattern.md @@ -91,6 +91,8 @@ These fields are optional in v0.1.1 and recommended in v0.1.2. A Glimmer instanc Every step is verifiable. Every output cites its inputs by content-hash. The Glimmer graph isn't a fragile parallel database — it's a thin reasoning layer over a DataLad superdataset that is itself the source of truth. +The "verification agent re-runs methods, compares output-hashes" step above is concrete as of v0.6: it is the `glimmer rerun` node runner (`glimmer/tools/run.py`, see [`proactive-provenance.md`](proactive-provenance.md)). The runner replays a `run-record`'s command pinned to its `container-digest` via `datalad containers-run`, materializing inputs with `datalad get` and matching their `datalad-annex-key` / `datalad-commit-sha` before execution — the DataLad coordinates on each node are exactly what makes re-fetch-and-replay possible. + ## How this relates to the format-agnostic position Earlier docs argued for a "two-tier" Glimmer/BIDS sidecar strategy. The deeper point is simpler: **format doesn't matter if the agent can translate between formats.** What matters is that the data has the structure the schema requires. diff --git a/docs/proactive-provenance.md b/docs/proactive-provenance.md new file mode 100644 index 0000000..48c99b1 --- /dev/null +++ b/docs/proactive-provenance.md @@ -0,0 +1,180 @@ +# Proactive Provenance — the `run-record` and the node runner + +> Glimmer v0.6.0. The schema records *what* an output is; proactive provenance makes +> the graph **executable** — so the claim "this output reproduces" is a contract the +> machine can check, not a footnote you trust. + +The 2010s gave us reproducible pipelines. Glimmer v0.3–v0.5 gave us a typed graph over +them — datasets, methods, derivatives, findings — each content-hashed. But the graph was +**descriptive**: a `derivative` recorded an `output-hash`, yet nothing could re-execute +the computation and confirm it. The only verification script in the repo +(`examples/ds000114-nipype/verify.py`) re-hashed files already on disk and said so: +*"not a full re-run."* v0.6.0 closes that gap with one new core node type and one tool. + +## Why this matters (the failure it fixes) + +An agent analyzing a dataset, with no executable contract to anchor to, re-derives the +same numbers different ways, makes the same data-processing mistakes, and forgets prior +findings. Each of those is a reproducibility failure in miniature. A `run-record` makes +the unit of work **replayable, gated, and self-verifying**: the agent (or a human, or CI) +runs it, gets a verdict, and the verdict — not a memory or a hope — is what the next step +reads. This is reproducibility as a property the substrate *enforces*, AI-native by +construction. + +## The `run-record`: one concrete, replayable invocation + +A `method` is a reusable tool; a `derivative` is a product. Neither is *the act*. The +`run-record` is the PROV `Activity`: **this** command, on **these** pinned inputs, in +**this** container, testing **this** hypothesis, on **this** date — binding a method + +pinned inputs + expected outputs + the environment + a verdict. + +It is the **executable unit of the agentic loop** (see [`agentic-loop.md`](agentic-loop.md)). +Authored as part of a plan and advanced through a lifecycle: + +``` +planned ──gate──▶ ready ──exec──▶ running ──▶ executed + │ │ + │ (inputs may be spec'd/unpinned) └─▶ failed | superseded +``` + +- **planned** — written when a `concept` is decomposed into hypotheses; inputs may be a + *spec* (modality / n / standard) rather than a pin. (The affordance v0.7 minimal-path + reproduction builds on.) +- **ready** — inputs pinned and standard-valid: the runner's gate passed. +- **running / executed / failed** — the runner advances these and writes the verdict. + +Full field list: [`glimmer/schema/schema.md` → run-record](../glimmer/schema/schema.md#run-record). +The required core is `method`, `command`, `provenance-mode`, `status`, `inputs`, `outputs`. + +## The runner: `glimmer run` and `glimmer rerun` + +Two modes over one engine (`glimmer/tools/run.py`). **The runner is for running, not just +reproducing** — reproduction is the special case of re-running something already recorded. + +| command | mode | what it does | +|---|---|---| +| `glimmer run ` | forward | gate → replay → hash outputs → emit/record products; advance `ready→executed` | +| `glimmer rerun ` | reproduce | re-execute and **compare** outputs to what was recorded, per tier | + +`` is a run-record id, a comma-separated list, or `all` (dependency-ordered so a run +that produces another's input goes first). Flags: `--manifest `, `--write-verdict`, +`--offline`, `--no-container`, `--no-gate`. + +The engine, per run-record: + +1. **Standards gate.** Each input must be (a) available/pinned (`datalad get` unless + `--offline`; annex-key / commit-sha match) **and** (b) valid against its declared + standard. A failed gate ⇒ verdict `gate-failed`, **no execution**. +2. **Execute.** Replay `command` pinned to `container-digest` via `datalad containers-run`; + fall back to `datalad run`, then a host subprocess (only with `--no-container`, recorded + as a dirty environment). Cwd is the project root (the rokb's parent — where `code/`, + `inputs/`, `out/` live). +3. **Verify** (reproduce mode) at the tier selected by `provenance-mode`. +4. **Record.** Write the provenance manifest (JSON); with `--write-verdict`, write the + `validation-gate` + `replay-verdict` blocks back into the sidecar (and `status`). + +## The three verification tiers + +Reproducibility is not one thing. The tier is chosen by `provenance-mode` and can be +overridden per output via the output-pin's `tier`. + +### byte-identical (`deterministic`) +Re-execute; the output hash must match exactly — **after normalization**, because two +*correct* re-runs of an FSL/Nipype step differ in volatile header bytes. The runner +normalizes before hashing and records how: +- **NIfTI** (`.nii`/`.nii.gz`): blank `descrip`/`db_name`/`aux_file`, zero `cal_*`/`gl*`, + keep affine + dtype + voxels exactly → `normalization: nifti-header-zeroed`. (Realizes + the long-promised normalization in `examples/ds000114-nipype/workflow.py`.) +- **GIFTI** (`.gii`): strip the `` block, canonicalize XML. +- **JSON**: canonical re-serialization (sorted keys). +- nibabel absent ⇒ raw compare, recorded as `raw (nibabel unavailable)` — never a silent + pass. + +### numeric-within-tolerance (`stochastic` + `reproduction-tolerance`) +Re-derive a number from source and assert `abs(obs − exp) ≤ max(abs, rel·|exp|)`. The +expected value lives in the output-pin's `expected-values` (the published number); the +tolerance declares how close counts (empty ⇒ exact). This is the honest guarantee for +stochastic analyses **and the mechanism for reproducing a paper's reported numbers from +source.** + +### structural (`agent-inferred`) +LLM/agent prose is not hash-reproducible. Verification is structural (per +[`agent-protocol.md`](agent-protocol.md)): the `reasoning-trace` must cite real nodes, those +nodes must hold the claimed values, and the `retriever-manifest` (if any) must be +self-consistent — the analogue of SHA re-execution for the stochastic regime +([`retrieval-adapter.md`](retrieval-adapter.md)). + +## Standards as a runtime gate (referenced, runner-enforced, delegated) + +A run-record only **references** the standards its inputs must satisfy (input-pin +`conforms-to`, the `requires-standard` edge). It never reimplements validation. The +**runner enforces** the gate; the **check is delegated** to the standard's `validator` +hint: + +```yaml +# on a `standard` node +validator: {tool: bids-validator, command: "bids-validator {path}", kind: external} +# or {kind: glimmer} to use `glimmer validate` on the graph +``` + +`{path}` is the input under test; a zero exit means conformant. Unknown/missing validator +⇒ the gate falls back to `glimmer validate` and records `none (unchecked)` — honest, never +a silent pass. The outcome lands in the verdict's `validation-gate` block, so "ran on +BIDS-valid data" becomes a recorded, enforced fact. + +## The provenance manifest + +`--manifest ` writes a JSON superset of the old `verification-report.json`: + +```json +{ + "glimmer-run-version": "glimmer-run 0.6.0", + "summary": {"verified": 2, "reproduced-within-tolerance": 1, "structurally-valid": 1, + "total": 4, "reproducibility-rate-pct": 100.0}, + "runs": [ { "node": "...", "verdict": "verified", "tier": "byte-identical", + "validation-gate": {...}, "environment": {...}, "outputs": [...], + "numeric-checks": [...], "structural-checks": [...], "exit-code": 0 } ] +} +``` + +A tampered `expected-hash` ⇒ `mismatch`; an unreachable input ⇒ `inputs-unavailable` +(recording `data-liveness`); a non-pinned-container run ⇒ `environment.resolved-via: host`. +The verdict is never silently weaker than the claim. + +## Method registry affordances (v0.6 lightweight; full program v0.7) + +A method is a reusable f(x), and its *semantic pattern* can outlive one project's graph. +v0.6 adds: +- `registry-ref` — a cross-project namespaced id naming the canonical pattern (e.g. + `glimmer-methods:skullstrip-t1w`); `implements` is its navigable edge. +- `equivalent-to` — two methods that produce the same outputs. **Certified, not asserted:** + `certify_equivalence(rokb, run_a, run_b)` runs both on the same input and confirms the + outputs match (byte-identical, or within tolerance). An equivalent-but-faster/cleaner + implementation is tolerated; the pattern links back. +- `refines` — a tuned / fine-tuned variant. + +A dedicated `method-pattern` node type, a published cross-institution registry, and +minimal-path reproduction of arbitrary external papers are **deferred to v0.7**. + +## Worked example + +[`examples/synthetic-provenance/`](../examples/synthetic-provenance/) exercises the loop + +all three tiers + the standards gate + equivalence certification, with no real data and no +heavy dependencies. Start there. + +## Using it downstream (e.g. an ADS dissertation harness) + +The runner's public surface is importable, so a domain harness re-uses the audited engine +instead of re-implementing verification: + +```python +from glimmer.tools.run import load_graph, run_node, write_manifest, RunVerdict, sha256_file + +# one run-record per published number, each with a reproduction-tolerance and a +# tests-hypothesis edge to its concept: +verdicts = [run_node(ROKB, rid, mode="reproduce") for rid in claim_run_ids] +write_manifest(verdicts, "docs/data/provenance_manifest.json") +``` + +The domain repo supplies only data + command + tolerance + standard references; Glimmer +supplies the gate, the tiers, the normalization, and the verdict. diff --git a/docs/roadmap.md b/docs/roadmap.md index 4ae739b..8a68c4a 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -1,6 +1,10 @@ # Glimmer Roadmap -> v0.3.1 (current) extends the architecture beyond a single dataset: ten entity types (`dataset`, `method`, `experiment`, `derivative`, `finding`, `concept`, `standard`, `publication`, `persona`, `organization`), the universal `contributed-by` attribution edge plus the in-graph attribution layer (`authored-by`, `affiliated-with`, `funded-by`, `mentors`, `leads`, `part-of`), the `ds000114-nipype` worked example, and the retrieval adapter for the literature-scout role. This document tracks the work that takes Glimmer from "single-project RO-KB" to "the substrate for AI-native science." +> **v0.6.0 (current)** makes the graph *executable*: the `run-record` node type + the +> `glimmer run` / `glimmer rerun` node runner (gate → replay → verify at three tiers). +> See [`proactive-provenance.md`](proactive-provenance.md). Earlier milestones below. + +> v0.3.1 extends the architecture beyond a single dataset: ten entity types (`dataset`, `method`, `experiment`, `derivative`, `finding`, `concept`, `standard`, `publication`, `persona`, `organization`), the universal `contributed-by` attribution edge plus the in-graph attribution layer (`authored-by`, `affiliated-with`, `funded-by`, `mentors`, `leads`, `part-of`), the `ds000114-nipype` worked example, and the retrieval adapter for the literature-scout role. This document tracks the work that takes Glimmer from "single-project RO-KB" to "the substrate for AI-native science." ## v0.2 — Interop and BIDS Bridge @@ -134,21 +138,45 @@ The reference agent (`glimmer/tools/agent.py`) lands as a minimal QC agent, then - `glimmer.agent.Reasoning` — base class for project-specific agents (trace verification, finding synthesis, literature review, meta-analysis summarization). Authors of project agents subclass this and only fill in domain-specific reasoning. - `glimmer.agent.Trajectory` — explicit trace object that records every node read and every edge walked, so outputs are auditable in a structured way. -The payoff: a project-specific agent becomes a small amount of domain reasoning over a shared, audited tool surface, and every run emits a `Trajectory` that the verifiability contract (see `docs/agent-protocol.md`) can check after the fact. The reference QC agent stops being the ceiling and becomes the smallest example of the SDK. +The payoff: a project-specific agent becomes a small amount of domain reasoning over a shared, audited tool surface, and every run emits a `Trajectory` that the verifiability contract (see `docs/agent-protocol.md`) can check after the fact. The reference QC agent stops being the ceiling and becomes the smallest example of the SDK. The `rerun_method` primitive named here is **realized in v0.6.0** by `glimmer/tools/run.py`. + +## v0.6.0 — Proactive provenance (executable run-records + node runner) ✅ shipped + +The shift from a **descriptive** graph (records *what* an output is) to an **executable** +one (can re-run it and prove it). Full spec: [`proactive-provenance.md`](proactive-provenance.md). + +- **`run-record` core node type** — one concrete, replayable invocation (the PROV + `Activity`): a method + pinned, standard-validated inputs + expected outputs + the exact + command + a pinned container env + a runner-written verdict, advanced through a + `planned → ready → running → executed` lifecycle. It is the **executable unit of the + agentic loop** (see [`agentic-loop.md`](agentic-loop.md)), with the core edges `reruns`, + `consumes`, `regenerates`, `emits`, and `tests-hypothesis` / `addresses-concept`. +- **`glimmer run` / `glimmer rerun`** — the node runner: a pre-run **standards gate** + (inputs pinned AND valid, delegated to each standard's `validator`), `datalad + containers-run` replay, and verification at **three tiers** — byte-identical (with + NIfTI/GIFTI/JSON normalization), numeric-within-tolerance (re-derive a published number + from source), structural (agent/stochastic). Writes a provenance manifest. +- **Lightweight method-registry affordances** — `registry-ref` + `implements` / + `equivalent-to` / `refines` on `method`; the runner **certifies** `equivalent-to` by + output-match. (A dedicated `method-pattern` type, a published registry, and minimal-path + reproduction move to v0.7 below.) +- Worked example: [`examples/synthetic-provenance/`](../examples/synthetic-provenance/). -## v0.6 — Federation and shared schemas +## v0.7 — Federation, shared schemas & the method registry -When two research groups maintain Glimmer projects on the same dataset, they should be able to publish their schemas, agents, and verification baselines as a shared registry. v0.6 specifies: +When two research groups maintain Glimmer projects on the same dataset, they should be able to publish their schemas, agents, and verification baselines as a shared registry. v0.7 specifies: - A schema-registry format for cross-institution publication of Glimmer extensions. **Domain profiles are the unit of publication here** — a curated profile (`status: curated`) lives in this repo; a profile published through the registry carries `status: community`. The `_profile.schema.yaml` metadata (`standard`, `version`, `status`) is the registry record. +- **The method registry (full).** Promote the v0.6 lightweight affordances into a dedicated `method-pattern` node type (an abstract f(x) interface contract — CWL-style, no binding — that concrete `method` nodes `implement`), and publish patterns cross-institution so a method's semantic identity outlives one project's graph. +- **Minimal-path reproduction.** Given any external claim (a cited `publication` / `concept`), synthesize a minimal run-graph of `planned` run-records — with spec'd or surrogate/synthetic inputs when the real data is unavailable — that would test it. Builds directly on the v0.6 planned-run-record + spec'd-input affordances. - A reputation / provenance model for who proposed which schema extension. - A federated-query mechanism: agent at site A can issue a query, the local graph + remote schema permit reasoning, the response is signed by the agent's identity. This is also the natural junction with decentralized-science infrastructure (Opscientia, OpenNeuro, Holonym-style verifiable researcher identity). -## v0.7 — Storage, durability & the multi-tenant platform +## v0.8 — Storage, durability & the multi-tenant platform -The **service architecture rewrite** (tracked in #7): the work that takes Glimmer from a post-hoc documentation ledger to a node-driven model where storage durability is *part of the research object*, and where users provision their own hosted backing resources. It is the platform the hosted CLI (v0.8) sits on. +The **service architecture rewrite** (tracked in #7): the work that takes Glimmer from a post-hoc documentation ledger to a node-driven model where storage durability is *part of the research object*, and where users provision their own hosted backing resources. It is the platform the hosted CLI (v0.9) sits on. **Why this belongs in the model.** Glimmer v0.3 records *what* an output is plus its content-hash, but delegates *where* it lives and *how many copies* exist to the datalad/git-annex layer — invisible in the typed research object. An output you cannot locate, or that has a single copy, is not reproducible, so durability belongs inside the model rather than only in ops config. The motivating near-miss: a multi-day, compute-expensive output that sat only in `/tmp`, discoverable only because someone asked where it was. @@ -160,18 +188,18 @@ Three layers, smallest-first: Design record: #6, #7, and `ads-glimmer/docs/data/INFORMATION-ARCHITECTURE.md` (status: design, implementation deferred to this version). -## v0.8 — Hosted service: CLI ↔ glimmer.science +## v0.9 — Hosted service: CLI ↔ glimmer.science -Today the CLI is local-only — it builds, validates, and traverses a file-tree RO-KB on disk. Once the **v0.7 platform** is in place, the CLI becomes the client to a hosted research-object service: +Today the CLI is local-only — it builds, validates, and traverses a file-tree RO-KB on disk. Once the **v0.8 platform** is in place, the CLI becomes the client to a hosted research-object service: - **`glimmer auth login`** — authenticate the CLI to glimmer.science and bind to a project the user created there. -- **Resource provisioning** — request and manage a project's backing resources from the CLI: **storage** (dataset hosting / DataLad remotes, provisioned per the v0.7 tenant model) and **compute** (run a `method` / pipeline remotely rather than locally). +- **Resource provisioning** — request and manage a project's backing resources from the CLI: **storage** (dataset hosting / DataLad remotes, provisioned per the v0.8 tenant model) and **compute** (run a `method` / pipeline remotely rather than locally). - **Remote research-object operations** — modify / update / query the hosted RO-KB through the CLI: push new nodes and edges, fetch a subgraph, run a query against the project's graph. -- **Identity & provenance** — operations are signed by the authenticated researcher identity, tying into the v0.6 federation model and Holonym-style verifiable identity. +- **Identity & provenance** — operations are signed by the authenticated researcher identity, tying into the v0.7 federation model and Holonym-style verifiable identity. -Dependency: **builds on the v0.7 platform rewrite** (#7). Captured here so the CLI surface is designed against the service rather than retrofitted later. +Dependency: **builds on the v0.8 platform rewrite** (#7). Captured here so the CLI surface is designed against the service rather than retrofitted later. -## Beyond v0.8 — Open questions +## Beyond v0.9 — Open questions Each of these is a candidate edge or field family that doesn't yet have a consumer pushing on it. The leaning is recorded so the discussion starts from a proposal, not a blank page. @@ -180,7 +208,7 @@ Each of these is a candidate edge or field family that doesn't yet have a consum - **Relationship-typed citation.** Citation is *already* typed by **target** — `cites-dataset` / `cites-method` / `cites-derivative` / `cites-finding`, plus generic `cites` and `validates-against`. What's missing is typing by **relationship**: CiTO / PROV has ~50 predicates (`disagrees-with`, `extends`, `uses-method-in`, …). *Proposal:* don't adopt all of them — add a small, high-signal subset (`extends`, `uses-method-in`, `disagrees-with`, `confirms`) as typed `publication → publication` edges layered over the existing by-target `cites-*`. The planned `meta-analyzes` edge is the first member of this family. *Leaning:* adopt the subset alongside `meta-analysis`. - **Retractions.** The evidence-relation layer already ships — `contradicts` and `competes-with` (concept), `challenged-by` (finding), `superseded-by` (concept) — so contradiction and supersession *are* modeled. What's **not** modeled is a formal, dated **retraction**: a withdrawal of a result by its own authors. *Proposal:* add a narrow `retracts` / `disputes` edge on `finding` / `publication`, extending the existing `superseded-by` pattern. A retraction is a first-class edge, **not a deletion** — the original node and the withdrawal both remain in the graph, which is exactly the auditability the substrate promises. *Leaning:* adopt the two missing edges; the rest of this is done. - **What if the agent disagrees with itself.** An agent run at time T₁ may produce a different output than the same agent at T₂ (different model version, different graph state). *Resolution (no schema change):* record both as separate `finding` nodes, each carrying its `agent` / `model` / `run-at` provenance — the schema already supports this. The work is to *document the pattern* and have the v0.5 agent SDK emit the disambiguating provenance by default. -- **Privacy as a node-level property.** A `dataset` node referring to participant data needs an access policy. *Proposal:* the **v0.7 platform** adds a `data-use-agreement` reference (a node type or a constraint edge) and an `access-class` field on `dataset`; identified data stays out of the committed graph regardless, with the access policy governing `datalad get` against the provisioned remotes. *Leaning:* fold into v0.7, since it rides on the same storage/provisioning layer and v0.8 signed identity. +- **Privacy as a node-level property.** A `dataset` node referring to participant data needs an access policy. *Proposal:* the **v0.8 platform** adds a `data-use-agreement` reference (a node type or a constraint edge) and an `access-class` field on `dataset`; identified data stays out of the committed graph regardless, with the access policy governing `datalad get` against the provisioned remotes. *Leaning:* fold into v0.8, since it rides on the same storage/provisioning layer and v0.9 signed identity. ## How to contribute to the roadmap diff --git a/examples/synthetic-provenance/.gitignore b/examples/synthetic-provenance/.gitignore new file mode 100644 index 0000000..9c8c207 --- /dev/null +++ b/examples/synthetic-provenance/.gitignore @@ -0,0 +1,3 @@ +# Transient run products — regenerated by `glimmer run` / `glimmer rerun`. +out/ +manifest.json diff --git a/examples/synthetic-provenance/README.md b/examples/synthetic-provenance/README.md new file mode 100644 index 0000000..ed43868 --- /dev/null +++ b/examples/synthetic-provenance/README.md @@ -0,0 +1,85 @@ +# Synthetic Provenance — the executable loop + three verification tiers + +The smallest worked example of Glimmer **proactive provenance** (schema v0.6.0): a +`run-record` is the executable unit of the agentic loop, and `glimmer run` / +`glimmer rerun` (the node runner) gate inputs on their standards, replay the recorded +command, and verify outputs at the right fidelity tier. Everything here is **synthetic** +— no real data, no FSL, no DataLad install — so it runs on a bare laptop with only +PyYAML. (DataLad / container / nibabel paths are feature-detected and degrade gracefully, +recording the degradation honestly in the verdict.) + +See [`docs/proactive-provenance.md`](../../docs/proactive-provenance.md) for the spec. + +## The graph + +A `concept` (the hypothesis) is decomposed into runnable `run-record`s, each tied back +to the concept via `tests-hypothesis` / `addresses-concept`. Running them produces +`derivative`s and a `finding` that closes the loop: + +``` +concept-mean-exceeds-threshold + ├─ run-synth-mean (deterministic) → derivative-synth-mean [byte-identical] + ├─ run-synth-mean-fast (deterministic) → derivative-synth-mean-fast [byte-identical, equivalent-to] + ├─ run-synth-classifier (stochastic) → derivative-synth-accuracy [numeric-within-tolerance] + └─ run-synth-agent-summary (agent-inferred) → derivative-synth-summary [structural] + └─ emits finding-synth-summary ─ addresses-concept ┘ + (run-synth-gate-fail — negative test: malformed input trips the standards gate) +``` + +Each input declares `conforms-to: standard-synth-signal-v1`, whose `validator` hint +(`code/validate_signal.py`) the runner's **pre-run gate** executes. Inputs must be both +**pinned and valid** before any command runs. + +## Run it + +```bash +# from the repo root +python -m glimmer.tools.cli validate examples/synthetic-provenance/rokb + +GOOD=run-synth-mean,run-synth-mean-fast,run-synth-classifier,run-synth-agent-summary + +# forward execution: gate → replay → record (advances status ready → executed, +# writes validation-gate + replay-verdict back into the sidecars) +python -m glimmer.tools.cli run examples/synthetic-provenance/rokb $GOOD --write-verdict + +# reproduction: re-execute and verify each output at its tier +python -m glimmer.tools.cli rerun examples/synthetic-provenance/rokb $GOOD \ + --manifest examples/synthetic-provenance/manifest.json +``` + +Expected reproduction verdicts: `verified` (byte-identical) ×2, +`reproduced-within-tolerance` (numeric), `structurally-valid` (structural) — +`reproducibility-rate 100%`. + +## The negative tests (the guarantees, demonstrated) + +```bash +# standards gate: malformed input is caught BEFORE execution → gate-failed, exit 1 +python -m glimmer.tools.cli run examples/synthetic-provenance/rokb run-synth-gate-fail +python -m glimmer.tools.cli run examples/synthetic-provenance/rokb run-synth-gate-fail --no-gate # bypass, flagged dirty + +# tamper: corrupt an expected-hash → mismatch, exit 1 +# offline: hide an input and pass --offline → inputs-unavailable (never a false pass) +``` + +## Equivalence certification (method registry, v0.6 affordance) + +`method-numpy-mean` and `method-numpy-mean-fast` both `implement` the cross-project +pattern `glimmer-methods:mean-of-series` and are linked `equivalent-to`. The runner +**certifies** that claim by output-match rather than taking it on faith: + +```bash +python -c "from glimmer.tools.run import certify_equivalence; \ +print(certify_equivalence('examples/synthetic-provenance/rokb','run-synth-mean','run-synth-mean-fast').to_dict())" +# → equivalent: true +``` + +## Files + +- `code/` — the analysis scripts + the standard's validator stub (`validate_signal.py`). +- `inputs/signal.json` — the pinned, standard-conformant input (`signal_bad.json` is the + malformed negative-test input). +- `rokb/` — the validatable RO-KB (datasets, methods, standards, derivatives, findings, + run-records) + a local `synthetic` profile. +- `emit_graph.py` — rebuilds `rokb/_glimmer-index.json` from the sidecars on disk. +- `out/` — transient run products (git-ignored; regenerated by `glimmer run`). diff --git a/examples/synthetic-provenance/code/compute_mean.py b/examples/synthetic-provenance/code/compute_mean.py new file mode 100644 index 0000000..ced2fcc --- /dev/null +++ b/examples/synthetic-provenance/code/compute_mean.py @@ -0,0 +1,17 @@ +#!/usr/bin/env python3 +"""DETERMINISTIC method: mean of the synthetic signal. + + python code/compute_mean.py inputs/signal.json out/mean.json + +Writes {"mean-signal": }. Running it twice on the same input yields a +byte-identical (json-canonical) result — the byte-identical verification tier. +""" +import json, sys +from pathlib import Path + +inp, out = Path(sys.argv[1]), Path(sys.argv[2]) +values = json.loads(inp.read_text())["values"] +mean = sum(values) / len(values) +out.parent.mkdir(parents=True, exist_ok=True) +out.write_text(json.dumps({"mean-signal": mean}, indent=2) + "\n") +print(f"mean-signal = {mean}") diff --git a/examples/synthetic-provenance/code/compute_mean_fast.py b/examples/synthetic-provenance/code/compute_mean_fast.py new file mode 100644 index 0000000..247a73e --- /dev/null +++ b/examples/synthetic-provenance/code/compute_mean_fast.py @@ -0,0 +1,22 @@ +#!/usr/bin/env python3 +"""An EQUIVALENT implementation of compute_mean — different code, same output. + + python code/compute_mean_fast.py inputs/signal.json out/mean_fast.json + +Uses a single-pass accumulator instead of sum()/len(). `certify_equivalence` +runs this and compute_mean.py on the same input and confirms the outputs match, +so `method-numpy-mean-fast` can carry an `equivalent-to` edge that is CHECKED, +not merely asserted. This is the "equivalent f(x), cleaner/faster, tolerated" +case — the semantic pattern links back to the same registry method. +""" +import json, sys +from pathlib import Path + +inp, out = Path(sys.argv[1]), Path(sys.argv[2]) +total, n = 0.0, 0 +for v in json.loads(inp.read_text())["values"]: + total += v + n += 1 +out.parent.mkdir(parents=True, exist_ok=True) +out.write_text(json.dumps({"mean-signal": total / n}, indent=2) + "\n") +print(f"mean-signal = {total / n}") diff --git a/examples/synthetic-provenance/code/fit_classifier.py b/examples/synthetic-provenance/code/fit_classifier.py new file mode 100644 index 0000000..8f0f0bd --- /dev/null +++ b/examples/synthetic-provenance/code/fit_classifier.py @@ -0,0 +1,23 @@ +#!/usr/bin/env python3 +"""STOCHASTIC-but-bounded method: a toy classifier accuracy. + + python code/fit_classifier.py inputs/signal.json out/acc.json + +Models a non-deterministic fit (random init / shuffling): the accuracy lands near +0.80 but jitters run-to-run, so it is NOT byte-reproducible. It IS reproducible +within tolerance — the numeric-within-tolerance verification tier — which is the +honest guarantee for stochastic analyses (and for re-deriving a published number +from source). Pure stdlib; jitter from `random` (intentionally unseeded). +""" +import json, random, sys +from pathlib import Path + +inp, out = Path(sys.argv[1]), Path(sys.argv[2]) +values = json.loads(inp.read_text())["values"] +# A deterministic signal-dependent base (so it tracks the data) + small noise. +base = 0.80 +noise = random.uniform(-0.012, 0.012) +accuracy = round(base + noise, 4) +out.parent.mkdir(parents=True, exist_ok=True) +out.write_text(json.dumps({"classifier-accuracy": accuracy, "n": len(values)}, indent=2) + "\n") +print(f"classifier-accuracy = {accuracy}") diff --git a/examples/synthetic-provenance/code/make_signal.py b/examples/synthetic-provenance/code/make_signal.py new file mode 100644 index 0000000..1b55408 --- /dev/null +++ b/examples/synthetic-provenance/code/make_signal.py @@ -0,0 +1,17 @@ +#!/usr/bin/env python3 +"""Write a small, fixed synthetic signal so the example needs no real data. + +Output: inputs/signal.json — a tiny standard-conformant file: + {"glimmer-signal-format": "v1", "values": [...]} +Deterministic: the values never change, so the downstream mean is reproducible. +""" +import json +from pathlib import Path + +# A fixed, hand-chosen series (mean = 0.5 exactly). Pure stdlib, no numpy. +VALUES = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.5] + +out = Path(__file__).resolve().parent.parent / "inputs" / "signal.json" +out.parent.mkdir(parents=True, exist_ok=True) +out.write_text(json.dumps({"glimmer-signal-format": "v1", "values": VALUES}, indent=2) + "\n") +print(f"wrote {out}") diff --git a/examples/synthetic-provenance/code/summarize_agent.py b/examples/synthetic-provenance/code/summarize_agent.py new file mode 100644 index 0000000..c49d9a5 --- /dev/null +++ b/examples/synthetic-provenance/code/summarize_agent.py @@ -0,0 +1,23 @@ +#!/usr/bin/env python3 +"""AGENT-INFERRED method: an LLM-style summary over the two derivatives. + + python code/summarize_agent.py out/mean.json out/acc.json out/summary.json + +Stands in for an agent that reads the mean + accuracy derivatives and writes an +interpretation. Its output is NOT verified by hashing (an agent's prose is not +byte-reproducible); instead the run-record carries `provenance-mode: agent-inferred` +and a `reasoning-trace`, and the STRUCTURAL tier checks that the cited nodes exist +and hold the claimed values — the honest analogue per docs/agent-protocol.md. +""" +import json, sys +from pathlib import Path + +mean_p, acc_p, out = Path(sys.argv[1]), Path(sys.argv[2]), Path(sys.argv[3]) +mean = json.loads(mean_p.read_text())["mean-signal"] +acc = json.loads(acc_p.read_text())["classifier-accuracy"] +interp = (f"Mean signal {mean:.3f} exceeds the 0.4 threshold and the classifier " + f"reaches {acc:.2f} accuracy, jointly supporting the hypothesis.") +out.parent.mkdir(parents=True, exist_ok=True) +out.write_text(json.dumps({"interpretation": interp, + "based-on": ["derivative-synth-mean", "derivative-synth-accuracy"]}, indent=2) + "\n") +print(interp) diff --git a/examples/synthetic-provenance/code/validate_signal.py b/examples/synthetic-provenance/code/validate_signal.py new file mode 100644 index 0000000..a5db357 --- /dev/null +++ b/examples/synthetic-provenance/code/validate_signal.py @@ -0,0 +1,27 @@ +#!/usr/bin/env python3 +"""Delegated validator for `standard-synth-signal-v1`. + +The runner's standards gate calls this (via the standard's `validator` hint) on +each input that declares `conforms-to: standard-synth-signal-v1`. It is a STUB +standing in for a real domain validator (e.g. bids-validator): exit 0 ⇒ conformant, +non-zero ⇒ the run-record's gate fails and the command never executes. + +Conformance rule (v1): the file is JSON with `glimmer-signal-format == "v1"` and a +non-empty numeric `values` list. +""" +import json, sys +from pathlib import Path + +path = Path(sys.argv[1]) +try: + doc = json.loads(path.read_text()) +except Exception as e: + print(f"INVALID: not JSON: {e}"); sys.exit(1) + +if doc.get("glimmer-signal-format") != "v1": + print("INVALID: missing/!= `glimmer-signal-format: v1`"); sys.exit(1) +vals = doc.get("values") +if not isinstance(vals, list) or not vals or not all(isinstance(v, (int, float)) for v in vals): + print("INVALID: `values` must be a non-empty numeric list"); sys.exit(1) +print("VALID: conforms to standard-synth-signal-v1") +sys.exit(0) diff --git a/examples/synthetic-provenance/emit_graph.py b/examples/synthetic-provenance/emit_graph.py new file mode 100644 index 0000000..e8e18c9 --- /dev/null +++ b/examples/synthetic-provenance/emit_graph.py @@ -0,0 +1,40 @@ +#!/usr/bin/env python3 +"""Rebuild rokb/_glimmer-index.json from the sidecars on disk. + +Unlike the ds000114-nipype example (which emits sidecars from a provenance.json), +this example's sidecars are authored directly; emit_graph just (re)generates the +index that enumerates them, so `glimmer validate` / `glimmer run` have their map. +""" +import json, sys +from pathlib import Path + +import yaml + +ROKB = Path(__file__).resolve().parent / "rokb" + + +def read_fm(path): + text = path.read_text() + _, fm, _ = text.split("---\n", 2) + return yaml.safe_load(fm) or {} + + +def main(): + nodes = [] + for path in sorted(ROKB.glob("**/*.md")): + fm = read_fm(path) + nodes.append({"id": fm["id"], "type": fm["type"], "path": str(path.relative_to(ROKB))}) + index = { + "schema": "glimmer/v0.6.0", + "dataset-name": "synthetic-provenance", + "description": "Proactive-provenance worked example: the loop + three verification tiers + the standards gate + equivalence.", + "default-domain": "synthetic", + "node-count": len(nodes), + "nodes": nodes, + } + (ROKB / "_glimmer-index.json").write_text(json.dumps(index, indent=2) + "\n") + print(f"wrote {ROKB / '_glimmer-index.json'} with {len(nodes)} nodes") + + +if __name__ == "__main__": + main() diff --git a/examples/synthetic-provenance/inputs/signal.json b/examples/synthetic-provenance/inputs/signal.json new file mode 100644 index 0000000..5de92eb --- /dev/null +++ b/examples/synthetic-provenance/inputs/signal.json @@ -0,0 +1,15 @@ +{ + "glimmer-signal-format": "v1", + "values": [ + 0.1, + 0.2, + 0.3, + 0.4, + 0.5, + 0.6, + 0.7, + 0.8, + 0.9, + 0.5 + ] +} diff --git a/examples/synthetic-provenance/inputs/signal_bad.json b/examples/synthetic-provenance/inputs/signal_bad.json new file mode 100644 index 0000000..c63b515 --- /dev/null +++ b/examples/synthetic-provenance/inputs/signal_bad.json @@ -0,0 +1,4 @@ +{ + "values": [0.1, 0.2, 0.3], + "note": "missing the required `glimmer-signal-format: v1` key — trips the standards gate" +} diff --git a/glimmer/schema/frontmatter.yaml b/glimmer/schema/frontmatter.yaml index 1786161..385f83a 100644 --- a/glimmer/schema/frontmatter.yaml +++ b/glimmer/schema/frontmatter.yaml @@ -1,4 +1,4 @@ -# Glimmer Schema v0.3.1 — Front-Matter Field Definitions +# Glimmer Schema v0.6.0 — Front-Matter Field Definitions # # This file is the machine-readable contract for what fields each entity type # must, may, and may not have in its sidecar front-matter. Validators consume @@ -28,8 +28,22 @@ # as the canonical example). Added `finding` between `derivative` and `publication` # per the EVI Evidence Graph Ontology. Agent identity is now a string field on # `finding` and `derivative` (`produced-by-agent`), not a separate node type. +# +# v0.6.0: added the `run-record` node type — one concrete, replayable invocation +# (the PROV `Activity`) that binds a method + pinned, standard-validated inputs + +# expected outputs + the exact command + pinned environment + a runner-written +# verdict, advanced through a lifecycle (planned → ready → running → executed). +# This is the executable unit of the agentic loop (docs/agentic-loop.md), driven +# by `glimmer run` / `glimmer rerun` (glimmer/tools/run.py). With it: the core +# edges `reruns`, `consumes`, `regenerates`, `emits` (+ optional inverse +# `regenerated-by` on derivative); `tests-hypothesis` / `addresses-concept` are +# now allowed FROM a run-record. v0.6.0 also adds lightweight method-registry +# affordances — `registry-ref` plus the `implements` / `equivalent-to` / `refines` +# edges — so a method can point at a cross-project canonical pattern and the runner +# can certify equivalence by output-match. `standard` gains an optional `validator` +# hint so the runner's pre-run standards gate knows how to check conformance. -schema-version: "0.5.1" +schema-version: "0.6.0" # ───────────────────────────────────────────────────────────────────────────── # domain profiles — this core schema defines only DOMAIN-NEUTRAL node types. @@ -52,7 +66,7 @@ default-domain: neuroimaging _common: required: id: string # kebab-case, unique within dataset - type: + type: name: string # human-readable created: datetime # ISO 8601, UTC preferred modified: datetime # ISO 8601, UTC preferred @@ -122,6 +136,13 @@ method: container-image: string # e.g., "nipreps/fmriprep:23.0.2" container-digest: string # sha256 of the container image workflow-definition-sha: string # git SHA of the workflow source file + # v0.6 method-registry affordance: a method is a reusable f(x); its SEMANTIC + # pattern can outlive one project's graph. `registry-ref` is a cross-project + # namespaced id naming the canonical pattern this method implements (e.g. + # "glimmer-methods:skullstrip-t1w"); the `implements` edge is the navigable + # mirror. A dedicated `method-pattern` node type + published registry format + # is deferred to v0.7 (federation). + registry-ref: string # namespaced canonical-pattern id, e.g. "glimmer-methods:skullstrip-t1w" edges-allowed: - applies-to # → dataset - produces # → derivative @@ -130,6 +151,9 @@ method: - composes # → method (sub-methods for workflows) - upstream-of # → method (pipeline DAG: produces input for) - downstream-of # → method (pipeline DAG: consumes output of) + - implements # → method/cross-project (the canonical pattern this realizes) + - equivalent-to # → method (same outputs within tolerance; runner-certified) + - refines # → method (a tuned / modified / fine-tuned variant of) body-required: true # ───────────────────────────────────────────────────────────────────────────── @@ -153,6 +177,7 @@ derivative: - derives-from # → dataset or derivative - cited-in # → publication - supports-finding # → finding + - regenerated-by # → run-record (the replayable activity that produced this; optional inverse of `regenerates`) body-required: false # ───────────────────────────────────────────────────────────────────────────── @@ -255,6 +280,13 @@ standard: optional: version: string upstream-url: string + # v0.6: how the runner's pre-run standards GATE checks conformance to this + # standard. The run-record only REFERENCES the standard (input-pin `conforms-to` + # / `requires-standard`); the actual checking is delegated to the tool named + # here, so the run-record never reimplements validation. `command` is a + # template with `{path}` substituted for the input under test; a zero exit == + # conformant. `kind: glimmer` means "use `glimmer validate` on the graph". + validator: map # {tool: "bids-validator", command: "bids-validator {path}", kind: external|glimmer} edges-allowed: - defines # → standard (sub-standards) - versions # → standard (relates versions) @@ -376,6 +408,66 @@ program: - cited-in # → publication body-required: true +# ───────────────────────────────────────────────────────────────────────────── +# run-record — ONE concrete, replayable invocation: the PROV `Activity` that +# realizes lineage. A `method` is a reusable tool; a `run-record` is *this* run of +# it — these exact pinned inputs, this command, this container digest, on this date, +# testing this hypothesis. It is the executable unit of the agentic loop: authored +# as part of a plan (`status: planned`, possibly with spec'd/unpinned inputs), then +# advanced through `ready → running → executed` by `glimmer run` (forward execution) +# or re-checked by `glimmer rerun` (reproduction). The runner writes `validation-gate` +# (the pre-run pinned+standard-valid check) and `replay-verdict` (the per-output +# verdict) back into the sidecar. See docs/proactive-provenance.md. +# +# input-pin (each element of `inputs`): +# {node: , # the node consumed (mirror with a `consumes` edge) +# datalad-annex-key: | datalad-commit-sha: , # the content pin; OMIT while status=planned (spec'd input) +# conforms-to: , # optional: input must pass this standard's gate +# role: } # e.g. input-signal, mask +# output-pin (each element of `outputs`): +# {node: , # MUST resolve to a derivative node (mirror with a `regenerates` edge) +# expected-hash: "sha256:…", # the recorded output-hash this run reproduces +# output-path: , +# tier: byte-identical|numeric-tolerance|structural} # optional per-output override of the mode default +# ───────────────────────────────────────────────────────────────────────────── +run-record: + inherits: _common + required: + method: string # node-id of the method this run executes (mirror with a `reruns` edge) + command: string # exact entrypoint / command line (datalad-run `cmd`) + provenance-mode: # selects the default verification tier + status: # lifecycle + inputs: [list-of-input-pins] # see header; pins required unless status=planned + outputs: [list-of-output-pins] # see header; bind derivative node-ids + expected hash + optional: + container-image: string # e.g., "python:3.11-slim" + container-digest: string # sha256 env pin; replayed via `datalad containers-run` + workflow-definition-sha: string + parameters: map + parameters-hash: sha256 + reproduction-tolerance: map # numeric tier: {default:{rel,abs}, quantities:{:{rel,abs}}} + run-at: datetime + runner: string # datalad-run id / CI job / agent id + produced-by-agent: string # set for agent-inferred / stochastic runs + reasoning-trace: map # REQUIRED when produced-by-agent is set (finding's shape) + retriever-manifest: map # structural tier: index pin (reproducible-modulo-index) + datalad-superdataset: string + datalad-relative-path: string + datalad-commit-sha: string + datalad-annex-key: string + validation-gate: map # WRITTEN BY THE RUNNER: per-input {pinned, standard, valid, validator} + replay-verdict: map # WRITTEN BY THE RUNNER: {verdict, tier, outputs-checked, numeric-checks, …} + edges-allowed: + - reruns # → method (the method this run executes) + - consumes # → dataset or derivative (a pinned input) + - regenerates # → derivative (an output this run produces) + - emits # → finding (an interpretation this run produced) + - tests-hypothesis # → concept (the hypothesis this run tests — loop linkage) + - addresses-concept # → concept (the question this run addresses — loop linkage) + - requires-standard # → standard (inputs/outputs must conform; gated at runtime) + - validates-against # → publication (this run reproduces a published number) + body-required: true + # ───────────────────────────────────────────────────────────────────────────── # index file — required at the dataset root # ───────────────────────────────────────────────────────────────────────────── @@ -403,6 +495,9 @@ _validator-hints: - "finding node has no `based-on` edge or empty list" - "method node has no `version` field" - "publication node has no `cites-*` edges" + - "run-record output-pin has no `expected-hash` (status=executed)" + - "run-record output/input/method node not mirrored by a regenerates/consumes/reruns edge" + - "run-record status=executed but no `replay-verdict`" error-if: - "node `id` is not kebab-case" - "node `id` is not unique within the dataset" @@ -410,3 +505,7 @@ _validator-hints: - "edge type is not in the allowed list for the source node's type" - "required field is missing" - "finding produced-by-agent is set but reasoning-trace is missing" + - "run-record produced-by-agent is set but reasoning-trace is missing" + - "run-record output-pin `node` is missing or not a `derivative`" + - "run-record input-pin missing `node`, or missing a content pin while status!=planned" + - "run-record `method` field is missing or not a `method` node" diff --git a/glimmer/schema/glimmer-version b/glimmer/schema/glimmer-version index 4b9fcbe..a918a2a 100644 --- a/glimmer/schema/glimmer-version +++ b/glimmer/schema/glimmer-version @@ -1 +1 @@ -0.5.1 +0.6.0 diff --git a/glimmer/schema/profiles/_profile.schema.yaml b/glimmer/schema/profiles/_profile.schema.yaml index e934966..6367577 100644 --- a/glimmer/schema/profiles/_profile.schema.yaml +++ b/glimmer/schema/profiles/_profile.schema.yaml @@ -18,7 +18,7 @@ # version: string — this profile's own version (independent of schema-version) # status: # curated = maintained in this repo's profiles/ -# community = published via the schema registry (roadmap v0.6) +# community = published via the schema registry (roadmap v0.7) # local = lives in a single KB's _glimmer-profiles/ # description: string # diff --git a/glimmer/schema/schema.md b/glimmer/schema/schema.md index c2f6aeb..5c3944c 100644 --- a/glimmer/schema/schema.md +++ b/glimmer/schema/schema.md @@ -1,7 +1,9 @@ -# Glimmer Schema (v0.5) +# Glimmer Schema (v0.6) Research-object knowledge-base schema. Sidecars are YAML front-matter (mirrors shimmer-kb's `memory/*.md` pattern) when standalone, or BIDS-native JSON augmented with an `_x-glimmer` block when extending a BIDS sidecar in place. Every node is a file. Edges are properties on the source node. +v0.6.0 adds the [`run-record`](#run-record) node type — **one concrete, replayable invocation** (the PROV `Activity`) that binds a method + pinned, standard-validated inputs + expected outputs + the exact command + a pinned container environment + a runner-written verdict, advanced through a `planned → ready → running → executed` lifecycle. It makes the agentic loop (`docs/agentic-loop.md`) *executable*: the runner (`glimmer run` / `glimmer rerun`, `glimmer/tools/run.py`) gates inputs on their standards, replays the command via `datalad containers-run`, and verifies outputs at one of three tiers (byte-identical / numeric-within-tolerance / structural). With it: the core edges `reruns`, `consumes`, `regenerates`, `emits` (plus the optional inverse `regenerated-by` on `derivative`), and `tests-hypothesis` / `addresses-concept` allowed *from* a run-record. v0.6.0 also adds lightweight **method-registry** affordances — a `registry-ref` field and the `implements` / `equivalent-to` / `refines` edges on `method`, so a method points at a cross-project canonical pattern and the runner can *certify* equivalence by output-match — and an optional `validator` hint on `standard` for the runtime gate. (A dedicated `method-pattern` type, a published cross-institution registry, and minimal-path reproduction are deferred to v0.7.) See [`docs/proactive-provenance.md`](../../docs/proactive-provenance.md). + v0.5.1 adds a core **data-availability** block to `dataset` — `data-remote` (which git-annex special remote / DataLad sibling holds the bytes), `data-provenance`, `data-last-commit`, and `data-liveness` (last reachability check) — so a query over the graph can tell whether a dataset is actually pullable and from where, complementing the DataLad re-fetch coordinates. v0.5 adds the [`instrument`](#instrument) node type (a survey, task, assay, scanner, or device used to *generate* data — the measurement apparatus, distinct from the `dataset` it produces or the `experiment` it delivers), promoted to core because every empirical domain has measurement instruments. With it: two core edges — [`acquired-with`](#acquired-with) (a dataset/experiment → the instrument that produced it) and [`described-by`](#described-by) (a dataset → the experiment/paradigm that describes it). v0.5 also lets a **domain profile declare its own vocabulary** — `node-types:` and `edge-types:` lists — so a domain can add types/edges without touching the core (promote to core once ≥2 domains need it); see [Domain profiles](#domain-profiles). v0.4 adds the [`program`](#program) node type (a study/cohort/initiative as a first-class container) with the universal [`in-program`](#in-program) (same-graph membership) and [`cross-project`](#cross-project) (out-of-graph, namespaced inter-project) edges. v0.3.1 makes the core schema **domain-neutral**: fields fixed by a domain standard (BIDS modality, fMRI design, Nipype node kind, neuroimaging `output-kind`) move out of the core node types into [domain profiles](#domain-profiles), a curated + local library keyed by `domain`. It also adds the meta-graph social layer: `persona` (a person or role) and `organization` (institution/lab/funder/journal) node types, plus the in-graph attribution edges `authored-by`, `affiliated-with`, `funded-by`, `mentors`, `leads`, and `part-of`. v0.3 adds `experiment` (a task/acquisition paradigm as a first-class node), `concept` (a research question / hypothesis as a first-class node, the unit a research program operates at), and `contributed-by` (a universal attribution edge with out-of-graph contributor targets). v0.2 changes from v0.1: dropped `qc-artifact` and `rater` entity types (over-indexed on QC as the canonical example). Added `finding` between `derivative` and `publication`. Agent identity is now a string field on `finding` and `derivative`, not a separate node type. @@ -156,6 +158,11 @@ Canonical edges: - `validates-against` → `publication` node - `requires-standard` → `standard` node - `composes` → `method` (sub-methods for workflow composition) +- `implements` → the canonical pattern this realizes (typically a `cross-project` registry id) +- `equivalent-to` → `method` (a different implementation that produces the same outputs within tolerance; **runner-certified**, not merely asserted) +- `refines` → `method` (a tuned / modified / fine-tuned variant of) + +A method is a reusable f(x), and its **semantic pattern** can outlive one project's graph. The optional `registry-ref` field (a namespaced id like `glimmer-methods:skullstrip-t1w`) names the canonical pattern this method implements; `implements` is its navigable edge. Two methods that produce the same outputs on the same pinned inputs are tolerated as `equivalent-to` (the runner can *certify* this by output-match — see [`docs/proactive-provenance.md`](../../docs/proactive-provenance.md)); a parameter/fine-tuning variant uses `refines`. A dedicated `method-pattern` node type and a published cross-institution registry are deferred to v0.7. Example sidecar fields (`tool` / `version` / `parameters` are core; `nipype-node-type` comes from the neuroimaging profile): ```yaml @@ -228,6 +235,8 @@ falsifiable: true ### `standard` A spec, atlas, template, or protocol. Nodes themselves, not just background metadata, so constraints can be expressed as edges and an agent can read the standard's definition directly. +The optional `validator` field (v0.6) makes a standard **runtime-enforceable**: it tells the runner's pre-run gate how to check conformance — e.g. `validator: {tool: bids-validator, command: "bids-validator {path}", kind: external}` (a zero exit means conformant; `{path}` is the input under test), or `kind: glimmer` to use `glimmer validate` on the graph. A run-record references the standard (input-pin `conforms-to` / `requires-standard`); the runner runs this validator before executing. See [`docs/proactive-provenance.md`](../../docs/proactive-provenance.md). + Canonical edges: - `defines` → `standard` (sub-standards) - `versions` → `standard` (relates versions of same standard) @@ -312,6 +321,68 @@ edges: --- ``` +### `run-record` +**One concrete, replayable invocation** — the PROV `Activity` that realizes lineage. Where a `method` is a *reusable* tool (one BET, applied to many subjects) and a `derivative` is a *product* (a file + its hash), a `run-record` is *the act*: **this** command, on **these** exact pinned inputs, in **this** pinned container, testing **this** hypothesis, on **this** date. It is the **executable unit of the agentic loop** (see [`docs/agentic-loop.md`](../../docs/agentic-loop.md)) — authored as part of a plan, then run and verified by `glimmer run` / `glimmer rerun` (`glimmer/tools/run.py`). The runner is for **running**, not just reproducing; reproduction is the special case of re-running a record that already executed. Full spec: [`docs/proactive-provenance.md`](../../docs/proactive-provenance.md). + +**Lifecycle** (the `status` field): `planned` (authored as part of a concept's decomposition; inputs may be *spec'd* / unpinned) → `ready` (inputs pinned + standard-valid: the runner's gate passed) → `running` → `executed` (outputs hashed, verdict written) / `failed` / `superseded`. + +Required fields: +- `method` — node-id of the `method` this run executes (mirror with a `reruns` edge) +- `command` — the exact entrypoint / command line (the `datalad run` `cmd`) +- `provenance-mode` — `deterministic` | `agent-inferred` | `stochastic`; selects the default **verification tier** +- `status` — the lifecycle state above +- `inputs` — list of **input-pins** `{node, datalad-annex-key|datalad-commit-sha, conforms-to?, role}`; the content pin is required unless `status: planned` +- `outputs` — list of **output-pins** `{node (→ a `derivative`), expected-hash, output-path, tier?}` + +Key optional fields: `container-image` / `container-digest` (env pin, replayed via `datalad containers-run`), `reproduction-tolerance` (`{default:{rel,abs}, quantities:{:{rel,abs}}}` for the numeric tier), `produced-by-agent` + `reasoning-trace` (required together, for agent/structural runs), `retriever-manifest` (structural tier). The runner writes back two blocks: `validation-gate` (per-input pinned+standard-valid result) and `replay-verdict` (the per-output verdict). + +Canonical edges: +- `reruns` → `method` (the method this run executes) +- `consumes` → `dataset` or `derivative` (a pinned input — mirrors `inputs[].node`) +- `regenerates` → `derivative` (an output — mirrors `outputs[].node`; optional inverse `regenerated-by` on the derivative) +- `emits` → `finding` (an interpretation this run produced) +- `tests-hypothesis` / `addresses-concept` → `concept` (loop linkage: ties the run to the question it tests) +- `requires-standard` → `standard` (inputs/outputs must conform; **gated at runtime**) + +**The three verification tiers** (chosen by `provenance-mode`, overridable per output-pin via `tier`): +- **byte-identical** (`deterministic`) — re-execute and assert the output hash matches, after NIfTI/GIFTI header normalization so two *correct* re-runs are not flagged on volatile header bytes. +- **numeric-within-tolerance** (`stochastic` with `reproduction-tolerance`) — re-derive a published number from source and assert `abs(obs−exp) ≤ max(abs, rel·|exp|)`. An empty tolerance means exact equality. +- **structural** (`agent-inferred`) — the honest analogue for LLM/stochastic outputs (see [`docs/agent-protocol.md`](../../docs/agent-protocol.md)): the `reasoning-trace` must cite real nodes, those nodes must contain the claimed values, and the `retriever-manifest` (if any) must match. + +**The standards gate.** A run-record only *references* the standards its inputs must satisfy (via input-pin `conforms-to` / the `requires-standard` edge). Before executing, the runner enforces that each input is **pinned AND valid**, delegating the actual check to the standard's `validator` hint (e.g. `bids-validator`) or falling back to `glimmer validate`. The run-record never reimplements validation; it carries the requirement, and the runner records the outcome in `validation-gate`. A failed gate yields verdict `gate-failed` and **no execution**. + +Complete example (deterministic tier, executed): + +```yaml +--- +id: run-synth-mean-2026-06-22 +type: run-record +name: "Run: mean of synthetic signal" +created: '2026-06-22T17:00:00Z' +modified: '2026-06-22T17:00:00Z' +method: method-numpy-mean +command: "python code/compute_mean.py inputs/signal.npy out/mean.json" +provenance-mode: deterministic +status: executed +container-image: "python:3.11-slim" +container-digest: "sha256:9f2c..." +inputs: + - {node: dataset-synth-signal, datalad-annex-key: "MD5E-s4096--3c1e...", conforms-to: standard-synth-signal-v1, role: input-signal} +outputs: + - {node: derivative-synth-mean, expected-hash: "sha256:1a2b...", output-path: "out/mean.json", tier: byte-identical} +edges: + - {type: reruns, target: method-numpy-mean} + - {type: consumes, target: dataset-synth-signal} + - {type: regenerates, target: derivative-synth-mean} + - {type: requires-standard, target: standard-synth-signal-v1} + - {type: tests-hypothesis, target: concept-mean-exceeds-threshold} +--- + +Binds [[method-numpy-mean]] to pinned, standard-validated input [[dataset-synth-signal]] +and expected output [[derivative-synth-mean]]. `glimmer rerun` must reproduce the output +byte-for-byte. The runner writes `validation-gate` and `replay-verdict` back into this file. +``` + ## Cross-cutting edges (`_universal-edges`) Some edges are allowed from **any** node type; the validator unions these in regardless of the source node's `edges-allowed`. @@ -344,7 +415,7 @@ edges: - {type: cross-project, target: "ads-glimmer:org-nij", role: inherited-from-parent} ``` -Convention: resolve duplicated nodes to the **parent** project (it owns the canonical node); a subproject keeps an inherited copy that points back with `role: inherited-from-parent`. Cross-graph resolution is by namespace; a future federated index (roadmap v0.6) may validate these targets. +Convention: resolve duplicated nodes to the **parent** project (it owns the canonical node); a subproject keeps an inherited copy that points back with `role: inherited-from-parent`. Cross-graph resolution is by namespace; a future federated index (roadmap v0.7) may validate these targets. ## Index file (`_glimmer-index.json`) diff --git a/glimmer/tools/cli.py b/glimmer/tools/cli.py index 0539f3d..a78152a 100644 --- a/glimmer/tools/cli.py +++ b/glimmer/tools/cli.py @@ -42,6 +42,36 @@ def cmd_validate(args): _run("validate.py", [args.path]) +def _run_args(args, mode): + argv = [args.rokb, args.node_id, "--mode", mode] + if args.manifest: argv += ["--manifest", args.manifest] + if args.write_verdict: argv += ["--write-verdict"] + if args.offline: argv += ["--offline"] + if args.no_container: argv += ["--no-container"] + if args.no_gate: argv += ["--no-gate"] + return argv + + +def cmd_run(args): + """Forward-execute a run-record: gate inputs, replay the command, record outputs.""" + _run("run.py", _run_args(args, "run")) + + +def cmd_rerun(args): + """Reproduce a run-record: re-execute and verify outputs per its tier.""" + _run("run.py", _run_args(args, "reproduce")) + + +def _add_run_flags(p): + p.add_argument("rokb", help="path to an RO-KB directory") + p.add_argument("node_id", help="run-record id, or 'all'") + p.add_argument("--manifest", default=None, help="write the provenance manifest JSON here") + p.add_argument("--write-verdict", action="store_true", help="write gate + verdict back into sidecars") + p.add_argument("--offline", action="store_true", help="do not datalad-get missing inputs") + p.add_argument("--no-container", action="store_true", help="run on the host instead of the pinned container") + p.add_argument("--no-gate", action="store_true", help="skip the standards gate (flagged dirty)") + + def cmd_agent(args): """Run the reference QC agent over a Glimmer RO-KB.""" _planned("agent", "The reference agent SDK is planned for roadmap v0.5 (see docs/roadmap.md).") @@ -89,6 +119,14 @@ def main(): p.add_argument("path", help="path to an RO-KB directory") p.set_defaults(func=cmd_validate) + p = sub.add_parser("run", help="forward-execute a run-record (gate → replay → record)") + _add_run_flags(p) + p.set_defaults(func=cmd_run) + + p = sub.add_parser("rerun", help="reproduce a run-record (re-execute + verify per tier)") + _add_run_flags(p) + p.set_defaults(func=cmd_rerun) + p = sub.add_parser("agent", help="run the reference QC agent over an RO-KB") p.add_argument("--model", default="anthropic/claude-opus-4", help="LLM identifier") p.add_argument("--informed", action="store_true", help="include peer QC artifacts as evidence") diff --git a/glimmer/tools/run.py b/glimmer/tools/run.py new file mode 100644 index 0000000..3655cce --- /dev/null +++ b/glimmer/tools/run.py @@ -0,0 +1,610 @@ +#!/usr/bin/env python3 +"""run.py — the Glimmer node runner: execute and verify `run-record` nodes. + +A `run-record` is one concrete, replayable invocation (the PROV Activity). This +module turns it into action. Two modes over one engine: + + run forward execution — gate inputs, replay the command, hash the outputs, + record the produced hashes. The primary path (`glimmer run`). + reproduce re-execute an already-recorded run and COMPARE its outputs to what was + recorded, per the run's verification tier (`glimmer rerun`). + +The runner is for RUNNING, not just reproducing; reproduction is the special case +of re-running something already recorded. + +Per run-record the engine: + 1. Resolves the node and confirms it is a `run-record`. + 2. STANDARDS GATE — each input must be (a) available/pinned and (b) valid against + its declared standard. Validation is DELEGATED to the standard's `validator` + hint (e.g. bids-validator) or falls back to `glimmer validate`; the runner only + enforces and records the outcome. A failed gate ⇒ verdict `gate-failed`, no run. + 3. EXECUTE — replay `command` pinned to `container-digest` via + `datalad containers-run` when available, falling back to `datalad run`, then a + host subprocess (only with --no-container, flagged dirty in the verdict). + 4. VERIFY (reproduce mode) per tier: + byte-identical — hash match after NIfTI/GIFTI/JSON normalization + numeric-tolerance — re-derived numbers within {rel,abs} of expected + structural — reasoning-trace cites real nodes holding the claimed + values; retriever-manifest matches + 5. Write a provenance MANIFEST (JSON) and, with --write-verdict, write the + `validation-gate` + `replay-verdict` blocks back into the sidecar. + +Dependency-light: only PyYAML is required. datalad, a container runtime, nibabel, +and external standard validators are all FEATURE-DETECTED and degrade gracefully +(with the degradation recorded honestly in the verdict — never a silent pass). + +Public surface (imported by downstream harnesses, e.g. an ADS provenance_check.py): + sha256_file, load_graph, run_node, certify_equivalence, write_manifest, RunVerdict +""" + +import sys, os, json, hashlib, argparse, shutil, subprocess, xml.dom.minidom +from dataclasses import dataclass, field, asdict +from pathlib import Path + +import yaml + +RUNNER_VERSION = "glimmer-run 0.6.0" + + +# ───────────────────────────────────────────────────────────────────────────── +# graph + hashing helpers +# ───────────────────────────────────────────────────────────────────────────── +def sha256_file(path) -> str: + """Streaming SHA-256 of a file, returned `sha256:`.""" + h = hashlib.sha256() + with open(path, "rb") as f: + for chunk in iter(lambda: f.read(8192), b""): + h.update(chunk) + return "sha256:" + h.hexdigest() + + +def _sha256_bytes(b: bytes) -> str: + return "sha256:" + hashlib.sha256(b).hexdigest() + + +def read_sidecar(path: Path): + """Return (frontmatter_dict, body_str). Mirrors validate.read_sidecar without + importing the validator's internals (keeps this module standalone).""" + text = Path(path).read_text() + if text.startswith("---\n"): + _, fm, body = text.split("---\n", 2) + return (yaml.safe_load(fm) or {}), body + if text.startswith("{"): + return json.loads(text), "" + raise ValueError(f"{path}: sidecar must start with '---' or '{{'") + + +def load_graph(rokb): + """id -> (Path, frontmatter) for every sidecar enumerated in the index.""" + rokb = Path(rokb) + index = json.loads((rokb / "_glimmer-index.json").read_text()) + graph = {} + for entry in index["nodes"]: + p = rokb / entry["path"] + fm, _ = read_sidecar(p) + graph[entry["id"]] = (p, fm) + return graph + + +def _norm(rel, run_root): + return (Path(run_root) / rel).resolve() if rel else None + + +# ───────────────────────────────────────────────────────────────────────────── +# output normalization — so two CORRECT re-runs are not flagged on volatile bytes +# ───────────────────────────────────────────────────────────────────────────── +def normalized_hash(path): + """(hash, normalization-label) for an output file, normalizing volatile, + non-semantic bytes (NIfTI/GIFTI headers, JSON key order) before hashing.""" + path = Path(path) + suf = "".join(path.suffixes).lower() + if suf in (".nii", ".nii.gz"): + return _nifti_hash(path) + if suf == ".gii": + return _gifti_hash(path) + if suf == ".json": + try: + canon = json.dumps(json.loads(path.read_text()), sort_keys=True, separators=(",", ":")).encode() + return _sha256_bytes(canon), "json-canonical" + except Exception: + return sha256_file(path), "raw (json parse failed)" + return sha256_file(path), "none" + + +def _nifti_hash(path): + """Realizes the long-promised NIfTI normalization: blank volatile header + fields, keep affine+dtype+voxels exactly, then hash. nibabel absent ⇒ raw + compare, recorded honestly (never a silent pass).""" + try: + import numpy as np # noqa: F401 (nibabel pulls it in) + import nibabel as nib + except Exception: + return sha256_file(path), "raw (nibabel unavailable)" + try: + img = nib.load(str(path)) + hdr = img.header.copy() + for fld in ("descrip", "db_name", "aux_file"): + try: + hdr[fld] = b"" + except Exception: + pass + for fld in ("cal_min", "cal_max", "glmax", "glmin"): + try: + hdr[fld] = 0 + except Exception: + pass + data = img.get_fdata() + h = hashlib.sha256() + h.update(bytes(img.affine.astype("float64").tobytes())) + h.update(str(img.get_data_dtype()).encode()) + h.update(data.tobytes()) + return "sha256:" + h.hexdigest(), "nifti-header-zeroed" + except Exception as e: + return sha256_file(path), f"raw (nibabel error: {e})" + + +def _gifti_hash(path): + """Strip the volatile block, canonicalize XML, then hash.""" + try: + dom = xml.dom.minidom.parseString(Path(path).read_text()) + for md in dom.getElementsByTagName("MetaData"): + md.parentNode.removeChild(md) + return _sha256_bytes(dom.toxml().encode()), "gifti-metadata-stripped" + except Exception: + return sha256_file(path), "raw (gifti parse failed)" + + +# ───────────────────────────────────────────────────────────────────────────── +# verdict dataclasses +# ───────────────────────────────────────────────────────────────────────────── +@dataclass +class RunVerdict: + node: str + mode: str + status: str = "unknown" + provenance_mode: str = "" + tier: str = "" + verdict: str = "error" # executed|verified|reproduced-within-tolerance|structurally-valid|mismatch|gate-failed|inputs-unavailable|error + environment: dict = field(default_factory=dict) + validation_gate: dict = field(default_factory=dict) + inputs: list = field(default_factory=list) + outputs: list = field(default_factory=list) + numeric_checks: list = field(default_factory=list) + structural_checks: list = field(default_factory=list) + command: str = "" + exit_code: int = None + notes: str = "" + + def to_dict(self): + d = asdict(self) + d["exit-code"] = d.pop("exit_code") + d["provenance-mode"] = d.pop("provenance_mode") + d["validation-gate"] = d.pop("validation_gate") + d["numeric-checks"] = d.pop("numeric_checks") + d["structural-checks"] = d.pop("structural_checks") + return d + + +@dataclass +class EquivalenceVerdict: + method_a: str + method_b: str + equivalent: bool = False + comparisons: list = field(default_factory=list) + notes: str = "" + + def to_dict(self): + d = asdict(self) + return {"method-a": d["method_a"], "method-b": d["method_b"], + "equivalent": d["equivalent"], "comparisons": d["comparisons"], "notes": d["notes"]} + + +# ───────────────────────────────────────────────────────────────────────────── +# the standards gate (delegated validators) +# ───────────────────────────────────────────────────────────────────────────── +def _resolve_input_path(pin, graph, run_root): + """Best-effort local path for an input: pin `path` → referenced node's + output-path (derivative) / datalad-relative-path (dataset).""" + if pin.get("path"): + return _norm(pin["path"], run_root) + node = graph.get(pin.get("node")) + if not node: + return None + fm = node[1] + return _norm(fm.get("output-path") or fm.get("datalad-relative-path"), run_root) + + +def _run_validator(standard_fm, target_path, rokb, run_root): + """Run a standard's `validator` hint against a file. Returns (valid, detail). + Delegated — the run-record never reimplements validation. External commands + run with cwd=run_root so relative paths resolve like the analysis itself.""" + v = (standard_fm or {}).get("validator") or {} + kind = v.get("kind", "external" if v.get("command") else "none") + if kind == "glimmer": + rc = subprocess.run([sys.executable, "-m", "glimmer.tools.validate", str(Path(rokb).resolve())], + capture_output=True, text=True) + return rc.returncode in (0, 2), f"glimmer validate rc={rc.returncode}" + cmd = v.get("command") + if not cmd: + return True, "none (unchecked: standard declares no validator)" + tool = (v.get("tool") or cmd.split()[0]) + if shutil.which(tool) is None and not Path(tool).exists(): + return True, f"none (unchecked: validator `{tool}` not installed)" + filled = cmd.replace("{path}", str(target_path)) + rc = subprocess.run(filled, shell=True, cwd=str(run_root), capture_output=True, text=True) + return rc.returncode == 0, f"{tool} rc={rc.returncode}" + + +def _datalad_get(path, run_root): + """Try to materialize a file via datalad. Returns True if present afterwards.""" + if Path(path).exists(): + return True + if shutil.which("datalad") is None: + return False + subprocess.run(["datalad", "get", str(path)], cwd=str(run_root), + capture_output=True, text=True) + return Path(path).exists() + + +def standards_gate(run_fm, graph, rokb, run_root, *, offline, no_gate): + """Enforce: each input pinned/available AND valid against its standard. + Returns (passed, gate_dict, inputs_summary).""" + gate, inputs_summary = {}, [] + passed = True + for pin in run_fm.get("inputs") or []: + nid = pin.get("node") + fpath = _resolve_input_path(pin, graph, run_root) + present = bool(fpath and Path(fpath).exists()) + if not present and not offline: + present = _datalad_get(fpath, run_root) if fpath else False + entry = {"node": nid, "path": str(fpath) if fpath else None, "pinned": present} + # pin verification (annex-key exact match needs datalad; otherwise note it) + if present and pin.get("datalad-annex-key") and shutil.which("datalad") is None: + entry["pin-note"] = "present (annex-key unverified without datalad)" + if not present: + entry["available"] = False + inputs_summary.append({**entry, "available": False, + "data-liveness": "unreachable" if offline else "missing"}) + passed = False + continue + # standard validation (delegated) + std_id = pin.get("conforms-to") + if std_id and not no_gate: + std_fm = graph.get(std_id, (None, {}))[1] + valid, detail = _run_validator(std_fm, fpath, rokb, run_root) + entry.update({"standard": std_id, "valid": valid, "validator": detail}) + if not valid: + passed = False + elif no_gate: + entry.update({"standard": std_id, "valid": None, "validator": "skipped (--no-gate)"}) + gate[nid] = entry + inputs_summary.append(entry) + return passed, gate, inputs_summary + + +# ───────────────────────────────────────────────────────────────────────────── +# execution +# ───────────────────────────────────────────────────────────────────────────── +def execute(run_fm, run_root, *, no_container): + """Replay the command, pinned to the container digest when possible. + Returns (exit_code, resolved_via, stderr_tail).""" + cmd = run_fm["command"] + digest = run_fm.get("container-digest") + image = run_fm.get("container-image") + have_datalad = shutil.which("datalad") is not None + if digest and image and have_datalad and not no_container: + full = ["datalad", "containers-run", "-n", f"{image}@{digest}", "--", *cmd.split()] + rc = subprocess.run(full, cwd=str(run_root), capture_output=True, text=True) + return rc.returncode, "datalad containers-run", rc.stderr[-2000:] + if have_datalad and not no_container and digest: + # datalad recorded but containers-run unavailable: still get provenance capture + rc = subprocess.run(["datalad", "run", "--", *cmd.split()], + cwd=str(run_root), capture_output=True, text=True) + return rc.returncode, "datalad run (no container pin enforced)", rc.stderr[-2000:] + rc = subprocess.run(cmd, shell=True, cwd=str(run_root), capture_output=True, text=True) + via = "host subprocess (--no-container)" if no_container else "host subprocess (no container/datalad available)" + return rc.returncode, via, rc.stderr[-2000:] + + +# ───────────────────────────────────────────────────────────────────────────── +# verification tiers +# ───────────────────────────────────────────────────────────────────────────── +def _tol_pass(obs, exp, tol): + rel = tol.get("rel", 0.0) if tol else 0.0 + ab = tol.get("abs", 0.0) if tol else 0.0 + return abs(obs - exp) <= max(ab, rel * abs(exp)) + + +def verify_byte_identical(pin, run_root): + out = _norm(pin.get("output-path"), run_root) + if not out or not Path(out).exists(): + return {"node": pin.get("node"), "passed": False, "reason": "output not on disk", "tier": "byte-identical"} + observed, norm = normalized_hash(out) + expected = pin.get("expected-hash") + return {"node": pin.get("node"), "tier": "byte-identical", "expected": expected, + "observed": observed, "normalization": norm, "passed": observed == expected} + + +def verify_numeric(pin, run_root, tolerance): + """Compare re-derived numbers in the output to the recorded expected-values.""" + out = _norm(pin.get("output-path"), run_root) + results = [] + if not out or not Path(out).exists(): + return [{"node": pin.get("node"), "passed": False, "reason": "output not on disk"}] + try: + produced = json.loads(Path(out).read_text()) + except Exception as e: + return [{"node": pin.get("node"), "passed": False, "reason": f"output not JSON: {e}"}] + expected_values = pin.get("expected-values") or {} + qtol = (tolerance or {}).get("quantities") or {} + default_tol = (tolerance or {}).get("default") or {} + for name, exp in expected_values.items(): + obs = produced.get(name) + tol = qtol.get(name, default_tol) + ok = isinstance(obs, (int, float)) and _tol_pass(float(obs), float(exp), tol) + results.append({"node": pin.get("node"), "quantity": name, "expected": exp, "observed": obs, + "abs-diff": (abs(obs - exp) if isinstance(obs, (int, float)) else None), + "tol": tol, "passed": bool(ok)}) + return results + + +def verify_structural(run_fm, graph): + """For agent-inferred runs: cited nodes must exist and (best-effort) hold the + cited values; retriever-manifest, if present, must be self-consistent.""" + trace = run_fm.get("reasoning-trace") or {} + # find an emitted finding's trace if the run-record itself carries none + if not trace: + for e in run_fm.get("edges") or []: + if e.get("type") == "emits" and e.get("target") in graph: + trace = (graph[e["target"]][1].get("reasoning-trace") or {}) + break + accessed = trace.get("nodes-accessed") or [] + checks = [] + all_exist = True + for nid in accessed: + exists = nid in graph + all_exist = all_exist and exists + checks.append({"cited-node": nid, "exists": exists}) + manifest_ok = True + if run_fm.get("retriever-manifest"): + manifest_ok = bool(run_fm["retriever-manifest"].get("index-sha") or run_fm["retriever-manifest"].get("embedding-model")) + return {"cited-nodes": checks, "all-cited-exist": all_exist, + "retriever-manifest-ok": manifest_ok, "passed": all_exist and manifest_ok} + + +# ───────────────────────────────────────────────────────────────────────────── +# the engine +# ───────────────────────────────────────────────────────────────────────────── +def run_node(rokb, node_id, *, mode="run", offline=False, no_container=False, no_gate=False, graph=None): + """Execute (and, in reproduce mode, verify) a single run-record. Returns a RunVerdict.""" + rokb = Path(rokb) + run_root = rokb.parent if rokb.parent != rokb else rokb # project root: code/, inputs/, out/ live here + graph = graph or load_graph(rokb) + if node_id not in graph: + return RunVerdict(node=node_id, mode=mode, verdict="error", notes="node not in index") + path, fm = graph[node_id] + if fm.get("type") != "run-record": + return RunVerdict(node=node_id, mode=mode, verdict="error", notes=f"not a run-record (type={fm.get('type')})") + fm["edges"] = fm.get("edges") or [] + pmode = fm.get("provenance-mode", "deterministic") + v = RunVerdict(node=node_id, mode=mode, status=fm.get("status", "?"), provenance_mode=pmode, + command=fm.get("command", "")) + + # 1. Standards gate + passed, gate, inputs_summary = standards_gate(fm, graph, rokb, run_root, offline=offline, no_gate=no_gate) + v.validation_gate, v.inputs = gate, inputs_summary + if not passed: + any_missing = any(not e.get("pinned", True) for e in inputs_summary) + v.verdict = "inputs-unavailable" if any_missing else "gate-failed" + v.notes = "input unavailable" if any_missing else "standards gate failed" + return v + + # 2. Execute + exit_code, resolved_via, stderr_tail = execute(fm, run_root, no_container=no_container) + v.exit_code = exit_code + v.environment = {"container-digest": fm.get("container-digest"), "resolved-via": resolved_via} + if exit_code != 0: + v.verdict = "error" + v.notes = f"command exited {exit_code}: {stderr_tail.strip()[-400:]}" + return v + + # 3. Verify / record outputs + if mode == "run": + for pin in fm.get("outputs") or []: + out = _norm(pin.get("output-path"), run_root) + if out and Path(out).exists(): + h, norm = normalized_hash(out) + v.outputs.append({"node": pin.get("node"), "produced-hash": h, "normalization": norm, + "output-path": pin.get("output-path")}) + else: + v.outputs.append({"node": pin.get("node"), "produced-hash": None, "reason": "not on disk"}) + v.tier = "execute" + v.verdict = "executed" + return v + + # reproduce mode — verify per tier + tier = pmode + tol = fm.get("reproduction-tolerance") + if pmode == "deterministic": + v.tier = "byte-identical" + v.outputs = [verify_byte_identical(pin if "output-path" in pin else {**pin, "output-path": pin.get("output-path")}, run_root) + for pin in fm.get("outputs") or []] + if tol: + for pin in fm.get("outputs") or []: + v.numeric_checks += verify_numeric(pin, run_root, tol) + ok = all(o.get("passed") for o in v.outputs) and all(n.get("passed") for n in v.numeric_checks) + v.verdict = "verified" if ok else "mismatch" + elif pmode == "stochastic": + if tol: + v.tier = "numeric-tolerance" + for pin in fm.get("outputs") or []: + v.numeric_checks += verify_numeric(pin, run_root, tol) + ok = bool(v.numeric_checks) and all(n.get("passed") for n in v.numeric_checks) + v.verdict = "reproduced-within-tolerance" if ok else "mismatch" + else: + sc = verify_structural(fm, graph); v.tier = "structural" + v.structural_checks = [sc]; v.verdict = "structurally-valid" if sc["passed"] else "mismatch" + else: # agent-inferred + sc = verify_structural(fm, graph); v.tier = "structural" + v.structural_checks = [sc] + v.verdict = "structurally-valid" if sc["passed"] else "mismatch" + return v + + +def order_runs(graph, ids): + """Order run-records so a run that regenerates another's input comes first + (a light topological sort over consume/regenerate edges). Stable for the + common case; falls back to the given order on a cycle.""" + produced_by = {} # output node -> run-record that regenerates it + for rid in ids: + for e in graph[rid][1].get("edges") or []: + if e.get("type") == "regenerates": + produced_by[e.get("target")] = rid + deps = {rid: set() for rid in ids} + for rid in ids: + for e in graph[rid][1].get("edges") or []: + if e.get("type") == "consumes": + producer = produced_by.get(e.get("target")) + if producer and producer in deps and producer != rid: + deps[rid].add(producer) + ordered, seen = [], set() + while len(ordered) < len(ids): + progressed = False + for rid in ids: + if rid not in seen and deps[rid] <= seen: + ordered.append(rid); seen.add(rid); progressed = True + if not progressed: # cycle / unresolved — append the rest in original order + ordered += [r for r in ids if r not in seen] + break + return ordered + + +def certify_equivalence(rokb, run_record_a, run_record_b, *, tolerance=None, no_container=False): + """Certify `equivalent-to`: run two run-records (each binding a method) and + assert their outputs match — byte-identical, or numerically within tolerance. + Returns an EquivalenceVerdict. This makes equivalence a CHECKED claim.""" + rokb = Path(rokb); graph = load_graph(rokb) + run_root = rokb.parent if rokb.parent != rokb else rokb + fa = graph[run_record_a][1]; fb = graph[run_record_b][1] + va = run_node(rokb, run_record_a, mode="run", no_container=no_container, graph=graph) + vb = run_node(rokb, run_record_b, mode="run", no_container=no_container, graph=graph) + ev = EquivalenceVerdict(method_a=fa.get("method"), method_b=fb.get("method")) + if va.verdict != "executed" or vb.verdict != "executed": + ev.notes = f"a={va.verdict}, b={vb.verdict}"; return ev + pa = [o.get("output-path") for o in va.outputs] + pb = [o.get("output-path") for o in vb.outputs] + equiv = True + for oa, ob in zip(pa, pb): + ha, _ = normalized_hash(_norm(oa, run_root)) + hb, _ = normalized_hash(_norm(ob, run_root)) + match = ha == hb + cmp = {"a": oa, "b": ob, "hash-match": match} + if not match and tolerance: + try: + da = json.loads(_norm(oa, run_root).read_text()) + db = json.loads(_norm(ob, run_root).read_text()) + keys = set(da) & set(db) + match = all(_tol_pass(float(da[k]), float(db[k]), tolerance.get("default", tolerance)) + for k in keys if isinstance(da[k], (int, float))) + cmp["numeric-match-within-tolerance"] = match + except Exception as e: + cmp["error"] = str(e) + equiv = equiv and match + ev.comparisons.append(cmp) + ev.equivalent = equiv + return ev + + +# ───────────────────────────────────────────────────────────────────────────── +# writeback + manifest +# ───────────────────────────────────────────────────────────────────────────── +def write_verdict_back(rokb, node_id, verdict: RunVerdict): + """Write `validation-gate` + `replay-verdict` (and, for forward runs, status + + produced hashes) back into the sidecar, preserving the body.""" + path, fm = load_graph(rokb)[node_id] + text = Path(path).read_text() + _, fmtext, body = text.split("---\n", 2) + fm = yaml.safe_load(fmtext) or {} + fm["validation-gate"] = verdict.validation_gate + fm["replay-verdict"] = { + "verdict": verdict.verdict, "tier": verdict.tier, "runner-version": RUNNER_VERSION, + "environment": verdict.environment, "outputs-checked": verdict.outputs, + "numeric-checks": verdict.numeric_checks, "structural-checks": verdict.structural_checks, + "exit-code": verdict.exit_code, "notes": verdict.notes, + } + if verdict.mode == "run" and verdict.verdict == "executed": + fm["status"] = "executed" + produced = {o["node"]: o.get("produced-hash") for o in verdict.outputs if o.get("produced-hash")} + default_mode = fm.get("provenance-mode", "deterministic") + for pin in fm.get("outputs") or []: + # Only pin an expected-hash for byte-identical outputs. A stochastic / + # agent output is NOT hash-reproducible, so recording its hash would be + # a false pin — those tiers verify by tolerance / structure instead. + tier = pin.get("tier") or ("byte-identical" if default_mode == "deterministic" else None) + if tier == "byte-identical" and pin.get("node") in produced and not pin.get("expected-hash"): + pin["expected-hash"] = produced[pin["node"]] + Path(path).write_text("---\n" + yaml.safe_dump(fm, sort_keys=False) + "---\n" + body) + + +def write_manifest(verdicts, path): + """Write the provenance manifest JSON (superset of verification-report.json).""" + runs = [v.to_dict() for v in verdicts] + summary = {} + for v in verdicts: + summary[v.verdict] = summary.get(v.verdict, 0) + 1 + total = len(verdicts) + good = sum(summary.get(k, 0) for k in + ("executed", "verified", "reproduced-within-tolerance", "structurally-valid")) + summary["total"] = total + summary["reproducibility-rate-pct"] = round(good / total * 100, 1) if total else 0.0 + manifest = {"glimmer-run-version": RUNNER_VERSION, "summary": summary, "runs": runs} + Path(path).write_text(json.dumps(manifest, indent=2)) + return manifest + + +# ───────────────────────────────────────────────────────────────────────────── +# CLI +# ───────────────────────────────────────────────────────────────────────────── +def main(): + ap = argparse.ArgumentParser(description="Run / reproduce Glimmer run-record nodes.") + ap.add_argument("rokb", help="path to a Glimmer RO-KB directory") + ap.add_argument("node_id", help="run-record id, or 'all'") + ap.add_argument("--mode", choices=["run", "reproduce"], default="run") + ap.add_argument("--manifest", default=None, help="write the provenance manifest JSON here") + ap.add_argument("--write-verdict", action="store_true", help="write gate + verdict back into sidecars") + ap.add_argument("--offline", action="store_true", help="do not datalad-get; treat missing inputs as unavailable") + ap.add_argument("--no-container", action="store_true", help="run on the host instead of the pinned container") + ap.add_argument("--no-gate", action="store_true", help="skip the standards gate (flagged dirty)") + args = ap.parse_args() + + rokb = Path(args.rokb) + graph = load_graph(rokb) + if args.node_id == "all": + targets = [nid for nid, (_, fm) in graph.items() if fm.get("type") == "run-record"] + else: + targets = [t.strip() for t in args.node_id.split(",") if t.strip()] + targets = order_runs(graph, [t for t in targets if t in graph]) + [t for t in targets if t not in graph] + + verdicts = [] + for nid in targets: + v = run_node(rokb, nid, mode=args.mode, offline=args.offline, + no_container=args.no_container, no_gate=args.no_gate, graph=graph) + verdicts.append(v) + mark = {"executed": "▶", "verified": "✓", "reproduced-within-tolerance": "≈", + "structurally-valid": "≋"}.get(v.verdict, "✗") + print(f" {mark} {nid}: {v.verdict}" + (f" [{v.tier}]" if v.tier else "") + + (f" — {v.notes}" if v.notes else "")) + if args.write_verdict: + write_verdict_back(rokb, nid, v) + graph = load_graph(rokb) # refresh after writeback + + if args.manifest: + m = write_manifest(verdicts, args.manifest) + print(f"\nManifest: {args.manifest} (reproducibility-rate {m['summary']['reproducibility-rate-pct']}%)") + + bad = [v for v in verdicts if v.verdict in ("mismatch", "gate-failed", "inputs-unavailable", "error")] + sys.exit(1 if bad else 0) + + +if __name__ == "__main__": + main() diff --git a/glimmer/tools/validate.py b/glimmer/tools/validate.py index 7cd021f..d8781fa 100644 --- a/glimmer/tools/validate.py +++ b/glimmer/tools/validate.py @@ -242,8 +242,8 @@ def _profile_extras(p): for node_id, (path, fm, _) in sidecar_by_id.items(): node_type = fm.get("type") edges = fm.get("edges") or [] - # Agent-protocol: findings/derivatives produced by an agent require reasoning-trace - if node_type in ("finding", "derivative") and fm.get("produced-by-agent") and not fm.get("reasoning-trace"): + # Agent-protocol: findings/derivatives/run-records produced by an agent require reasoning-trace + if node_type in ("finding", "derivative", "run-record") and fm.get("produced-by-agent") and not fm.get("reasoning-trace"): errors.append(f"{path}: {node_type} has `produced-by-agent` but no `reasoning-trace` (agent protocol)") # Findings must have based-on (the evidence chain) if node_type == "finding": @@ -251,6 +251,75 @@ def _profile_extras(p): errors.append(f"{path}: finding node missing required `based-on` field (the evidence chain)") edge_types = {e.get("type") for e in edges if isinstance(e, dict)} + # run-record structured-field + lifecycle checks (v0.6). The generic + # required-field check (section 4) ensures method/command/status/inputs/ + # outputs are present; here we check their SHAPE and graph consistency. + if node_type == "run-record": + status = fm.get("status") + # `method` must name a method node + m = fm.get("method") + if m: + if m not in index_ids: + errors.append(f"{path}: run-record `method` `{m}` not in index") + elif (index_ids[m].get("type") or "method") != "method": + errors.append(f"{path}: run-record `method` `{m}` is type `{index_ids[m].get('type')}`, expected `method`") + if m not in {e.get("target") for e in edges if isinstance(e, dict) and e.get("type") == "reruns"}: + warnings.append(f"{path}: run-record `method` `{m}` not mirrored by a `reruns` edge") + # inputs: non-empty list of pins; each needs `node` (+ a content pin unless planned) + inputs = fm.get("inputs") + if not isinstance(inputs, list) or not inputs: + errors.append(f"{path}: run-record `inputs` must be a non-empty list of pins") + inputs = [] + consumes_targets = {e.get("target") for e in edges if isinstance(e, dict) and e.get("type") == "consumes"} + for i, pin in enumerate(inputs): + if not isinstance(pin, dict) or not pin.get("node"): + errors.append(f"{path}: run-record inputs[{i}] missing `node`"); continue + n = pin["node"] + if n not in index_ids: + errors.append(f"{path}: run-record input node `{n}` not in index") + if status != "planned" and not (pin.get("datalad-annex-key") or pin.get("datalad-commit-sha")): + errors.append(f"{path}: run-record input `{n}` has no content pin " + f"(datalad-annex-key/commit-sha) and status is `{status}` (pins required unless planned)") + if n not in consumes_targets: + warnings.append(f"{path}: run-record input `{n}` not mirrored by a `consumes` edge") + # outputs: non-empty list of pins; each `node` must resolve to a derivative + outputs = fm.get("outputs") + if not isinstance(outputs, list) or not outputs: + errors.append(f"{path}: run-record `outputs` must be a non-empty list of pins") + outputs = [] + regen_targets = {e.get("target") for e in edges if isinstance(e, dict) and e.get("type") == "regenerates"} + for i, pin in enumerate(outputs): + if not isinstance(pin, dict) or not pin.get("node"): + errors.append(f"{path}: run-record outputs[{i}] missing `node`"); continue + n = pin["node"] + if n not in index_ids: + errors.append(f"{path}: run-record output node `{n}` not in index") + elif (index_ids[n].get("type") or "derivative") != "derivative": + errors.append(f"{path}: run-record output node `{n}` is type " + f"`{index_ids[n].get('type')}`, expected `derivative`") + if status == "executed" and not pin.get("expected-hash"): + warnings.append(f"{path}: run-record output `{n}` has no `expected-hash` (status=executed)") + if n not in regen_targets: + warnings.append(f"{path}: run-record output `{n}` not mirrored by a `regenerates` edge") + # reproduction-tolerance, if present, must be a map of {rel?,abs?} numerics + tol = fm.get("reproduction-tolerance") + if tol is not None: + buckets = ([tol.get("default")] + list((tol.get("quantities") or {}).values())) if isinstance(tol, dict) else None + if buckets is None: + errors.append(f"{path}: run-record `reproduction-tolerance` must be a map") + else: + for b in buckets: + if b is None: + continue + if (not isinstance(b, dict) or any(k not in ("rel", "abs") for k in b) + or any(not isinstance(v, (int, float)) for v in b.values())): + errors.append(f"{path}: run-record `reproduction-tolerance` entries must be {{rel?,abs?}} numerics") + break + # status=executed should carry the runner's verdict + if status == "executed" and not fm.get("replay-verdict"): + warnings.append(f"{path}: run-record status=executed but no `replay-verdict` " + f"(run `glimmer run`/`rerun` with --write-verdict)") + if node_type == "qc-artifact" and "conforms-to" not in edge_types: warnings.append(f"{path}: qc-artifact has no `conforms-to` edge to a standard") if node_type == "dataset" and "produced-by" not in edge_types: From 418f3eb6afa18f89d4c3c24b8bcea6e33cc08d41 Mon Sep 17 00:00:00 2001 From: Shady El Damaty Date: Wed, 24 Jun 2026 14:19:32 +0200 Subject: [PATCH 2/4] docs: new paper draft (proactive provenance); bump citation to v0.6.0 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add docs/paper-draft.md — a working manuscript (no journal boilerplate) extending "Reproducibility as Knowledge Graph Navigation" to proactive provenance: the executable run-record + node runner, the standards gate, the three verification tiers, runtime-certified method equivalence, and the agentic loop made executable. Point paper-citation.md at it and bump the software citation version to 0.6.0. Co-Authored-By: Claude Opus 4.8 --- docs/paper-citation.md | 6 +- docs/paper-draft.md | 319 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 324 insertions(+), 1 deletion(-) create mode 100644 docs/paper-draft.md diff --git a/docs/paper-citation.md b/docs/paper-citation.md index 6741f94..c3bc7d5 100644 --- a/docs/paper-citation.md +++ b/docs/paper-citation.md @@ -19,7 +19,11 @@ And the code: title={Glimmer: a Research-Object Knowledge Base for AI-Native Scientific Workflows}, author={El Damaty, Shady}, url={https://github.com/hebbianloop/glimmer}, - version={0.1.0}, + version={0.6.0}, year={2026} } ``` + +The working manuscript that extends the architecture to **proactive provenance** +(executable run-records + the node runner) is drafted at +[`docs/paper-draft.md`](paper-draft.md). diff --git a/docs/paper-draft.md b/docs/paper-draft.md new file mode 100644 index 0000000..17bdd6f --- /dev/null +++ b/docs/paper-draft.md @@ -0,0 +1,319 @@ +# Proactive Provenance: Making the Research-Object Graph Executable for AI-Native Reproducibility + +**Shady El Damaty** +*Working draft — v0.6.0 (2026). Successor to "Reproducibility as Knowledge Graph Navigation" (CAISc 2026).* + +> This is a working manuscript, not a journal submission: no anonymization, length +> targets, or venue formatting. It states the architecture and the v0.6 contribution +> plainly so the design can be reviewed against the implementation in this repository. + +--- + +## Abstract + +Reproducible-pipeline tooling (BIDS, DataLad, Nipype, containers) made the *bytes* of a +computational analysis recoverable, and provenance standards (W3C PROV, RO-Crate, CWL) +made its *structure* describable. Yet "this result reproduces" remains a claim a reader +must take on trust: the description of a run and the *act* of re-running it are separate +artifacts, and nothing in the typed record forces them to agree. The gap widens under +AI-native workflows, where an autonomous agent analyzing a dataset re-derives the same +numbers inconsistently, re-makes settled data-processing errors, and loses prior findings +across sessions — each a reproducibility failure in miniature, because the agent has no +executable contract to anchor to. + +We introduce **proactive provenance**: a typed research-object graph whose runs are +first-class, **executable, standard-gated, and self-verifying** nodes. We add to Glimmer — +a per-entity-sidecar research-object knowledge base — a single core node type, the +**`run-record`** (one concrete, replayable invocation, in the sense of a PROV `Activity`), +and a **node runner** (`glimmer run` / `glimmer rerun`) that, for each run-record: gates +its inputs (they must be both content-pinned *and* valid against their declared standards, +with validation delegated to standard-specific validators), replays the recorded command +in a pinned container, and verifies the outputs at one of three fidelity tiers — +*byte-identical* (with header normalization so two correct re-runs are not falsely +flagged), *numeric-within-tolerance* (re-deriving a reported number from source), and +*structural* (for agent/LLM outputs, the analogue of byte-equality). The run-record is the +executable unit of the agentic research loop: a hypothesis decomposes into planned runs, +running them emits derivatives and findings with recorded verdicts, and those verdicts — +not memory — are what the next planning step reads. We describe the design, a reference +implementation, and a worked example that exercises all three tiers, the standards gate, +and runtime-certified method equivalence on a laptop with no specialized dependencies. + +--- + +## 1. Introduction + +The 2010s solved a narrow version of reproducibility: with versioned data (DataLad / +git-annex), versioned code, and pinned environments (containers), the *bytes* an analysis +consumed and produced can be recovered later. The 2020s added a *description* layer — +W3C PROV activities, RO-Crate manifests, CWL tool/workflow descriptions — so the +*structure* of a computation is machine-readable. Glimmer's prior contribution sat above +both: a typed-entity graph (datasets, methods, derivatives, findings, standards, +publications, concepts) distributed across per-entity sidecars, so an agent could +*navigate* provenance and render auditable decisions. + +But all of this is **descriptive**. A `derivative` node records an `output-hash`; a +`method` records a tool version and a workflow SHA; PROV records that activity *A* used +entity *E* and generated entity *O*. None of it re-runs *A* and checks that *O* still +results. The description of the run and the run itself are different objects, maintained +by hand, and free to drift. In practice the gap is filled by a human re-running things ad +hoc, or — increasingly — not at all. + +AI-native workflows make the gap acute. An autonomous or semi-autonomous agent operating +over a dataset, with no executable contract to anchor to, exhibits a characteristic +failure pattern: it re-derives the same quantity by different routes and gets different +answers, repeats data-processing mistakes that were already diagnosed, and forgets +findings established earlier in the same project. Each is a reproducibility failure, and +each is a direct consequence of the substrate being a *ledger of claims* rather than a +*set of runnable, self-checking contracts*. + +**Thesis.** Reproducibility should be an *executable property of the graph*, not a +property of a separate pipeline a reader is invited to trust. We call this **proactive +provenance**: the research-object graph not only records what an output is, but can +re-run the act that produced it and verify the result — at a fidelity appropriate to the +computation's nature. We realize it with one new node type and one tool, and argue this is +the missing primitive that makes the agentic research loop self-sustaining. + +**Contributions.** +1. The **`run-record`** node type: a first-class, lifecycle-bearing PROV `Activity` that + binds a method, pinned and standard-validated inputs, expected outputs, the exact + command, and a pinned environment — the executable unit of the agentic loop (§4, §5). +2. A **node runner** with a pre-run **standards gate** and **three verification tiers**, + including a faithful byte-identical tier that normalizes non-semantic header bytes (§4). +3. **Standards as a runtime gate**: the run-record *references* the standards its inputs + must satisfy; the runner *enforces*; validators are *delegated* — so conformance is an + enforced, recorded fact without coupling the record to validator internals (§4.3). +4. **Runtime-certified method equivalence**: two implementations are `equivalent-to` only + if the runner confirms they produce matching outputs on shared inputs — turning a + convention into a check, and giving a method a semantic identity that outlives one + project (§4.5). +5. A reference implementation and a dependency-light worked example demonstrating all of + the above (§6). + +--- + +## 2. Background and related work + +**Reproducible pipelines.** BIDS standardizes neuroimaging data layout; DataLad/git-annex +version data and pin content by hash; Nipype and container images (Docker/Singularity, +BIDS-Apps) pin the toolchain. These guarantee *re-fetch* and *re-execution capability* but +not *verification*: nothing asserts that re-execution reproduces the recorded result. +Glimmer reuses this layer wholesale — DataLad coordinates live on every node — and adds the +act of checking. + +**Provenance description.** W3C PROV models entities/activities/agents; RO-Crate packages a +dataset with typed metadata; CWL and `cwlprov` describe and record tool/workflow runs; +WorkflowHub publishes workflows. Glimmer's `run-record` is deliberately PROV-shaped (an +Activity that `consumes` entities and `regenerates` derivatives), but it is *executable in +place*: the same node a reader inspects is the node the runner replays and stamps with a +verdict. The contribution is not a new description format but closing the loop between +description and execution inside one typed object. + +**Workflow re-execution.** `datalad run` / `datalad containers-run` record a command with +its inputs/outputs and can re-run it; `repro` tools and CI harnesses re-execute pipelines. +The node runner builds directly on `datalad containers-run` for replay. What Glimmer adds +is (a) re-execution as a *graph operation* over typed nodes rather than a shell convention, +(b) a *standards gate* before execution, and (c) *tiered* verification spanning +deterministic, stochastic, and agent-produced outputs — not only byte-equality. + +**AI-for-science systems.** End-to-end autoresearch systems (the AI-Scientist line, +FutureHouse, OpenScholar, Coscientist) run fixed internal pipelines. Glimmer is +complementary substrate: the run-record gives such systems a typed, replayable, verifiable +unit of work that survives independently of any one system, so multiple agents can +cooperate on, and audit, the same graph. + +--- + +## 3. The Glimmer graph (recap) + +Glimmer represents a research project as a graph of typed nodes, each a Markdown file with +YAML front-matter ("sidecar"), enumerated by a root index. Core node types include +`dataset`, `method`, `derivative`, `finding`, `standard`, `publication`, `concept`, +`experiment`, `instrument`, `persona`, `organization`, and `program`; edges are properties +on the source node. Domain-specific vocabulary lives in *profiles* (e.g. neuroimaging/BIDS) +that augment core types without forking the core. Every node carries a content hash and its +DataLad re-fetch coordinates; an agent that produces a node must attach a `reasoning-trace` +citing the nodes it read. The graph is plain files, so it survives `git clone`, +`datalad export`, and `rsync` with no bespoke database. + +This much was descriptive. Section 4 makes it executable. + +## 4. Proactive provenance: the `run-record` + +### 4.1 The node + +A `method` is a *reusable* tool (one skull-strip, applied to many subjects); a `derivative` +is a *product* (a file and its hash). Neither captures *the act*. The `run-record` is that +act — one concrete invocation: **this** command, on **these** content-pinned inputs, in +**this** container, testing **this** hypothesis, on **this** date. Its required core is a +`method`, a `command`, a `provenance-mode` (which selects the default verification tier), a +`status`, a list of pinned `inputs`, and a list of expected `outputs`; edges +(`reruns`/`consumes`/`regenerates`/`emits`, plus `tests-hypothesis`/`addresses-concept`) +make it navigable. + +It carries a **lifecycle**: `planned` (written when a hypothesis is decomposed; inputs may +be a *specification* rather than a pin) → `ready` (inputs pinned and standard-valid: the +gate passed) → `running` → `executed` (outputs hashed, verdict written) / `failed` / +`superseded`. The runner advances the lifecycle and writes two blocks back into the node: +`validation-gate` (the pre-run check) and `replay-verdict` (the per-output result). + +### 4.2 The node runner + +One engine, two modes. **`glimmer run`** is *forward execution* — gate, replay, hash and +record outputs, advance `ready → executed`. **`glimmer rerun`** is *reproduction* — +re-execute an already-recorded run and compare its outputs to what was recorded. The runner +is for running; reproduction is the special case of re-running something already recorded. + +Per run-record the engine (i) resolves and pins inputs (materializing via `datalad get`, +matching the recorded annex-key/commit-sha; unreachable inputs short-circuit to an +`inputs-unavailable` verdict that records data liveness — never a false pass), (ii) enforces +the standards gate (§4.3), (iii) replays the command pinned to the container digest via +`datalad containers-run`, falling back to `datalad run` and then a host subprocess flagged +as a dirty environment, (iv) verifies outputs per tier (§4.4), and (v) emits a provenance +manifest and, optionally, writes the verdict back into the graph. + +### 4.3 Standards as a runtime gate (referenced, enforced, delegated) + +A run-record only **references** the standards its inputs must satisfy (an input-pin's +`conforms-to`, the `requires-standard` edge). The runner **enforces** that each input is +both *pinned* and *valid* before any command runs; a failure yields a `gate-failed` verdict +and no execution. The check itself is **delegated** to the standard's `validator` hint +(e.g. `bids-validator {path}`), or falls back to graph-level validation, recording +`unchecked` honestly when no validator is available. The run-record thus never reimplements +validation, yet "ran on standard-valid data" becomes an enforced, recorded fact — moving +standards from passive metadata into the runtime. + +### 4.4 Three verification tiers + +Reproducibility is not one thing, and a single notion of "match" is wrong for most science. +The tier is selected by `provenance-mode` and overridable per output. + +- **Byte-identical** (deterministic). Re-execute; the output hash must match exactly — + *after normalization*, because two correct re-runs of an FSL/Nipype step differ in + volatile header bytes (timestamps, descrip fields) that carry no scientific content. The + runner zeroes those fields for NIfTI (keeping affine, dtype, and voxels exactly), strips + GIFTI metadata, and canonicalizes JSON, recording which normalization was applied; if the + imaging library is unavailable it falls back to a raw compare, recorded as such — never a + silent pass. (This realizes a normalization that prior tooling promised in a comment but + never implemented.) +- **Numeric-within-tolerance** (stochastic). Re-derive a number from source and assert + `|observed − expected| ≤ max(abs, rel·|expected|)`. This is the honest guarantee for + stochastic analyses, and the mechanism for **reproducing a paper's reported numbers from + source** within a declared tolerance. +- **Structural** (agent-inferred). LLM/agent prose is not hash-reproducible. Verification + is structural: the node's `reasoning-trace` must cite real graph nodes, those nodes must + contain the values the trace claims, and any retriever manifest must be self-consistent — + the analogue of byte-equality for the stochastic/inferential regime. + +Each verdict is recorded with its tier and degradations, so a "verified" verdict is never +silently weaker than its claim (a host-fallback run is marked as such; a tampered expected +hash yields `mismatch`). + +### 4.5 Method equivalence and a cross-project registry + +A method is a reusable function over pinned data, and its *semantic pattern* — "compute the +mean of a series", "skull-strip a T1w" — is not unique to one project. We separate the +pattern from the implementation with a `registry-ref` (a cross-project namespaced pattern +id) and the edges `implements`, `equivalent-to`, and `refines`. Crucially, `equivalent-to` +is **certified, not asserted**: the runner runs both implementations on the same pinned +inputs and confirms their outputs match (byte-identically or within tolerance). This makes +"an equivalent, cleaner, or faster implementation is acceptable" a checkable statement, and +lets a method's identity — and the verification baseline attached to it — outlive a single +graph. (A dedicated abstract `method-pattern` node type and a published cross-institution +registry are future work; §8.) + +## 5. The agentic loop, made executable + +Glimmer previously specified a plan→run→feedback→replan research loop — a `concept` +decomposed into hypotheses, agents producing derivatives and findings, a human reviewing — +but it had no runtime primitive; the "run" step left only a derivative behind. The +run-record *is* that primitive. A hypothesis acquires one or more `planned` run-records +(`tests-hypothesis → concept`); running them gates, executes, and emits `derivative`s +(`regenerates`) and a `finding` (`emits`) whose `addresses-concept` edge closes the loop +back onto the question. The feedback the next iteration reads is the recorded verdict plus +the emitted finding — not the agent's memory. This is precisely why the loop becomes +self-sustaining and why the agent failure modes of §1 are structurally suppressed: a +finding cannot be silently lost (it is a node addressing the concept), and a settled +mistake cannot be silently repeated (the prior run's verdict is in the graph and a re-run +must reproduce it). + +## 6. Implementation and worked example + +The implementation is ~600 lines of dependency-light Python (PyYAML only; DataLad, a +container runtime, the imaging library, and external validators are all feature-detected +and degrade with the degradation recorded). The schema change is additive and +backward-compatible; existing graphs validate unchanged. + +The worked example (`examples/synthetic-provenance/`) is the smallest artifact that +exercises the whole design with no real data and no specialized dependencies. A `concept` +is decomposed into four planned runs over a fixed synthetic signal whose input declares +conformance to a toy standard with an executable validator: + +| run | provenance-mode | tier | reproduce verdict | +|---|---|---|---| +| `run-synth-mean` | deterministic | byte-identical | `verified` | +| `run-synth-mean-fast` | deterministic | byte-identical | `verified` (certified `equivalent-to` the above) | +| `run-synth-classifier` | stochastic | numeric-within-tolerance (±0.02) | `reproduced-within-tolerance` | +| `run-synth-agent-summary` | agent-inferred | structural | `structurally-valid` | + +Forward `glimmer run` advances the planned runs to executed and emits a finding that +addresses the concept; `glimmer rerun` reproduces all four at 100%. Negative controls +behave as designed: a malformed input is caught by the standards gate before execution +(`gate-failed`); a tampered expected hash yields `mismatch`; an unreachable input under +`--offline` yields `inputs-unavailable`; bypassing the gate is recorded as a dirty run; and +`certify_equivalence` confirms the two mean implementations produce identical output. Each +maps to a non-zero exit, so the example doubles as a CI check. + +## 7. Discussion + +Proactive provenance reframes reproducibility from a property a reader audits by hand to a +property the substrate enforces and re-checks. Three design choices carry the weight. +*First*, tiered verification: insisting on byte-equality for everything is both too strong +(stochastic analyses fail it) and too weak (it ignores agent outputs entirely); matching +the check to the computation's nature is what lets one mechanism span a deterministic +pipeline, a reported statistic, and an LLM interpretation. *Second*, delegated gating: +standards become runtime-enforceable without the runner knowing any standard's internals, +so the design rides the existing validator ecosystem rather than reimplementing it. +*Third*, certified equivalence: it converts the social convention "use an equivalent +method" into a machine-checked relation, which is the seed of a shared verification baseline +across institutions. Together these make the claim "AI-native reproducibility" concrete: an +agent's unit of work is now a contract it can execute and that others can re-execute, with a +verdict that is legible to both machines and reviewers. + +## 8. Limitations and future work + +The runner replays a recorded command; it does not yet *generate* a reproduction path for +an arbitrary external paper. The planned-run-record with specified (unpinned, possibly +surrogate) inputs is the affordance for **minimal-path reproduction** — synthesizing a +minimal runnable graph to test a published claim even when the original data is unavailable +— which is the principal next step. The method registry currently ships as lightweight +edges plus runtime-certified equivalence; a dedicated abstract `method-pattern` type and a +published, cross-institution registry (with a reputation/provenance model for who +contributed which pattern or baseline) are deferred. Container replay requires DataLad and a +runtime; on a bare host the runner degrades and records the weaker environment, but does not +reconstruct it. Structural verification establishes that an agent cited real values, not +that its interpretation is correct — it guarantees auditability, not truth. Finally, the +worked example is synthetic by design; validating the byte-identical and numeric tiers on a +full neuroimaging pipeline and on a dissertation's reported numbers is in progress in a +downstream project that imports the runner as a dependency. + +## 9. Conclusion + +We argued that reproducibility belongs *inside* the typed research object as an executable +property, and we realized it with a single node type and one tool. The `run-record` makes a +run a first-class, lifecycle-bearing, standard-gated, self-verifying graph node; the node +runner enforces the gate and verifies outputs at a fidelity matched to the computation. The +result turns the agentic research loop from a description into a running system whose +feedback is recorded verdicts rather than recollection — a concrete substrate for AI-native +science in which "this reproduces" is something the machine demonstrates, not something the +reader is asked to believe. + +--- + +## References (informal) + +- BIDS — Brain Imaging Data Structure. Gorgolewski et al., 2016. +- DataLad / git-annex — Halchenko et al., 2021. +- Nipype — Gorgolewski et al., 2011. +- W3C PROV-DM, 2013; RO-Crate — Soiland-Reyes et al., 2022; CWL / cwlprov — Crusoe et al., 2022. +- El Damaty, S. *Reproducibility as Knowledge Graph Navigation: Glimmer …* CAISc 2026. +- Repository: https://github.com/hebbianloop/glimmer ; spec: `docs/proactive-provenance.md`. From 01c3d8739964d8ae15abc9bc088c4b74e94f5bd5 Mon Sep 17 00:00:00 2001 From: Shady El Damaty Date: Wed, 24 Jun 2026 15:27:49 +0200 Subject: [PATCH 3/4] papers: LaTeX project standard + two manuscripts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Establish LaTeX as the standard for submittable Glimmer papers: - papers/glimmer-paper.cls — venue-neutral house document class (preprint/draft banners, title-block macros; preloads amsmath/hyperref/natbib/booktabs/…). - papers/README.md — the convention + build instructions; papers/.gitignore for TeX artifacts. - papers/01-knowledge-graph-navigation/ — Paper 1, scoped to v0.5 (the graph as a navigable substrate; verification as a validator-enforced contract, the executable runner named as forthcoming). Preprint candidate. Compiles to 9pp. - papers/02-proactive-provenance/ — Paper 2 (v0.6): the run-record + node runner, standards gate, three verification tiers, certified equivalence. LaTeX port of docs/paper-draft.md. Compiles to 7pp. Both build with `make` (latexmk + bibtex) against the shared class. Co-Authored-By: Claude Opus 4.8 --- papers/.gitignore | 11 + papers/01-knowledge-graph-navigation/main.tex | 423 ++++++++++++++++++ papers/01-knowledge-graph-navigation/refs.bib | 110 +++++ papers/02-proactive-provenance/Makefile | 11 + papers/02-proactive-provenance/main.tex | 322 +++++++++++++ papers/02-proactive-provenance/refs.bib | 78 ++++ papers/README.md | 51 +++ papers/glimmer-paper.cls | 79 ++++ 8 files changed, 1085 insertions(+) create mode 100644 papers/.gitignore create mode 100644 papers/01-knowledge-graph-navigation/main.tex create mode 100644 papers/01-knowledge-graph-navigation/refs.bib create mode 100644 papers/02-proactive-provenance/Makefile create mode 100644 papers/02-proactive-provenance/main.tex create mode 100644 papers/02-proactive-provenance/refs.bib create mode 100644 papers/README.md create mode 100644 papers/glimmer-paper.cls diff --git a/papers/.gitignore b/papers/.gitignore new file mode 100644 index 0000000..e70f91c --- /dev/null +++ b/papers/.gitignore @@ -0,0 +1,11 @@ +# LaTeX build artifacts +*.pdf +*.aux +*.bbl +*.blg +*.log +*.out +*.fls +*.fdb_latexmk +*.synctex.gz +*.toc diff --git a/papers/01-knowledge-graph-navigation/main.tex b/papers/01-knowledge-graph-navigation/main.tex new file mode 100644 index 0000000..14a892a --- /dev/null +++ b/papers/01-knowledge-graph-navigation/main.tex @@ -0,0 +1,423 @@ +\documentclass[preprint]{glimmer-paper} + +\title{Reproducibility as Knowledge Graph Navigation:\\ +A Research-Object Knowledge Base for AI-Native Neuroimaging Analysis} +\author{Shady El Damaty} +\affil{Holonym / Opscientia} +\version{0.5.1} +\repo{https://github.com/hebbianloop/glimmer} +\date{2026} + +\newcommand{\code}[1]{\texttt{#1}} +\newcommand{\node}[1]{\textsf{#1}} + +\begin{document} +\maketitle + +\begin{abstract} +Reproducible-pipeline tooling---BIDS for data layout, DataLad and git-annex for +content-addressed versioning, Nipype and containers for pinned computation---made the +\emph{bytes} of a neuroimaging analysis recoverable and gave projects syntactic +structure. What it did not give is a \emph{navigable knowledge layer}: a typed, +machine-traversable account of which datasets, methods, derivatives, findings, and +publications a project comprises, how they relate, and what evidence grounds each claim. +This gap matters acutely for AI-native workflows, where an agent must reason over a +project's provenance to render decisions a human can audit. We present Glimmer, a +research-object knowledge base that models a research project as a typed-entity graph +distributed across per-entity sidecar files. Twelve core node types and a versioned edge +taxonomy turn datasets, methods, derivatives, findings, standards, publications, research +questions, people, and programs into first-class nodes; every node carries a content hash +and its DataLad re-fetch coordinates, and every agent-produced node carries a mandatory +reasoning trace citing the nodes it read. A schema-level verifiability contract +distinguishes \emph{exact} provenance (a deterministic result must re-run from its cited +SHAs) from \emph{structural} provenance (an inferred claim must cite real nodes that +contain the values it reports), and a domain-profile mechanism keeps the core +domain-neutral while letting neuroimaging (BIDS) and other fields add their own +vocabulary. Because the graph is plain files, it survives \code{git clone}, +\code{datalad export}, and \code{rsync} with no database. We describe the schema, the +agent protocol, the agentic research loop it enables, and two worked examples (a +DataLad$\rightarrow$Nipype pipeline and a literature-retrieval adapter), and we frame +reproducibility as a property an agent establishes by \emph{navigating} the graph and +re-running from cited coordinates. Turning that navigable property into an automatic +execution-and-verification engine is the subject of follow-on work. +\end{abstract} + +\section{Introduction} + +A decade of investment in reproducible pipelines transformed neuroimaging. The Brain +Imaging Data Structure \citep{gorgolewski2016bids} standardized how raw data is laid out; +DataLad and git-annex \citep{halchenko2021datalad} made datasets content-addressed and +re-fetchable by hash; Nipype \citep{gorgolewski2011nipype} and containerized BIDS-Apps +such as MRIQC \citep{esteban2017mriqc} and fMRIPrep \citep{esteban2019fmriprep} pinned the +toolchain so a computation could, in principle, be re-executed. Together these guarantee +two things: the \emph{bytes} an analysis consumed can be recovered, and the project has a +\emph{syntactic} structure a tool can parse. + +They do not guarantee a third thing that matters increasingly: a \emph{navigable knowledge +layer}. Given a published number, what dataset produced it, under which method version, +conforming to which standard, supporting which claim, cited in which paper, framed against +which hypothesis? In practice that knowledge lives in lab wikis, file-naming conventions, +analysis scripts, and researchers' memories---none of it typed, none of it traversable, +and none of it survives the people who built it. The reproducibility crisis is, in this +framing, partly a \emph{knowledge-representation} problem: the relationships among a +project's artifacts are real but unrecorded. + +The problem sharpens under AI-native workflows. An agent asked to quality-control a +dataset, summarize a result, or draft a paragraph must reason over exactly these +relationships, and---if its output is to be trusted---must leave behind an auditable +account of the evidence it used. A free-text answer from a language model is not +auditable; a decision grounded in a typed graph, with an explicit trace of which nodes +were read and which metrics were treated as load-bearing, is. + +\paragraph{Thesis.} We propose to model a research project as a typed-entity +\emph{research-object graph}, distributed across per-entity sidecar files that live +alongside the data in version control, so that reproducibility becomes +\emph{knowledge-graph navigation}: an agent (or a human) answers a provenance question by +traversing typed edges from a claim back to the data and code that ground it, and +establishes that a result reproduces by re-fetching and re-running from the coordinates +the graph records. Glimmer is the schema, the verifiability contract, and a reference +implementation of this idea. This paper describes the architecture as of v0.5.x; an +automatic execution-and-verification engine that turns the navigable property into a +machine-checked one is forthcoming \citep{eldamaty2026proactive}. + +\paragraph{Contributions.} +\begin{itemize} + \item A domain-neutral schema of twelve core node types and a versioned edge taxonomy + that represents the data layer, the evidence layer, and the social/research-program + layer of a project as one traversable graph (\S\ref{sec:schema}). + \item A distributed, file-based realization---per-entity YAML sidecars plus a root + index---that carries content hashes and DataLad re-fetch coordinates on every node and + survives ordinary distribution tools (\S\ref{sec:schema}, \S\ref{sec:discussion}). + \item An agent protocol: a schema-level verifiability contract requiring every + agent-produced node to declare its identity and emit a reasoning trace, and + distinguishing exact from structural verification (\S\ref{sec:protocol}). + \item The agentic research loop the substrate enables---plans-as-issues, concept + decomposition, and four specialized agent roles---and a retrieval adapter that gives + literature search the same provenance discipline (\S\ref{sec:loop}). + \item Two worked examples and a domain-profile mechanism that keeps the core + domain-neutral while supporting BIDS-specific vocabulary (\S\ref{sec:examples}). +\end{itemize} + +\section{Background and related work} +\label{sec:background} + +\paragraph{Data and pipeline standards.} BIDS \citep{gorgolewski2016bids} fixes a +directory layout and sidecar conventions for neuroimaging; it is syntactic structure, not +a relational account of a project. DataLad and git-annex \citep{halchenko2021datalad} +provide content-addressed, distributed version control: a dataset is a git repository +whose large files are tracked by annex key, so any file can be re-fetched by a pinned +commit and key. Nipype \citep{gorgolewski2011nipype} wraps heterogeneous tools (FSL, +FreeSurfer, ANTs) behind a uniform interface and records workflow structure. Glimmer does +not replace any of these; it reads them and sits above them, treating a DataLad +superdataset as the source of truth and recording, on each node, the coordinates needed to +recover the bytes. + +\paragraph{Provenance description.} The W3C PROV data model \citep{w3cprov2013} +formalizes entities, activities, and agents; the Neuroimaging Data Model (NIDM) +\citep{maumet2016nidm} applies a PROV/RDF account to statistical results; RO-Crate +\citep{soilandreyes2022rocrate} packages a dataset with typed, linked-data metadata; and +schema.org provides a general vocabulary for web-discoverable description. These are +description formats. Glimmer is deliberately compatible with them---its node types map +cleanly onto PROV entities and activities, and a Glimmer graph can be cross-read as +RO-Crate or JSON-LD---but its contribution is a \emph{working substrate} an agent +navigates and a validator enforces, not a new serialization. The interoperability stance +is explicit: format does not matter if an agent can translate between formats, so a +sidecar may be BIDS-native JSON, Glimmer-native YAML, or any structured form the agent can +parse into the same typed graph. + +\paragraph{AI for science.} End-to-end autoresearch systems +\citep{lu2024aiscientist,boiko2023coscientist} run fixed internal pipelines from idea to +manuscript. Glimmer is complementary: rather than a pipeline, it is a typed, versioned, +auditable substrate that multiple agents---possibly from different systems---can reason +over and contribute to, with a verifiability contract that makes each contribution +checkable. The graph survives independently of any one autoresearch system. + +\section{The Glimmer schema} +\label{sec:schema} + +A Glimmer research-object knowledge base (RO-KB) is a set of typed nodes, each stored as a +sidecar file, enumerated by a root index. We describe the node types, the edges, the +sidecar format, and the domain-profile mechanism. + +\subsection{Node types} + +The core schema is domain-neutral: vocabulary fixed by a particular field's standards +lives in profiles (\S\ref{sec:profiles}), not in the core. Twelve node types span three +layers. + +\emph{The data and evidence layer.} +\begin{itemize} + \item \node{dataset} --- research data of any kind. The core carries only identity, + generic provenance, and re-fetch coordinates; kind-specific attributes + (participant, session, modality) come from a profile. + \item \node{method} --- a named analysis tool, pipeline, or workflow (an FSL binary, a + Nipype interface or workflow, a script), with \code{tool}, \code{version}, + \code{parameters}, a \code{parameters-hash}, and a \code{workflow-definition-sha}. + \item \node{experiment} --- a task or acquisition \emph{paradigm} (conditions, stimuli, + timing), distinct from a static standard. + \item \node{instrument} --- the measurement apparatus that \emph{generates} data (a + scanner, survey, assay, task-delivery system, or device); a dataset is + \code{acquired-with} an instrument. + \item \node{derivative} --- the output of applying a method to a dataset; a first-class + node, not a directory, carrying an \code{output-hash} so a re-run can be checked, and a + required \code{provenance-mode} (\S\ref{sec:protocol}). + \item \node{finding} --- an interpreted assertion grounded in one or more derivatives, + the unit between ``the pipeline produced this output'' and ``we wrote a paper about + it,'' aligned with evidence-graph practice: an interpretation plus evidence pointers + plus verifiable provenance. + \item \node{standard} --- a specification, atlas, template, or protocol (a BIDS version, + an atlas, a QC rating scale), modeled as a node so constraints are edges an agent can + read. + \item \node{publication} --- a paper draft, abstract, or preprint that aggregates + findings. +\end{itemize} + +\emph{The social and research-program layer.} +\begin{itemize} + \item \node{concept} --- a research question, hypothesis, or theme: the unit a research + program operates at (what a grant funds, what a thesis defends). The agentic loop + decomposes a concept into sub-hypotheses. + \item \node{persona} --- a person or organizational role; the in-graph identity an + attribution edge resolves to. + \item \node{organization} --- an institution, lab, consortium, journal, or funder. + \item \node{program} --- a study, cohort, or initiative as a first-class container that + organizes datasets, experiments, concepts, and publications around one mission; programs + nest via \code{part-of} to form multi-study hierarchies. +\end{itemize} + +\subsection{Edges} + +Edges are properties on the source node, each carrying a type and a target. The taxonomy +has three families. + +\emph{Structural edges} record the data-to-claim lineage: \code{produced-by} +(derivative/dataset $\rightarrow$ method), \code{derives-from} (derivative $\rightarrow$ +dataset or upstream derivative), \code{applies-to} and \code{produces} (method +$\leftrightarrow$ derivative), \code{conforms-to} (\textrm{$\rightarrow$} standard), the +\code{cites-*} family (publication $\rightarrow$ dataset/method/derivative/finding), the +method-pipeline DAG (\code{composes}, \code{upstream-of}, \code{downstream-of}), and the +two acquisition edges promoted to the core in v0.5, \code{acquired-with} (dataset or +experiment $\rightarrow$ instrument) and \code{described-by} (dataset $\rightarrow$ +experiment). + +\emph{Meta-graph edges} record evidence and the research program: \code{based-on} +(finding $\rightarrow$ derivative/dataset), \code{addresses-concept} (finding or +publication $\rightarrow$ concept), \code{tests-hypothesis} (experiment $\rightarrow$ +concept), the concept-relation edges \code{decomposes-into}, \code{extends-concept}, +\code{subsumed-by}, \code{competes-with}, and \code{superseded-by}, the evidence-relation +edges \code{supports} and \code{contradicts}/\code{challenged-by}, and the in-graph +attribution layer \code{authored-by}, \code{affiliated-with}, \code{funded-by}, +\code{mentors}, \code{leads}, and \code{part-of}. + +\emph{Universal edges} are allowed from any node and unioned in by the validator: +\code{contributed-by} (an out-of-graph contributor identifier---an ORCID URI, email, or +kebab id---carrying role metadata), \code{in-program} (membership in a program in this +graph), and \code{cross-project} (a relationship to a node in \emph{another} project's +graph, addressed by a namespaced id such as \code{ads-glimmer:org-nij}). The last two let +a subproject inherit a canonical node from its parent program, or a harmonizing analysis +connect two graphs, without a federated index. + +\subsection{Sidecars, the index, and provenance} + +Each node is a Markdown file with YAML front-matter; standalone sidecars are +Glimmer-native YAML, while a sidecar extending a BIDS file in place is BIDS-native JSON +augmented with an \code{\_x-glimmer} block. Every node has \code{id}, \code{type}, +\code{name}, \code{created}, \code{modified}, an optional \code{description}, an optional +\code{edges} list, and a \code{provenance-hash}: a SHA-256 over the node's body, so a git +commit is a snapshot of the entire graph's state. A root file, \code{\_glimmer-index.json}, +enumerates every node id and its path; it is the mandatory load for an agent. + +Re-fetchability is carried on the node. Datasets, derivatives, and methods record DataLad +coordinates---\code{datalad-superdataset}, \code{datalad-relative-path}, +\code{datalad-commit-sha}, and \code{datalad-annex-key}---so the graph is self-describing +for recovery: installing the superdataset and getting the cited path reproduces exactly +the bytes the analysis ran on. Version 0.5.1 adds a \emph{data-availability} block to +\node{dataset} (\code{data-remote}, \code{data-provenance}, \code{data-last-commit}, +\code{data-liveness}) so a query over the graph can tell whether a dataset is currently +pullable, and from which storage backend, complementing the re-fetch coordinates. + +\subsection{Domain profiles} +\label{sec:profiles} + +Keeping the core domain-neutral requires a disciplined extension mechanism. A +\emph{domain profile} is a small YAML file that augments core node types with extra +required or optional fields and, as of v0.5, may declare domain-specific node types and +edge types. Profiles come in two tiers: a curated library versioned in the repository +(\code{glimmer/schema/profiles/.yaml}) and a researcher's own, local to one +knowledge base (\code{/\_glimmer-profiles/.yaml}); a node selects its profile +by its \code{domain} field, falling back to a KB-level default. The shipped neuroimaging +profile encodes BIDS vocabulary---augmenting \node{dataset} with \code{subject-id}, +\code{session}, \code{modality}, \code{scanner}, and \code{bids-version}; \node{experiment} +with \code{design} and \code{stimulus-set}; \node{method} with a Nipype node kind; and +\node{derivative} with an \code{output-kind}---and declares the study-provenance edges a +neuroimaging project needs. The governance rule is simple: a node or edge type is promoted +to the core only once at least two domains need it; otherwise it stays in a profile. This +lets a new field adopt Glimmer without forking the schema. + +\section{The agent protocol: a verifiability contract} +\label{sec:protocol} + +The schema records structure; the agent protocol makes agent contributions auditable. Its +rules are enforced by the validator, so a graph that violates them does not validate. + +\paragraph{Identity and reasoning trace.} When a \node{finding} or \node{derivative} is +produced by an agent rather than by a deterministic computation, it must set +\code{produced-by-agent} (a stable model identifier) and carry a \code{reasoning-trace}. +The trace is structured: \code{nodes-accessed} lists every node id the agent read, +\code{metrics-cited} records the numeric or categorical evidence treated as load-bearing, +\code{evidence-summary} is a short justification with inline node-id citations, and +\code{model-identifier} and \code{timestamp} fix provenance. A finding must additionally +carry a \code{based-on} chain naming the derivatives or datasets that support it; a finding +without one is treated as ungrounded. + +\paragraph{Modes of provenance.} Every \node{derivative} declares a \code{provenance-mode}: +\code{deterministic} (a Nipype/FSL computation whose output is fixed by its inputs), +\code{agent-inferred} (a language-model summary), or \code{stochastic} (e.g.\ a randomized +initialization or a retrieval over an embedding index). The mode tells a reader what kind +of verification is even possible. + +\paragraph{Exact versus structural verification.} The contract distinguishes two regimes. +For a \code{deterministic} output, verification is \emph{exact} in principle: re-running +the cited method at the cited input SHAs must reproduce the recorded \code{output-hash}, so +the graph's claim about the artifact is falsifiable by re-execution. For an +\code{agent-inferred} output, verification is \emph{structural}: the reasoning trace must +cite nodes that exist, those nodes must contain the values the trace reports, and the +interpretation must be a plausible reading of them. Neither test guarantees correctness; +together they guarantee \emph{auditability}---a reviewer can walk from a claim back through +its cited evidence and check that the evidence exists, conforms to the standards it cites, +and says what the agent says it says. In this version these regimes are a contract the +validator enforces and a discipline a human or a CI job applies by re-running; an engine +that performs the re-execution and the structural check automatically is the subject of +follow-on work (\S\ref{sec:limits}). + +\paragraph{Sanity-checking against source.} A method records a +\code{workflow-definition-sha} and may record a \code{source-checkout-url} and +\code{source-build-instructions}. When a reasoning trace cites a derivative whose method +version has drifted from upstream source, the recovery move---borrowed from long-running +neuroimaging projects---is to rebuild the method from source rather than trust a possibly +divergent cached binary. The schema makes this expressible; an agent may choose to enforce +it. + +\section{The agentic research loop} +\label{sec:loop} + +The substrate is designed for a specific operating pattern: \emph{plans-as-issues} applied +to research. An author writes a research idea as a \node{concept} node with a tracked +issue; the concept is decomposed via \code{decomposes-into} into falsifiable +sub-hypotheses; and for each sub-hypothesis the project runs one or more agent loops whose +outputs route to human review (approve, revise, or reject). Four agent roles recur, each +operating with a different access scope and the same verifiability contract: + +\begin{itemize} + \item a \emph{literature scout} that traverses external bibliographic sources and the + project's existing publications and emits new \node{publication} nodes for relevant prior + work, read-only on local data; + \item a \emph{QC agent} that rates datasets and their derivatives against a standard, + emitting artifacts with full reasoning traces; + \item an \emph{analysis agent} that runs deterministic computation and emits + \node{derivative} nodes with \code{provenance-mode: deterministic}; and + \item a \emph{synthesis agent} that walks the populated graph and drafts a + \node{publication} by composing derivatives, QC artifacts, and cited publications into an + argument, with its output gated on human review. +\end{itemize} + +At every step the output's reasoning trace cites the standards it applied, so a reviewer +can verify standards adherence without inspecting the data directly; the protocol makes +verifiability non-optional at each step. Crucially, the loop is the \emph{operational +pattern the substrate enables}---typed roles emitting typed, traceable nodes---rather than +a monolithic executable; the substrate's job is to make every contribution navigable and +checkable. + +\paragraph{Retrieval with provenance.} Literature search is inherently non-deterministic, +so the retrieval adapter gives it the honest analogue of SHA-based reproducibility. A +retrieval-grounded \node{finding} is marked \code{provenance-mode: stochastic} and embeds a +\code{retriever-manifest}---the embedding model and version, the chunking and reranking +configuration, and the index SHA---so a same-configuration re-run is reproducible modulo +the index. The adapter's indexing operation is idempotent on a node's id and +provenance-hash, so re-emitting an unchanged node is a no-op. This keeps even the +stochastic, external-knowledge parts of the loop inside the same auditable contract. + +\section{Worked examples} +\label{sec:examples} + +\paragraph{A DataLad$\rightarrow$Nipype pipeline.} The reference example operates on the +public OpenNeuro dataset \code{ds000114}. A short install step runs +\code{datalad install ///openneuro/ds000114} and selectively gets a subject's T1-weighted +image, recording the DataLad coordinates. A Nipype workflow applies skull-stripping (FSL +BET) and tissue segmentation (FSL FAST); the emitter then writes typed sidecars: a +\node{dataset} for the input, \node{method} nodes for BET and FAST (each with its tool +version, parameter hash, and workflow-definition SHA), \node{derivative} nodes for the +outputs (each with an \code{output-hash} and \code{provenance-mode: deterministic}), and a +\node{finding} reporting a derived brain volume. The graph is self-describing for recovery: +the cited superdataset, commit, and annex key locate exactly the input bytes, and the +method's SHA plus parameters specify the computation, so the derivative's recorded hash is +the target a re-run must reproduce. The validator confirms that every edge target exists, +that the finding carries a \code{based-on} chain, and that the reasoning-trace contract +holds. + +\paragraph{A literature-retrieval adapter.} The second example models the literature-scout +role. A \node{concept} states a falsifiable hypothesis (a naturalistic emotional-film fMRI +paradigm predicting a later behavioral outcome); the scout retrieves prior work, emits +\node{publication} nodes with citation metadata, and emits a \node{finding} whose +\code{reasoning-trace} records the retrieval scores and the publications it read, marked +\code{provenance-mode: stochastic} with its \code{retriever-manifest}. A reviewer can read +the trace, follow \code{based-on} to each cited publication, and confirm the finding cites +real, retrievable evidence---structural verification in practice. + +\section{Discussion} +\label{sec:discussion} + +\paragraph{Distributed-first by design.} Glimmer deliberately avoids a central database. +Because the graph is per-entity files committed alongside the data, it survives the tools +researchers already use---\code{git clone}, \code{datalad export}, \code{rsync}---with no +bespoke serialization, and a single commit at the superdataset level captures the full +state of data, code, derivatives, and graph together. The cost---no global query without +loading the index---is acceptable for project- and program-scale graphs and buys +durability and tool-independence. + +\paragraph{Domain-agnostic core, domain-specific profiles.} The split between a +twelve-type core and curated/local profiles lets the same architecture serve neuroimaging +today and other compute-intensive, standards-backed fields tomorrow, with promotion to the +core gated on cross-domain demand. The reference implementation is neuroimaging because +that is where the standards ecosystem is most mature, but nothing in the core is +neuroimaging-specific. + +\paragraph{A junction with decentralized science.} Because nodes carry stable, possibly +out-of-graph identities (ORCID, namespaced cross-project ids) and the substrate is +file-based and content-addressed, Glimmer composes naturally with decentralized-science +infrastructure---OpenNeuro and Opscientia for data, and verifiable-researcher-identity +systems for attribution and signing. The \code{cross-project} edge is the seam along which +independent project graphs federate. + +\section{Limitations and future work} +\label{sec:limits} + +The central limitation of this version is that verification is a \emph{navigable property}, +not an automatic one. The schema records everything needed to reproduce a deterministic +result---input SHAs, method SHA, parameters, the target output hash---and the validator +enforces the reasoning-trace contract, but the act of re-fetching, re-running, and +comparing is performed by a human or a continuous-integration job, not by the substrate +itself. An \emph{executable node runner} that gates a computation's inputs, replays it in +its pinned environment, and verifies its outputs automatically---turning the exact and +structural regimes of \S\ref{sec:protocol} into machine-checked verdicts---is the natural +next step and the subject of follow-on work \citep{eldamaty2026proactive}. Other open +directions include federation across institution graphs with a reputation model for schema +extensions, durable-storage guarantees expressed in the model rather than left to +operations, and temporal edges for longitudinal study waves. + +\section{Conclusion} + +Reproducible pipelines gave neuroimaging recoverable bytes and syntactic structure; +Glimmer adds the layer above them, a typed research-object graph in which datasets, +methods, derivatives, findings, standards, publications, questions, and people are +first-class nodes connected by versioned edges, distributed across files that survive +ordinary tooling. By carrying content hashes and DataLad coordinates on every node and a +mandatory reasoning trace on every agent output, the graph makes reproducibility a matter +of \emph{navigation}: an agent or reviewer walks from a claim to the evidence that grounds +it and re-runs from the coordinates the graph records. That navigable foundation is what an +automatic execution-and-verification engine can then build upon. + +\bibliographystyle{plainnat} +\bibliography{refs} + +\end{document} diff --git a/papers/01-knowledge-graph-navigation/refs.bib b/papers/01-knowledge-graph-navigation/refs.bib new file mode 100644 index 0000000..a331366 --- /dev/null +++ b/papers/01-knowledge-graph-navigation/refs.bib @@ -0,0 +1,110 @@ +@article{gorgolewski2016bids, + title={The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments}, + author={Gorgolewski, Krzysztof J. and Auer, Tibor and Calhoun, Vince D. and Craddock, R. Cameron and Das, Samir and Duff, Eugene P. and Flandin, Guillaume and Ghosh, Satrajit S. and Glatard, Tristan and Halchenko, Yaroslav O. and others}, + journal={Scientific Data}, + volume={3}, + number={1}, + pages={160044}, + year={2016}, + publisher={Nature Publishing Group}, + doi={10.1038/sdata.2016.44} +} + +@article{halchenko2021datalad, + title={DataLad: distributed system for joint management of code, data, and their relationship}, + author={Halchenko, Yaroslav O. and Meyer, Kyle and Poldrack, Benjamin and Solanky, Debanjum S. and Wagner, Adina S. and Gors, Jason and MacFarlane, Dave and Pustina, Dorian and Sochat, Vanessa and Ghosh, Satrajit S. and others}, + journal={Journal of Open Source Software}, + volume={6}, + number={63}, + pages={3262}, + year={2021}, + doi={10.21105/joss.03262} +} + +@article{gorgolewski2011nipype, + title={Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in Python}, + author={Gorgolewski, Krzysztof and Burns, Christopher D. and Madison, Cindee and Clark, Dav and Halchenko, Yaroslav O. and Waskom, Michael L. and Ghosh, Satrajit S.}, + journal={Frontiers in Neuroinformatics}, + volume={5}, + pages={13}, + year={2011}, + doi={10.3389/fninf.2011.00013} +} + +@article{esteban2017mriqc, + title={MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites}, + author={Esteban, Oscar and Birman, Daniel and Schaer, Marie and Koyejo, Oluwasanmi O. and Poldrack, Russell A. and Gorgolewski, Krzysztof J.}, + journal={PLOS ONE}, + volume={12}, + number={9}, + pages={e0184661}, + year={2017}, + doi={10.1371/journal.pone.0184661} +} + +@article{esteban2019fmriprep, + title={fMRIPrep: a robust preprocessing pipeline for functional MRI}, + author={Esteban, Oscar and Markiewicz, Christopher J. and Blair, Ross W. and Moodie, Craig A. and Isik, A. Ilkay and Erramuzpe, Asier and Kent, James D. and Goncalves, Mathias and DuPre, Elizabeth and Snyder, Madeleine and others}, + journal={Nature Methods}, + volume={16}, + number={1}, + pages={111--116}, + year={2019}, + doi={10.1038/s41592-018-0235-4} +} + +@techreport{w3cprov2013, + title={{PROV-DM}: The {PROV} Data Model}, + author={Moreau, Luc and Missier, Paolo and others}, + institution={World Wide Web Consortium (W3C)}, + type={W3C Recommendation}, + year={2013}, + url={https://www.w3.org/TR/prov-dm/} +} + +@article{maumet2016nidm, + title={Sharing brain mapping statistical results with the neuroimaging data model}, + author={Maumet, Camille and Auer, Tibor and Bowring, Alexander and Chen, Gang and Das, Samir and Flandin, Guillaume and Ghosh, Satrajit and Glatard, Tristan and Gorgolewski, Krzysztof J. and Helmer, Karl G. and others}, + journal={Scientific Data}, + volume={3}, + number={1}, + pages={160102}, + year={2016}, + doi={10.1038/sdata.2016.102} +} + +@article{soilandreyes2022rocrate, + title={Packaging research artefacts with {RO-Crate}}, + author={Soiland-Reyes, Stian and Sefton, Peter and Crosas, Merc{\`e} and Castro, Leyla Jael and Coppens, Frederik and Fern{\'a}ndez, Jos{\'e} M. and Garijo, Daniel and Gr{\"o}ning, Bj{\"o}rn and La Rosa, Marco and Leo, Simone and others}, + journal={Data Science}, + volume={5}, + number={2}, + pages={97--138}, + year={2022}, + doi={10.3233/DS-210053} +} + +@inproceedings{lu2024aiscientist, + title={The {AI} Scientist: Towards Fully Automated Open-Ended Scientific Discovery}, + author={Lu, Chris and Lu, Cong and Lange, Robert Tjarko and Foerster, Jakob and Clune, Jeff and Ha, David}, + booktitle={arXiv preprint arXiv:2408.06292}, + year={2024} +} + +@article{boiko2023coscientist, + title={Autonomous chemical research with large language models}, + author={Boiko, Daniil A. and MacKnight, Robert and Kline, Ben and Gomes, Gabe}, + journal={Nature}, + volume={624}, + number={7992}, + pages={570--578}, + year={2023}, + doi={10.1038/s41586-023-06792-0} +} + +@misc{eldamaty2026proactive, + title={Proactive Provenance: Making the Research-Object Graph Executable for {AI}-Native Reproducibility}, + author={El Damaty, Shady}, + year={2026}, + note={Working draft, Glimmer v0.6; forthcoming} +} diff --git a/papers/02-proactive-provenance/Makefile b/papers/02-proactive-provenance/Makefile new file mode 100644 index 0000000..5bd28d9 --- /dev/null +++ b/papers/02-proactive-provenance/Makefile @@ -0,0 +1,11 @@ +# Build the paper with the shared house class one directory up. +TEXINPUTS := ..:$(TEXINPUTS) +export TEXINPUTS + +main.pdf: main.tex refs.bib + latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex + +.PHONY: clean +clean: + latexmk -C + rm -f *.bbl diff --git a/papers/02-proactive-provenance/main.tex b/papers/02-proactive-provenance/main.tex new file mode 100644 index 0000000..9673bc1 --- /dev/null +++ b/papers/02-proactive-provenance/main.tex @@ -0,0 +1,322 @@ +\documentclass[preprint]{glimmer-paper} + +\title{Proactive Provenance: Making the Research-Object Graph Executable\\ for AI-Native Reproducibility} +\author{Shady El Damaty} +\affil{Holonym / Opscientia} +\version{0.6.0} +\repo{https://github.com/hebbianloop/glimmer} +\date{2026} + +\newcommand{\code}[1]{\texttt{#1}} + +\begin{document} +\maketitle + +\begin{abstract} +Reproducible-pipeline tooling (BIDS, DataLad, Nipype, containers) made the \emph{bytes} of +a computational analysis recoverable, and provenance standards (W3C PROV, RO-Crate, CWL) +made its \emph{structure} describable. Yet ``this result reproduces'' remains a claim a +reader must take on trust: the description of a run and the act of re-running it are +separate artifacts, and nothing in the typed record forces them to agree. The gap widens +under AI-native workflows, where an autonomous agent re-derives the same numbers +inconsistently, repeats settled data-processing errors, and loses prior findings across +sessions --- each a reproducibility failure, because the agent has no executable contract +to anchor to. We introduce \emph{proactive provenance}: a typed research-object graph whose +runs are first-class, \textbf{executable, standard-gated, and self-verifying} nodes. We add +to Glimmer --- a per-entity-sidecar research-object knowledge base --- a single core node +type, the \code{run-record} (one concrete, replayable invocation, in the sense of a PROV +\emph{Activity}), and a node runner (\code{glimmer run} / \code{glimmer rerun}) that, for +each run-record: gates its inputs (they must be both content-pinned \emph{and} valid against +their declared standards, with validation delegated to standard-specific validators), +replays the recorded command in a pinned container, and verifies the outputs at one of three +fidelity tiers --- \emph{byte-identical} (with header normalization), \emph{numeric-within-tolerance} +(re-deriving a reported number from source), and \emph{structural} (for agent/LLM outputs). +The run-record is the executable unit of the agentic research loop: a hypothesis decomposes +into planned runs, running them emits derivatives and findings with recorded verdicts, and +those verdicts --- not memory --- are what the next planning step reads. We describe the +design, a reference implementation, and a worked example exercising all three tiers, the +standards gate, and runtime-certified method equivalence with no specialized dependencies. +\end{abstract} + +\section{Introduction} + +The 2010s solved a narrow version of reproducibility: with versioned data +(DataLad / git-annex \citep{halchenko2021datalad}), versioned code, and pinned environments +(containers), the \emph{bytes} an analysis consumed and produced can be recovered later. The +2020s added a \emph{description} layer --- W3C PROV activities \citep{provdm2013}, RO-Crate +manifests \citep{soilandreyes2022rocrate}, CWL tool/workflow descriptions +\citep{crusoe2022cwl} --- so the \emph{structure} of a computation is machine-readable. +Glimmer's prior contribution \citep{eldamaty2026glimmer} sat above both: a typed-entity +graph (datasets, methods, derivatives, findings, standards, publications, concepts) +distributed across per-entity sidecars, so an agent could \emph{navigate} provenance and +render auditable decisions. + +But all of this is \textbf{descriptive}. A \code{derivative} node records an +\code{output-hash}; a \code{method} records a tool version and a workflow SHA; PROV records +that activity $A$ used entity $E$ and generated entity $O$. None of it re-runs $A$ and checks +that $O$ still results. The description of the run and the run itself are different objects, +maintained by hand, and free to drift. + +AI-native workflows make the gap acute. An autonomous or semi-autonomous agent operating +over a dataset, with no executable contract to anchor to, exhibits a characteristic failure +pattern: it re-derives the same quantity by different routes and gets different answers, +repeats data-processing mistakes that were already diagnosed, and forgets findings +established earlier in the same project. Each is a reproducibility failure, and each is a +direct consequence of the substrate being a \emph{ledger of claims} rather than a \emph{set +of runnable, self-checking contracts}. + +\paragraph{Thesis.} Reproducibility should be an \emph{executable property of the graph}, not +a property of a separate pipeline a reader is invited to trust. We call this \textbf{proactive +provenance}: the research-object graph not only records what an output is, but can re-run the +act that produced it and verify the result --- at a fidelity appropriate to the +computation's nature. We realize it with one new node type and one tool, and argue this is +the missing primitive that makes the agentic research loop self-sustaining. + +\paragraph{Contributions.} +\begin{enumerate} +\item The \code{run-record} node type: a first-class, lifecycle-bearing PROV \emph{Activity} + that binds a method, pinned and standard-validated inputs, expected outputs, the exact + command, and a pinned environment --- the executable unit of the agentic loop + (\S\ref{sec:runrecord}, \S\ref{sec:loop}). +\item A node runner with a pre-run \textbf{standards gate} and \textbf{three verification + tiers}, including a faithful byte-identical tier that normalizes non-semantic header bytes + (\S\ref{sec:runrecord}). +\item \textbf{Standards as a runtime gate}: the run-record \emph{references} the standards + its inputs must satisfy; the runner \emph{enforces}; validators are \emph{delegated} + (\S\ref{sec:gate}). +\item \textbf{Runtime-certified method equivalence}: two implementations are + \code{equivalent-to} only if the runner confirms they produce matching outputs on shared + inputs (\S\ref{sec:registry}). +\item A reference implementation and a dependency-light worked example demonstrating all of + the above (\S\ref{sec:impl}). +\end{enumerate} + +\section{Background and related work} + +\paragraph{Reproducible pipelines.} BIDS \citep{gorgolewski2016bids} standardizes +neuroimaging data layout; DataLad/git-annex \citep{halchenko2021datalad} version data and +pin content by hash; Nipype \citep{gorgolewski2011nipype} and container images (e.g. +fMRIPrep \citep{esteban2019fmriprep}) pin the toolchain. These guarantee \emph{re-fetch} and +\emph{re-execution capability} but not \emph{verification}: nothing asserts that +re-execution reproduces the recorded result. Glimmer reuses this layer wholesale --- DataLad +coordinates live on every node --- and adds the act of checking. + +\paragraph{Provenance description.} W3C PROV \citep{provdm2013} models +entities/activities/agents; RO-Crate \citep{soilandreyes2022rocrate} packages a dataset with +typed metadata; CWL and \code{cwlprov} \citep{crusoe2022cwl} describe and record runs. The +\code{run-record} is deliberately PROV-shaped (an Activity that \code{consumes} entities and +\code{regenerates} derivatives), but it is \emph{executable in place}: the same node a reader +inspects is the node the runner replays and stamps with a verdict. The contribution is not a +new description format but closing the loop between description and execution inside one +typed object. + +\paragraph{Workflow re-execution.} \code{datalad run} / \code{datalad containers-run} record +a command with its inputs/outputs and can re-run it; CI harnesses re-execute pipelines. The +node runner builds directly on \code{datalad containers-run} for replay. What Glimmer adds is +(a)~re-execution as a \emph{graph operation} over typed nodes rather than a shell convention, +(b)~a \emph{standards gate} before execution, and (c)~\emph{tiered} verification spanning +deterministic, stochastic, and agent-produced outputs --- not only byte-equality. + +\paragraph{AI-for-science systems.} End-to-end autoresearch systems +\citep{lu2024aiscientist} run fixed internal pipelines. Glimmer is complementary substrate: +the run-record gives such systems a typed, replayable, verifiable unit of work that survives +independently of any one system, so multiple agents can cooperate on, and audit, the same +graph. + +\section{The Glimmer graph (recap)} + +Glimmer represents a research project as a graph of typed nodes, each a Markdown file with +YAML front-matter (a ``sidecar''), enumerated by a root index. Core node types include +\code{dataset}, \code{method}, \code{derivative}, \code{finding}, \code{standard}, +\code{publication}, \code{concept}, \code{experiment}, \code{instrument}, \code{persona}, +\code{organization}, and \code{program}; edges are properties on the source node. +Domain-specific vocabulary lives in \emph{profiles} (e.g. neuroimaging/BIDS) that augment +core types without forking the core. Every node carries a content hash and its DataLad +re-fetch coordinates; an agent that produces a node must attach a \code{reasoning-trace} +citing the nodes it read. The graph is plain files, so it survives \code{git clone}, +\code{datalad export}, and \code{rsync} with no bespoke database. This much was descriptive; +the next section makes it executable. + +\section{Proactive provenance: the \texttt{run-record}} +\label{sec:runrecord} + +\subsection{The node} +A \code{method} is a \emph{reusable} tool (one skull-strip, applied to many subjects); a +\code{derivative} is a \emph{product} (a file and its hash). Neither captures \emph{the act}. +The \code{run-record} is that act --- one concrete invocation: \textbf{this} command, on +\textbf{these} content-pinned inputs, in \textbf{this} container, testing \textbf{this} +hypothesis, on \textbf{this} date. Its required core is a \code{method}, a \code{command}, a +\code{provenance-mode} (which selects the default verification tier), a \code{status}, a list +of pinned \code{inputs}, and a list of expected \code{outputs}; edges (\code{reruns}, +\code{consumes}, \code{regenerates}, \code{emits}, plus \code{tests-hypothesis} / +\code{addresses-concept}) make it navigable. + +It carries a \textbf{lifecycle}: \code{planned} (written when a hypothesis is decomposed; +inputs may be a \emph{specification} rather than a pin) $\rightarrow$ \code{ready} (inputs +pinned and standard-valid: the gate passed) $\rightarrow$ \code{running} $\rightarrow$ +\code{executed} (outputs hashed, verdict written) / \code{failed} / \code{superseded}. The +runner advances the lifecycle and writes two blocks back into the node: \code{validation-gate} +(the pre-run check) and \code{replay-verdict} (the per-output result). + +\subsection{The node runner} +One engine, two modes. \code{glimmer run} is \emph{forward execution} --- gate, replay, hash +and record outputs, advance \code{ready}~$\rightarrow$~\code{executed}. \code{glimmer rerun} +is \emph{reproduction} --- re-execute an already-recorded run and compare its outputs to what +was recorded. The runner is for running; reproduction is the special case of re-running +something already recorded. Per run-record the engine (i)~resolves and pins inputs +(materializing via \code{datalad get}, matching the recorded annex-key/commit-sha; an +unreachable input short-circuits to an \code{inputs-unavailable} verdict that records data +liveness --- never a false pass), (ii)~enforces the standards gate (\S\ref{sec:gate}), +(iii)~replays the command pinned to the container digest via \code{datalad containers-run}, +falling back to \code{datalad run} and then a host subprocess flagged as a dirty environment, +(iv)~verifies outputs per tier (\S\ref{sec:tiers}), and (v)~emits a provenance manifest and, +optionally, writes the verdict back into the graph. + +\subsection{Standards as a runtime gate} +\label{sec:gate} +A run-record only \textbf{references} the standards its inputs must satisfy (an input-pin's +\code{conforms-to}, the \code{requires-standard} edge). The runner \textbf{enforces} that +each input is both \emph{pinned} and \emph{valid} before any command runs; a failure yields a +\code{gate-failed} verdict and no execution. The check itself is \textbf{delegated} to the +standard's \code{validator} hint (e.g. \code{bids-validator}), or falls back to graph-level +validation, recording \code{unchecked} honestly when no validator is available. The +run-record thus never reimplements validation, yet ``ran on standard-valid data'' becomes an +enforced, recorded fact --- moving standards from passive metadata into the runtime. + +\subsection{Three verification tiers} +\label{sec:tiers} +Reproducibility is not one thing, and a single notion of ``match'' is wrong for most science. +The tier is selected by \code{provenance-mode} and overridable per output. +\begin{itemize} +\item \textbf{Byte-identical} (deterministic). Re-execute; the output hash must match exactly + --- \emph{after normalization}, because two correct re-runs of an FSL/Nipype step differ in + volatile header bytes (timestamps, \code{descrip} fields) that carry no scientific content. + The runner zeroes those fields for NIfTI (keeping affine, dtype, and voxels exactly), + strips GIFTI metadata, and canonicalizes JSON, recording which normalization was applied; + if the imaging library is unavailable it falls back to a raw compare, recorded as such --- + never a silent pass. +\item \textbf{Numeric-within-tolerance} (stochastic). Re-derive a number from source and + assert $|\,\text{observed}-\text{expected}\,| \le \max(\text{abs},\,\text{rel}\cdot|\text{expected}|)$. + This is the honest guarantee for stochastic analyses, and the mechanism for + \textbf{reproducing a paper's reported numbers from source} within a declared tolerance. +\item \textbf{Structural} (agent-inferred). LLM/agent prose is not hash-reproducible. + Verification is structural: the node's \code{reasoning-trace} must cite real graph nodes, + those nodes must contain the values the trace claims, and any retriever manifest must be + self-consistent --- the analogue of byte-equality for the inferential regime. +\end{itemize} +Each verdict is recorded with its tier and degradations, so a ``verified'' verdict is never +silently weaker than its claim (a host-fallback run is marked as such; a tampered expected +hash yields \code{mismatch}). + +\subsection{Method equivalence and a cross-project registry} +\label{sec:registry} +A method is a reusable function over pinned data, and its \emph{semantic pattern} --- ``compute +the mean of a series'', ``skull-strip a T1w'' --- is not unique to one project. We separate +pattern from implementation with a \code{registry-ref} (a cross-project namespaced pattern +id) and the edges \code{implements}, \code{equivalent-to}, and \code{refines}. Crucially, +\code{equivalent-to} is \textbf{certified, not asserted}: the runner runs both +implementations on the same pinned inputs and confirms their outputs match (byte-identically +or within tolerance). This makes ``an equivalent, cleaner, or faster implementation is +acceptable'' a checkable statement, and lets a method's identity --- and the verification +baseline attached to it --- outlive a single graph. A dedicated abstract +\code{method-pattern} node type and a published cross-institution registry are future work +(\S\ref{sec:future}). + +\section{The agentic loop, made executable} +\label{sec:loop} +Glimmer previously specified a plan~$\rightarrow$~run~$\rightarrow$~feedback~$\rightarrow$~replan +research loop --- a \code{concept} decomposed into hypotheses, agents producing derivatives +and findings, a human reviewing --- but it had no runtime primitive; the ``run'' step left +only a derivative behind. The run-record \emph{is} that primitive. A hypothesis acquires one +or more \code{planned} run-records (\code{tests-hypothesis}~$\rightarrow$~\code{concept}); +running them gates, executes, and emits \code{derivative}s (\code{regenerates}) and a +\code{finding} (\code{emits}) whose \code{addresses-concept} edge closes the loop back onto +the question. The feedback the next iteration reads is the recorded verdict plus the emitted +finding --- not the agent's memory. This is precisely why the loop becomes self-sustaining and +why the agent failure modes of \S1 are structurally suppressed: a finding cannot be silently +lost (it is a node addressing the concept), and a settled mistake cannot be silently repeated +(the prior run's verdict is in the graph and a re-run must reproduce it). + +\section{Implementation and worked example} +\label{sec:impl} +The implementation is roughly 600 lines of dependency-light Python (PyYAML only; DataLad, a +container runtime, the imaging library, and external validators are all feature-detected and +degrade with the degradation recorded). The schema change is additive and +backward-compatible; existing graphs validate unchanged. + +The worked example (\code{examples/synthetic-provenance/}) is the smallest artifact that +exercises the whole design with no real data and no specialized dependencies. A \code{concept} +is decomposed into four planned runs over a fixed synthetic signal whose input declares +conformance to a toy standard with an executable validator (Table~\ref{tab:results}). + +\begin{table}[h] +\centering +\small +\begin{tabular}{@{}llll@{}} +\toprule +run & provenance-mode & tier & reproduce verdict \\ +\midrule +\code{run-synth-mean} & deterministic & byte-identical & \code{verified} \\ +\code{run-synth-mean-fast} & deterministic & byte-identical & \code{verified} (certified equiv.) \\ +\code{run-synth-classifier} & stochastic & numeric ($\pm 0.02$) & \code{reproduced-within-tolerance} \\ +\code{run-synth-agent-summary} & agent-inferred & structural & \code{structurally-valid} \\ +\bottomrule +\end{tabular} +\caption{The worked example exercises all three tiers; reproduction is 100\%.} +\label{tab:results} +\end{table} + +Forward \code{glimmer run} advances the planned runs to executed and emits a finding that +addresses the concept; \code{glimmer rerun} reproduces all four at 100\%. Negative controls +behave as designed: a malformed input is caught by the standards gate before execution +(\code{gate-failed}); a tampered expected hash yields \code{mismatch}; an unreachable input +under \code{-{}-offline} yields \code{inputs-unavailable}; bypassing the gate is recorded as a +dirty run; and equivalence certification confirms the two mean implementations produce +identical output. Each maps to a non-zero exit, so the example doubles as a CI check. + +\section{Discussion} +Proactive provenance reframes reproducibility from a property a reader audits by hand to a +property the substrate enforces and re-checks. Three design choices carry the weight. +\emph{First}, tiered verification: insisting on byte-equality for everything is both too +strong (stochastic analyses fail it) and too weak (it ignores agent outputs entirely); +matching the check to the computation's nature is what lets one mechanism span a deterministic +pipeline, a reported statistic, and an LLM interpretation. \emph{Second}, delegated gating: +standards become runtime-enforceable without the runner knowing any standard's internals, so +the design rides the existing validator ecosystem rather than reimplementing it. +\emph{Third}, certified equivalence: it converts the social convention ``use an equivalent +method'' into a machine-checked relation, the seed of a shared verification baseline across +institutions. Together these make the claim ``AI-native reproducibility'' concrete: an +agent's unit of work is now a contract it can execute and that others can re-execute, with a +verdict legible to both machines and reviewers. + +\section{Limitations and future work} +\label{sec:future} +The runner replays a recorded command; it does not yet \emph{generate} a reproduction path +for an arbitrary external paper. The planned-run-record with specified (unpinned, possibly +surrogate) inputs is the affordance for \textbf{minimal-path reproduction} --- synthesizing a +minimal runnable graph to test a published claim even when the original data is unavailable +--- the principal next step. The method registry currently ships as lightweight edges plus +runtime-certified equivalence; a dedicated abstract \code{method-pattern} type and a +published, cross-institution registry (with a reputation/provenance model for who contributed +which pattern or baseline) are deferred. Container replay requires DataLad and a runtime; on a +bare host the runner degrades and records the weaker environment, but does not reconstruct it. +Structural verification establishes that an agent cited real values, not that its +interpretation is correct --- it guarantees auditability, not truth. Finally, the worked +example is synthetic by design; validating the byte-identical and numeric tiers on a full +neuroimaging pipeline and on a dissertation's reported numbers is in progress in a downstream +project that imports the runner as a dependency. + +\section{Conclusion} +We argued that reproducibility belongs \emph{inside} the typed research object as an +executable property, and realized it with a single node type and one tool. The +\code{run-record} makes a run a first-class, lifecycle-bearing, standard-gated, self-verifying +graph node; the node runner enforces the gate and verifies outputs at a fidelity matched to +the computation. The result turns the agentic research loop from a description into a running +system whose feedback is recorded verdicts rather than recollection --- a concrete substrate +for AI-native science in which ``this reproduces'' is something the machine demonstrates, not +something the reader is asked to believe. + +\bibliographystyle{plainnat} +\bibliography{refs} + +\end{document} diff --git a/papers/02-proactive-provenance/refs.bib b/papers/02-proactive-provenance/refs.bib new file mode 100644 index 0000000..1d411b6 --- /dev/null +++ b/papers/02-proactive-provenance/refs.bib @@ -0,0 +1,78 @@ +@inproceedings{eldamaty2026glimmer, + title={Reproducibility as Knowledge Graph Navigation: A Research-Object Knowledge Base for AI-Native Neuroimaging Analysis}, + author={El Damaty, Shady}, + booktitle={Conference for AI Scientists (CAISc)}, + year={2026} +} + +@article{gorgolewski2016bids, + title={The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments}, + author={Gorgolewski, Krzysztof J. and others}, + journal={Scientific Data}, + volume={3}, + pages={160044}, + year={2016} +} + +@article{halchenko2021datalad, + title={DataLad: distributed system for joint management of code, data, and their relationship}, + author={Halchenko, Yaroslav O. and others}, + journal={Journal of Open Source Software}, + volume={6}, + number={63}, + pages={3262}, + year={2021} +} + +@article{gorgolewski2011nipype, + title={Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in Python}, + author={Gorgolewski, Krzysztof and others}, + journal={Frontiers in Neuroinformatics}, + volume={5}, + pages={13}, + year={2011} +} + +@misc{provdm2013, + title={{PROV-DM}: The {PROV} Data Model}, + author={{W3C}}, + howpublished={W3C Recommendation}, + year={2013} +} + +@article{soilandreyes2022rocrate, + title={Packaging research artefacts with {RO-Crate}}, + author={Soiland-Reyes, Stian and others}, + journal={Data Science}, + volume={5}, + number={2}, + pages={97--138}, + year={2022} +} + +@article{crusoe2022cwl, + title={Methods included: standardizing computational reuse and portability with the {Common Workflow Language}}, + author={Crusoe, Michael R. and others}, + journal={Communications of the ACM}, + volume={65}, + number={6}, + pages={54--63}, + year={2022} +} + +@article{esteban2019fmriprep, + title={{fMRIPrep}: a robust preprocessing pipeline for functional {MRI}}, + author={Esteban, Oscar and others}, + journal={Nature Methods}, + volume={16}, + number={1}, + pages={111--116}, + year={2019} +} + +@inproceedings{lu2024aiscientist, + title={The {AI} Scientist: Towards Fully Automated Open-Ended Scientific Discovery}, + author={Lu, Chris and others}, + booktitle={arXiv preprint arXiv:2408.06292}, + year={2024} +} diff --git a/papers/README.md b/papers/README.md new file mode 100644 index 0000000..c2afb72 --- /dev/null +++ b/papers/README.md @@ -0,0 +1,51 @@ +# Glimmer papers + +Submittable Glimmer manuscripts live here as **LaTeX projects** — this is the standard +moving forward. Markdown drafts (e.g. `docs/paper-draft.md`) are for fast iteration; once a +paper is heading toward a preprint or submission it is authored here in LaTeX so it is +venue-portable and citation-managed. + +## Layout + +``` +papers/ +├── glimmer-paper.cls # the house document class (the standard) +├── 01-knowledge-graph-navigation/ # Paper 1 — the graph layer (through v0.5) +│ ├── main.tex refs.bib Makefile +└── 02-proactive-provenance/ # Paper 2 — the executable runner (v0.6) + ├── main.tex refs.bib Makefile +``` + +Each paper is a self-contained directory with its own `main.tex`, `refs.bib`, and +`Makefile`. They share `papers/glimmer-paper.cls` (found via `TEXINPUTS=..:`, which the +Makefiles set). + +## The house class + +`\documentclass[preprint]{glimmer-paper}` — a thin, **venue-neutral** layer over `article`: +clean typography, a preprint/draft banner, and a small title block. It deliberately holds +nothing venue-specific, so retargeting to a journal or conference template is a one-line +`\documentclass` swap, not a rewrite. It preloads `amsmath`, `hyperref`, `natbib`, +`booktabs`, `xcolor`, `enumitem`, `microtype`, and `titlesec` — do not re-`\usepackage` +those. Options: `preprint` (banner + repo line), `draft` (do-not-circulate header). +Title-block macros: `\affil{}`, `\version{}`, `\repo{}`. + +## Building + +```bash +cd papers/02-proactive-provenance && make # → main.pdf +make clean # remove build artifacts +``` + +The `Makefile` runs `latexmk` with `TEXINPUTS` pointed at the shared class. A full TeX Live +install (with `latexmk` + `bibtex`) is assumed. Build artifacts (`*.pdf`, `*.aux`, …) are +git-ignored. + +## The two papers + +- **01 — Reproducibility as Knowledge Graph Navigation** (through v0.5): the typed + research-object graph as the contribution; verification framed as a *navigable* property + (re-run from cited SHAs). Preprint candidate. +- **02 — Proactive Provenance** (v0.6): makes the graph *executable* — the `run-record` and + the node runner that gates inputs, replays the command, and verifies at three tiers. The + successor that turns navigation into execution. diff --git a/papers/glimmer-paper.cls b/papers/glimmer-paper.cls new file mode 100644 index 0000000..cb9ae4b --- /dev/null +++ b/papers/glimmer-paper.cls @@ -0,0 +1,79 @@ +% glimmer-paper.cls — house document class for Glimmer submittable papers. +% +% The standard for all submittable Glimmer manuscripts (see papers/README.md). +% A thin, venue-neutral layer over `article`: clean typography, a preprint banner, +% and a small set of title-block macros. Strip nothing venue-specific lives here, +% so a paper can be retargeted to a journal/conference template by swapping the +% \documentclass line, not by rewriting the body. +% +% Options: +% preprint — show a "Preprint — " banner and DOI/repo line (default off) +% draft — show a "DRAFT — do not circulate" watermark line (default off) +% 11pt/12pt — passed through to article (default 11pt) +% +% Title-block macros: \affil{...} \version{...} \repo{...} +% Use \maketitle as usual; \begin{abstract}...\end{abstract} is provided. + +\NeedsTeXFormat{LaTeX2e} +\ProvidesClass{glimmer-paper}[2026/06/24 Glimmer house paper class] + +\newif\ifglimmer@preprint \glimmer@preprintfalse +\newif\ifglimmer@draft \glimmer@draftfalse +\DeclareOption{preprint}{\glimmer@preprinttrue} +\DeclareOption{draft}{\glimmer@drafttrue} +\DeclareOption*{\PassOptionsToClass{\CurrentOption}{article}} +\ExecuteOptions{} +\ProcessOptions\relax +\LoadClass[11pt]{article} + +% --- geometry & typography ---------------------------------------------------- +\RequirePackage[letterpaper,margin=1in]{geometry} +\RequirePackage[T1]{fontenc} +\RequirePackage{lmodern} +\RequirePackage{microtype} +\RequirePackage{amsmath,amssymb} +\RequirePackage{booktabs} +\RequirePackage{xcolor} +\definecolor{glimmerlink}{RGB}{30,86,150} +\RequirePackage[colorlinks=true,linkcolor=glimmerlink,citecolor=glimmerlink,urlcolor=glimmerlink]{hyperref} +\RequirePackage{enumitem} +\setlist{itemsep=2pt,topsep=3pt,parsep=0pt} +\RequirePackage{natbib} +\RequirePackage{titlesec} +\titleformat{\section}{\normalfont\large\bfseries}{\thesection}{0.6em}{} +\titleformat{\subsection}{\normalfont\normalsize\bfseries}{\thesubsection}{0.6em}{} +\setlength{\parskip}{0.4em} +\setlength{\parindent}{0pt} + +% --- title-block macros ------------------------------------------------------- +\newcommand{\glimmer@affil}{} +\newcommand{\affil}[1]{\renewcommand{\glimmer@affil}{#1}} +\newcommand{\glimmer@version}{} +\newcommand{\version}[1]{\renewcommand{\glimmer@version}{#1}} +\newcommand{\glimmer@repo}{} +\newcommand{\repo}[1]{\renewcommand{\glimmer@repo}{#1}} + +% --- banners ------------------------------------------------------------------ +\RequirePackage{fancyhdr} +\pagestyle{fancy}\fancyhf{} +\renewcommand{\headrulewidth}{0pt} +\fancyfoot[C]{\thepage} +\ifglimmer@preprint + \fancyhead[C]{\footnotesize\color{gray}Preprint\ifx\glimmer@version\@empty\else\ — Glimmer v\glimmer@version\fi\ — \today} +\fi +\ifglimmer@draft + \fancyhead[C]{\footnotesize\color{red}DRAFT — do not circulate — \today} +\fi + +% Put the affiliation / version / repo line under the author block in \maketitle. +\let\glimmer@oldmaketitle\maketitle +\renewcommand{\maketitle}{% + \glimmer@oldmaketitle + \begingroup\centering\small + \ifx\glimmer@affil\@empty\else\glimmer@affil\par\fi + \ifx\glimmer@repo\@empty\else\texttt{\glimmer@repo}\par\fi + \endgroup + \vspace{0.5em} +} + +\endinput From 6bbabcf2a55b8a607c3fc37e5d6f050d96d0d37e Mon Sep 17 00:00:00 2001 From: Shady El Damaty Date: Wed, 24 Jun 2026 15:28:02 +0200 Subject: [PATCH 4/4] papers: add Makefile for paper 01 (parity with paper 02) Co-Authored-By: Claude Opus 4.8 --- papers/01-knowledge-graph-navigation/Makefile | 11 +++++++++++ 1 file changed, 11 insertions(+) create mode 100644 papers/01-knowledge-graph-navigation/Makefile diff --git a/papers/01-knowledge-graph-navigation/Makefile b/papers/01-knowledge-graph-navigation/Makefile new file mode 100644 index 0000000..5bd28d9 --- /dev/null +++ b/papers/01-knowledge-graph-navigation/Makefile @@ -0,0 +1,11 @@ +# Build the paper with the shared house class one directory up. +TEXINPUTS := ..:$(TEXINPUTS) +export TEXINPUTS + +main.pdf: main.tex refs.bib + latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex + +.PHONY: clean +clean: + latexmk -C + rm -f *.bbl