An autonomous incident-response coordinator where every finding traces back to the specific arrow that produced it. Submission to the Find Evil! hackathon.
Built on weft for typed arrow composition and the SANS SIFT Workstation for the underlying forensics tooling.
Existing AI-assisted DFIR tooling (Protocol SIFT and similar) gives an
LLM agent a generic execute_shell_cmd plus a long system prompt and
hopes it stays on the rails. The hackathon brief flags this as the
source of the autonomy gap: hallucination, evidence spoliation risk,
loops that don't terminate.
Provenance replaces that surface with a typed action space and a critic-mediated policy:
- Typed arrows wrap each SIFT tool. Inputs and outputs are Go structs, not raw shell text. An agent cannot construct invalid command lines, scan outside the case directory, or run a Windows plugin on a Linux image — those misconfigurations are not expressible.
- Multi-agent loop with specialist proposers (memory / disk / network) and an independent critic that scores proposals before any arrow runs. Specialists are stateless across iterations; the evidence bag is the only memory.
- Deterministic coordinator in Go (no LLM in the loop body) enforces termination, audit-trail emission, and proposal selection.
- Functional seams for every behavioral choice. Selection policy, critic invocation strategy, proposal collection, and termination conditions are all swappable closures.
make all # tidy, build, test
make test # ~25 unit tests, no SIFT install required
make test-verbose # show every test by name
make cover # coverage report
make run # run the mcp-server binaryThe tests require nothing beyond Go and this repository — every SIFT
tool invocation is unit-tested via a fake runner.RunFunc against
captured fixture output. The production binary (bin/mcp-server)
expects the real SIFT tools (vol, yara, ...) on PATH and is
intended to be deployed inside the SIFT Workstation VM.
Requires Go 1.23 or newer.
provenance/
├── runner/ functional seam for subprocess execution
│ └── runner.go RunFunc, RealRun, NewFake
├── sift/ typed SIFT arrows wrapping forensics tools
│ ├── types.go Process, Timeline, YaraReport, ...
│ ├── pslist.go Volatility 3 pslist (Win/Linux dispatch)
│ ├── yara.go YARA scan with case-root path validation
│ ├── triage.go composed Par(memory, disk) -> synthesis arrow
│ └── fixtures/ captured tool outputs for tests
├── coord/ the multi-agent coordinator
│ ├── types.go EvidenceBag, Proposal, Verdict, enums
│ ├── predicate.go DSL parser + field registry + evaluator
│ ├── seams.go four behavioral seams + defaults
│ └── coordinator.go the loop, exposed as a weft.Arrow
├── mcpadapter/ generic weft.Arrow -> MCP tool wrapper
│ └── register.go one function: RegisterArrow[In, Out]
└── cmd/mcp-server/ production binary serving MCP over stdio
Every test/production boundary in this repo is a function type, not an interface. Function types cannot be type-asserted back to a concrete implementation — there is no concrete type behind the seam, only a closure. This makes the boundary opaque and prevents the class of bug where someone reaches through the seam to bypass it.
The seams:
| Seam | Type | Where |
|---|---|---|
| Subprocess runner | RunFunc |
runner/ |
| Trace emission | TraceTap |
coord/types.go |
| Proposal collection | CollectProposalsFn |
coord/seams.go |
| Critic invocation | InvokeCriticFn |
coord/seams.go |
| Selection policy | SelectFn |
coord/seams.go |
| Termination check | TerminationCondition |
coord/seams.go |
| Arrow execution | ArrowExecutor |
coord/types.go |
Each ships with a sensible default; each can be replaced field-by-field
on the Coordinator struct before calling Run().
These are read-only and integrity properties enforced before any subprocess is spawned or any LLM is called:
- Volatility plugin dispatch is profile-driven Go code. A
MemoryImage.Profilestarting with"Win"selectswindows.pslist; anything else selectslinux.pslist. No agent-controlled string could pick the wrong plugin. - YARA target paths must resolve inside
CaseRootafter cleaning andfilepath.Relnormalization.../../etc/passwdis rejected beforeexec.Commandis called.CaseRootmust be absolute. - Predicate DSL parses at proposal-receipt time. Malformed predicates and references to unknown fields are caught before the critic or the executor sees them.
- The coordinator loop is deterministic Go. Termination conditions, proposal validation, audit-trail emission — none of these run through an LLM, so none of them can be coaxed off-rail by a prompt-injection.
Every ExecutionStep records the proposing specialist, the arrow that
ran, the rationale and predicate that were supplied, which bag fields
the step populated, and timing data. Findings carry an Evidence
slice naming the arrows that contributed their supporting data. A
judge running a demo can trace any "high severity" claim through
findings -> arrows -> trace entries -> raw tool invocations.
| Layer | State |
|---|---|
| Typed arrows (sift/) | Real |
| Subprocess seam | Real |
| MCP tool exposure | Real |
| Coordinator loop | Real |
| Predicate DSL | Real |
| Termination conditions | Real |
| Specialist agents | Stubs |
| Critic agent | Stub |
| Arrow executor | Stub |
The stubs are deterministic Go that exercise every code path in the
loop. Phase 2 of the build replaces them with LLM-driven specialists
and critic, plus a real executor that dispatches ArrowName to the
concrete sift.* arrows.
For the hackathon, this is plain Go. For the long-term shape:
- Temporal would handle durable execution, retry-aware activities,
workflow replay as native audit trail, and signals/queries for
human-in-the-loop interaction. The coordinator's
ArrowExecutorseam is the natural lift point — each arrow execution becomes a Temporal activity, the coordinator loop becomes the workflow body. - Learned coordinator policy. The proposal/verdict/outcome triples produced by every run are training data. A future version replaces the heuristic critic with a model trained on the trace corpus to predict the next-best arrow given a bag state.
- Cross-case calibration. Specialists carry forward calibration scores between cases — agents whose expectations match reality more often get their proposals weighted higher in selection.
MIT. See LICENSE.