日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)
Ground design decisions in cited research — then verify the citations with a different model family before any of it becomes canon.
study-swarm is a protocol, not a tool. When you're making a substantial design decision with an LLM — a new product layer, an architecture choice, a "should we trust the model here" call — improvising from first principles ships designs that are stale, and citing papers from memory ships designs that rest on sources that don't exist or don't say what you think. study-swarm replaces both: dispatch parallel research agents, demand specific cited findings, and gate every citation through an external verifier of a different model family before it informs the design.
It applies its own medicine. The protocol prescribes verifier-protected envelopes for the systems it helps design — so it runs one on itself. No model grades its own homework, including the one running the protocol.
- Identify 3–5 load-bearing design questions where empirical evidence would change the answer.
- Dispatch one research agent per question, in parallel. Each must return paper titles + authors + years + URLs + a one-sentence finding — specificity over breadth ("6–8 well-sourced findings beat 20 vague gestures").
- Synthesize the findings into a Research grounding section:
N. **<finding>.** <Authors> <year> (<arXiv/DOI>). <design implication>. - Verify externally — a different model family, reasoning-stripped, checks every citation in two stages: a retrieval oracle confirms the paper exists (never the model's memory), then a groundedness lens confirms the finding matches the source. Halt on fabricated/misattributed; halt-and-escalate if the verifier or retrieval oracle is unavailable (never read absence as "citations fine").
- Connect each architectural choice back to a finding by number. Citations without a design implication are noise.
The full executable detail — the halt table, the sourcing standard, the ensemble rule — is in PROTOCOL.md.
Because the failure modes are documented, not hypothetical:
- LLMs can't reliably verify their own output. Huang et al. 2023 (arXiv:2310.01798); Kambhampati et al. 2024 (arXiv:2402.01817, LLM-Modulo); Stechly et al. 2024 (arXiv:2402.08115) — the external verifier carries the gains; the self-critique content is inert.
- Same-family judges self-prefer. Panickssery, Bowman & Feng 2024 (arXiv:2404.13076) — self-recognition correlates linearly with self-preference, so partial blinding doesn't help. Verga et al. 2024 (arXiv:2404.18796, PoLL) — a panel across disjoint families is less biased at ~7× lower cost.
- Citations are where LLMs lie. Walters & Wilder 2023 (doi:10.1038/s41598-023-41032-5) — 55% of GPT-3.5 / 18% of GPT-4 citations are fabricated. Onweller et al. 2026 (arXiv:2605.06635) — links resolve >94% of the time yet only 39–77% of cited content actually supports the claim. So existence must be checked by retrieval, not recall.
- Hide the generator's reasoning. Khalifa et al. 2026 (arXiv:2601.14691, "Gaming the Judge") — manipulated chain-of-thought alone inflates a judge's false-positives by up to 90% with actions held fixed. Turpin et al. 2023 (arXiv:2305.04388) — CoT is post-hoc rationalization. The verifier sees the bare citation claim, never the "why I included this."
- Diversity beats count. Rajan 2025 (arXiv:2511.16708) — four verifiers at pairwise correlation ρ ∈ [0.05, 0.25] beat any single one via submodular coverage. Kim et al. 2025 (arXiv:2506.07962) — LLM errors are correlated, so the load-bearing variable is lens diversity, not raw count.
As a test, the protocol was run against its own citations. Two decorrelated non-Claude families — Mistral (mistral-small:24b) and IBM Granite (granite4.1:30b) — checked a citation set, reasoning-stripped, seeded with two blind traps:
| Planted trap | Mistral | IBM Granite | Ground truth |
|---|---|---|---|
| Chain-of-thought prompting attributed to "Nakamura & Olsen" | missed | caught (misattributed → really Wei et al. 2022, arXiv:2201.11903) | misattributed |
| a fabricated "98% of errors removed, no oracle needed" paper | caught (fabricated) | caught (fabricated) | fabricated |
Neither family caught both traps alone — but their union caught 2/2. A single judge would have shipped the misattribution. Separately, the retrieval oracle caught two real misattributions in our own design docs (papers cited under the wrong first author) that no parametric LLM could have flagged — and it correctly confirmed genuine 2026 papers that both LLMs false-flagged as fabricated simply because the papers postdate their training. That last point is the whole reason Step 4's existence check must be a retrieval oracle, never an LLM.
That single run is the thesis in miniature: decorrelated lenses + a retrieval oracle for existence beat any one smart judge.
You can run the protocol by hand — any different-family model plus resolving the arXiv/DOI yourself satisfies Step 4. Two sibling tools make it one command:
- prism-verify — the runtime verifier: family-different routing, reasoning-stripped, multi-lens adjudication, a deterministic retrieval existence floor (arXiv → Crossref), and signed receipts.
- role-os — provides
roleos verify-citations <dispatch>, the runner that extracts a dispatch's citations and gates them through prism.
The handoff is the dispatch format itself: a finding written as N. **finding.** Authors year (arXiv|DOI). implication. — with one resolvable identifier per finding — is exactly what roleos verify-citations lifts and gates. A lint-clean dispatch hands off cleanly; a malformed citation is what the runner flags as unparsed. That contract is what study-swarm lint checks locally, so Step 3 and Step 4 agree on what a citation is.
npm i -g @dogfood-lab/study-swarm # or run ad-hoc: npx @dogfood-lab/study-swarm <command>| Command | What it does |
|---|---|
study-swarm protocol |
Print the full protocol — the five steps, the halt table, the sourcing standard. |
study-swarm new <slug> |
Scaffold a <slug>.dispatch.md with the five-step skeleton to fill in. |
study-swarm lint [--json] <path…> |
Check a dispatch's Research grounding against the sourcing standard — every finding needs an author, a year, and a resolvable identifier (arXiv / DOI / URL); "studies show…" hand-waving is rejected. Exit 1 on violations, so it gates CI. A <path> may be a file, a directory (linted recursively for *.dispatch.md), or - for stdin; --json emits a machine-readable report. |
lint is deterministic — zero model calls — so it's safe in CI. It enforces Step 3's sourcing standard locally; the model-based Step 4 verification still defers to roleos verify-citations → prism.
A typical loop:
study-swarm new my-decision # creates my-decision.dispatch.md
# …fill in the questions, run the research dispatch, write the findings…
study-swarm lint my-decision.dispatch.md # enforce the sourcing standard (Step 3)
roleos verify-citations my-decision.dispatch.md # model-based Step 4 (different family, via prism)A complete, lint-clean dispatch — study-swarm applied to its own design — ships in examples/study-swarm-self.dispatch.md as a worked reference.
lint takes a file, a directory (linted recursively for *.dispatch.md), or - for stdin, and --json emits a machine-readable report. Drop this into your repo to gate every dispatch's sourcing on each PR (a copy-paste sample also lives in examples/study-swarm-ci.yml):
# .github/workflows/dispatches.yml
name: study-swarm lint
on:
pull_request:
paths: ['**/*.dispatch.md', '.github/workflows/dispatches.yml']
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npx @dogfood-lab/study-swarm@latest lint dispatches/Current — the field moves fast; demanding specific studies-with-years keeps designs from shipping 18 months behind. Functional — evidence shows what fails, not just what works (explanations can increase over-reliance on wrong AI — Bansal et al. 2021, arXiv:2006.14779). Safe — the verifier-protected envelope is the architecture the evidence supports, and the protocol enforces it on its own output. Sourcing isn't academic theater; it's the evidence trail.
study-swarm ships a thin, zero-dependency CLI (study-swarm) alongside the methodology. It makes no network or model calls and collects no telemetry; there are no secrets or credentials in the source. At runtime it only reads the file you pass to lint and writes a single <slug>.dispatch.md in the current directory for new (refusing to overwrite, and never outside the working directory). The model-based verification the methodology describes (Step 4) is run by the sibling tools, not by this package. See SECURITY.md.
A working protocol, externally verified by its own machinery — a different model family checks its citations (see the proof above). This repo is the public reference; PROTOCOL.md is the executable shape. Part of the dogfood-lab family — methods and showcases for building in the AI era.
MIT licensed.
Part of the dogfood-lab family — methods & showcases for building in the AI era. Built by MCP Tool Shop.
