protocol-to-ecrf

A Python pipeline that parses clinical trial protocol documents (PDF and DOCX) and automatically generates draft electronic Case Report Form (eCRF) templates for OpenClinica (CDISC ODM-XML 1.3.2) and REDCap (data dictionary CSV), with round-trip validation and a vanilla-JavaScript form preview.

This tool is assistive, not autonomous. Every output is marked draft and must be reviewed by a clinical data manager and validated against the source protocol before use in a live study.

Motivation

Building eCRFs from a protocol is slow, repetitive, and easy to get subtly wrong. A data specialist reads the protocol, extracts the data collection points, picks controlled terminologies, maps types, drafts the form in OpenClinica or REDCap, then hands it back for review. Automation can take the first pass and free the specialist to focus on judgement calls — but only if the output is safe by construction: traceable back to the source, conservatively typed, bound to real controlled terminology, and never trusted blindly.

protocol-to-ecrf is a small, opinionated tool that does that first pass for a defined subset of CDISC ODM 1.3.2 and the REDCap data dictionary format. What it does NOT do: replace a data manager, claim regulatory compliance, handle every ODM element, or hand you a ready-to-deploy CRF.

Pipeline

flowchart LR
    A[PDF / DOCX protocol] --> B[Text extractor<br/>pdfplumber / python-docx]
    B --> C[Section detector<br/>regex anchors]
    C --> D[Rule-based extractor<br/>+ optional Claude fallback]
    D --> E[Codelist matcher<br/>CDISC-CT / NCI / internal]
    E --> F{Emitters}
    F --> G[ODM-XML 1.3.2]
    F --> H[REDCap data dictionary CSV]
    F --> I[JSON intermediate]
    G --> J[Round-trip validator]
    H --> J
    J --> K[Static HTML form preview]

Installation

Requires Python 3.13+ and uv.

git clone https://github.com/siddh-m/protocol-to-ecrf.git
cd protocol-to-ecrf
uv sync

Usage

uv run protocol2ecrf parse path/to/protocol.pdf --out output/

Example:

$ uv run protocol2ecrf parse \
    tests/fixtures/protocols/synthetic/minimal_oncology.docx \
    --out examples/synthetic-oncology-phase2

WARNING: protocol-to-ecrf produces DRAFT CRF templates for human review.
Outputs MUST be reviewed by a clinical data manager and validated against
the source protocol before use in a live study. This tool is assistive,
not autonomous.

Parsed minimal_oncology.docx -> 14 fields (low-confidence: 13)
Output written to examples/synthetic-oncology-phase2

Three files land in the output directory:

File	What it is
`study.xml`	CDISC ODM-XML 1.3.2 CRF template (OpenClinica-compatible)
`redcap_dict.csv`	REDCap data dictionary (18 canonical columns)
`extracted.json`	Structured intermediate (full `ExtractionResult`)

Diff two protocol versions:

uv run protocol2ecrf diff old.pdf new.pdf --out diff-output/
# writes diff.json and diff.md

Browse a committed example without installing anything: examples/synthetic-oncology-phase2/

Controlled Terminology

Every emitted CRF binds its fields to one of eight canonical codelists, shipped in src/protocol_to_ecrf/terminology/catalogue.json:

OID	Name	Source
`CL.YES_NO`	Yes/No response	internal
`CL.SEX_AT_BIRTH`	Sex assigned at birth	CDISC-CT (C20197, C16576, C45908)
`CL.RACE_NIH`	NIH race categories	NIH OMB-2024
`CL.ETHNICITY_NIH`	NIH ethnicity categories	NIH OMB-2024
`CL.SEVERITY_CTCAE_V5`	CTCAE v5 severity grade	NCI-CTCAE 5.0
`CL.UNIT_COMMON`	Common clinical units	CDISC-CT
`CL.ROUTE_OF_ADMIN`	Route of administration	CDISC-CT
`CL.OUTCOME_GRADE`	Clinical outcome grade	internal (RECIST-flavoured)

NCI Thesaurus concept IDs are embedded as <Alias Context="CDISC-CT" Name="Cxxxxx"/> in the emitted ODM where known, so downstream consumers (and auditors) can resolve to the canonical terminology registry.

Safety Posture

These rules are enforced in code, not just in documentation:

Outputs are always draft. Every emitted artifact carries a visible NEEDS_REVIEW sentinel — in the ODM <Annotation><Comment>, in the REDCap form_notes row, and as "status": "draft-needs-review" in the JSON intermediate. The CLI prints a prominent WARNING banner on stderr on every run unless explicitly suppressed with --quiet.
Low-confidence fields are flagged. Any ProtocolField with confidence below 0.8 gets needs_human_review = True, an <Alias Context="internal" Name="needs_review"/> child on its ODM ItemDef, and a @NEEDS_REVIEW marker in its REDCap Field Annotation column. Round-trip validators assert this propagation on every run.
LLM fallback may not fabricate. The optional Claude API extractor uses a system prompt that explicitly forbids inventing fields not present in the source text. Every LLM-sourced field is assigned confidence=0.6 (below the 0.8 review threshold) so it is always surfaced to a human before use.
Round-trip validation must pass. If the validator detects any parity violation between extracted fields and emitted artifacts, the CLI exits with code 2 and a list of failing fields on stderr.

Development

uv run pytest -q                              # default offline suite (~60 tests)
uv run pytest -m llm                          # LLM fallback tests (stub LLM)
uv run pytest -m smoke                        # end-to-end smoke against real protocol
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/

Test markers

@pytest.mark.network — requires network access, auto-skipped by default
@pytest.mark.llm — exercises the LLM fallback path via a stub LLM, auto-skipped by default
@pytest.mark.smoke — end-to-end against a cached real public protocol, auto-skipped by default

Limitations

ODM subset scope. The emitter produces exactly six ODM element types in v1: Study, MetaDataVersion, StudyEventDef, FormDef, ItemGroupDef, ItemDef, CodeList. Clinical data, audit trails, admin data, and reference data are out of scope and documented here so you're not surprised by what's missing.
Protocol format assumptions. The section detector expects identifiable heading lines — "Inclusion Criteria", "Primary Endpoint", "Schedule of Assessments", "Adverse Event Reporting" or close variants. Protocols that bury these in free prose will extract poorly.
LLM confidence caveat. LLM-sourced fields are confidence 0.6 by design, always surfaced as needs_human_review, and bounded by a prompt that forbids fabrication — but no guardrail is perfect, review everything the LLM touched.
Not validated for regulatory submission. This tool is a research aid. Do not use its outputs in an IRB submission, an FDA filing, or a live study without running them past a clinical data manager and (separately) the CRF validation process your unit already uses.

Failure walkthrough

If a rule-based type inference misfires — say, it reads "aged 18-75" and decides the field is a text instead of an integer — the round-trip validator re-parses the emitted ODM, sees a DataType="text" on an ItemDef whose source field was marked integer, and returns passed=False with a type_mismatches entry. The CLI exits 2 and the failing field is surfaced on stderr. The reviewer then either:

fixes the source text so the rule fires correctly, or
re-runs with --use-llm to let Claude re-extract that section, or
hand-edits the emitted ODM and documents the change in their CRF log.

This is the intended workflow — the tool does not try to be clever, it fails loudly.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
examples		examples
preview		preview
scripts		scripts
src/protocol_to_ecrf		src/protocol_to_ecrf
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
prd.json		prd.json
progress.txt		progress.txt
pyproject.toml		pyproject.toml
ralph.sh		ralph.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

protocol-to-ecrf

Motivation

Pipeline

Installation

Usage

Controlled Terminology

Safety Posture

Development

Test markers

Limitations

Failure walkthrough

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

protocol-to-ecrf

Motivation

Pipeline

Installation

Usage

Controlled Terminology

Safety Posture

Development

Test markers

Limitations

Failure walkthrough

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages