A Python pipeline that parses clinical trial protocol documents (PDF and DOCX) and automatically generates draft electronic Case Report Form (eCRF) templates for OpenClinica (CDISC ODM-XML 1.3.2) and REDCap (data dictionary CSV), with round-trip validation and a vanilla-JavaScript form preview.
This tool is assistive, not autonomous. Every output is marked draft and must be reviewed by a clinical data manager and validated against the source protocol before use in a live study.
Building eCRFs from a protocol is slow, repetitive, and easy to get subtly wrong. A data specialist reads the protocol, extracts the data collection points, picks controlled terminologies, maps types, drafts the form in OpenClinica or REDCap, then hands it back for review. Automation can take the first pass and free the specialist to focus on judgement calls — but only if the output is safe by construction: traceable back to the source, conservatively typed, bound to real controlled terminology, and never trusted blindly.
protocol-to-ecrf is a small, opinionated tool that does that first pass
for a defined subset of CDISC ODM 1.3.2 and the REDCap data dictionary
format. What it does NOT do: replace a data manager, claim regulatory
compliance, handle every ODM element, or hand you a ready-to-deploy CRF.
flowchart LR
A[PDF / DOCX protocol] --> B[Text extractor<br/>pdfplumber / python-docx]
B --> C[Section detector<br/>regex anchors]
C --> D[Rule-based extractor<br/>+ optional Claude fallback]
D --> E[Codelist matcher<br/>CDISC-CT / NCI / internal]
E --> F{Emitters}
F --> G[ODM-XML 1.3.2]
F --> H[REDCap data dictionary CSV]
F --> I[JSON intermediate]
G --> J[Round-trip validator]
H --> J
J --> K[Static HTML form preview]
Requires Python 3.13+ and uv.
git clone https://github.com/siddh-m/protocol-to-ecrf.git
cd protocol-to-ecrf
uv syncuv run protocol2ecrf parse path/to/protocol.pdf --out output/Example:
$ uv run protocol2ecrf parse \
tests/fixtures/protocols/synthetic/minimal_oncology.docx \
--out examples/synthetic-oncology-phase2
WARNING: protocol-to-ecrf produces DRAFT CRF templates for human review.
Outputs MUST be reviewed by a clinical data manager and validated against
the source protocol before use in a live study. This tool is assistive,
not autonomous.
Parsed minimal_oncology.docx -> 14 fields (low-confidence: 13)
Output written to examples/synthetic-oncology-phase2Three files land in the output directory:
| File | What it is |
|---|---|
study.xml |
CDISC ODM-XML 1.3.2 CRF template (OpenClinica-compatible) |
redcap_dict.csv |
REDCap data dictionary (18 canonical columns) |
extracted.json |
Structured intermediate (full ExtractionResult) |
Diff two protocol versions:
uv run protocol2ecrf diff old.pdf new.pdf --out diff-output/
# writes diff.json and diff.mdBrowse a committed example without installing anything:
examples/synthetic-oncology-phase2/
Every emitted CRF binds its fields to one of eight canonical codelists,
shipped in src/protocol_to_ecrf/terminology/catalogue.json:
| OID | Name | Source |
|---|---|---|
CL.YES_NO |
Yes/No response | internal |
CL.SEX_AT_BIRTH |
Sex assigned at birth | CDISC-CT (C20197, C16576, C45908) |
CL.RACE_NIH |
NIH race categories | NIH OMB-2024 |
CL.ETHNICITY_NIH |
NIH ethnicity categories | NIH OMB-2024 |
CL.SEVERITY_CTCAE_V5 |
CTCAE v5 severity grade | NCI-CTCAE 5.0 |
CL.UNIT_COMMON |
Common clinical units | CDISC-CT |
CL.ROUTE_OF_ADMIN |
Route of administration | CDISC-CT |
CL.OUTCOME_GRADE |
Clinical outcome grade | internal (RECIST-flavoured) |
NCI Thesaurus concept IDs are embedded as <Alias Context="CDISC-CT" Name="Cxxxxx"/>
in the emitted ODM where known, so downstream consumers (and auditors)
can resolve to the canonical terminology registry.
These rules are enforced in code, not just in documentation:
-
Outputs are always draft. Every emitted artifact carries a visible
NEEDS_REVIEWsentinel — in the ODM<Annotation><Comment>, in the REDCapform_notesrow, and as"status": "draft-needs-review"in the JSON intermediate. The CLI prints a prominent WARNING banner on stderr on every run unless explicitly suppressed with--quiet. -
Low-confidence fields are flagged. Any
ProtocolFieldwith confidence below 0.8 getsneeds_human_review = True, an<Alias Context="internal" Name="needs_review"/>child on its ODMItemDef, and a@NEEDS_REVIEWmarker in its REDCapField Annotationcolumn. Round-trip validators assert this propagation on every run. -
LLM fallback may not fabricate. The optional Claude API extractor uses a system prompt that explicitly forbids inventing fields not present in the source text. Every LLM-sourced field is assigned
confidence=0.6(below the 0.8 review threshold) so it is always surfaced to a human before use. -
Round-trip validation must pass. If the validator detects any parity violation between extracted fields and emitted artifacts, the CLI exits with code
2and a list of failing fields on stderr.
uv run pytest -q # default offline suite (~60 tests)
uv run pytest -m llm # LLM fallback tests (stub LLM)
uv run pytest -m smoke # end-to-end smoke against real protocol
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/@pytest.mark.network— requires network access, auto-skipped by default@pytest.mark.llm— exercises the LLM fallback path via a stub LLM, auto-skipped by default@pytest.mark.smoke— end-to-end against a cached real public protocol, auto-skipped by default
- ODM subset scope. The emitter produces exactly six ODM element types
in v1:
Study,MetaDataVersion,StudyEventDef,FormDef,ItemGroupDef,ItemDef,CodeList. Clinical data, audit trails, admin data, and reference data are out of scope and documented here so you're not surprised by what's missing. - Protocol format assumptions. The section detector expects identifiable heading lines — "Inclusion Criteria", "Primary Endpoint", "Schedule of Assessments", "Adverse Event Reporting" or close variants. Protocols that bury these in free prose will extract poorly.
- LLM confidence caveat. LLM-sourced fields are confidence 0.6 by
design, always surfaced as
needs_human_review, and bounded by a prompt that forbids fabrication — but no guardrail is perfect, review everything the LLM touched. - Not validated for regulatory submission. This tool is a research aid. Do not use its outputs in an IRB submission, an FDA filing, or a live study without running them past a clinical data manager and (separately) the CRF validation process your unit already uses.
If a rule-based type inference misfires — say, it reads "aged 18-75" and
decides the field is a text instead of an integer — the round-trip
validator re-parses the emitted ODM, sees a DataType="text" on an
ItemDef whose source field was marked integer, and returns
passed=False with a type_mismatches entry. The CLI exits 2 and the
failing field is surfaced on stderr. The reviewer then either:
- fixes the source text so the rule fires correctly, or
- re-runs with
--use-llmto let Claude re-extract that section, or - hand-edits the emitted ODM and documents the change in their CRF log.
This is the intended workflow — the tool does not try to be clever, it fails loudly.
MIT