Skip to content

siddh-m/protocol-to-ecrf

Repository files navigation

protocol-to-ecrf

CI

A Python pipeline that parses clinical trial protocol documents (PDF and DOCX) and automatically generates draft electronic Case Report Form (eCRF) templates for OpenClinica (CDISC ODM-XML 1.3.2) and REDCap (data dictionary CSV), with round-trip validation and a vanilla-JavaScript form preview.

This tool is assistive, not autonomous. Every output is marked draft and must be reviewed by a clinical data manager and validated against the source protocol before use in a live study.

Motivation

Building eCRFs from a protocol is slow, repetitive, and easy to get subtly wrong. A data specialist reads the protocol, extracts the data collection points, picks controlled terminologies, maps types, drafts the form in OpenClinica or REDCap, then hands it back for review. Automation can take the first pass and free the specialist to focus on judgement calls — but only if the output is safe by construction: traceable back to the source, conservatively typed, bound to real controlled terminology, and never trusted blindly.

protocol-to-ecrf is a small, opinionated tool that does that first pass for a defined subset of CDISC ODM 1.3.2 and the REDCap data dictionary format. What it does NOT do: replace a data manager, claim regulatory compliance, handle every ODM element, or hand you a ready-to-deploy CRF.

Pipeline

flowchart LR
    A[PDF / DOCX protocol] --> B[Text extractor<br/>pdfplumber / python-docx]
    B --> C[Section detector<br/>regex anchors]
    C --> D[Rule-based extractor<br/>+ optional Claude fallback]
    D --> E[Codelist matcher<br/>CDISC-CT / NCI / internal]
    E --> F{Emitters}
    F --> G[ODM-XML 1.3.2]
    F --> H[REDCap data dictionary CSV]
    F --> I[JSON intermediate]
    G --> J[Round-trip validator]
    H --> J
    J --> K[Static HTML form preview]
Loading

Installation

Requires Python 3.13+ and uv.

git clone https://github.com/siddh-m/protocol-to-ecrf.git
cd protocol-to-ecrf
uv sync

Usage

uv run protocol2ecrf parse path/to/protocol.pdf --out output/

Example:

$ uv run protocol2ecrf parse \
    tests/fixtures/protocols/synthetic/minimal_oncology.docx \
    --out examples/synthetic-oncology-phase2

WARNING: protocol-to-ecrf produces DRAFT CRF templates for human review.
Outputs MUST be reviewed by a clinical data manager and validated against
the source protocol before use in a live study. This tool is assistive,
not autonomous.

Parsed minimal_oncology.docx -> 14 fields (low-confidence: 13)
Output written to examples/synthetic-oncology-phase2

Three files land in the output directory:

File What it is
study.xml CDISC ODM-XML 1.3.2 CRF template (OpenClinica-compatible)
redcap_dict.csv REDCap data dictionary (18 canonical columns)
extracted.json Structured intermediate (full ExtractionResult)

Diff two protocol versions:

uv run protocol2ecrf diff old.pdf new.pdf --out diff-output/
# writes diff.json and diff.md

Browse a committed example without installing anything: examples/synthetic-oncology-phase2/

Controlled Terminology

Every emitted CRF binds its fields to one of eight canonical codelists, shipped in src/protocol_to_ecrf/terminology/catalogue.json:

OID Name Source
CL.YES_NO Yes/No response internal
CL.SEX_AT_BIRTH Sex assigned at birth CDISC-CT (C20197, C16576, C45908)
CL.RACE_NIH NIH race categories NIH OMB-2024
CL.ETHNICITY_NIH NIH ethnicity categories NIH OMB-2024
CL.SEVERITY_CTCAE_V5 CTCAE v5 severity grade NCI-CTCAE 5.0
CL.UNIT_COMMON Common clinical units CDISC-CT
CL.ROUTE_OF_ADMIN Route of administration CDISC-CT
CL.OUTCOME_GRADE Clinical outcome grade internal (RECIST-flavoured)

NCI Thesaurus concept IDs are embedded as <Alias Context="CDISC-CT" Name="Cxxxxx"/> in the emitted ODM where known, so downstream consumers (and auditors) can resolve to the canonical terminology registry.

Safety Posture

These rules are enforced in code, not just in documentation:

  1. Outputs are always draft. Every emitted artifact carries a visible NEEDS_REVIEW sentinel — in the ODM <Annotation><Comment>, in the REDCap form_notes row, and as "status": "draft-needs-review" in the JSON intermediate. The CLI prints a prominent WARNING banner on stderr on every run unless explicitly suppressed with --quiet.

  2. Low-confidence fields are flagged. Any ProtocolField with confidence below 0.8 gets needs_human_review = True, an <Alias Context="internal" Name="needs_review"/> child on its ODM ItemDef, and a @NEEDS_REVIEW marker in its REDCap Field Annotation column. Round-trip validators assert this propagation on every run.

  3. LLM fallback may not fabricate. The optional Claude API extractor uses a system prompt that explicitly forbids inventing fields not present in the source text. Every LLM-sourced field is assigned confidence=0.6 (below the 0.8 review threshold) so it is always surfaced to a human before use.

  4. Round-trip validation must pass. If the validator detects any parity violation between extracted fields and emitted artifacts, the CLI exits with code 2 and a list of failing fields on stderr.

Development

uv run pytest -q                              # default offline suite (~60 tests)
uv run pytest -m llm                          # LLM fallback tests (stub LLM)
uv run pytest -m smoke                        # end-to-end smoke against real protocol
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/

Test markers

  • @pytest.mark.network — requires network access, auto-skipped by default
  • @pytest.mark.llm — exercises the LLM fallback path via a stub LLM, auto-skipped by default
  • @pytest.mark.smoke — end-to-end against a cached real public protocol, auto-skipped by default

Limitations

  • ODM subset scope. The emitter produces exactly six ODM element types in v1: Study, MetaDataVersion, StudyEventDef, FormDef, ItemGroupDef, ItemDef, CodeList. Clinical data, audit trails, admin data, and reference data are out of scope and documented here so you're not surprised by what's missing.
  • Protocol format assumptions. The section detector expects identifiable heading lines — "Inclusion Criteria", "Primary Endpoint", "Schedule of Assessments", "Adverse Event Reporting" or close variants. Protocols that bury these in free prose will extract poorly.
  • LLM confidence caveat. LLM-sourced fields are confidence 0.6 by design, always surfaced as needs_human_review, and bounded by a prompt that forbids fabrication — but no guardrail is perfect, review everything the LLM touched.
  • Not validated for regulatory submission. This tool is a research aid. Do not use its outputs in an IRB submission, an FDA filing, or a live study without running them past a clinical data manager and (separately) the CRF validation process your unit already uses.

Failure walkthrough

If a rule-based type inference misfires — say, it reads "aged 18-75" and decides the field is a text instead of an integer — the round-trip validator re-parses the emitted ODM, sees a DataType="text" on an ItemDef whose source field was marked integer, and returns passed=False with a type_mismatches entry. The CLI exits 2 and the failing field is surfaced on stderr. The reviewer then either:

  1. fixes the source text so the rule fires correctly, or
  2. re-runs with --use-llm to let Claude re-extract that section, or
  3. hand-edits the emitted ODM and documents the change in their CRF log.

This is the intended workflow — the tool does not try to be clever, it fails loudly.

License

MIT

About

Python pipeline that parses clinical trial protocols and auto-generates draft OpenClinica ODM-XML and REDCap data dictionary CSV eCRF templates, with round-trip validation and a vanilla-JS form preview. Assistive, not autonomous.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors