-
Notifications
You must be signed in to change notification settings - Fork 3
Preprocessing
Manu Murugesan edited this page Mar 13, 2026
·
1 revision
The preprocessing module is the foundation of medicaid-utils. It loads raw Parquet claim files into Dask DataFrames and applies validated cleaning and variable construction routines.
| Type | Description | Class |
|---|---|---|
ip |
Inpatient claims | max_ip.MAXIP |
ot |
Outpatient claims | max_ot.MAXOT |
ps |
Person Summary | max_ps.MAXPS |
cc |
Chronic Conditions | max_cc.MAXCC |
| Type | Description | Class |
|---|---|---|
ip |
Inpatient claims | taf_ip.TAFIP |
ot |
Outpatient claims | taf_ot.TAFOT |
lt |
Long-Term Care | taf_lt.TAFLT |
rx |
Pharmacy claims | taf_rx.TAFRX |
ps |
Person Summary (Demographics) | taf_ps.TAFPS |
Each file type has tailored cleaning routines that run automatically when clean=True (the default):
- Date standardization — converts date columns to consistent datetime types
- Diagnosis code cleaning — strips whitespace, normalizes formatting, handles ICD-9/10 differences
- Procedure code cleaning — validates procedure code systems (CPT, HCPCS, ICD)
- Demographic derivation — computes age, gender flags, and date-of-birth validation
- Duplicate flagging — identifies exact duplicate claims for exclusion
- Encounter/capitation classification — flags FFS, encounter, and capitation claims
Additional derived variables computed when preprocess=True (the default):
- Payment calculation — standardized payment amount from available payment fields
- ED use flags — emergency department utilization indicators (CPT, UB-92, revenue center, place of service)
- IP overlap detection — flags outpatient claims that overlap with inpatient stays
- Length of stay — computed from admission and discharge dates
- Eligibility patterns — monthly enrollment strings and gap detection
- Rural classification — RUCA or RUCC codes via ZIP code crosswalk
- Dual eligibility — Medicare-Medicaid dual enrollment flags
- Basis of eligibility — categorization by eligibility group (aged, blind/disabled, child, adult)
from medicaid_utils.preprocessing import max_ip
# Full pipeline (default)
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms")
# Raw data only
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", clean=False, preprocess=False)
# Clean but skip variable construction
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", preprocess=False)
# Cache intermediate results to disk
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", tmp_folder="/tmp/cache")# Export to Parquet (recommended)
ip.export("/path/to/output", output_format="parquet", repartition=True)
# Export to CSV
ip.export("/path/to/output", output_format="csv")- MAX vs TAF — Key differences between file formats
- Glossary — Column naming conventions
medicaid-utils | Documentation | PyPI | GitHub | MIT License | Research Computing Group, Biostatistics Laboratory, The University of Chicago
Getting Started
User Guide
Recipes & How-Tos
Reference
Links