Skip to content

Preprocessing

Manu Murugesan edited this page Mar 13, 2026 · 1 revision

Preprocessing

The preprocessing module is the foundation of medicaid-utils. It loads raw Parquet claim files into Dask DataFrames and applies validated cleaning and variable construction routines.

Supported File Types

MAX Format (Pre-2016)

Type Description Class
ip Inpatient claims max_ip.MAXIP
ot Outpatient claims max_ot.MAXOT
ps Person Summary max_ps.MAXPS
cc Chronic Conditions max_cc.MAXCC

TAF Format (2016+)

Type Description Class
ip Inpatient claims taf_ip.TAFIP
ot Outpatient claims taf_ot.TAFOT
lt Long-Term Care taf_lt.TAFLT
rx Pharmacy claims taf_rx.TAFRX
ps Person Summary (Demographics) taf_ps.TAFPS

What Cleaning Does

Each file type has tailored cleaning routines that run automatically when clean=True (the default):

  • Date standardization — converts date columns to consistent datetime types
  • Diagnosis code cleaning — strips whitespace, normalizes formatting, handles ICD-9/10 differences
  • Procedure code cleaning — validates procedure code systems (CPT, HCPCS, ICD)
  • Demographic derivation — computes age, gender flags, and date-of-birth validation
  • Duplicate flagging — identifies exact duplicate claims for exclusion
  • Encounter/capitation classification — flags FFS, encounter, and capitation claims

What Preprocessing Adds

Additional derived variables computed when preprocess=True (the default):

  • Payment calculation — standardized payment amount from available payment fields
  • ED use flags — emergency department utilization indicators (CPT, UB-92, revenue center, place of service)
  • IP overlap detection — flags outpatient claims that overlap with inpatient stays
  • Length of stay — computed from admission and discharge dates
  • Eligibility patterns — monthly enrollment strings and gap detection
  • Rural classification — RUCA or RUCC codes via ZIP code crosswalk
  • Dual eligibility — Medicare-Medicaid dual enrollment flags
  • Basis of eligibility — categorization by eligibility group (aged, blind/disabled, child, adult)

Controlling the Pipeline

from medicaid_utils.preprocessing import max_ip

# Full pipeline (default)
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms")

# Raw data only
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", clean=False, preprocess=False)

# Clean but skip variable construction
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", preprocess=False)

# Cache intermediate results to disk
ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms", tmp_folder="/tmp/cache")

Exporting Processed Data

# Export to Parquet (recommended)
ip.export("/path/to/output", output_format="parquet", repartition=True)

# Export to CSV
ip.export("/path/to/output", output_format="csv")

See Also

  • MAX vs TAF — Key differences between file formats
  • Glossary — Column naming conventions

Clone this wiki locally