-
Notifications
You must be signed in to change notification settings - Fork 3
FAQ
medicaid-utils is an open-source Python toolkit for analyzing Medicaid claims data from CMS. It provides preprocessing, cleaning, cohort extraction, risk adjustment algorithms, and quality measures for both MAX (1999–2015) and TAF (2014–present) file formats.
The package is developed by the Research Computing Group in the Biostatistics Laboratory at The University of Chicago. It grew out of research computing infrastructure built for peer-reviewed Medicaid health services research publications.
The package itself is open-source and freely available. However, to use it with actual Medicaid claims data, you need a data use agreement with CMS or access through ResDAC. The package does not include any claims data.
Python 3.11, 3.12, and 3.13 are tested in CI. Python 3.10 and earlier are not supported.
Parquet format, organized by year and state. See Data Layout for the required folder structure.
Not directly — you need to convert to Parquet first. See Data Layout for conversion examples.
MAX (Medicaid Analytic eXtract) covers 1999–2015 with ICD-9 coding. TAF (T-MSIS Analytic Files) covers 2014–present with ICD-10 coding. Both have IP, OT, RX, and PS claim types, though medicaid-utils currently only implements MAX RX preprocessing for TAF. See MAX vs TAF for a detailed comparison.
- Small states (e.g., WY): 16 GB is usually sufficient
- Medium states (e.g., AL, IL): 32–64 GB recommended
- Large states (e.g., CA, NY, TX): 64 GB+ recommended, or use a distributed cluster
Using tmp_folder for intermediate caching helps manage memory regardless of dataset size.
-
MAX files:
ip.df(single DataFrame) -
TAF files:
ip.dct_files["base"](dict of sub-file DataFrames)
See MAX vs TAF for details.
medicaid-utils uses Dask, which is lazy by default. Operations build a task graph but don't execute until you call .compute(). This allows Dask to optimize the computation plan.
Use ICD-9 and ICD-10 prefixes in a dictionary. Codes are matched using prefix matching — "E11" matches "E110", "E1100", "E1101", etc. See Cohort Extraction for examples.
Yes! The dct_diag_proc_codes dictionary supports both simultaneously:
dct_codes = {
"diag_codes": {
"diabetes_t2": {
"incl": {
9: ["250"], # ICD-9
10: ["E11"], # ICD-10
},
},
},
"proc_codes": {},
}Eight algorithms: Elixhauser, CDPS-Rx, BETOS, ED PQI, IP PQI, NYU/Billings, PMCA, and low-value care. See Risk Adjustment Algorithms for details and usage.
This typically happens when object-type columns contain pyarrow sentinel values. medicaid-utils handles this automatically by converting object columns to strings before export. If you encounter this error, make sure you're using the latest version.
- Reduce
n_workersand increasememory_limitper worker - Use
tmp_folderto cache intermediate results to disk - Consider repartitioning your data into smaller partitions
- For multi-state analyses, process states sequentially
Check that your data folder structure matches the expected layout. See Data Layout.
See Contributing for guidelines on submitting issues and pull requests.
Open an issue on GitHub with:
- Python version and package version
- Minimal code example that reproduces the issue
- Full error traceback
medicaid-utils | Documentation | PyPI | GitHub | MIT License | Research Computing Group, Biostatistics Laboratory, The University of Chicago
Getting Started
User Guide
Recipes & How-Tos
Reference
Links