FAQ

Frequently Asked Questions

General

What is medicaid-utils?

medicaid-utils is an open-source Python toolkit for analyzing Medicaid claims data from CMS. It provides preprocessing, cleaning, cohort extraction, risk adjustment algorithms, and quality measures for both MAX (1999–2015) and TAF (2014–present) file formats.

Who develops medicaid-utils?

The package is developed by the Research Computing Group in the Biostatistics Laboratory at The University of Chicago. It grew out of research computing infrastructure built for peer-reviewed Medicaid health services research publications.

Do I need a CMS data use agreement (DUA) to use this package?

The package itself is open-source and freely available. However, to use it with actual Medicaid claims data, you need a data use agreement with CMS or access through ResDAC. The package does not include any claims data.

What Python versions are supported?

Python 3.11, 3.12, and 3.13 are tested in CI. Python 3.10 and earlier are not supported.

Data & Setup

What format should my data be in?

Parquet format, organized by year and state. See Data Layout for the required folder structure.

Can I use SAS or CSV data?

Not directly — you need to convert to Parquet first. See Data Layout for conversion examples.

What's the difference between MAX and TAF?

MAX (Medicaid Analytic eXtract) covers 1999–2015 with ICD-9 coding. TAF (T-MSIS Analytic Files) covers 2014–present with ICD-10 coding. Both have IP, OT, RX, and PS claim types, though medicaid-utils currently only implements MAX RX preprocessing for TAF. See MAX vs TAF for a detailed comparison.

How much RAM do I need?

Small states (e.g., WY): 16 GB is usually sufficient
Medium states (e.g., AL, IL): 32–64 GB recommended
Large states (e.g., CA, NY, TX): 64 GB+ recommended, or use a distributed cluster

Using tmp_folder for intermediate caching helps manage memory regardless of dataset size.

Usage

How do I access the DataFrame after loading?

MAX files: ip.df (single DataFrame)
TAF files: ip.dct_files["base"] (dict of sub-file DataFrames)

See MAX vs TAF for details.

Why are operations slow without `.compute()`?

medicaid-utils uses Dask, which is lazy by default. Operations build a task graph but don't execute until you call .compute(). This allows Dask to optimize the computation plan.

How do I define diagnosis codes for cohort extraction?

Use ICD-9 and ICD-10 prefixes in a dictionary. Codes are matched using prefix matching — "E11" matches "E110", "E1100", "E1101", etc. See Cohort Extraction for examples.

Can I use both ICD-9 and ICD-10 codes?

Yes! The dct_diag_proc_codes dictionary supports both simultaneously:

dct_codes = {
    "diag_codes": {
        "diabetes_t2": {
            "incl": {
                9: ["250"],    # ICD-9
                10: ["E11"],   # ICD-10
            },
        },
    },
    "proc_codes": {},
}

What risk adjustment algorithms are available?

Eight algorithms: Elixhauser, CDPS-Rx, BETOS, ED PQI, IP PQI, NYU/Billings, PMCA, and low-value care. See Risk Adjustment Algorithms for details and usage.

Troubleshooting

`ArrowInvalid` or `ArrowTypeError` when exporting to Parquet

This typically happens when object-type columns contain pyarrow sentinel values. medicaid-utils handles this automatically by converting object columns to strings before export. If you encounter this error, make sure you're using the latest version.

Dask worker running out of memory

Reduce n_workers and increase memory_limit per worker
Use tmp_folder to cache intermediate results to disk
Consider repartitioning your data into smaller partitions
For multi-state analyses, process states sequentially

`FileNotFoundError` when loading claims

Check that your data folder structure matches the expected layout. See Data Layout.

Contributing

How can I contribute?

See Contributing for guidelines on submitting issues and pull requests.

How do I report a bug?

Open an issue on GitHub with:

Python version and package version
Minimal code example that reproduces the issue
Full error traceback

Home

Getting Started

User Guide

Recipes & How-Tos

Reference

Links

FAQ

Frequently Asked Questions

General

What is medicaid-utils?

Who develops medicaid-utils?

Do I need a CMS data use agreement (DUA) to use this package?

What Python versions are supported?

Data & Setup

What format should my data be in?

Can I use SAS or CSV data?

What's the difference between MAX and TAF?

How much RAM do I need?

Usage

How do I access the DataFrame after loading?

Why are operations slow without .compute()?

How do I define diagnosis codes for cohort extraction?

Can I use both ICD-9 and ICD-10 codes?

What risk adjustment algorithms are available?

Troubleshooting

ArrowInvalid or ArrowTypeError when exporting to Parquet

Dask worker running out of memory

FileNotFoundError when loading claims

Contributing

How can I contribute?

How do I report a bug?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Why are operations slow without `.compute()`?

`ArrowInvalid` or `ArrowTypeError` when exporting to Parquet

`FileNotFoundError` when loading claims