Skip to content
Manu Murugesan edited this page Mar 14, 2026 · 2 revisions

Frequently Asked Questions

General

What is medicaid-utils?

medicaid-utils is an open-source Python toolkit for analyzing Medicaid claims data from CMS. It provides preprocessing, cleaning, cohort extraction, risk adjustment algorithms, and quality measures for both MAX (1999–2015) and TAF (2014–present) file formats.

Who develops medicaid-utils?

The package is developed by the Research Computing Group in the Biostatistics Laboratory at The University of Chicago. It grew out of research computing infrastructure built for peer-reviewed Medicaid health services research publications.

Do I need a CMS data use agreement (DUA) to use this package?

The package itself is open-source and freely available. However, to use it with actual Medicaid claims data, you need a data use agreement with CMS or access through ResDAC. The package does not include any claims data.

What Python versions are supported?

Python 3.11, 3.12, and 3.13 are tested in CI. Python 3.10 and earlier are not supported.

Data & Setup

What format should my data be in?

Parquet format, organized by year and state. See Data Layout for the required folder structure.

Can I use SAS or CSV data?

Not directly — you need to convert to Parquet first. See Data Layout for conversion examples.

What's the difference between MAX and TAF?

MAX (Medicaid Analytic eXtract) covers 1999–2015 with ICD-9 coding. TAF (T-MSIS Analytic Files) covers 2014–present with ICD-10 coding. Both have IP, OT, RX, and PS claim types, though medicaid-utils currently only implements MAX RX preprocessing for TAF. See MAX vs TAF for a detailed comparison.

How much RAM do I need?

  • Small states (e.g., WY): 16 GB is usually sufficient
  • Medium states (e.g., AL, IL): 32–64 GB recommended
  • Large states (e.g., CA, NY, TX): 64 GB+ recommended, or use a distributed cluster

Using tmp_folder for intermediate caching helps manage memory regardless of dataset size.

Usage

How do I access the DataFrame after loading?

  • MAX files: ip.df (single DataFrame)
  • TAF files: ip.dct_files["base"] (dict of sub-file DataFrames)

See MAX vs TAF for details.

Why are operations slow without .compute()?

medicaid-utils uses Dask, which is lazy by default. Operations build a task graph but don't execute until you call .compute(). This allows Dask to optimize the computation plan.

How do I define diagnosis codes for cohort extraction?

Use ICD-9 and ICD-10 prefixes in a dictionary. Codes are matched using prefix matching — "E11" matches "E110", "E1100", "E1101", etc. See Cohort Extraction for examples.

Can I use both ICD-9 and ICD-10 codes?

Yes! The dct_diag_proc_codes dictionary supports both simultaneously:

dct_codes = {
    "diag_codes": {
        "diabetes_t2": {
            "incl": {
                9: ["250"],    # ICD-9
                10: ["E11"],   # ICD-10
            },
        },
    },
    "proc_codes": {},
}

What risk adjustment algorithms are available?

Eight algorithms: Elixhauser, CDPS-Rx, BETOS, ED PQI, IP PQI, NYU/Billings, PMCA, and low-value care. See Risk Adjustment Algorithms for details and usage.

Troubleshooting

ArrowInvalid or ArrowTypeError when exporting to Parquet

This typically happens when object-type columns contain pyarrow sentinel values. medicaid-utils handles this automatically by converting object columns to strings before export. If you encounter this error, make sure you're using the latest version.

Dask worker running out of memory

  • Reduce n_workers and increase memory_limit per worker
  • Use tmp_folder to cache intermediate results to disk
  • Consider repartitioning your data into smaller partitions
  • For multi-state analyses, process states sequentially

FileNotFoundError when loading claims

Check that your data folder structure matches the expected layout. See Data Layout.

Contributing

How can I contribute?

See Contributing for guidelines on submitting issues and pull requests.

How do I report a bug?

Open an issue on GitHub with:

  • Python version and package version
  • Minimal code example that reproduces the issue
  • Full error traceback

Clone this wiki locally