Data Layout

medicaid-utils expects Medicaid claim files stored as Parquet datasets, organized by year and state, and sorted by beneficiary ID (BENE_MSIS or MSIS_ID).

MAX Files (Pre-2016, ICD-9 Era)

data_root/
  medicaid/
    {YEAR}/
      {STATE}/
        max/
          ip/parquet/      # Inpatient claims
          ot/parquet/      # Outpatient claims
          ps/parquet/      # Person Summary
          cc/parquet/      # Chronic Conditions

Example: data_root/medicaid/2012/WY/max/ip/parquet/

TAF Files (2016+, ICD-10 Era)

TAF claims are split into multiple subtypes per claim type:

data_root/
  medicaid/
    {YEAR}/
      {STATE}/
        taf/
          ip/                    # Inpatient
            iph/parquet/         #   Header (base)
            ipl/parquet/         #   Line
            ipoccr/parquet/      #   Occurrence codes
            ipdx/parquet/        #   Diagnosis codes
            ipndc/parquet/       #   NDC codes
          ot/                    # Outpatient (oth, otl, otoccr, otdx, otndc)
          lt/                    # Long-Term Care (lth, ltl, ltoccr, ltdx, ltndc)
          rx/                    # Pharmacy (rxh, rxl, rxndc)
          de/                    # Demographics/Eligibility (Person Summary)
            debse/parquet/       #   Base demographics
            dedts/parquet/       #   Dates
            demc/parquet/        #   Managed care
            dedsb/parquet/       #   Disability
            demfp/parquet/       #   Money Follows the Person
            dewvr/parquet/       #   Waiver
            dehsp/parquet/       #   Home health/SPF
            dedxndc/parquet/     #   Diagnosis & NDC codes

Notes

Each Parquet dataset can be a single file or a directory of partitioned Parquet files.
Files must be pre-sorted by beneficiary ID to enable efficient partition-level operations.
The package uses pyarrow as the default Parquet engine.
{YEAR} is a four-digit year (e.g., 2012, 2019).
{STATE} is a two-letter uppercase state abbreviation (e.g., WY, AL, IL).

Converting Raw CMS Data

If your CMS data is in SAS (.sas7bdat), CSV, or other formats, you need to convert it to Parquet first. Example using pandas:

import pandas as pd

# Read SAS file
df = pd.read_sas("max_ip_2012_wy.sas7bdat")

# Sort by beneficiary ID
df = df.sort_values("MSIS_ID")

# Write to Parquet
df.to_parquet("data_root/medicaid/2012/WY/max/ip/parquet/data.parquet", index=False)

For large files, use Dask:

import dask.dataframe as dd

df = dd.read_csv("max_ip_2012_wy.csv")
df = df.set_index("MSIS_ID").reset_index()
df.to_parquet("data_root/medicaid/2012/WY/max/ip/parquet/")

Next Steps

Quick Start — Load and process your first claims
MAX vs TAF — Understand the differences between file formats

Home

Getting Started

User Guide

Recipes & How-Tos

Reference

Links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Layout

Data Layout

MAX Files (Pre-2016, ICD-9 Era)

TAF Files (2016+, ICD-10 Era)

Notes

Converting Raw CMS Data

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally