-
Notifications
You must be signed in to change notification settings - Fork 3
Data Layout
Manu Murugesan edited this page Mar 13, 2026
·
1 revision
medicaid-utils expects Medicaid claim files stored as Parquet datasets, organized by year and state, and sorted by beneficiary ID (BENE_MSIS or MSIS_ID).
data_root/
medicaid/
{YEAR}/
{STATE}/
max/
ip/parquet/ # Inpatient claims
ot/parquet/ # Outpatient claims
ps/parquet/ # Person Summary
cc/parquet/ # Chronic Conditions
Example: data_root/medicaid/2012/WY/max/ip/parquet/
TAF claims are split into multiple subtypes per claim type:
data_root/
medicaid/
{YEAR}/
{STATE}/
taf/
ip/ # Inpatient
iph/parquet/ # Header (base)
ipl/parquet/ # Line
ipoccr/parquet/ # Occurrence codes
ipdx/parquet/ # Diagnosis codes
ipndc/parquet/ # NDC codes
ot/ # Outpatient (oth, otl, otoccr, otdx, otndc)
lt/ # Long-Term Care (lth, ltl, ltoccr, ltdx, ltndc)
rx/ # Pharmacy (rxh, rxl, rxndc)
de/ # Demographics/Eligibility (Person Summary)
debse/parquet/ # Base demographics
dedts/parquet/ # Dates
demc/parquet/ # Managed care
dedsb/parquet/ # Disability
demfp/parquet/ # Money Follows the Person
dewvr/parquet/ # Waiver
dehsp/parquet/ # Home health/SPF
dedxndc/parquet/ # Diagnosis & NDC codes
- Each Parquet dataset can be a single file or a directory of partitioned Parquet files.
- Files must be pre-sorted by beneficiary ID to enable efficient partition-level operations.
- The package uses
pyarrowas the default Parquet engine. -
{YEAR}is a four-digit year (e.g.,2012,2019). -
{STATE}is a two-letter uppercase state abbreviation (e.g.,WY,AL,IL).
If your CMS data is in SAS (.sas7bdat), CSV, or other formats, you need to convert it to Parquet first. Example using pandas:
import pandas as pd
# Read SAS file
df = pd.read_sas("max_ip_2012_wy.sas7bdat")
# Sort by beneficiary ID
df = df.sort_values("MSIS_ID")
# Write to Parquet
df.to_parquet("data_root/medicaid/2012/WY/max/ip/parquet/data.parquet", index=False)For large files, use Dask:
import dask.dataframe as dd
df = dd.read_csv("max_ip_2012_wy.csv")
df = df.set_index("MSIS_ID").reset_index()
df.to_parquet("data_root/medicaid/2012/WY/max/ip/parquet/")- Quick Start — Load and process your first claims
- MAX vs TAF — Understand the differences between file formats
medicaid-utils | Documentation | PyPI | GitHub | MIT License | Research Computing Group, Biostatistics Laboratory, The University of Chicago
Getting Started
User Guide
Recipes & How-Tos
Reference
Links