Skip to content

Data Layout

Manu Murugesan edited this page Mar 13, 2026 · 1 revision

Data Layout

medicaid-utils expects Medicaid claim files stored as Parquet datasets, organized by year and state, and sorted by beneficiary ID (BENE_MSIS or MSIS_ID).

MAX Files (Pre-2016, ICD-9 Era)

data_root/
  medicaid/
    {YEAR}/
      {STATE}/
        max/
          ip/parquet/      # Inpatient claims
          ot/parquet/      # Outpatient claims
          ps/parquet/      # Person Summary
          cc/parquet/      # Chronic Conditions

Example: data_root/medicaid/2012/WY/max/ip/parquet/

TAF Files (2016+, ICD-10 Era)

TAF claims are split into multiple subtypes per claim type:

data_root/
  medicaid/
    {YEAR}/
      {STATE}/
        taf/
          ip/                    # Inpatient
            iph/parquet/         #   Header (base)
            ipl/parquet/         #   Line
            ipoccr/parquet/      #   Occurrence codes
            ipdx/parquet/        #   Diagnosis codes
            ipndc/parquet/       #   NDC codes
          ot/                    # Outpatient (oth, otl, otoccr, otdx, otndc)
          lt/                    # Long-Term Care (lth, ltl, ltoccr, ltdx, ltndc)
          rx/                    # Pharmacy (rxh, rxl, rxndc)
          de/                    # Demographics/Eligibility (Person Summary)
            debse/parquet/       #   Base demographics
            dedts/parquet/       #   Dates
            demc/parquet/        #   Managed care
            dedsb/parquet/       #   Disability
            demfp/parquet/       #   Money Follows the Person
            dewvr/parquet/       #   Waiver
            dehsp/parquet/       #   Home health/SPF
            dedxndc/parquet/     #   Diagnosis & NDC codes

Notes

  • Each Parquet dataset can be a single file or a directory of partitioned Parquet files.
  • Files must be pre-sorted by beneficiary ID to enable efficient partition-level operations.
  • The package uses pyarrow as the default Parquet engine.
  • {YEAR} is a four-digit year (e.g., 2012, 2019).
  • {STATE} is a two-letter uppercase state abbreviation (e.g., WY, AL, IL).

Converting Raw CMS Data

If your CMS data is in SAS (.sas7bdat), CSV, or other formats, you need to convert it to Parquet first. Example using pandas:

import pandas as pd

# Read SAS file
df = pd.read_sas("max_ip_2012_wy.sas7bdat")

# Sort by beneficiary ID
df = df.sort_values("MSIS_ID")

# Write to Parquet
df.to_parquet("data_root/medicaid/2012/WY/max/ip/parquet/data.parquet", index=False)

For large files, use Dask:

import dask.dataframe as dd

df = dd.read_csv("max_ip_2012_wy.csv")
df = df.set_index("MSIS_ID").reset_index()
df.to_parquet("data_root/medicaid/2012/WY/max/ip/parquet/")

Next Steps

  • Quick Start — Load and process your first claims
  • MAX vs TAF — Understand the differences between file formats

Clone this wiki locally