health-data cms data-engineering

How CMS Public Use Files Power Health Plan Data

Mark at Opelyx ·

Every year, sometime in May, the Centers for Medicare and Medicaid Services publishes a collection of CSV files that constitute the most comprehensive public dataset on ACA marketplace health plans available anywhere. These are the Public Use Files (PUFs), and if you’re building anything in the health insurance data space, understanding what they contain — and what they don’t — is foundational.

I’ve processed the 2024, 2025, and 2026 plan year PUFs extensively. Here’s what I’ve learned.

The Five File Types

The CMS PUF release isn’t a single file. It’s five distinct datasets that must be joined together to build a complete picture of any plan.

Plan Attributes PUF is the master plan record. Each row is a unique plan variant — a specific plan offered by a specific issuer in a specific service area. Key fields include the StandardComponentId (the stable plan identifier that persists across years), metal tier, plan type (HMO, PPO, EPO, POS, HDHP), network URL, formulary URL, and various benefit flags. The raw file contains tens of thousands of rows for FFM states; after deduplication to standard component variants, our processed database holds about 22,000 plans across 30 FFM states.

Rates PUF is the largest file and the most complex. Each row is a premium for a specific plan, in a specific rating area, for a specific age band, with a specific tobacco status variant. For the 2026 plan year covering 30 FFM states, this comes out to approximately 1.4 million rows in our normalized schema. The raw CMS format uses wide-format age columns (Age0, Age1, … Age64, AgeFamily) which require unpivoting into a long format that’s actually queryable.

Benefits and Cost Sharing PUF maps plans to individual benefit records — things like whether a specific service is covered, what the copay is, whether it applies before or after the deductible. A single plan can have 50+ benefit rows. This is where you get the SBC-style data used for cost estimation.

Service Area PUF defines which geographic areas each issuer’s plan is available in, down to the county level. This is separate from rating areas — an issuer might serve all of Florida but only some counties in a particular rating area.

Machine Readable PUF is the index of machine-readable URLs — the plans.json files that issuers are required to publish. These URLs point to the issuer’s own JSON files with provider directories, formularies, and plan details. The quality varies significantly; about 42% of URLs in the most recent PUF are populated, but only 32.6% point to actual useful documents — the remaining ~10% are generic issuer landing pages that don’t contain machine-readable plan data.

The Release Cycle

The PUFs for a given plan year are typically released in May of the preceding year. The 2026 plan year PUFs (covering plans sold during open enrollment in fall 2025) were released in mid-May 2025. This gives the industry about six months before open enrollment opens to build systems using the new data.

There are sometimes revisions after the initial release. Issuers can update their filings, and CMS publishes updated PUFs to reflect changes. If you’re building a production system, you need a mechanism to re-ingest when updates are released — just downloading the files once and never checking again will leave you with stale data.

The 21 SBM states are entirely absent from the CMS PUF release. California, New York, Colorado, and the other state-run exchanges publish their own data on their own schedules. Some publish in the same May window. Others release later. A few don’t publish PUF-equivalent data at all and require different sourcing approaches.

Processing Challenges at Scale

The naive approach to PUF processing — download CSV, parse with pandas, export SQL — works fine for small subsets but breaks down at production scale. Here are the specific problems we solved:

Age band normalization. The Rates PUF uses a wide format with one column per age. Individual ages 0-64 get their own column, then there’s a single “Family” column for family rates. To make premiums queryable by a specific person’s age, you need to unpivot this into one row per age per plan per rating area. A plan with rates in 10 rating areas goes from 10 rows to 640+ rows after normalization.

Tobacco variants. For each age and rating area, the Rates PUF has separate rows for “Tobacco User/Non-Tobacco User,” “No Preference,” and “Tobacco User.” For API purposes, you typically want “No Preference” — the non-tobacco rate. But CMS changed the column naming conventions between plan years, so your parsing logic needs to handle multiple representations of the same semantic concept.

StandardComponentId vs PlanId. These are different identifiers and the join logic matters. PlanId is the full 14-character HIOS plan ID including the variant suffix. StandardComponentId is the first 14 characters identifying the plan regardless of variant. Rates join to plans on StandardComponentId, not PlanId. If you join on the wrong key, you’ll either get duplicates or missing records depending on how you handle it.

D1 import performance. Our full normalized dataset is around 275MB of SQL. Cloudflare D1 has a maximum import file size, so we split the data into chunks and import sequentially. D1 caches imports by content hash — if you import the same file twice, the second import is a no-op. This is a feature, but it means that if you fix a data issue in a file and re-import the exact same filename with modified content, you have to bust the cache by changing the content (we add a timestamp comment to the SQL header).

Concurrent imports fail. D1 allows only one active import at a time per database. If you try to run parallel imports, you get Currently processing a long-running import errors. The solution is sequential imports, which means our full initial load takes 15-20 minutes but runs unattended.

What the Data Enables

After processing, the query pattern is a two-step lookup. First, resolve the ZIP code to a rating area via a dedicated zip_rating_area table. Second, query plans and rates directly by that rating area ID. Separating these concerns keeps each query simple and indexable.

-- Step 1: resolve ZIP to rating area (single-row lookup)
SELECT rating_area_id, state_code
FROM zip_rating_area
WHERE zip_code = ?

-- Step 2: query plans + rates by rating area
SELECT
  p.plan_id,
  p.plan_marketing_name,
  p.metal_level,
  p.plan_type,
  r.individual_rate,
  pcs.medical_deductible_individual,
  pcs.moop_individual
FROM rates r
INNER JOIN plans p ON p.standard_component_id = r.plan_id
LEFT JOIN plan_cost_sharing pcs ON pcs.plan_id = p.plan_id
WHERE r.rating_area_id = ?
  AND r.state_code = ?
  AND r.age = ?
  AND p.plan_id LIKE '%-01'
ORDER BY r.individual_rate ASC

A few things worth noting about this structure. The rates table is the driving table — rates are joined up to plans on standard_component_id, not the other way around. The plan_id LIKE '%-01' filter is important: each plan has up to six CSR variant suffixes (-01 through -06), and only -01 is the base plan; the others are cost-sharing reduction variants for subsidized enrollees. Querying without this filter returns duplicate plans at the same price. Tobacco status filtering happens at ingest time — the rates table stores only the non-tobacco rates, so there’s no tobacco column to filter on at query time. The plan_cost_sharing left join pulls deductible and out-of-pocket maximum data needed for cost estimation.

The data pipeline — from raw CMS CSVs to a queryable D1 database — processes about 1.4 million rate records, 22,000 plans across 51 jurisdictions, and multiple join tables for rating areas, ZIPs, and service areas. It runs once per plan year with mid-cycle re-runs when CMS publishes updates.

The hard part isn’t writing the SQL. It’s understanding the data well enough to know which columns to trust, which joins are correct, and where the CMS documentation diverges from what the files actually contain. That understanding took months to build.