health-data sbm data-engineering

The State-Based Marketplace Data Challenge

Mark at Opelyx ·

When we started building the Opelyx Health Plans API, the CMS Public Use Files were the obvious starting point. Federal marketplaces, 30 states, clean CSV files published every year — relatively straightforward. We had plans, rates, cost-sharing, and even brochure URLs in one place.

Then we looked at the other 21 states.

The SBM Problem

State-Based Marketplaces (SBMs) are exchanges run entirely by individual states rather than the federal government. California, Colorado, New York, Massachusetts — the biggest markets in the country, home to tens of millions of people shopping for coverage. And the federal CMS PUF files contain exactly zero SBM plan data.

That’s not an exaggeration. We parsed machine_readable_PUF_2026.xlsx — the index of all machine-readable plan data — and cross-referenced it against every state. The SBM states are simply absent. If you want to cover the full market, you have to go to each state individually.

The 21 SBM states are: CA, CO, CT, DC, GA, ID, IL, KY, MA, MD, ME, MN, NJ, NM, NV, NY, PA, RI, VA, VT, and WA.

Here’s where it gets complicated: each one does something different.

Three Categories of SBM Data

After building parsers for each state, they roughly fall into three categories, though even that taxonomy breaks down in practice.

States with CMS-style PUF data. Georgia is the standout example here. Georgia Access publishes a full Plan PUF at their website that mirrors the CMS format almost exactly — including URL fields for plan brochures and SBCs. They’re the only SBM state we found with structured URL data. This made Georgia straightforward to import. If every SBM state did what Georgia does, our lives would be much simpler.

States with XLSX rate sheets. California (Covered California) publishes rate data as Excel files. They include rating area assignments, plan names, and rates by age bracket. The structure is consistent year over year, which helps. Vermont is similar — tiny market (two insurers, Anthem and Blue Cross), but clean and consistent XLSX data. We upload these raw files to R2 immediately on download so we have a chain-of-custody audit trail, then parse them separately.

States where you’re reading regulatory filings. Colorado is the most interesting case. The Colorado DOI publishes rate filings through SERFF (System for Electronic Rate and Form Filing). You go to filingaccess.serff.com/sfa/home/CO, search for health insurance rate filings, and dig into documents filed against Rate Manual Regulation 4-2-39. The rate tables are embedded in PDF or Excel attachments — structured, but buried in administrative filings designed for actuaries, not developers.

What “Structured” Actually Means

The gap between “structured data” and “usable data” is large. Take rating area identifiers. The federal system uses numeric rating area IDs like “Rating Area 43”. Multiple states use this scheme independently, so “Rating Area 43” in Florida is a completely different geography than “Rating Area 43” in South Carolina. You always have to filter by state_code alongside the rating_area field — joining on rating area alone gives you garbage results.

Age brackets are another one. The CMS PUFs use standard age bands, but the specifics vary: some plans list individual ages 0 through 64, some use bands like “0-14”, “21-29”, “30-39”, and so on. California’s XLSX files use their own age tier system. When you’re trying to normalize rates across 51 jurisdictions, you end up writing a lot of band-normalization logic that feels like it should be a solved problem but isn’t.

Then there’s the tobacco surcharge issue. Federal rules allow insurers to charge up to 1.5x more for tobacco users. In practice, some plans list a single “No Preference” rate, some list separate “Tobacco User” and “Non-Tobacco User” rates, and some list “Tobacco User/Non-Tobacco User” as a combined category. Our parser has to handle all three. And if you’re only importing the “Non-Tobacco User” rows and skipping “No Preference” plans, you’re missing a chunk of coverage. We import both “No Preference” and “Tobacco User/Non-Tobacco User” rows to make sure nothing falls through.

The URL Problem

One of the most requested features in any health plan dataset is plan document URLs — brochure PDFs, Summary of Benefits and Coverage documents. The CMS PUFs include brochure_url and sbc_url columns for FFM plans.

SBM states almost never have this. Out of 21 states, Georgia is essentially the only one with structured URL data in their PUF. The rest require either crawling issuer websites directly or building state-specific scrapers.

We spent time crawling national issuers’ machine-readable files — the plans.json endpoints that major carriers publish for CMS compliance. This works for FFM states, but for SBM states there’s a catch: the HIOS plan component IDs in issuer machine-readable files often don’t match the IDs in state exchange data. For example, PacificSource Oregon plans appear in both the issuer’s JSON (using one ID prefix) and in state data (using a different prefix) for what is literally the same plan. You can’t join on plan ID alone. You end up building fuzzy crosswalk tables matching on plan name, county, and metal tier to get the URLs aligned.

Even when you have URLs, a meaningful percentage are junk. We found that roughly 58% of brochure URL fields in FFM data are NULL. Of the ones that are populated, about 10% are generic landing pages — Ambetter links that go to ambetterhealth.com, Regence links that return the bare domain. We NULL those out rather than store them. A NULL is honest. A generic landing page that returns HTTP 200 but contains no plan-specific information is actively misleading.

Our Approach: A Unified Parser

After building individual parsers for each state format, we refactored into a shared parsing library with state-specific adapters. The core contracts are:

  • Every parser outputs the same internal record shape: plan metadata, rate rows keyed by age/tobacco/area, cost-sharing tiers
  • Every parser logs what it’s doing at the row level, not just summary counts
  • Raw source files always land in R2 first, before any transformation

The unified approach means we can audit any plan record back to its source file, and we can re-run a single state’s parser without affecting the others. When CMS updates the FFM PUF structure (they do, occasionally), we only touch the FFM adapter. When California changes their XLSX layout (they did, in 2025), we only touch the CA adapter.

It took longer to build than just shipping state-specific one-off scripts. We had 21 manual scripts before the refactor. But the maintenance burden on 21 separate scripts that each embed slightly different assumptions about the data would have been brutal to sustain.

What’s Still Hard

A few things remain genuinely unsolved, or at least not solved cleanly:

Plan URL coverage for SBM states. Unless a state publishes a PUF with URL data (only Georgia does), we’re relying on issuer machine-readable files with the HIOS ID mismatch problem described above. The coverage is incomplete and the matching is approximate.

Mid-year updates. States update their data at different times. CMS does a major PUF release annually but also publishes corrections. State exchanges vary — some do quarterly updates, some are effectively static for the plan year. We track update timestamps per state and flag stale data, but there’s no clean signal that a state has published new data short of polling their download pages.

Smaller states with thin data. Vermont has two insurers. DC has a handful. When data is thin, edge cases surface more often — a plan that somehow got both tobacco and non-tobacco rates filed, a rating area that has no corresponding ZIP codes in the crosswalk, an age bracket that doesn’t match any standard band.

The SBM challenge is fundamentally a coordination problem at a national scale. There’s no technical reason 21 states couldn’t all publish CMS-compatible PUFs. Georgia proves it’s possible. The rest just haven’t. Until that changes, anyone trying to build a comprehensive health plan dataset is doing 21 bespoke integration projects on top of the federal baseline.