health-data data-quality data-engineering

Health Insurance Data Quality: What We Learned Processing 22,000 Plans

Mark at Opelyx ·

The conventional assumption about government health data is that it’s boring but reliable. CMS publishes the Public Use Files, you download them, you import them, you build your API. What’s there to get wrong?

A lot, as it turns out. Here’s a collection of real data quality problems we found processing the CMS PUFs and state exchange data — what we caught, what we almost missed, and how we changed our pipeline to defend against future surprises.

The Swapped URL Columns

The CMS Plan Attributes PUF includes two URL fields: brochure_url and sbc_url. The brochure is the marketing document explaining what a plan covers in plain language. The SBC — Summary of Benefits and Coverage — is the standardized, regulated document that every plan is required to provide.

We discovered that these columns were transposed for FFM plan data. The values in brochure_url were SBC document links, and vice versa. Both fields contained plausible-looking URLs, both pointed to real documents, but they were in the wrong columns.

We caught this by spot-checking URLs against what the documents actually contained — a handful of spot-checks during QA showed that a “brochure” URL was returning a PDF clearly labeled “Summary of Benefits and Coverage” at the top. A brochure URL that returns an SBC is not a subtle bug. If we’d just loaded the columns as-is and served them through the API, every developer using those fields would have been giving their users the wrong document for each label.

We corrected the mapping in our ingest pipeline. The fix is four characters in one place. The hours of investigation to confirm we weren’t misreading the spec were not four characters.

Junk URLs Are Worse Than NULLs

When CMS publishes URL columns in the PUFs, they’re reporting what issuers filed. Issuers are required to provide these URLs, but the enforcement on what a valid URL actually means is minimal.

We analyzed brochure URL coverage across the FFM dataset. About 57-58% of fields are NULL. Of the 42% that have values, a meaningful chunk are generic landing pages — not plan-specific documents.

Ambetter is a good example. Across multiple states, their brochure_url values resolve to ambetterhealth.com or a state-variant landing page like ambetterhealth.com/georgia. These pages exist, return HTTP 200, and contain zero plan-specific information. They’re marketing homepages.

Regence does something similar — URLs that resolve to the domain root or a generic health plans page with no reference to the specific plan being looked up.

A NULL in the database communicates “this data isn’t available.” A generic landing page that returns HTTP 200 communicates “here is your document” while actually providing nothing. The second case actively corrupts downstream products. Any developer who renders a “View Plan Document” button and links it to ambetterhealth.com is providing a broken user experience — and they won’t know until a user clicks it.

Our policy: we validate URLs during ingest and NULL out anything that doesn’t resolve to plan-specific content. We check for a set of known generic domains and path patterns. It’s not exhaustive — a new issuer could file junk URLs we haven’t seen before — but it catches the major offenders.

The actual brochure URL coverage after cleaning is around 32-33% of plans. A third, not 42%. The difference matters if you’re deciding whether to surface that feature.

Tobacco Preference: Three Ways to Say the Same Thing

The ACA allows tobacco surcharges up to 50% above base premium. How this plays out in plan data is a mess of inconsistent encoding.

Some plans list a single rate row with tobacco preference “No Preference” — one rate applies regardless of tobacco status. Some plans list two rows: “Tobacco User” and “Non-Tobacco User”. Some list “Tobacco User/Non-Tobacco User” as a single combined category that means the same thing as “No Preference.” Some plans list all three.

If your import query filters to only “Non-Tobacco User” rows, you drop all “No Preference” plans. If you filter to “Tobacco User/Non-Tobacco User”, you may miss separate-rate plans. Getting complete coverage requires explicitly including “No Preference” AND “Tobacco User/Non-Tobacco User” as valid values for the non-surcharge rate, which is counterintuitive.

We maintain a set of valid “base rate” tobacco preference values and filter to those when importing rates. It took three rounds of debugging to get the complete list right — each time we thought we had it, we’d find a state where another variant was in use.

Age Bracket Normalization

Health plan rates vary by age. The CMS PUFs encode this as age bands, but the bands vary by source.

The federal standard uses ages 0-20 as a single band (same rate for everyone under 21), then individual ages 21 through 64, then 65+. Some state exchanges use different bands: “0-14”, “15-20”, “21-24”, etc. Some historical data uses different groupings entirely.

Normalizing these into a common schema without losing fidelity requires care. We chose to store rates at the most granular level available from the source, then derive the standard bands for API responses. Collapsing a granular band to a standard band is safe; expanding a coarse band to granular ages requires assumptions (usually “apply the band rate to all individual ages in range”), which introduces approximation.

The API documents which approach is in use for a given plan’s rate data. If you’re building a premium calculator, you need to know whether the rate for a 22-year-old is an exact filed rate or a band-derived estimate.

HIOS Plan ID Mismatches

HIOS (Health Insurance Oversight System) plan IDs are the identifiers that connect a plan in CMS data to the same plan in issuer machine-readable files. The format is a 14-character string: 5-character HIOS issuer ID + 2-character state code + 7-character plan-specific suffix.

In theory, you can join CMS PUF plan data to issuer machine-readable (MR) plan data on this ID. In practice, issuers sometimes use different plan component IDs in their MR files than CMS uses in the PUF for what is the same underlying plan.

We found this concretely with PacificSource in Oregon. The plans.json from PacificSource uses one ID prefix; the CMS PUF for Oregon has the same plans under a different prefix. We could not match them by ID. We ended up building a fuzzy crosswalk: match on issuer name, county, metal tier, and plan name string similarity. It’s not perfect, but it gets us to around 85% match rate for the plans where we’re trying to pull in MR URLs.

This HIOS mismatch isn’t a PacificSource-specific quirk. We’ve seen it with other regional carriers, particularly in SBM states where the state exchange system and CMS tracking may have diverged over the plan year.

Chain-of-Custody: Store Everything in R2 First

Every raw data file — every XLSX from Covered California, every CSV from CMS, every ZIP archive — lands in R2 before we touch it. The key pattern is raw/{source}/{year}/{filename}. For the main CMS PUF: raw/cms/2026/Individual_Market_Medical.csv.

This seems like overhead. It’s saved us twice.

First, when we found the swapped URL columns, we needed to re-run the import from the original source data after fixing the parser. If we’d discarded the raw file, we’d have had to re-download it and hope the source hadn’t changed. The R2 copy was there.

Second, when we discovered the tobacco preference issue was causing plans to be missing from query results, we could confirm exactly which rows in the original CSV were being dropped and why. Debugging a parser against a database full of transformed data is hard. Debugging it against the original file is much easier.

Raw storage costs on R2 are negligible for this data volume — the full PUF set for a year is a few gigabytes. The audit trail is worth far more than the storage cost.

The Lesson: Assume the Data Is Wrong

The right mental model for processing public health data is not “this is government data, it’s reliable.” The right model is “this data was assembled from thousands of issuer filings of varying quality, run through government ETL processes, and exported as CSVs. Assume something is wrong.”

That doesn’t mean the data is bad — most of it is accurate and useful. It means: validate outputs against expectations, spot-check values against known correct answers, test edge cases before they become production bugs, and build your pipeline so you can re-run it with fixes rather than patching the database directly.

Every data quality issue we found was found during development, not after launch. That’s not because we got lucky. It’s because we built QA checks into the pipeline and ran them against real data before shipping.