api-design architecture developer-experience

API Design for Financial Data: Lessons from Building Opelyx

Mark at Opelyx ·

Building an API for financial data is different from building one for, say, a task manager or a blog platform. The data is authoritative — people make coverage decisions, investment decisions, compliance decisions based on what your API returns. That raises the stakes on every design choice: error messages, validation failures, rate limit responses. If the API lies or says something ambiguous, downstream consequences are real.

Here’s what we got right, what we got wrong, and the decisions that aren’t obvious until you’ve shipped and watched real traffic.

Validate Everything at the Boundary

This sounds obvious. It’s still the most common thing to get wrong.

We use Zod for input validation on every endpoint. Not just parsing — full schema validation with meaningful error messages. When a request comes in with age=abc or a missing required field, the response tells you exactly what failed and why, using RFC 9457 Problem Details (application/problem+json):

{
  "type": "https://api.opelyx.com/problems/validation-error",
  "title": "Validation Error",
  "status": 400,
  "detail": "Invalid query parameters.",
  "instance": "/v1/health/plans?age=abc",
  "errors": [
    {
      "field": "age",
      "message": "Expected number, received string"
    }
  ]
}

Not 400 Bad Request with a bare string. Not Invalid parameters. A structured, machine-parseable body with a URI problem type, the HTTP status, the request path, and an errors array identifying each invalid field.

The other half of boundary validation is output. We validate our own responses against the same Zod schemas before sending them. This caught a production bug early where a null value in a rates array was propagating to the response body because we were spreading database rows without checking shape. Downstream clients had started treating null as a valid rate, which would have silently corrupted their data.

Three-Tier Auth in Under a Millisecond

Authentication on the hot path has to be fast. Our auth flow hits three sources in order:

  1. KV cache, keyed by SHA-256 hash of the API key, 5-minute TTL
  2. D1 database — join between api_key and subscription tables
  3. Legacy fallback — a comma-separated key list in a Worker secret

On a cache hit (the vast majority of requests), auth adds about 0.3ms of latency. On a cache miss, it’s 5-10ms for the D1 round-trip, and we populate the cache so the next request is fast. The legacy fallback is a linear scan through the secret value — not great, but it only runs when KV and D1 both miss, which essentially means “brand new key that hasn’t been cached yet.”

We store SHA-256 hashes, never raw keys. The keys themselves use the format op_ followed by 40 hex characters — enough entropy to be brute-force resistant, short enough to be copyable. The op_ prefix means you can grep for them in source code, logs, or pastes if you’re auditing for accidental leaks.

One decision worth calling out: we use Bearer tokens (Authorization: Bearer op_...), not custom headers like X-API-Key. Bearer is the RFC 6750 standard. Every HTTP client library in existence handles it natively. Custom header schemes require documentation and often trip up developers who expect standard auth patterns.

Rate Limiting by Tier, Not by Endpoint

We rate-limit at the API key level with per-tier daily buckets: 100 requests/day for free, 10,000 for pro, 100,000 for enterprise. KV is the counter store — we increment on every request and check against the tier ceiling.

One thing we deliberately did not do: per-endpoint rate limits. It’s tempting to say “search is expensive, cap it separately from lookup.” In practice this creates user confusion — you’ve made 47 search requests and 12 lookup requests, and your rate limit response has to explain which bucket you hit. Users think in terms of their total usage, not your internal cost model.

If individual endpoints get expensive enough to justify separate controls, the right answer is tiered pricing that reflects cost, not hidden per-endpoint caps.

Rate limit responses follow the standard HTTP pattern — 429 Too Many Requests with Retry-After, RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers per draft-ietf-httpapi-ratelimit-headers. We also send the legacy X-RateLimit-* variants for clients that haven’t adopted the standard yet. Never just a 200 with an error body. Clients that implement retry logic look for the 429 status code.

Cursor Pagination Over Offset

Our plan search endpoint returns up to 22,000 plans. Users filter by state, metal tier, insurer, premium range, and a handful of other dimensions. Deep pagination is common — someone scanning all Gold plans in California hits dozens of pages.

We use cursor pagination rather than offset. The cursor is an opaque base64url-encoded JSON payload containing a compound key: the sort field value and the plan ID as a tiebreaker ({v: sortValue, id: planId}). The API supports sorting by premium, deductible, max out-of-pocket, or plan name, so the cursor encodes whatever sort column is active plus the plan ID for deterministic ordering. Each request returns a next_cursor in the response. To get the next page, pass that cursor back.

The reason isn’t philosophical — it’s that D1 (SQLite) handles the keyset condition with an index scan in constant time regardless of how deep you are in the result set. The actual WHERE clause is a compound OR:

WHERE (sort_column > :cursor_value)
   OR (sort_column = :cursor_value AND plan_id > :cursor_plan_id)

An OFFSET 5000 forces a scan and discard of 5,000 rows on every request. At our data size, the difference is measurable.

The tradeoff is real: you can’t jump to page 47 of 200. We return a total count on the first request (no cursor) so clients can compute “how many pages are there,” but navigating directly to an arbitrary page isn’t supported. For our use case — search results, not spreadsheet navigation — that’s fine.

Versioned Routes from Day One

All endpoints live under /v1/. Not /api/, not the root. This cost nothing to implement and gives us a clean path to breaking changes later. If we need to change the shape of a response, /v2/ can coexist with /v1/ while clients migrate.

We’ve already had to make a minor breaking change to the plan search response — we renamed a field in the first month. Because we hadn’t shipped /v1/ publicly yet, we could just fix it. If we had customers on that field, we’d have needed the versioning escape hatch. Lesson: version from before you have customers, not after.

OpenAPI Spec as the Source of Truth

We maintain an OpenAPI 3.1 spec in code — not as a separate YAML file someone updates manually, but as a hand-maintained JSON object inline in the route entry point. It powers the Scalar docs UI at /docs and the raw spec at /openapi.json.

Keeping the spec in the same file as the route definitions matters because they’re always co-located. If the code changes, the spec is right there to update. We’ve all seen API documentation that drifts from reality over months when it lives in a separate file. When a developer hits the Scalar UI and runs a request directly in the browser, they’re running against the same validation logic as production clients. The playground is not a fake demo environment.

Structured Error Responses, Always

Every error response follows RFC 9457 Problem Details (application/problem+json). The shape is the same whether it’s a validation failure, a 404, or a rate limit hit:

interface ProblemDetail {
  type: string; // URI identifying the problem type
  title: string; // Short summary (e.g. "Validation Error")
  status: number; // HTTP status code
  detail: string; // Human-readable explanation specific to this occurrence
  instance?: string; // Request path that triggered the error
  errors?: Array<{ field: string; message: string }>; // Validation details
  limit?: number; // Rate limit ceiling (429 responses)
  reset?: string; // Rate limit reset time (429 responses)
  required_tier?: string; // Minimum tier needed (402 responses)
  upgrade?: string; // Dashboard URL for upgrades
  docs?: string; // Documentation URL
}

The type field is a URI identifying the problem, not an uppercase constant. Our well-known types live under https://api.opelyx.com/problems/ — for example, validation-error, not-found, unauthorized, tier-restricted, rate-limit-exceeded, internal-error, method-not-allowed, not-acceptable, and invalid-year. The title field is human-readable. The instance field carries the request path so you can correlate an error to the exact call that triggered it.

We don’t return 500 with an empty body when something breaks internally. We return 500 with a full Problem Details body including the problem type URI. The X-Request-Id header is on every response — errors and successes. If a developer files a bug report and includes a request ID, we can pull the full trace.

CORS Is a Security Decision

We have an explicit CORS origin allowlist in @opelyx/auth-core. It’s a string array. No wildcards.

The allowed origins are our own sites (opelyx.com, *.opelyx.com) and any verified developer domains that have been explicitly added. Everything else gets a CORS rejection.

This seems obvious but I’ve seen production APIs with Access-Control-Allow-Origin: * because “it’s easier and the Bearer token protects everything anyway.” Bearer tokens in Authorization headers are not automatically sent by browsers, so CSRF is not actually the concern here. The real reason for explicit CORS origins is preventing unauthorized cross-origin JavaScript from reading API responses. A malicious site cannot attach a Bearer token it doesn’t have, but a wildcard CORS policy would let any origin’s JavaScript call your API and read the response if the user’s browser extension or proxy happens to attach credentials. Defense in depth means the CORS header is also a security boundary, not just a convenience toggle.

Content-Security-Policy, X-Frame-Options, X-Content-Type-Options — all set on every response. Not because we expect attacks today, but because retrofitting security headers after a breach is not a fun exercise.

What We’d Do Differently

Two things. First, we’d design the plan ID scheme earlier. We use standard_component_id from the CMS PUF as the primary key for plans, which is the right choice for data fidelity. But it’s a 14-character HIOS identifier that means nothing to developers. We should have added a stable opaque ID alongside it from the start for use in URLs and cursors.

Second, we’d build the usage tracking schema before the first paying customer. We have it now — a usage_log table in AUTH_DB that records every authenticated request by key, endpoint, and timestamp. We needed it to build the usage dashboard and for billing. But we had to backfill it, and backfilled data has gaps. If you’re building an API that will eventually have tiered billing, instrument usage from request one.

The fundamentals — Zod validation, structured errors, versioned routes, cursor pagination — held up. The operational instrumentation always needs to be built earlier than you think.