Skip to content

API Data Ingestion Explained: What It Is and How It Works

API data ingestion is more than making HTTP requests — it is engineering a pipeline that runs reliably for months unattended. That means handling cursor pagination (not offset), rate limits with exponential backoff, OAuth tokens that expire mid-run, schema changes that arrive silently, and write strategies that stay correct when the pipeline retries. Each concept addresses a specific failure mode that will hit you in production.

Complete incremental ingestion loop

# Simplified production ingestion loop
def run_incremental_ingestion():
    watermark = load_watermark('orders_api')

    all_records = []
    for page in paginate(client, '/orders', since=watermark):
        valid, dead = parse_records(page)
        all_records.extend(valid)
        if dead:
            write_dead_letter(dead)

    if all_records:
        upsert_to_warehouse(all_records)
        new_watermark = max(r.updated_at for r in all_records)
        save_watermark('orders_api', new_watermark)

# Each function addresses one production failure mode:
# paginate()       → cursor-based, rate-limit-aware
# parse_records()  → Pydantic validation + dead-letter
# upsert_to_warehouse() → MERGE on primary key
# save_watermark() → incremental sync state

The 5 Core Concepts

01

Cursor Pagination

The API returns a next_cursor token with each page. Pass the cursor as a parameter on the next request. Stop when the response contains no cursor. Stable under concurrent inserts — the cursor anchors to a record, not a row count.

Offset pagination skips records when inserts happen mid-scan. Always prefer cursor when available.

02

Exponential Backoff

On a 429 (rate limit) response, wait 2^n + random() seconds before retrying, where n is the attempt count. The random jitter prevents multiple parallel workers from all retrying at the same moment and hammering the API simultaneously.

Retrying immediately on 429 exhausts your remaining quota instantly and triggers longer backoffs from the API.

03

OAuth 2.0 Token Refresh

Access tokens expire (typically 60 minutes). Build automatic refresh into the HTTP client: check expiry before each request, refresh if within 60 seconds of expiry, cache the new token. Never refresh tokens inside the pagination loop — only in the client layer.

Hard-coded tokens expire silently mid-run. Pipeline fails at hour 1 with a 401 on a 3-hour backfill job.

04

Watermark Tracking

A watermark is the max updated_at from the last successful run. Pass it as the since parameter to fetch only new or changed records. Save the new max updated_at as the watermark after each successful run. Stored in a metadata table or key-value store.

Full refresh on every run wastes API quota, increases run time, and hits rate limits unnecessarily.

05

Idempotent Writes

Write records using UPSERT (MERGE) on a stable primary key. If the pipeline replays the same API response twice — due to a network error or manual re-run — UPSERT replaces the existing record rather than inserting a duplicate.

INSERT-only writes create duplicate rows every time the pipeline retries. Even one retry doubles the affected rows.

Common Mistakes

Mixing pagination logic with business logic

Keep pagination in a dedicated generator function that yields pages. Keep schema parsing and writing in separate functions. Mixing them makes the pipeline hard to test and impossible to retry at the page level.

Not saving watermarks atomically

If you update the watermark before confirming the warehouse write succeeded, a failure between the two steps leaves the warehouse missing records but the watermark advanced. Always update the watermark only after confirming the write completed.

Over-parallelizing without rate limit coordination

Running 10 parallel workers each using the full rate limit quota will hit the daily cap in 1/10th the expected time. Either use a shared token bucket across workers or partition the API quota explicitly across workers.

FAQ

What is API data ingestion in simple terms?
Systematically pulling data from HTTP APIs — handling pagination, rate limits, expired credentials, schema changes, and duplicate prevention — so your warehouse stays current without manual intervention.
Cursor vs offset pagination?
Cursor is stable under concurrent inserts (anchors to a record). Offset shifts when records are inserted mid-scan, causing skips and duplicates. Always prefer cursor for production pipelines.
What is a watermark?
The max updated_at from the last successful run. Passed as the since parameter on the next incremental run so only new/changed records are fetched.
What is idempotent ingestion?
Writing with UPSERT on a primary key so running the pipeline twice produces the same result — no duplicates on retry.

Related

Press Cmd+K to open