What is API Data Ingestion? A Complete Guide for Data Engineers (2026)
API data ingestion is the practice of systematically extracting data from REST and GraphQL APIs into data pipelines — handling pagination, rate limits, authentication, incremental sync, and schema drift at production scale.
Quick Answer
API data ingestion is pulling data from external APIs into your data platform reliably and efficiently. The challenge is not the HTTP call — it is handling everything that goes wrong at scale: pagination across millions of records, rate limits that throttle your pipeline, tokens that expire mid-run, schemas that change without notice, and the need to extract only new data on incremental runs without re-scanning the full dataset.
What is API Data Ingestion?
Most production data does not live in databases you can directly query — it lives behind APIs. Salesforce CRM records, Stripe payment events, GitHub repository data, Shopify orders, Twitter/X posts, weather data, financial market feeds — all delivered via REST or GraphQL APIs with authentication, pagination, and rate limits.
API data ingestion is the engineering discipline of extracting this data systematically: fetching pages, handling errors, refreshing credentials, tracking what you have already ingested, and writing results idempotently to your warehouse or lake. Done well, it runs unattended for months. Done poorly, it fills your warehouse with duplicates or silently stops ingesting when a token expires.
Ingestion Flow
- 1.Authenticate (API key / OAuth 2.0)
- 2.Fetch first page of results
- 3.Parse + validate response schema
- 4.Write to storage (UPSERT / deduplicate)
- 5.Advance cursor / watermark
- 6.Repeat until no next page
Core Toolchain
- ·requests / httpx — HTTP client
- ·Airbyte / Fivetran — managed connectors
- ·Singer taps — open-source connectors
- ·Airflow — pipeline orchestration
- ·dbt — downstream transformation
- ·Pydantic — response schema validation
Why Production API Ingestion is Hard
Naive approach
- ✗Full refresh every run — expensive, slow
- ✗No 429 handling — pipeline crashes at scale
- ✗Access tokens hard-coded — expire silently
- ✗Offset pagination — skips records on inserts
- ✗No schema validation — silent data corruption
- ✗Duplicate rows when pipeline replays
Production approach
- ✓Watermark-based incremental extraction
- ✓Exponential backoff + jitter on 429
- ✓Automatic OAuth token refresh
- ✓Cursor pagination — stable under inserts
- ✓Pydantic schema validation on every response
- ✓UPSERT writes — idempotent on replay
How API Ingestion Works
Cursor-based pagination with watermark tracking
import httpx, time, random
from pydantic import BaseModel
from typing import Optional
class OrderRecord(BaseModel):
order_id: str
amount_usd: float
created_at: str
status: str
def fetch_all_orders(since: str) -> list[OrderRecord]:
records, cursor = [], None
while True:
resp = fetch_page_with_backoff(since=since, cursor=cursor)
records += [OrderRecord(**r) for r in resp['data']]
cursor = resp.get('next_cursor')
if not cursor:
break
return records
def fetch_page_with_backoff(**params) -> dict:
for attempt in range(5):
r = httpx.get(API_URL, params=params, headers=auth_headers())
if r.status_code == 429:
wait = (2 ** attempt) + random.random()
time.sleep(wait) # exponential backoff + jitter
continue
r.raise_for_status()
return r.json()
OAuth 2.0 token refresh with automatic retry
import httpx
from datetime import datetime, timedelta
class OAuthClient:
_token: str | None = None
_expires_at: datetime | None = None
def token(self) -> str:
if not self._token or datetime.utcnow() >= self._expires_at:
self._refresh()
return self._token
def _refresh(self):
r = httpx.post(TOKEN_URL, data={...refresh_grant_body})
self._token = r.json()['access_token']
self._expires_at = datetime.utcnow() + timedelta(
seconds=r.json()['expires_in'] - 60 # 60s safety margin
)
Ingestion Patterns: Polling vs Webhooks vs CDC
Polling (Pull)
Use when: Always available as a baseline. Best for historical backfill and APIs without push support.
Tradeoff: Latency is bounded by polling interval. Wastes quota scanning unchanged records.
Webhook (Push)
Use when: API supports webhooks. Need sub-minute latency. Change volume is low relative to total dataset.
Tradeoff: Requires a publicly accessible endpoint to receive events. Cannot backfill history.
CDC via API
Use when: API exposes a changelog or events stream (Stripe events, Salesforce Change Data Capture).
Tradeoff: Most efficient for incremental sync. Requires change event retention from the source.
| Dimension | Polling | Webhooks | CDC via API |
|---|---|---|---|
| Latency | Polling interval | Near-real-time | Near-real-time |
| Backfill | Yes — paginate history | No — push only | Depends on retention |
| API quota | High — full scans | Low — event-driven | Low — changes only |
| Reliability | Self-managed retry | Requires ack + retry | Depends on source |
| Complexity | Low | Medium | Medium–High |
The 6 Core Challenges of API Ingestion
Pagination
APIs return data in pages. Cursor-based pagination is most reliable; offset pagination breaks under concurrent inserts. Always prefer cursor or keyset pagination for production pipelines.
Rate Limiting
Every production API has a rate limit. Handle 429 responses with exponential backoff and jitter. Proactively track your request budget with a token bucket counter.
Authentication Refresh
OAuth 2.0 access tokens expire. Build automatic token refresh into your HTTP client so long-running ingestion jobs do not fail mid-run on a 401.
Schema Changes
External APIs change their response schemas without notice. Use schema validation on ingestion and route unexpected fields to a staging area rather than failing hard.
Incremental Extraction
Full refresh of large APIs is expensive and wasteful. Use watermark columns (updated_at, created_at) and the API's since/after parameters to fetch only new or changed records.
Idempotent Writes
Network failures mean your pipeline may replay the same API response. Write to storage using UPSERT or deduplication keys so replays do not create duplicate rows.
Common API Ingestion Mistakes
Full refresh on every run
Re-fetching the entire API dataset on every pipeline run wastes API quota, increases run time, and can exhaust rate limits. Use watermark-based incremental extraction from day one.
No backoff on 429 responses
Retrying immediately after a rate limit response makes the problem worse — you burn through your remaining quota. Always implement exponential backoff with jitter when you receive a 429.
Ignoring schema validation
External APIs are uncontrolled surfaces. A silent field rename or type change can silently corrupt downstream tables. Validate API responses against a schema on ingestion and alert when unexpected shapes arrive.
Storing raw API responses without normalization
Storing JSON blobs in a data warehouse without flattening makes downstream queries painful and expensive. Normalize API responses to typed, flat columns at ingestion time.
Who Should Learn API Data Ingestion?
Junior
- ✓Makes HTTP requests with authentication
- ✓Handles basic pagination
- ✓Uses Airbyte or Singer connectors
- ✓Understands REST vs GraphQL
Senior
- ✓Builds rate-limit-aware ingestion with backoff
- ✓Implements watermark-based incremental sync
- ✓Validates API schemas with Pydantic
- ✓Designs idempotent write strategies
Staff
- ✓Designs multi-source ingestion platforms
- ✓Defines connector standards across teams
- ✓Architects webhook fan-out at scale
- ✓Manages API credential rotation and security
Related Concepts
Frequently Asked Questions
- What is API data ingestion?
- API data ingestion is the process of systematically pulling data from external or internal REST/GraphQL APIs and loading it into a data warehouse, data lake, or streaming pipeline. It involves authentication (API keys, OAuth 2.0), pagination (cursor-based, offset, keyset), rate limit handling (exponential backoff, token buckets), incremental extraction (watermarks, since parameters), and error recovery (idempotent writes, dead-letter queues).
- What is the difference between REST and GraphQL API ingestion?
- REST API ingestion pulls fixed resource endpoints — each endpoint returns a predefined shape, and you may need multiple calls to join related resources. GraphQL API ingestion sends a query specifying exactly which fields you need, reducing over-fetching and sometimes replacing multiple REST calls with one. For ingestion pipelines, REST is simpler to paginate predictably; GraphQL is better when you need selective fields from deeply nested resources.
- What is cursor-based pagination in API ingestion?
- Cursor-based pagination uses an opaque token (the cursor) returned by the API to fetch the next page, rather than a numeric offset. The cursor typically encodes the position in a sorted result set. Cursor pagination is stable under inserts — if new records are added between page requests, cursor pagination continues from where it left off without skipping or duplicating records. It is the preferred pagination pattern for production ingestion pipelines.
- How do you handle API rate limits in data ingestion?
- Rate limit handling strategies: (1) Exponential backoff with jitter — when you receive a 429, wait 2^n + random seconds before retrying. (2) Token bucket rate limiter — track request counts against the API's limit and sleep proactively before hitting the cap. (3) Respect Retry-After headers — some APIs tell you exactly how long to wait. (4) Parallelize across accounts — if you have multiple API credentials, fan out requests across them to multiply effective throughput.
- When should you use webhooks instead of polling an API?
- Use webhooks when: (1) the API supports them and you need near-real-time data (sub-minute latency), (2) the volume of changes is small relative to the total dataset (polling would scan millions of records to find a few hundred changes), (3) the API provider charges per request and polling would be expensive. Use polling when: the API does not support webhooks, you need historical backfill, or you cannot receive inbound connections (firewall restrictions).
What You'll Build with AI-DE
- ✓Production REST API ingestion pipeline with cursor pagination
- ✓Exponential backoff rate-limit handler for 429 responses
- ✓OAuth 2.0 token auto-refresh for long-running jobs
- ✓Pydantic schema validation with dead-letter queue routing
- ✓Watermark-based incremental sync that resumes from last position
- ✓Idempotent UPSERT writes to prevent duplicate rows on replay