Question 1

What is API data ingestion in simple terms?

Accepted Answer

API data ingestion is the engineering practice of systematically pulling data from HTTP APIs into a data warehouse or lake. It is not just making HTTP requests — it is handling everything that breaks at scale: paginating through millions of records, recovering from rate limits, refreshing expired credentials, detecting schema changes, and preventing duplicate rows when the pipeline retries.

Question 2

What is cursor pagination vs offset pagination?

Accepted Answer

Cursor pagination uses an opaque token (cursor) returned by the API to identify the next page position. It is stable under concurrent inserts because the cursor anchors to a specific record, not a row count. Offset pagination uses a numeric skip (e.g., page=3&per_page=100) which shifts when records are inserted or deleted between requests — causing skipped or duplicated records in long-running pagination jobs. Always prefer cursor pagination for production ingestion.

Question 3

What is a watermark in API ingestion?

Accepted Answer

A watermark is the bookmark of the last successfully ingested record, usually the maximum updated_at or created_at timestamp from the previous run. On each incremental run, the pipeline passes the watermark as a since or updated_after parameter to fetch only new or changed records. After a successful run, the watermark is updated to the max timestamp in the new batch.

Question 4

What is idempotent ingestion?

Accepted Answer

Idempotent ingestion means running the same pipeline twice produces the same result — no duplicate rows, no missing records. It is implemented by writing records with UPSERT (MERGE) on a stable primary key rather than plain INSERT. When the pipeline retries after a failure, UPSERT replaces existing records with the same key rather than inserting a second copy.

API Data Ingestion Explained: What It Is and How It Works

The 5 Core Concepts

Common Mistakes

FAQ

Related