ai-de.net/Projects/P28 · Multi-source ingestion service

PRO · module 01 free previewPlatform trackP28

Build a
fault-tolerant
ingestion service for REST, webhooks, S3, and SaaS

Ship a Python service that pulls from REST APIs with backoff + token-bucket rate limiting, receives HMAC-signed webhooks with bloom-filter dedup, ingests S3 batch drops (CSV / JSON / Parquet / NDJSON), and pulls from SaaS connectors (Stripe, Salesforce Bulk API) — orchestrated by one Airflow DAG with schema gates.

Timeline

14-18 hours

Difficulty

Mid → Senior

Stack

Python · FastAPI · Airflow · Kafka · Postgres

See PRO benefits

This is the system-design question every data platform team gets asked at Stripe, Airbnb, Uber and any company running a Fivetran-shaped ingestion stack: how do you ingest from five different shapes of API without duplicating retry logic, rate limiting, and schema gates?

By the end you will have

An httpx client with exponential backoff, token-bucket rate limiter (Redis), and request-fingerprinted idempotency
A FastAPI webhook receiver with HMAC-SHA256 verification and bloom-filter dedup
An S3 batch ingester with manifest tracking + multi-format parser (CSV / JSON / Parquet / NDJSON)
OAuth2 SaaS connectors for Stripe + Salesforce Bulk API 2.0 with SOQL incremental filters
JSON Schema validation, breaking-change gate, and a Confluent Schema Registry client
A unified Airflow DAG with Prometheus freshness exporter and a Z-score volume-anomaly query

PREREQComfortable with Python, HTTP, and at least one orchestrator. Helps if you’ve shipped one ingestion job before. Not an introduction to APIs — assumes you know what a 429 means.

ingestion.unified_dag · 4 modes wired

schema gate

Sources

Ingest + validate

Orchestrate

Sinks

REST · stripe

webhook · POST

s3://drops/

SaaS · sfdc

4 modes · 3.1K rows

httpx · backoff

FastAPI · HMAC

boto3 · manifest

OAuth2 · bulk

JSON SchemaDraft7 + Pydantic

schema-registryBACKWARD compat

retry · dedup · idemp

unified_dagTaskGroup × 4

S3KeySensor · per-source pool

freshness SLA · drift detect

Z-score volume

one DAG · 4 modes

postgres · outboxidemp · watermarks

kafka · ingested.eventsDLQ · replay

exactly-once tx

# Idempotent ingestion

fp = sha256(method + url + body)

INSERT ... ON CONFLICT (fp) DO NOTHING

+ Redis bloom · webhook dedup ~5%

→ safe to retry · safe to replay

● Schema gate · BACKWARD

new field nullable → ✓ pass

renamed field → ✗ block + alert

confluent.compatibility = BACKWARD

→ drift caught before warehouse writes

ingestion modes

60+

Python modules

exactly-once

outbox pattern

Why this matters in 2026

Every data platform begins with ingestion.

Fivetran and Airbyte built billion-dollar businesses on this problem. The patterns you ship here — backoff, rate limiting, idempotency, schema gates — are the load-bearing primitives behind every modern ingestion layer.

The 5-shape problem

REST, webhooks, S3 drops, SaaS exports, and streaming all need the same primitives — but most teams write retry logic five times. This project shows you how to write it once.

Schema drift kills pipelines

An additive field is fine. A renamed field silently corrupts your warehouse. The schema gate + Confluent Registry pattern catches it before downstream tables write.

Idempotency is non-negotiable

Webhook providers retry. APIs return duplicates. Without request fingerprinting + bloom-filter dedup, you get double-billed users and 3 a.m. pages. This project ships both.

Airflow is still the standard

Despite Dagster and Prefect, most production ingestion still runs on Airflow. The unified DAG pattern in module 04 is the one that gets shipped at companies running 100+ sources.

Curriculum · 4 modules · 14-18 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3-4 hours — build the httpx retry client and stand up the token-bucket rate limiter. If the patterns click, upgrade to unlock webhooks, SaaS connectors, and the unified Airflow DAG.

P28 · 14-18 hours · 4 modules

Free preview FREE required

Module 01 is free — no card required. Build the retry + rate-limit + idempotency primitives before paying.

M01

✓REST ingestion: backoff, pagination, rate limit, idempotency

Build the httpx-based ingestion client with exponential backoff + jitter, Retry-After header parsing, cursor / offset / keyset pagination, a Redis-backed token-bucket rate limiter, request-fingerprinted idempotency, circuit breakers, and watermark-resumable extraction.

3-4h18 lessonsFREE PREVIEW

Start →

M02

⊘Push-based: webhook receiver, dedup, S3 batch, exactly-once outbox

Stand up a FastAPI webhook receiver with HMAC-SHA256 verification. Layer Redis + bloom-filter dedup. Add an S3 batch ingester with manifest tracking and multi-format parsing (CSV / JSON / Parquet / NDJSON). Wire the transactional outbox for exactly-once semantics into Kafka via confluent-kafka.

3-4h18 lessonsFREE TIER

Unlock with PRO →

M03

⊘SaaS connectors: OAuth2, schema gates, contracts, registry

Build OAuth2 connectors for Stripe + Salesforce Bulk API 2.0 with SOQL incremental filters. Layer JSON Schema validation (Draft7), Pydantic models, and a breaking-change gate. Encode contracts as YAML and integrate with Confluent Schema Registry (BACKWARD / FORWARD / TRANSITIVE compatibility modes).

3-4h18 lessonsFREE TIER

Unlock with PRO →

M04

⊘Unified Airflow DAG, freshness SLAs, drift + volume anomalies

Compose all four ingestion modes into one Airflow DAG with TaskGroups, S3KeySensors for file-arrival triggers, and pools for per-source rate limits. Export Prometheus freshness metrics, detect schema drift against the registry, run a Z-score volume anomaly query, and ship K8s manifests with a blue/green deploy script.

5-6h17 lessonsFREE TIER

Unlock with PRO →

3 modules locked · Unlock all PRO content for $29/mo

Upgrade to PRO →

Backed by curriculum

API & External System Integration

10 modules·12 hours·REST fundamentals·OAuth + secrets·orchestration patterns·schema evolution·scaling + observability

Open curriculum→

This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One ingestion service.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are in the build.

01~7h

Pull-based foundations (parts 1 + 2)

httpx retry client + Redis token bucket + Postgres idempotency store live. FastAPI webhook receiver with HMAC + bloom-filter dedup. S3 batch ingester with multi-format parsing.

✓httpx client + circuit breaker + DLQ
✓FastAPI webhook receiver + bloom dedup
✓S3 manifest ingester + 4 format parsers

02~4h

SaaS connectors with schema gates (part 3)

OAuth2 connectors for Stripe + Salesforce Bulk API. JSON Schema + Pydantic validation. Breaking-change gate. YAML contracts. Confluent Schema Registry integration with compatibility checks.

✓Stripe + Salesforce OAuth2 connectors
✓Schema evolution gate (additive vs breaking)
✓Schema Registry client + YAML contracts

03~6h

Unified orchestration + monitoring (part 4)

All four ingestion modes composed in one Airflow DAG with TaskGroups + S3KeySensors + pools. Prometheus freshness exporter. Schema drift detector. Z-score volume anomaly query. K8s blue/green manifests.

✓Unified Airflow DAG + dynamic factory
✓Prometheus freshness + Z-score volume anomaly
✓K8s blue/green manifests + post-deploy checks

Project setup · 10 minutes

One command. Local Postgres + Redis + mock API + 4 sample datasets.

You get the full stack on day one — Postgres for the idempotency / outbox / watermark tables, Redis for the token bucket, a mock API for paginated REST fixtures, and pre-built sample datasets across all four ingestion modes.

What lives in the repo

Everything you need to run the four-mode ingestion pipeline on your laptop, plus the fixtures and verification queries used in modules 02–04.

docker-compose.yml — Postgres, Redis, mock REST API
ingestion/ — httpx client, retry, pagination, rate limiter, idempotency
ingestion/webhooks/ — FastAPI receiver, HMAC, dedup, outbox
connectors/ — OAuth2 manager, Stripe + Salesforce Bulk API
dags/ — unified Airflow DAG + dynamic factory + sensors
data/ — REST fixtures, webhook events, S3 batch files, SaaS export

Download · Starter Kit

API Data Ingestion Starter Kit

Pre-configured Docker stack, sample datasets across REST / webhooks / S3 / SaaS, and the module 01 retry-client scaffolds. Skip the boilerplate, start on the patterns.

145 KB · 105 files · 4 sample datasets · PRO required

~/projects/api-data-ingestion — zsh

1. Clone and start the stack

$ git clone github.com/ai-de/p28-api-data-ingestion

$ cd p28-api-data-ingestion && docker-compose up -d

2. Install dependencies + set env

$ python3 -m venv .venv && source .venv/bin/activate

$ pip install -r requirements.txt

$ export DB_URL=postgresql://postgres:postgres@localhost:5432/ingestion

$ export REDIS_URL=redis://localhost:6379/0

3. Run the REST ingestion pipeline

$ python -m ingestion.pipeline

$ # ✓ 500 records ingested across 5 paginated pages

4. Start the webhook receiver

$ uvicorn ingestion.webhooks.receiver:app --reload --port 8000

$ # POST /webhooks/stripe — HMAC verified, deduped, written to outbox

500

REST records (5 × 100 paginated)

1,500

Webhook events (~5% dup)

800

S3 batch records (CSV + NDJSON)

300

SaaS export records

Production hardening

The same service — but built for the 10x case.

Most ingestion tutorials hand you a requests.get() with a try/except. This one shows what changes when you’re running 100+ sources, the rate limits collide, and a webhook provider retries the same event 17 times.

Tutorial-gradeWhat you have today

Rate limiting

Single-instance Redis token bucket

Idempotency store

Local Postgres, no retention

Kafka

Single-broker docker container

Schema Registry

Mock client against in-memory map

Deploy

K8s YAML applied by hand

Outbox

Polling relay in the same process

Production-gradeModule 03–04

✓

Rate limiting

Cluster-mode Redis with EVAL Lua + replication for HA token bucket

✓

Idempotency store

Aurora with read replicas + TTL retention policy on fingerprints

✓

Kafka

MSK / Confluent Cloud with rack-aware partitioning and min.insync.replicas=2

✓

Schema Registry

Confluent Cloud with subject ACLs + FULL_TRANSITIVE compatibility

✓

Deploy

ArgoCD GitOps with progressive delivery and kubectl rollout status gates

✓

Outbox

CDC-driven outbox via Debezium connector — no polling drift

PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that’s quietly worth thousands of dollars in time-to-staff.

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — ingestion / Airflow / Kafka for this project. Async, comments inline, average turnaround 31 hours.

31h

avg turnaround

9.2/10

helpfulness

94%

return next month

2 office hours / month

Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.

30 min

per session

2 / mo

included

+ group

unlimited

What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for engineers who want production-shaped builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT

Projects

Production-grade builds

15+

Curriculum modules

All 7 tracks

Phase 1 only

All

All + bonus

Code review credits

Senior engineer review

4 / month

Unlimited

Career path access

5 paths × full plans

1 path

All 5

All 5 + 1:1

Certificate

Verifiable on LinkedIn

—

Yes

Yes + portfolio review

Community

Discord + office hours

Read-only

Full + 2/mo

Full + 4/mo

$29/mo

billed monthly · cancel anytime

or annual

$249/yr save 28%

Upgrade to PRO →

Who this is for

Pick this if you’re shipping the integration tier, not just consuming it.

Data engineers

You’ve wired one or two ingestion jobs, hit the same retry / pagination / dedup problem each time, and want the patterns codified before you write the third one.

Platform engineers

You run ingestion for 5+ teams. You need a self-serve framework so new sources don’t reinvent the rate limiter or the idempotency store on every project.

Backend engineers crossing over

You know HTTP and Kafka but the warehouse side is opaque. This project gives you the orchestration + schema gate vocabulary in language you already speak.

Integration engineers

You build connectors for a living. This is the connector framework you wish your last shop had — OAuth2, pagination, contracts, registry, all in one place.

Related curriculum

Going deeper? Three tracks back this project.

API integration is the spine. These three curriculums let you go deeper on the layers the unified DAG sits on — Python craft, Kafka, and event-driven design.

FAQ

Quick answers.

How is module 01 different from a free "REST API tutorial"?+

Module 01 builds the four primitives every production ingestion job needs — exponential backoff with jitter, token-bucket rate limiting (Redis-backed), request-fingerprinted idempotency, and watermark-resumable extraction — as 18 working Python files with a circuit breaker and DLQ. Most free tutorials show you `requests.get()` once and call it a day.

What's NOT in this project?+

No live cloud (no real AWS account, no live Confluent Cloud — everything runs locally). No CDC from a database (this is API/webhook/SaaS pull, not Debezium). No ML feature store. The K8s blue/green is shipped as manifests + a deploy script, not a running cluster. We're explicit about this so you know what you're getting.

How is this different from kafka-event-routing (P02)?+

P02 is broker-internal: routing, fan-out, replay across Kafka topics. This project is the upstream layer — getting external API / webhook / S3 / SaaS data INTO the platform. The two compose: this project lands events in the outbox, P02 routes them downstream.

Do I need cloud credentials?+

No. Everything runs locally — Postgres + Redis + a mock REST API in Docker. The S3 ingester runs against a localstack-shaped fixture, the Kafka client points at a single-broker container, and the Confluent Schema Registry calls hit a mock client. Patterns transfer cleanly to AWS / Confluent Cloud with config changes only.

Is the "transactional outbox for exactly-once" really exactly-once?+

The pattern is exactly-once at the outbox boundary — you write to your DB and the outbox table in one transaction, and a polling relay publishes to Kafka. Whether the consumer is exactly-once depends on its commit semantics. We show the producer-side pattern, document the assumptions, and don't claim end-to-end E2E exactly-once.

What does PRO actually unlock for $29/mo?+

All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime.

Ready to ship the real integration tier?

Start with module 01 — free, no card. About 3-4 hours. By the end you'll have a working httpx ingestion client with backoff, rate limiting, idempotency, and watermarks — the primitives every production ingestion job sits on top of.

See PRO benefits

P28 · Multi-source ingestion service · PRO · module 01 freeUpgrade to PRO →

Build afault-tolerantingestion service for REST, webhooks, S3, and SaaS

Every data platform begins with ingestion.

The 5-shape problem

Schema drift kills pipelines

Idempotency is non-negotiable

Airflow is still the standard

Module 01 is free. The rest unlocks with PRO.

API & External System Integration

Three sprints. Three checkpoints. One ingestion service.

One command. Local Postgres + Redis + mock API + 4 sample datasets.

What lives in the repo

API Data Ingestion Starter Kit

The same service — but built for the 10x case.

Real review from senior engineers who shipped this stack.

4 reviews / month

2 office hours / month

One subscription. 15+ projects, all curriculum, code review.

Pick this if you’re shipping the integration tier, not just consuming it.

Data engineers

Platform engineers

Backend engineers crossing over

Integration engineers

Going deeper? Three tracks back this project.

Python for Data Engineers

Kafka Streams Learning Path

Event Design & Data Contracts

Quick answers.

Ready to ship the real integration tier?

Build a
fault-tolerant
ingestion service for REST, webhooks, S3, and SaaS