Skip to content
ai-de.net/Projects/P28 · Multi-source ingestion service
PRO · module 01 free previewPlatform trackP28

Build a
fault-tolerant
ingestion service for REST, webhooks, S3, and SaaS

Ship a Python service that pulls from REST APIs with backoff + token-bucket rate limiting, receives HMAC-signed webhooks with bloom-filter dedup, ingests S3 batch drops (CSV / JSON / Parquet / NDJSON), and pulls from SaaS connectors (Stripe, Salesforce Bulk API) — orchestrated by one Airflow DAG with schema gates.

Timeline
14-18 hours
Difficulty
Mid → Senior
Stack
Python · FastAPI · Airflow · Kafka · Postgres

This is the system-design question every data platform team gets asked at Stripe, Airbnb, Uber and any company running a Fivetran-shaped ingestion stack: how do you ingest from five different shapes of API without duplicating retry logic, rate limiting, and schema gates?

By the end you will have
  • An httpx client with exponential backoff, token-bucket rate limiter (Redis), and request-fingerprinted idempotency
  • A FastAPI webhook receiver with HMAC-SHA256 verification and bloom-filter dedup
  • An S3 batch ingester with manifest tracking + multi-format parser (CSV / JSON / Parquet / NDJSON)
  • OAuth2 SaaS connectors for Stripe + Salesforce Bulk API 2.0 with SOQL incremental filters
  • JSON Schema validation, breaking-change gate, and a Confluent Schema Registry client
  • A unified Airflow DAG with Prometheus freshness exporter and a Z-score volume-anomaly query
PREREQComfortable with Python, HTTP, and at least one orchestrator. Helps if you’ve shipped one ingestion job before. Not an introduction to APIs — assumes you know what a 429 means.
ingestion.unified_dag · 4 modes wired
schema gate
Sources
Ingest + validate
Orchestrate
Sinks
REST · stripe
webhook · POST
s3://drops/
SaaS · sfdc
4 modes · 3.1K rows
httpx · backoff
FastAPI · HMAC
boto3 · manifest
OAuth2 · bulk
JSON SchemaDraft7 + Pydantic
schema-registryBACKWARD compat
retry · dedup · idemp
unified_dagTaskGroup × 4
S3KeySensor · per-source pool
freshness SLA · drift detect
Z-score volume
one DAG · 4 modes
postgres · outboxidemp · watermarks
kafka · ingested.eventsDLQ · replay
exactly-once tx
# Idempotent ingestion
fp = sha256(method + url + body)
INSERT ... ON CONFLICT (fp) DO NOTHING
+ Redis bloom · webhook dedup ~5%
→ safe to retry · safe to replay
● Schema gate · BACKWARD
new field nullable → ✓ pass
renamed field → ✗ block + alert
confluent.compatibility = BACKWARD
→ drift caught before warehouse writes
4
ingestion modes
60+
Python modules
exactly-once
outbox pattern
Why this matters in 2026

Every data platform begins with ingestion.

Fivetran and Airbyte built billion-dollar businesses on this problem. The patterns you ship here — backoff, rate limiting, idempotency, schema gates — are the load-bearing primitives behind every modern ingestion layer.

The 5-shape problem

REST, webhooks, S3 drops, SaaS exports, and streaming all need the same primitives — but most teams write retry logic five times. This project shows you how to write it once.

Schema drift kills pipelines

An additive field is fine. A renamed field silently corrupts your warehouse. The schema gate + Confluent Registry pattern catches it before downstream tables write.

Idempotency is non-negotiable

Webhook providers retry. APIs return duplicates. Without request fingerprinting + bloom-filter dedup, you get double-billed users and 3 a.m. pages. This project ships both.

Airflow is still the standard

Despite Dagster and Prefect, most production ingestion still runs on Airflow. The unified DAG pattern in module 04 is the one that gets shipped at companies running 100+ sources.

Curriculum · 4 modules · 14-18 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 3-4 hours — build the httpx retry client and stand up the token-bucket rate limiter. If the patterns click, upgrade to unlock webhooks, SaaS connectors, and the unified Airflow DAG.

P28 · 14-18 hours · 4 modules
Free preview FREE required
Module 01 is free — no card required. Build the retry + rate-limit + idempotency primitives before paying.
M01
REST ingestion: backoff, pagination, rate limit, idempotency
Build the httpx-based ingestion client with exponential backoff + jitter, Retry-After header parsing, cursor / offset / keyset pagination, a Redis-backed token-bucket rate limiter, request-fingerprinted idempotency, circuit breakers, and watermark-resumable extraction.
3-4h18 lessonsFREE PREVIEW
Start →
M02
Push-based: webhook receiver, dedup, S3 batch, exactly-once outbox
Stand up a FastAPI webhook receiver with HMAC-SHA256 verification. Layer Redis + bloom-filter dedup. Add an S3 batch ingester with manifest tracking and multi-format parsing (CSV / JSON / Parquet / NDJSON). Wire the transactional outbox for exactly-once semantics into Kafka via confluent-kafka.
3-4h18 lessonsFREE TIER
Unlock with PRO →
M03
SaaS connectors: OAuth2, schema gates, contracts, registry
Build OAuth2 connectors for Stripe + Salesforce Bulk API 2.0 with SOQL incremental filters. Layer JSON Schema validation (Draft7), Pydantic models, and a breaking-change gate. Encode contracts as YAML and integrate with Confluent Schema Registry (BACKWARD / FORWARD / TRANSITIVE compatibility modes).
3-4h18 lessonsFREE TIER
Unlock with PRO →
M04
Unified Airflow DAG, freshness SLAs, drift + volume anomalies
Compose all four ingestion modes into one Airflow DAG with TaskGroups, S3KeySensors for file-arrival triggers, and pools for per-source rate limits. Export Prometheus freshness metrics, detect schema drift against the registry, run a Z-score volume anomaly query, and ship K8s manifests with a blue/green deploy script.
5-6h17 lessonsFREE TIER
Unlock with PRO →
3 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
Backed by curriculum

API & External System Integration

10 modules·12 hours·REST fundamentals·OAuth + secrets·orchestration patterns·schema evolution·scaling + observability
Open curriculum

This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.

The build, in 3 phases

Three sprints. Three checkpoints. One ingestion service.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are in the build.

01~7h
Pull-based foundations (parts 1 + 2)

httpx retry client + Redis token bucket + Postgres idempotency store live. FastAPI webhook receiver with HMAC + bloom-filter dedup. S3 batch ingester with multi-format parsing.

  • httpx client + circuit breaker + DLQ
  • FastAPI webhook receiver + bloom dedup
  • S3 manifest ingester + 4 format parsers
02~4h
SaaS connectors with schema gates (part 3)

OAuth2 connectors for Stripe + Salesforce Bulk API. JSON Schema + Pydantic validation. Breaking-change gate. YAML contracts. Confluent Schema Registry integration with compatibility checks.

  • Stripe + Salesforce OAuth2 connectors
  • Schema evolution gate (additive vs breaking)
  • Schema Registry client + YAML contracts
03~6h
Unified orchestration + monitoring (part 4)

All four ingestion modes composed in one Airflow DAG with TaskGroups + S3KeySensors + pools. Prometheus freshness exporter. Schema drift detector. Z-score volume anomaly query. K8s blue/green manifests.

  • Unified Airflow DAG + dynamic factory
  • Prometheus freshness + Z-score volume anomaly
  • K8s blue/green manifests + post-deploy checks
Project setup · 10 minutes

One command. Local Postgres + Redis + mock API + 4 sample datasets.

You get the full stack on day one — Postgres for the idempotency / outbox / watermark tables, Redis for the token bucket, a mock API for paginated REST fixtures, and pre-built sample datasets across all four ingestion modes.

What lives in the repo

Everything you need to run the four-mode ingestion pipeline on your laptop, plus the fixtures and verification queries used in modules 02–04.

  • docker-compose.yml — Postgres, Redis, mock REST API
  • ingestion/ — httpx client, retry, pagination, rate limiter, idempotency
  • ingestion/webhooks/ — FastAPI receiver, HMAC, dedup, outbox
  • connectors/ — OAuth2 manager, Stripe + Salesforce Bulk API
  • dags/ — unified Airflow DAG + dynamic factory + sensors
  • data/ — REST fixtures, webhook events, S3 batch files, SaaS export
Download · Starter Kit

API Data Ingestion Starter Kit

Pre-configured Docker stack, sample datasets across REST / webhooks / S3 / SaaS, and the module 01 retry-client scaffolds. Skip the boilerplate, start on the patterns.

145 KB · 105 files · 4 sample datasets · PRO required
~/projects/api-data-ingestion — zsh
1. Clone and start the stack
$ git clone github.com/ai-de/p28-api-data-ingestion
$ cd p28-api-data-ingestion && docker-compose up -d
2. Install dependencies + set env
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install -r requirements.txt
$ export DB_URL=postgresql://postgres:postgres@localhost:5432/ingestion
$ export REDIS_URL=redis://localhost:6379/0
3. Run the REST ingestion pipeline
$ python -m ingestion.pipeline
$ # ✓ 500 records ingested across 5 paginated pages
4. Start the webhook receiver
$ uvicorn ingestion.webhooks.receiver:app --reload --port 8000
$ # POST /webhooks/stripe — HMAC verified, deduped, written to outbox
500
REST records (5 × 100 paginated)
1,500
Webhook events (~5% dup)
800
S3 batch records (CSV + NDJSON)
300
SaaS export records
Production hardening

The same service — but built for the 10x case.

Most ingestion tutorials hand you a requests.get() with a try/except. This one shows what changes when you’re running 100+ sources, the rate limits collide, and a webhook provider retries the same event 17 times.

Tutorial-gradeWhat you have today
×
Rate limiting
Single-instance Redis token bucket
×
Idempotency store
Local Postgres, no retention
×
Kafka
Single-broker docker container
×
Schema Registry
Mock client against in-memory map
×
Deploy
K8s YAML applied by hand
×
Outbox
Polling relay in the same process
Production-gradeModule 03–04
Rate limiting
Cluster-mode Redis with EVAL Lua + replication for HA token bucket
Idempotency store
Aurora with read replicas + TTL retention policy on fingerprints
Kafka
MSK / Confluent Cloud with rack-aware partitioning and min.insync.replicas=2
Schema Registry
Confluent Cloud with subject ACLs + FULL_TRANSITIVE compatibility
Deploy
ArgoCD GitOps with progressive delivery and kubectl rollout status gates
Outbox
CDC-driven outbox via Debezium connector — no polling drift
PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that’s quietly worth thousands of dollars in time-to-staff.

CR

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — ingestion / Airflow / Kafka for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for engineers who want production-shaped builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Upgrade to PRO
Who this is for

Pick this if you’re shipping the integration tier, not just consuming it.

DE

Data engineers

You’ve wired one or two ingestion jobs, hit the same retry / pagination / dedup problem each time, and want the patterns codified before you write the third one.

PE

Platform engineers

You run ingestion for 5+ teams. You need a self-serve framework so new sources don’t reinvent the rate limiter or the idempotency store on every project.

BE

Backend engineers crossing over

You know HTTP and Kafka but the warehouse side is opaque. This project gives you the orchestration + schema gate vocabulary in language you already speak.

IE

Integration engineers

You build connectors for a living. This is the connector framework you wish your last shop had — OAuth2, pagination, contracts, registry, all in one place.

FAQ

Quick answers.

Module 01 builds the four primitives every production ingestion job needs — exponential backoff with jitter, token-bucket rate limiting (Redis-backed), request-fingerprinted idempotency, and watermark-resumable extraction — as 18 working Python files with a circuit breaker and DLQ. Most free tutorials show you `requests.get()` once and call it a day.
No live cloud (no real AWS account, no live Confluent Cloud — everything runs locally). No CDC from a database (this is API/webhook/SaaS pull, not Debezium). No ML feature store. The K8s blue/green is shipped as manifests + a deploy script, not a running cluster. We're explicit about this so you know what you're getting.
P02 is broker-internal: routing, fan-out, replay across Kafka topics. This project is the upstream layer — getting external API / webhook / S3 / SaaS data INTO the platform. The two compose: this project lands events in the outbox, P02 routes them downstream.
No. Everything runs locally — Postgres + Redis + a mock REST API in Docker. The S3 ingester runs against a localstack-shaped fixture, the Kafka client points at a single-broker container, and the Confluent Schema Registry calls hit a mock client. Patterns transfer cleanly to AWS / Confluent Cloud with config changes only.
The pattern is exactly-once at the outbox boundary — you write to your DB and the outbox table in one transaction, and a polling relay publishes to Kafka. Whether the consumer is exactly-once depends on its commit semantics. We show the producer-side pattern, document the assumptions, and don't claim end-to-end E2E exactly-once.
All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks, all 5 career paths, certificate of completion, and full community access. Cancel anytime.

Ready to ship the real integration tier?

Start with module 01 — free, no card. About 3-4 hours. By the end you'll have a working httpx ingestion client with backoff, rate limiting, idempotency, and watermarks — the primitives every production ingestion job sits on top of.

P28 · Multi-source ingestion service · PRO · module 01 freeUpgrade to PRO →
Press Cmd+K to open