Build a
fault-tolerant
ingestion service for REST, webhooks, S3, and SaaS
Ship a Python service that pulls from REST APIs with backoff + token-bucket rate limiting, receives HMAC-signed webhooks with bloom-filter dedup, ingests S3 batch drops (CSV / JSON / Parquet / NDJSON), and pulls from SaaS connectors (Stripe, Salesforce Bulk API) — orchestrated by one Airflow DAG with schema gates.
This is the system-design question every data platform team gets asked at Stripe, Airbnb, Uber and any company running a Fivetran-shaped ingestion stack: how do you ingest from five different shapes of API without duplicating retry logic, rate limiting, and schema gates?
- An httpx client with exponential backoff, token-bucket rate limiter (Redis), and request-fingerprinted idempotency
- A FastAPI webhook receiver with HMAC-SHA256 verification and bloom-filter dedup
- An S3 batch ingester with manifest tracking + multi-format parser (CSV / JSON / Parquet / NDJSON)
- OAuth2 SaaS connectors for Stripe + Salesforce Bulk API 2.0 with SOQL incremental filters
- JSON Schema validation, breaking-change gate, and a Confluent Schema Registry client
- A unified Airflow DAG with Prometheus freshness exporter and a Z-score volume-anomaly query
Every data platform begins with ingestion.
Fivetran and Airbyte built billion-dollar businesses on this problem. The patterns you ship here — backoff, rate limiting, idempotency, schema gates — are the load-bearing primitives behind every modern ingestion layer.
The 5-shape problem
REST, webhooks, S3 drops, SaaS exports, and streaming all need the same primitives — but most teams write retry logic five times. This project shows you how to write it once.
Schema drift kills pipelines
An additive field is fine. A renamed field silently corrupts your warehouse. The schema gate + Confluent Registry pattern catches it before downstream tables write.
Idempotency is non-negotiable
Webhook providers retry. APIs return duplicates. Without request fingerprinting + bloom-filter dedup, you get double-billed users and 3 a.m. pages. This project ships both.
Airflow is still the standard
Despite Dagster and Prefect, most production ingestion still runs on Airflow. The unified DAG pattern in module 04 is the one that gets shipped at companies running 100+ sources.
Module 01 is free. The rest unlocks with PRO.
Try the first 3-4 hours — build the httpx retry client and stand up the token-bucket rate limiter. If the patterns click, upgrade to unlock webhooks, SaaS connectors, and the unified Airflow DAG.
API & External System Integration
This curriculum is the foundation for the project — not a sales add-on. PRO subscribers get full access to every module.
Three sprints. Three checkpoints. One ingestion service.
Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are in the build.
httpx retry client + Redis token bucket + Postgres idempotency store live. FastAPI webhook receiver with HMAC + bloom-filter dedup. S3 batch ingester with multi-format parsing.
- ✓httpx client + circuit breaker + DLQ
- ✓FastAPI webhook receiver + bloom dedup
- ✓S3 manifest ingester + 4 format parsers
OAuth2 connectors for Stripe + Salesforce Bulk API. JSON Schema + Pydantic validation. Breaking-change gate. YAML contracts. Confluent Schema Registry integration with compatibility checks.
- ✓Stripe + Salesforce OAuth2 connectors
- ✓Schema evolution gate (additive vs breaking)
- ✓Schema Registry client + YAML contracts
All four ingestion modes composed in one Airflow DAG with TaskGroups + S3KeySensors + pools. Prometheus freshness exporter. Schema drift detector. Z-score volume anomaly query. K8s blue/green manifests.
- ✓Unified Airflow DAG + dynamic factory
- ✓Prometheus freshness + Z-score volume anomaly
- ✓K8s blue/green manifests + post-deploy checks
One command. Local Postgres + Redis + mock API + 4 sample datasets.
You get the full stack on day one — Postgres for the idempotency / outbox / watermark tables, Redis for the token bucket, a mock API for paginated REST fixtures, and pre-built sample datasets across all four ingestion modes.
What lives in the repo
Everything you need to run the four-mode ingestion pipeline on your laptop, plus the fixtures and verification queries used in modules 02–04.
- docker-compose.yml — Postgres, Redis, mock REST API
- ingestion/ — httpx client, retry, pagination, rate limiter, idempotency
- ingestion/webhooks/ — FastAPI receiver, HMAC, dedup, outbox
- connectors/ — OAuth2 manager, Stripe + Salesforce Bulk API
- dags/ — unified Airflow DAG + dynamic factory + sensors
- data/ — REST fixtures, webhook events, S3 batch files, SaaS export
API Data Ingestion Starter Kit
Pre-configured Docker stack, sample datasets across REST / webhooks / S3 / SaaS, and the module 01 retry-client scaffolds. Skip the boilerplate, start on the patterns.
The same service — but built for the 10x case.
Most ingestion tutorials hand you a requests.get() with a try/except. This one shows what changes when you’re running 100+ sources, the rate limits collide, and a webhook provider retries the same event 17 times.
EVAL Lua + replication for HA token bucketTTL retention policy on fingerprintsmin.insync.replicas=2FULL_TRANSITIVE compatibilitykubectl rollout status gatesDebezium connector — no polling driftReal review from senior engineers who shipped this stack.
Submit your repo, get line-by-line feedback within 48 hours. The kind of review that’s quietly worth thousands of dollars in time-to-staff.
4 reviews / month
Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — ingestion / Airflow / Kafka for this project. Async, comments inline, average turnaround 31 hours.
2 office hours / month
Live 30-min sessions with a senior data engineer. Architecture questions, whiteboard a tricky migration, mock a system-design interview. Group sessions also available.
One subscription. 15+ projects, all curriculum, code review.
PRO is built for engineers who want production-shaped builds and feedback loops — not more tutorials.
Pick this if you’re shipping the integration tier, not just consuming it.
Data engineers
You’ve wired one or two ingestion jobs, hit the same retry / pagination / dedup problem each time, and want the patterns codified before you write the third one.
Platform engineers
You run ingestion for 5+ teams. You need a self-serve framework so new sources don’t reinvent the rate limiter or the idempotency store on every project.
Backend engineers crossing over
You know HTTP and Kafka but the warehouse side is opaque. This project gives you the orchestration + schema gate vocabulary in language you already speak.
Integration engineers
You build connectors for a living. This is the connector framework you wish your last shop had — OAuth2, pagination, contracts, registry, all in one place.
Going deeper? Three tracks back this project.
API integration is the spine. These three curriculums let you go deeper on the layers the unified DAG sits on — Python craft, Kafka, and event-driven design.
Quick answers.
Ready to ship the real integration tier?
Start with module 01 — free, no card. About 3-4 hours. By the end you'll have a working httpx ingestion client with backoff, rate limiting, idempotency, and watermarks — the primitives every production ingestion job sits on top of.