Skip to content
Back to Projects
Hands-On Project
~14 hours·4 parts

Marketing API Ingestion Service

Build a fault-tolerant Python service that paginates through third-party ad network APIs, handles rate limits, and loads data into the warehouse.

Data EngineerPlatform EngineerBackend DEIntegration Engineer
View on GitHub

Ingestion Pipeline

REST API
Webhooks
S3 Batch
SaaS Export
INGEST
VALIDATE
STORE
MONITOR
httpx · FastAPI · boto3
JSON Schema · Pydantic
PostgreSQL · Kafka
Prometheus · Airflow

What You'll Build

1

Foundation — REST API Ingestion with Retry Logic

3–4 hours

Build a production-grade REST API client with exponential backoff, cursor/offset/keyset pagination, token-bucket rate limiting, idempotent request handling, and structured error responses. The backbone of every data ingestion layer.

HTTP client with exponential backoff and jitter retry logic
Pagination engine (cursor, offset, keyset) with auto-detection
Token-bucket rate limiter respecting API provider limits
Idempotency layer with request fingerprinting and dedup
Structured error handling with retry vs. fail classification
Incremental extraction with watermark-based state tracking
6/6 items complete — REST ingestion client operational
2

Events — Webhook Receiver & S3 Batch Drops

3–4 hours

Handle push-based data: build a webhook receiver with HMAC signature verification and event deduplication, plus an S3 batch ingestion pipeline with manifest tracking, file format detection, and dead letter queues for failed events.

Webhook receiver with HMAC-SHA256 signature verification
Event deduplication with idempotency keys and bloom filters
S3 batch file ingestion with manifest tracking and checksums
Multi-format parser (CSV, JSON, Parquet, NDJSON)
Dead letter queue for failed events with replay capability
Exactly-once delivery guarantee with transactional outbox
12/12 items complete — Push-based ingestion layer live
3

Connectors — SaaS Export & Schema Validation

3–4 hours

Build connectors for third-party SaaS platforms (Salesforce, Stripe, HubSpot patterns), implement JSON Schema validation on every record, handle schema evolution gracefully, and enforce data contracts between source and consumer.

SaaS connector framework with OAuth2 token refresh
Salesforce bulk API extractor with SOQL query builder
JSON Schema validation on every ingested record
Schema evolution handler (additive vs. breaking changes)
Data contract enforcement between source and warehouse
Schema registry integration with compatibility checks
18/18 items complete — SaaS connectors with validated schemas
4

Production — Unified Pipeline & Monitoring

3–4 hours

Orchestrate all four source types in a unified Airflow DAG, implement source-specific scheduling strategies, build freshness SLA monitoring with alerting on schema drift and volume anomalies, and deploy with blue/green rollback safety.

Unified Airflow DAG orchestrating REST, webhook, S3, and SaaS sources
Source-specific scheduling (cron, event-driven, file-arrival)
Freshness SLA monitoring with per-source latency tracking
Schema drift detection with automated alerting
Volume anomaly detection (missing data, duplicates, spikes)
Blue/green deployment with rollback on ingestion failure
LIVE multi-source ingestion platform deployed

Skills This Project Reinforces

API Integration

M1: REST Fundamentals, M2: Auth & Secrets

Error Handling

Retry Logic, Dead Letter Queues, Circuit Breakers

Schema Validation

JSON Schema, Evolution, Contract Testing

Data Quality

Deduplication, Idempotency, Validation

Orchestration

Airflow DAGs, Scheduling Strategies, Dependencies

Observability

SLA Monitoring, Alerting, Drift Detection

Tech Stack

Python
Language
FastAPI
Webhooks
httpx
HTTP Client
Apache Airflow
Orchestration
Kafka
Streaming
PostgreSQL
State Store
Redis
Rate Limiting
JSON Schema
Validation
boto3
AWS S3
Great Expectations
Quality
Prometheus
Monitoring
Docker
Containers

Sample Datasets

mock_rest_api/8 MB · 50K records

Paginated REST API simulator with rate limiting, cursor pagination, and intermittent 429/500 errors

webhook_events.jsonl12 MB · 100K events

Simulated webhook payloads with HMAC signatures, duplicate events, and out-of-order delivery

s3_batch_drop/25 MB · 200K records

S3 batch files in mixed formats (CSV, JSON, Parquet) with manifest files and checksums

saas_export_sample.json5 MB · 10K records

Salesforce-style bulk API export with schema evolution scenarios (added fields, type changes)

Resume-Ready Bullets

Built multi-source data ingestion layer handling REST APIs, webhooks, S3 batch drops, and SaaS exports with exponential backoff retry logic reducing failed extractions by 95%

Implemented idempotent ingestion pipeline with request fingerprinting and bloom filter deduplication, achieving exactly-once delivery across 4 heterogeneous data sources processing 500K+ daily records

Designed schema validation framework using JSON Schema with automated drift detection, preventing 100% of breaking schema changes from reaching the data warehouse

Orchestrated unified Airflow DAG with source-specific scheduling (cron, event-driven, file-arrival), freshness SLA monitoring, and blue/green deployment with automated rollback

Related Learning

Ready to Build Your Ingestion Layer?

Every data engineering role starts with getting data in. This project gives you the production patterns that separate "it works on my laptop" from "it runs in prod."

Press Cmd+K to open