How to Design a Data Pipeline
Design a data pipeline in 6 steps: clarify requirements (throughput, latency, consistency, retention) → estimate capacity (GB/s, TB/day) → select ingestion layer → select processing layer → choose table format → write the RFC. Every component choice must be justified against your specific requirements, not picked by intuition.
Clarify Requirements
Never design before you understand the constraints. Four questions determine every component choice: throughput, latency, consistency, and retention.
requirements checklist
# Questions to answer before touching the architecture
throughput: 500K events/sec peak
latency_slo: p99 < 2 minutes end-to-end
consistency: exactly-once (financial data)
retention: 3 years (compliance requirement)
read_pattern: ad-hoc SQL + ML feature serving
Estimate Capacity
Back-of-envelope math is required in RFCs and interviews. Calculate throughput in GB/s and storage growth in TB/day. Always show peak headroom.
capacity.py — show your math
events_per_sec = 500_000
bytes_per_event = 1_024 # 1 KB avg
throughput_GBps = 500_000 * 1_024 / 1e9 # 0.5 GB/s
daily_raw_TB = 0.5 * 86_400 / 1_000 # 43 TB/day
compressed_TB = 43 * 0.25 # 11 TB/day (4:1 Parquet)
peak_multiplier = 3 # plan for 3× average
Select the Ingestion Layer
The ingestion layer determines your durability, replayability, and throughput ceiling. Kafka is the default for high-volume streaming; batch via S3 for scheduled workloads.
ingestion layer decision
# Ingestion layer selection matrix
Kafka → high throughput, durable, replayable, multi-consumer
Kinesis → AWS-native, managed, lower ops overhead
Pub/Sub → GCP-native, global delivery guarantees
S3 + Airflow → scheduled batch, simple, low cost
Debezium CDC → database change capture, row-level events
Select the Processing Layer
Processing determines your latency floor and transformation expressiveness. Match the engine to your latency SLO and team skillset.
processing layer decision
# Processing layer selection matrix
Flink → stateful streaming, sub-minute latency, exactly-once
Spark → large-scale batch, micro-batch, rich ML ecosystem
dbt → SQL transformations in warehouse, medallion layers
Kafka Streams→ lightweight stateful, no cluster, JVM-native
Choose a Table Format
Table format determines schema evolution, time travel, and multi-engine support. Iceberg is the default for new lakehouses; Delta for Databricks-native workflows.
table format decision
# Table format selection matrix
Iceberg → multi-engine, hidden partitioning, vendor-neutral
Delta Lake → Databricks-native, ACID, OPTIMIZE/ZORDER
Hudi → upsert-first, CDC patterns, row-level updates
Parquet only → no schema evolution, simple batch, lowest overhead
Write the RFC
Document the design before building it. A production RFC includes problem statement, architecture diagram, component justifications, capacity estimates, failure modes, and implementation timeline.
RFC structure
# RFC sections (Netflix/Uber format)
1. Problem Statement — what breaks without this?
2. Scope & Non-Goals — explicit boundaries
3. Architecture — 6-layer diagram + component rationale
4. Capacity Estimates — throughput, storage, peak math
5. Failure Modes — what breaks and how you recover
6. Alternatives — rejected options with scored tradeoffs
7. Timeline — milestones and dependencies
When Formal Pipeline Design Matters Most
- → The pipeline will be consumed by multiple teams or external stakeholders
- → Throughput exceeds 10K events/sec or storage grows faster than 1 TB/day
- → You are preparing for a staff or senior DE system design interview
- → The pipeline must meet compliance requirements (retention, lineage, audit trail)
Common Design Issues
No capacity estimates
Designing without numbers leads to undersized Kafka clusters, overpaying for Redshift, or hitting S3 request limits. Always calculate throughput and storage before choosing tools.
Streaming for everything
Flink is operationally expensive. If your SLO is 1-hour latency, a simple Airflow + Spark batch job is correct and far cheaper to maintain. Match the architecture to the SLO.
Missing failure mode analysis
Production pipelines fail. Document what happens when Kafka is unavailable, when a partition skews 10×, or when the upstream schema changes. RFC reviewers will ask.
FAQ
- How do I design a scalable data pipeline?
- Start with requirements (throughput, latency, consistency, retention). Calculate capacity (GB/s, TB/day). Select each layer (ingestion, processing, storage, serving). Document every choice with explicit justification in an RFC.
- When should I use batch vs streaming?
- Streaming (Kafka + Flink) for SLOs under 5 minutes. Batch (Airflow + Spark/dbt) when 1+ hour latency is acceptable — which covers most analytics. Micro-batch for sub-hour latency when batch SQL is easier to maintain.
- What should a pipeline RFC include?
- Problem statement, scope boundaries, architecture diagram (all 6 layers), component justifications, capacity estimates, failure modes, alternatives considered, and implementation timeline.