Skip to content

How to Design a Data Pipeline

Design a data pipeline in 6 steps: clarify requirements (throughput, latency, consistency, retention) → estimate capacity (GB/s, TB/day) → select ingestion layer → select processing layer → choose table format → write the RFC. Every component choice must be justified against your specific requirements, not picked by intuition.

1

Clarify Requirements

Never design before you understand the constraints. Four questions determine every component choice: throughput, latency, consistency, and retention.

requirements checklist

# Questions to answer before touching the architecture

throughput:     500K events/sec peak
latency_slo:    p99 < 2 minutes end-to-end
consistency:    exactly-once (financial data)
retention:      3 years (compliance requirement)
read_pattern:   ad-hoc SQL + ML feature serving
2

Estimate Capacity

Back-of-envelope math is required in RFCs and interviews. Calculate throughput in GB/s and storage growth in TB/day. Always show peak headroom.

capacity.py — show your math

events_per_sec  = 500_000
bytes_per_event = 1_024          # 1 KB avg
throughput_GBps = 500_000 * 1_024 / 1e9 # 0.5 GB/s
daily_raw_TB    = 0.5 * 86_400 / 1_000  # 43 TB/day
compressed_TB   = 43 * 0.25             # 11 TB/day (4:1 Parquet)
peak_multiplier = 3                     # plan for 3× average
3

Select the Ingestion Layer

The ingestion layer determines your durability, replayability, and throughput ceiling. Kafka is the default for high-volume streaming; batch via S3 for scheduled workloads.

ingestion layer decision

# Ingestion layer selection matrix
Kafka        high throughput, durable, replayable, multi-consumer
Kinesis      AWS-native, managed, lower ops overhead
Pub/Sub      GCP-native, global delivery guarantees
S3 + Airflow scheduled batch, simple, low cost
Debezium CDC database change capture, row-level events
4

Select the Processing Layer

Processing determines your latency floor and transformation expressiveness. Match the engine to your latency SLO and team skillset.

processing layer decision

# Processing layer selection matrix
Flink        stateful streaming, sub-minute latency, exactly-once
Spark        large-scale batch, micro-batch, rich ML ecosystem
dbt          SQL transformations in warehouse, medallion layers
Kafka Streamslightweight stateful, no cluster, JVM-native
5

Choose a Table Format

Table format determines schema evolution, time travel, and multi-engine support. Iceberg is the default for new lakehouses; Delta for Databricks-native workflows.

table format decision

# Table format selection matrix
Iceberg      multi-engine, hidden partitioning, vendor-neutral
Delta Lake   Databricks-native, ACID, OPTIMIZE/ZORDER
Hudi         upsert-first, CDC patterns, row-level updates
Parquet only no schema evolution, simple batch, lowest overhead
6

Write the RFC

Document the design before building it. A production RFC includes problem statement, architecture diagram, component justifications, capacity estimates, failure modes, and implementation timeline.

RFC structure

# RFC sections (Netflix/Uber format)
1. Problem Statement  — what breaks without this?
2. Scope & Non-Goals  — explicit boundaries
3. Architecture       — 6-layer diagram + component rationale
4. Capacity Estimates — throughput, storage, peak math
5. Failure Modes      — what breaks and how you recover
6. Alternatives       — rejected options with scored tradeoffs
7. Timeline           — milestones and dependencies

When Formal Pipeline Design Matters Most

  • The pipeline will be consumed by multiple teams or external stakeholders
  • Throughput exceeds 10K events/sec or storage grows faster than 1 TB/day
  • You are preparing for a staff or senior DE system design interview
  • The pipeline must meet compliance requirements (retention, lineage, audit trail)

Common Design Issues

No capacity estimates

Designing without numbers leads to undersized Kafka clusters, overpaying for Redshift, or hitting S3 request limits. Always calculate throughput and storage before choosing tools.

Streaming for everything

Flink is operationally expensive. If your SLO is 1-hour latency, a simple Airflow + Spark batch job is correct and far cheaper to maintain. Match the architecture to the SLO.

Missing failure mode analysis

Production pipelines fail. Document what happens when Kafka is unavailable, when a partition skews 10×, or when the upstream schema changes. RFC reviewers will ask.

FAQ

How do I design a scalable data pipeline?
Start with requirements (throughput, latency, consistency, retention). Calculate capacity (GB/s, TB/day). Select each layer (ingestion, processing, storage, serving). Document every choice with explicit justification in an RFC.
When should I use batch vs streaming?
Streaming (Kafka + Flink) for SLOs under 5 minutes. Batch (Airflow + Spark/dbt) when 1+ hour latency is acceptable — which covers most analytics. Micro-batch for sub-hour latency when batch SQL is easier to maintain.
What should a pipeline RFC include?
Problem statement, scope boundaries, architecture diagram (all 6 layers), component justifications, capacity estimates, failure modes, alternatives considered, and implementation timeline.

Related

Press Cmd+K to open