How do I design a scalable data pipeline?

Start with requirements: establish throughput (events/sec), latency SLO, consistency requirements, and retention. Calculate capacity (GB/s throughput, TB/day storage). Then select each layer: ingestion (Kafka/Kinesis/batch), processing (Flink/Spark/dbt), table format (Iceberg/Delta), and serving (warehouse/feature store/API). Document every component choice with explicit justification against your requirements in an RFC.

When should I use batch vs streaming for a data pipeline?

Use streaming (Kafka + Flink) when your latency SLO is under 5 minutes and your use case requires real-time decisions (fraud detection, live dashboards, ML feature serving). Use batch (Airflow + Spark/dbt) when latency of 1+ hours is acceptable, which covers most analytics and reporting workloads. Use micro-batch (Spark Structured Streaming) as a middle ground when you need sub-hour latency but your processing logic is easier to express as batch SQL.

What should a data pipeline RFC include?

A production-quality data pipeline RFC includes: (1) problem statement and scope boundaries; (2) architecture diagram with all 6 layers labeled; (3) component justifications — why each tool over alternatives; (4) capacity estimates with throughput and storage math; (5) failure modes and recovery strategies; (6) implementation timeline with milestones; (7) open questions and risks with probability scoring.

How to Design a Data Pipeline

Design a data pipeline in 6 steps: clarify requirements (throughput, latency, consistency, retention) → estimate capacity (GB/s, TB/day) → select ingestion layer → select processing layer → choose table format → write the RFC. Every component choice must be justified against your specific requirements, not picked by intuition.

Clarify Requirements

Never design before you understand the constraints. Four questions determine every component choice: throughput, latency, consistency, and retention.

requirements checklist

# Questions to answer before touching the architecture

throughput:     500K events/sec peak
latency_slo:    p99 < 2 minutes end-to-end
consistency:    exactly-once (financial data)
retention:      3 years (compliance requirement)
read_pattern:   ad-hoc SQL + ML feature serving

Estimate Capacity

Back-of-envelope math is required in RFCs and interviews. Calculate throughput in GB/s and storage growth in TB/day. Always show peak headroom.

capacity.py — show your math

events_per_sec  = 500_000
bytes_per_event = 1_024          # 1 KB avg
throughput_GBps = 500_000 * 1_024 / 1e9 # 0.5 GB/s
daily_raw_TB    = 0.5 * 86_400 / 1_000  # 43 TB/day
compressed_TB   = 43 * 0.25             # 11 TB/day (4:1 Parquet)
peak_multiplier = 3                     # plan for 3× average

Select the Ingestion Layer

The ingestion layer determines your durability, replayability, and throughput ceiling. Kafka is the default for high-volume streaming; batch via S3 for scheduled workloads.

ingestion layer decision

# Ingestion layer selection matrix
Kafka        → high throughput, durable, replayable, multi-consumer
Kinesis      → AWS-native, managed, lower ops overhead
Pub/Sub      → GCP-native, global delivery guarantees
S3 + Airflow → scheduled batch, simple, low cost
Debezium CDC → database change capture, row-level events

Select the Processing Layer

Processing determines your latency floor and transformation expressiveness. Match the engine to your latency SLO and team skillset.

processing layer decision

# Processing layer selection matrix
Flink        → stateful streaming, sub-minute latency, exactly-once
Spark        → large-scale batch, micro-batch, rich ML ecosystem
dbt          → SQL transformations in warehouse, medallion layers
Kafka Streams→ lightweight stateful, no cluster, JVM-native

Choose a Table Format

Table format determines schema evolution, time travel, and multi-engine support. Iceberg is the default for new lakehouses; Delta for Databricks-native workflows.

table format decision

# Table format selection matrix
Iceberg      → multi-engine, hidden partitioning, vendor-neutral
Delta Lake   → Databricks-native, ACID, OPTIMIZE/ZORDER
Hudi         → upsert-first, CDC patterns, row-level updates
Parquet only → no schema evolution, simple batch, lowest overhead

Write the RFC

Document the design before building it. A production RFC includes problem statement, architecture diagram, component justifications, capacity estimates, failure modes, and implementation timeline.

RFC structure

# RFC sections (Netflix/Uber format)
1. Problem Statement  — what breaks without this?
2. Scope & Non-Goals  — explicit boundaries
3. Architecture       — 6-layer diagram + component rationale
4. Capacity Estimates — throughput, storage, peak math
5. Failure Modes      — what breaks and how you recover
6. Alternatives       — rejected options with scored tradeoffs
7. Timeline           — milestones and dependencies

When Formal Pipeline Design Matters Most

→ The pipeline will be consumed by multiple teams or external stakeholders
→ Throughput exceeds 10K events/sec or storage grows faster than 1 TB/day
→ You are preparing for a staff or senior DE system design interview
→ The pipeline must meet compliance requirements (retention, lineage, audit trail)

Common Design Issues

✗

No capacity estimates

Designing without numbers leads to undersized Kafka clusters, overpaying for Redshift, or hitting S3 request limits. Always calculate throughput and storage before choosing tools.

✗

Streaming for everything

Flink is operationally expensive. If your SLO is 1-hour latency, a simple Airflow + Spark batch job is correct and far cheaper to maintain. Match the architecture to the SLO.

✗

Missing failure mode analysis

Production pipelines fail. Document what happens when Kafka is unavailable, when a partition skews 10×, or when the upstream schema changes. RFC reviewers will ask.

FAQ

How do I design a scalable data pipeline?: Start with requirements (throughput, latency, consistency, retention). Calculate capacity (GB/s, TB/day). Select each layer (ingestion, processing, storage, serving). Document every choice with explicit justification in an RFC.
When should I use batch vs streaming?: Streaming (Kafka + Flink) for SLOs under 5 minutes. Batch (Airflow + Spark/dbt) when 1+ hour latency is acceptable — which covers most analytics. Micro-batch for sub-hour latency when batch SQL is easier to maintain.
What should a pipeline RFC include?: Problem statement, scope boundaries, architecture diagram (all 6 layers), component justifications, capacity estimates, failure modes, alternatives considered, and implementation timeline.

→

What is Data Engineering System Design?

/guide/what-is-data-engineering-system-design

→

System Design Learning Path

/learn/system-design

→

Build the Staff Engineer Playbook

/projects/staff-engineer-playbook

How to Design a Data Pipeline

When Formal Pipeline Design Matters Most

Common Design Issues

FAQ

Related