What is Data Engineering System Design?
The complete guide — architectures, tradeoffs, interview frameworks, and how to communicate design decisions at the staff level.
Quick Answer
Data engineering system design is the practice of architecting end-to-end data pipelines and platforms — selecting ingestion, processing, storage, and serving components — while making explicit tradeoffs between throughput, latency, cost, and reliability. It is both an engineering discipline and a communication skill: the ability to write RFCs, defend ADRs, and present tradeoff matrices to senior stakeholders.
What is Data Engineering System Design?
System design in data engineering goes beyond picking tools. It requires decomposing a business requirement — "we need real-time fraud detection" or "we need a 200-table analytics warehouse" — into a layered architecture where every component is chosen for explicit reasons, every tradeoff is documented, and every failure mode is addressed. Senior and staff engineers distinguish themselves by how rigorously they make and communicate these decisions.
Junior approach
"We need Kafka and Spark." Tools are named without requirements analysis, no latency or throughput estimates, no failure modes addressed. No documentation of why alternatives were rejected.
Staff approach
Starts with requirements: 500K events/sec, p99 latency < 2 minutes, exactly-once guarantees, 3-year retention. Evaluates lambda vs kappa, documents why Flink beats Spark Streaming for this SLO, addresses backfill and disaster recovery, writes an RFC with a risk register.
Before vs. After Learning System Design
Before
- ✗ Design reviews expose gaps in tradeoff reasoning
- ✗ Architecture decisions undocumented — impossible to revisit
- ✗ System design interviews feel unstructured and unpredictable
- ✗ Proposals rejected without clear feedback on what was missing
After
- ✓ RFCs pass review on first submission with scored tradeoff matrices
- ✓ ADRs create an auditable record of every architectural decision
- ✓ System design interviews follow a proven 45-minute framework
- ✓ Technical proposals earn buy-in from principal engineers and VPs
What Data Engineering System Design Covers
🏗
Pipeline Architecture
Design ingestion, processing, and serving layers for petabyte-scale data systems with explicit latency and throughput SLOs.
💾
Storage Layer Selection
Choose between data warehouses, lakehouses, and data lakes based on query patterns, schema evolution needs, and cost targets.
⚡
Streaming vs Batch
Decide between lambda, kappa, and medallion architectures based on latency requirements and operational complexity tolerance.
🛡
Fault Tolerance Design
Architect exactly-once guarantees, idempotent pipelines, and backfill strategies for production reliability.
🔄
Schema Evolution
Design systems that handle upstream schema changes without breaking downstream consumers using contracts and table formats.
📝
Interview Communication
Write RFCs, ADRs, and tradeoff matrices that communicate architectural decisions to staff and principal engineers.
The 6-Layer Data Architecture Model
Every production data system can be decomposed into six layers. A complete system design addresses all six — even if some layers are trivial for the specific problem.
INGEST
STORE
PROCESS
FORMAT
SERVE
OBSERVE
Layer breakdown with tool choices
# 6-layer data architecture — example: real-time analytics platform
Layer 1 — Ingest: Kafka (500K events/sec, 7-day retention)
Layer 2 — Store: S3 (raw, Parquet, partitioned by event_date)
Layer 3 — Process: Flink (stateful, exactly-once, 30s windows)
Layer 4 — Format: Apache Iceberg (time-travel, hidden partitioning)
Layer 5 — Serve: Trino (ad-hoc SQL) + Redis (sub-50ms lookups)
Layer 6 — Observe: OpenLineage + Prometheus + Grafana SLO dashboard
Back-of-envelope capacity estimate (RFC section)
# Throughput and storage estimates — always show your math
events_per_sec = 500_000
bytes_per_event = 1_024 # 1 KB average
throughput_GBps = 500_000 * 1_024 / 1e9 # ≈ 0.5 GB/s
daily_TB = 0.5 * 86_400 / 1_000 # ≈ 43 TB/day raw
compressed_TB = 43 * 0.25 # 4:1 Parquet ratio ≈ 11 TB/day
Lambda vs Kappa vs Medallion Architecture
Lambda
Two parallel paths: batch (Spark) for correctness, speed layer (Flink) for latency. Merged at query time. Correct for complex aggregations but operationally expensive — you maintain two codebases.
Kappa
Single streaming path — all processing via Flink or Kafka Streams. Reprocess by replaying from the source log. Simpler to operate when streaming alone meets latency and correctness requirements.
Medallion
Three quality tiers: Bronze (raw), Silver (cleaned), Gold (aggregated). Batch-first, maps directly to dbt layers. The dominant pattern for lakehouses built on Iceberg or Delta Lake.
| Dimension | Lambda | Kappa | Medallion |
|---|---|---|---|
| Primary question | How do I get both correctness and speed? | How do I simplify to one path? | How do I organize quality tiers? |
| Processing paths | Two (batch + speed) | One (streaming only) | One per tier (Bronze→Silver→Gold) |
| Reprocessing | Rerun batch layer | Replay from source log | Rerun dbt from Bronze |
| Latency | Minutes (speed) or hours (batch) | Seconds to minutes | Minutes to hours (batch-first) |
| Best for | Complex event processing needing correctness | High-volume streaming with simple logic | Lakehouse with dbt + Iceberg |
Common Mistakes
- ✗Jumping to tools before clarifying requirements — always establish throughput, latency, consistency, and cost constraints first
- ✗No back-of-envelope estimates — reviewers expect to see storage growth and peak throughput calculations in your RFC
- ✗Ignoring failure modes — every design must address: what happens if the message broker goes down, if a partition skews, if upstream schema changes?
- ✗Underdocumented tradeoffs — "we chose Iceberg" is not a decision. "We chose Iceberg over Delta Lake because we need multi-engine support for Trino and Spark with a vendor-neutral table format" is.
Who Should Learn Data Engineering System Design?
Senior Data Engineers
Learn to write RFCs, estimate capacity, and present tradeoff matrices. System design skills are the primary differentiator for senior → staff promotion.
Interview Candidates
Staff and senior DE interviews at top companies are dominated by system design rounds. A structured 45-minute framework separates passing from failing candidates.
Tech Leads / Architects
Build the ADR practice, RFC templates, and postmortem culture that scale a data platform team from 3 to 30 engineers without architectural chaos.
Related Guides
Frequently Asked Questions
What is data engineering system design?
Data engineering system design is the practice of architecting end-to-end data systems — pipelines, storage layers, processing engines, and serving infrastructure — to meet throughput, latency, reliability, and cost requirements. It involves selecting the right tools (Kafka, Spark, Iceberg, dbt), defining data flow patterns (lambda, kappa, medallion), making explicit tradeoffs between consistency and availability, and documenting decisions in RFCs and Architecture Decision Records.
What are the key components of a data engineering system design?
A complete data engineering system design covers six layers: (1) ingestion — how data enters the system (Kafka, Kinesis, CDC); (2) storage — where raw data lands (S3, GCS, HDFS); (3) processing — how data is transformed (Spark, Flink, dbt); (4) table format — how files are organized (Iceberg, Delta Lake, Hudi); (5) serving — how consumers access data (data warehouse, feature store, API); (6) observability — how you monitor quality and SLOs.
What is the difference between lambda and kappa architecture?
Lambda architecture maintains two separate processing paths: a batch layer (Spark) for correctness and a speed layer (Kafka Streams or Flink) for low latency. Results are merged at query time. Kappa architecture eliminates the batch layer entirely — all processing runs through a single streaming system (Flink or Kafka Streams), with reprocessing triggered by replaying from the source log. Lambda is more complex but easier to debug; kappa is simpler to operate when streaming alone can meet latency and correctness requirements.
What is the medallion architecture?
Medallion architecture organizes a data lakehouse into three quality tiers: Bronze (raw ingested data, append-only), Silver (cleaned and joined data, schema enforced), and Gold (aggregated, business-ready models consumed by BI and ML). It provides a clear quality progression, simplifies auditing, and maps naturally to dbt layers (staging, intermediate, marts). Most modern lakehouses built on Iceberg or Delta Lake follow this pattern.
How do I approach a data engineering system design interview?
Follow a structured 45-minute framework: (1) clarify requirements — ask about data volume, latency SLOs, read/write patterns, and consistency requirements; (2) estimate scale — calculate throughput, storage growth, and peak QPS; (3) design the high-level architecture — sketch ingestion, processing, storage, and serving layers; (4) deep-dive on the hardest component — fault tolerance, exactly-once guarantees, or schema evolution; (5) discuss tradeoffs explicitly — why Iceberg over Hive, why Flink over Spark Streaming; (6) address operational concerns — monitoring, backfill, and disaster recovery.
What You Will Build
In the System Design learning path you build the Staff Engineer Playbook — the leadership artifacts that get senior engineers promoted: production-quality RFCs, Architecture Decision Records, and postmortems used at Netflix, Uber, and Google.
- →Complete RFC following Netflix/Uber format with tradeoff matrices
- →Architecture Decision Records (ADRs) for 3 major design choices
- →Back-of-envelope capacity estimates with throughput and storage math
- →Risk register with probability scoring and mitigations
- →Blameless postmortem following Google SRE standards
- →System design interview mock sessions with feedback rubric