What is Data Engineering System Design? (2026)

Quick answer

Data engineering system design is the practice of architecting end-to-end data pipelines and platforms — selecting ingestion, processing, storage, and serving components — while making explicit tradeoffs between throughput, latency, cost, and reliability. It is both an engineering discipline and a communication skill: the ability to write RFCs, defend ADRs, and present tradeoff matrices to senior stakeholders. Learn it hands-on at /learn/system-design or build the /projects/staff-engineer-playbook.

What is data engineering system design?

System design in data engineering goes beyond picking tools. It requires decomposing a business requirement — "real-time fraud detection" or "a 200-table analytics warehouse" — into a layered architecture where every component is chosen for explicit reasons, every tradeoff is documented, and every failure mode is addressed.

Senior and staff engineers distinguish themselves by how rigorously they make and communicate these decisions. A junior says "we need Kafka and Spark" without requirements analysis. A staff engineer starts with: 500K events/sec, p99 latency under 2 minutes, exactly-once guarantees, 3-year retention. Then they evaluate lambda vs kappa, document why Flink beats Spark Streaming for this SLO, address backfill and disaster recovery, and write an RFC with a risk register.

That depth of reasoning — not memorized vendor preferences — is what hiring committees and design reviews actually evaluate.

SKILL · SYSTEM-DESIGN

Master system design in 8 hours, hands-on.

From the 6-layer model to RFCs, ADRs, and tradeoff matrices. Real design problems graded against the bar at top tech companies.

Start learning →

Why does system design matter?

Promotion to senior+ at every top tech company is gated on system design depth, not tool recall
A bad architecture decision at platform-build time costs the org years and tens of engineers
RFCs and ADRs are how decisions outlive the engineer — without them every team relearns the same lessons
Tradeoff matrices replace "what do you think?" debates with structured, scoreable engineering reasoning
The same skill that wins interviews wins design reviews — and wins influence over org strategy
Staff engineers spend more time on architecture and communication than on code

The six-layer architecture model

Every production data system can be decomposed into six layers. A complete design addresses all six — even when some are trivial for the specific problem.

Ingest — how data enters the system. Kafka or Kinesis for events, Debezium or Fivetran for CDC, REST or webhook for SaaS APIs. Define throughput, peak burst, and at-least-once vs exactly-once guarantees.
Store — where raw data lands. S3, GCS, or HDFS for lakes; Snowflake or BigQuery for warehouses. Choose by query patterns, cost-per-TB, and ecosystem fit.
Process — how data is transformed. Spark for large batch, Flink for streaming, dbt for SQL transformations in the warehouse. Decide based on volume, latency, and team SQL fluency.
Format — how files are organized on storage. Iceberg, Delta Lake, or Hudi give ACID transactions, schema evolution, and time travel on top of Parquet. Choose by feature support and ecosystem compatibility.
Serve — how consumers access data. Warehouse for BI, feature store for ML, REST or GraphQL APIs for product features, reverse ETL for SaaS sync.
Observe — how you monitor quality and SLOs. Lineage, freshness checks, anomaly detection, alerting, on-call runbooks.

Walk through every design — interview or production — by these six layers in order. Missing a layer is the most common reason a design review fails.

Lambda vs Kappa vs Medallion

Dimension	Lambda	Kappa	Medallion
Primary question	Both correctness + speed	Simplify to one path	Organize quality tiers
Processing paths	Two (batch + speed)	One (streaming only)	One per tier (Bronze→Silver→Gold)
Reprocessing	Rerun batch layer	Replay from source log	Rerun dbt from Bronze
Latency	Minutes (speed) or hours (batch)	Seconds to minutes	Minutes to hours (batch-first)
Operational complexity	High — two codebases	Medium — one codebase	Low — dbt + warehouse
Best for	Complex events needing correctness	High-volume streaming with simple logic	Lakehouse with dbt + Iceberg

Most teams in 2026 start with medallion on an Iceberg or Delta lakehouse and only escape into kappa when streaming latency requirements push them there. True lambda is rare in new builds — the operational tax of maintaining two pipelines rarely justifies the gain when modern streaming engines like Flink already provide correctness.

RFCs, ADRs, and tradeoff matrices

System design is half engineering, half writing. The artifacts that separate staff-level work from senior:

RFC (Request for Comments) — a 3–6 page document submitted before building. Sections: context, requirements, constraints, alternatives considered, recommended design, risks, rollout plan. Reviewed by 3+ engineers including a principal.
ADR (Architecture Decision Record) — a 1-page record stored in git alongside code. One per major decision (table format, processing engine, schema-evolution policy). Captures what was decided, why, and what was rejected.
Tradeoff matrix — a scored grid: rows are options, columns are dimensions (latency, cost, ops, team fit, ecosystem). Cells contain numeric scores with one-line justifications. Forces the conversation onto explicit criteria instead of vendor preference.

A staff promo packet routinely includes 3–5 RFCs the candidate authored. The artifacts ARE the work.

PROJECT · STAFF-ENGINEER-PLAYBOOK

Ship a real RFC bundle reviewed by a principal.

Author 3 RFCs, 5 ADRs, and a tradeoff matrix for a petabyte-scale platform redesign. Mentor-reviewed against staff-promo standards.

Open project →

Common mistakes (and what to do instead)

Naming tools before naming requirements — "we need Kafka" without throughput, latency, retention, or consumer count is not a design. Always lead with the numbers.
No alternatives considered — listing only the recommended design tells reviewers you didn't do the work. Show 2–3 alternatives with explicit reasons for rejection.
Missing the observability layer — pipelines without freshness, lineage, and on-call runbooks are not production. The sixth layer is not optional.
Designing for peak load, not steady state — costing infrastructure for the 99th-percentile burst is how budgets explode. Right-size for steady state with autoscaling for bursts.
Treating cost as someone else's problem — finance now expects the designing engineer to model 1-year cost trajectory. Include a cost section in every RFC.
Skipping the failure-mode walkthrough — for every component, write down what breaks first and how the system recovers. Reviewers will ask. Have the answer.

Who is system design for?

System design is for mid-to-senior data engineers preparing for senior or staff promotion, for platform engineers owning org-wide architecture decisions, and for any data engineer interviewing at a top tech company where the system design loop is the dominant signal.

Teams that benefit most:

Senior data engineers eyeing a staff promo and looking for a portfolio of RFCs
Platform leads architecting a new lakehouse, feature store, or streaming platform from scratch
Engineers interviewing at FAANG / unicorn / quant where system design is half the loop
Tech leads who keep losing design reviews to better-prepared peers and want a structured framework

Frequently asked questions

What is data engineering system design?

Data engineering system design is the practice of architecting end-to-end data systems — pipelines, storage layers, processing engines, and serving infrastructure — to meet throughput, latency, reliability, and cost requirements. It involves selecting the right tools, defining data flow patterns, making explicit tradeoffs between consistency and availability, and documenting decisions in RFCs and Architecture Decision Records.

What are the key components of a data engineering system design?

A complete design covers six layers: ingestion (Kafka, Kinesis, CDC), storage (S3, GCS, HDFS), processing (Spark, Flink, dbt), table format (Iceberg, Delta Lake, Hudi), serving (warehouse, feature store, API), and observability (lineage, freshness, SLOs). Every production system can be decomposed into these six — even when some are trivial for the specific problem.

What is the difference between lambda and kappa architecture?

Lambda maintains two processing paths: a batch layer for correctness and a speed layer for low latency, merged at query time. Kappa eliminates the batch layer — all processing runs through one streaming system, with reprocessing triggered by replaying the source log. Lambda is more complex but easier to debug; kappa is simpler to operate when streaming alone meets latency and correctness needs.

What is the medallion architecture?

Medallion organizes a data lakehouse into three quality tiers: Bronze (raw ingested data, append-only), Silver (cleaned and joined data, schema enforced), Gold (aggregated, business-ready models consumed by BI and ML). It provides a clear quality progression, simplifies auditing, and maps naturally to dbt staging/intermediate/marts. Most modern lakehouses built on Iceberg or Delta follow this pattern.

How do I approach a data engineering system design interview?

Follow a structured 45-minute framework: clarify requirements (volume, latency, consistency), estimate scale (throughput, storage, QPS), design the high-level architecture across the six layers, deep-dive on the hardest component (fault tolerance, exactly-once, schema evolution), discuss tradeoffs explicitly (why Iceberg over Hive, why Flink over Spark Streaming), and address operations (monitoring, backfill, disaster recovery).

What to do next

Start shipping.

Three steps from a guide to a job-ready portfolio. Pick one and start now — the rest will follow.

01 · LEARN

Take the skill

Self-paced module with code, exercises, and a deliverable. Free preview, paid completion.

Start S0X · System Design →

02 · BUILD

Ship the project

Production-grade build with starter kit + mentor code review. The artifact that gets you interviews.

Open P0X · Staff Engineer Playbook →

03 · COMMIT

Pick a career path

The full progression — skills + projects + interview prep — for the role you actually want.

See paths →

What is data engineering system design?

Master system design in 8 hours, hands-on.

Why does system design matter?

The six-layer architecture model

Lambda vs Kappa vs Medallion

RFCs, ADRs, and tradeoff matrices

Ship a real RFC bundle reviewed by a principal.

Common mistakes (and what to do instead)

Who is system design for?

Frequently asked questions

Start shipping.

Take the skill

Ship the project

Pick a career path

Related guides

What is Apache Iceberg?

What is Apache Flink?

What is DataOps?