Data Engineering System Design Explained: What It Is and How It Works
Data engineering system design is the practice of architecting data pipelines and platforms with explicit tradeoffs. It combines a 6-layer architecture model (ingest → store → process → format → serve → observe) with communication artifacts — RFCs and ADRs — that make design decisions auditable and transferable. The skill that separates senior from staff-level engineers.
Architecture Decision Record (ADR) structure
# adr/001-table-format-selection.md
## Status
Accepted — 2026-01-15
## Context
We need a table format supporting multi-engine access (Trino + Spark),
hidden partitioning, and schema evolution without full rewrites.
## Decision
Use Apache Iceberg over Delta Lake.
## Consequences
+ Vendor-neutral — no Databricks dependency
+ Hidden partitioning eliminates partition misuse bugs
- Less mature Spark MERGE support vs Delta
- Smaller community than Delta Lake in 2024
The 6-Layer Architecture Model
Ingest
How data enters the system. Kafka for high-throughput streaming; Kinesis for AWS-native; Debezium CDC for database capture; S3 + Airflow for scheduled batch.
Kafka · Kinesis · Debezium · Pub/Sub
Store
Where raw data lands. S3 or GCS for lake storage, partitioned by date. The raw layer is append-only — never modify it, always reprocess.
S3 · GCS · HDFS · Azure ADLS
Process
How data is transformed. Flink for stateful streaming; Spark for large-scale batch; dbt for SQL transformations in a warehouse or lakehouse.
Flink · Spark · dbt · Kafka Streams
Format
How files are organized on storage. Iceberg for multi-engine support and hidden partitioning; Delta for Databricks; Parquet for simple batch without schema evolution needs.
Iceberg · Delta Lake · Hudi · Parquet
Serve
How consumers access processed data. Trino or Spark SQL for ad-hoc queries; Redshift/Snowflake for BI; Redis for sub-50ms feature serving; REST API for application consumers.
Trino · Redshift · Snowflake · Redis
Observe
How you monitor pipeline health and data quality. OpenLineage for lineage; Prometheus + Grafana for metrics; SLO framework for reliability targets.
OpenLineage · Prometheus · Grafana · dbt tests
Senior vs Staff: What Changes
Senior
- ✓ Designs systems that work reliably
- ✓ Knows which tools to use
- ✓ Can explain design to peers
- ✓ Addresses failure modes when asked
Staff
- ✓ Documents why — RFCs, ADRs, tradeoff matrices
- ✓ Scores alternatives and rejects with reasoning
- ✓ Presents to VP in 5 min, PE in 45 min
- ✓ Models build vs buy, TCO, team ownership cost
Common Mistakes
Designing without requirements
Architecture without constraints is just opinion. Always establish throughput, latency, consistency, and retention before selecting any component.
Tool decisions without ADRs
If there is no ADR for your table format or message broker choice, the next engineer to join will relitigate the decision from scratch. Document the why, not just the what.
Ignoring operational complexity
Flink is more powerful than Spark Streaming — but also harder to operate. If your team has no Flink experience and your SLO is 30 minutes, Spark micro-batch is the right choice.
FAQ
- What is data engineering system design?
- The practice of architecting data pipelines and platforms with explicit tradeoffs — selecting the right ingestion, processing, storage, and serving components for a given set of throughput, latency, and cost constraints, documented in RFCs and ADRs.
- What is an RFC in data engineering?
- A design document proposing a significant architectural change. Includes problem statement, architecture diagram, component justifications, capacity estimates, failure modes, alternatives considered, and implementation timeline. Stored in source control as the permanent record of why decisions were made.
- What is an ADR?
- Architecture Decision Record — a short document capturing a single decision: context, decision, consequences (gains and tradeoffs), and status. Stored alongside code so future engineers understand why the system is built the way it is.
- What separates senior from staff-level system design?
- Staff engineers document the why — tradeoff matrices, rejected alternatives, failure mode analysis, cost modeling. They can present the same design to a VP in 5 minutes and a principal engineer in 45 minutes.