# The Data Engineering Interview Cheatsheet

> 47 system-design questions with senior+ framings, organized by stage. Use these to prep before an on-site, to write your own portfolio walkthroughs, or to defend an architecture in a review.

**Author:** AI-DE Team
**Last updated:** 2026-05-15
**License:** Free to share with attribution. https://ai-de.net

---

## How to use this

Each question lists **what to clarify first** (the questions you ask the interviewer before you draw boxes), **the framing** (the architectural shape you'd defend), and **the gotchas** (the senior+ test — what less-experienced candidates miss).

The single most important behavior in any of these: **clarify before you draw**. Senior interviews aren't testing whether you remember Kafka — they're testing whether you can scope a problem without making assumptions.

---

## I. Batch + warehouse (12 questions)

### 1. Design a daily e-commerce sales reporting pipeline

- **Clarify:** What's the source — Postgres replicas, app event stream, raw S3 dumps? What's the freshness SLA — 24h, 1h, 15min? What's the downstream consumer — BI tool, finance API, ML training set?
- **Framing:** Sources → staging (raw ingest) → silver (cleaned + typed) → gold (dimensional marts) using dbt + a warehouse (Snowflake / BigQuery / Redshift). Orchestrate with Airflow.
- **Gotcha:** Most candidates jump to "I'd use dbt." Seniors define the **grain of each table first** (one row per order? per order line? per order-day?) and call out late-arriving data handling explicitly.

### 2. Design a dimensional model for a multi-tenant SaaS

- **Clarify:** Tenant count — 5 enterprise or 5,000 mid-market? Row-level isolation required? Per-tenant retention?
- **Framing:** Shared dim_user, dim_account, fact_event with a tenant_id column. Postgres RLS or warehouse row-access policies at the gold layer.
- **Gotcha:** "I'd just have a tenant_id column" is the wrong answer for regulated tenants. Senior framing distinguishes **logical separation** (column-level) from **physical separation** (schema-per-tenant) and names the threshold where each makes sense.

### 3. Handle late-arriving data in a daily aggregation

- **Clarify:** How late can data arrive — minutes, hours, days? Is the source replayable? Are downstream tables append-only or upserted?
- **Framing:** Watermark-based incremental models (dbt `is_incremental()` + `merge` strategy), with a 7-day reprocessing window for stragglers.
- **Gotcha:** Recompute-all-daily is the lazy answer. Senior framing trades **idempotency** against **incremental complexity** and shows the math on cost (full table rebuild = O(N), incremental = O(delta)).

### 4. Migrate a 10-year-old Postgres warehouse to Snowflake

- **Clarify:** What's the cutover constraint — zero downtime, or scheduled maintenance window? Who else reads the existing tables? What's the FinOps budget?
- **Framing:** Phased dual-write: stream CDC from Postgres → Kafka → Snowflake; backfill historic via Spark + S3; cut consumers over by domain (orders first, then users, then events); decommission Postgres tables after a verification period.
- **Gotcha:** Junior answer is "lift and shift." Senior framing names the **verification strategy** (row counts, checksum sampling, side-by-side dashboards) and the rollback plan.

### 5. Design SCD Type 2 for customer dimensions

- **Clarify:** How often do attributes change? Which attributes are tracked vs. ignored? Is history queried often or rarely?
- **Framing:** `dim_customer` with `valid_from`, `valid_to`, `is_current`, surrogate key `customer_dim_id`. Fact tables FK to the dim by surrogate key, so historical fact rows always join to the historical dim version.
- **Gotcha:** Most candidates remember Type 2 mechanics but skip the **performance implication** — every fact join now hits a versioned dim. Senior framing names when Type 1 is enough and when Type 6 (Type 1 + Type 2 hybrid) wins.

### 6. Design a metrics layer (semantic model) for a product team

- **Framing:** dbt + MetricFlow (or Cube, LookML). One semantic-layer definition per metric (`active_users`, `revenue_30d`) backed by gold-layer fact tables, exposed to BI via dbt exposures.
- **Gotcha:** Seniors emphasize **metric ownership** and the social problem: who's allowed to define a new metric? Without governance, the layer fragments within 6 months.

### 7. Implement incremental dbt models for a 500GB fact table

- **Framing:** Incremental strategy = `merge` for late-arriving updates, `append` for immutable events, `delete+insert` for partition-level reprocessing. Combine with a watermark column + the dbt `is_incremental()` macro.
- **Gotcha:** Senior framing knows when **full refresh** is actually cheaper (small tables, schema changes, dev/staging) and doesn't over-engineer.

### 8. Design data quality tests at warehouse-scale

- **Framing:** Three layers — schema tests (dbt `not_null`, `unique`, `accepted_values`), value tests (dbt-utils + dbt-expectations), and contract tests (ODCS-style with CI gating).
- **Gotcha:** Senior framing names what to do when a test **fails in production** at 3am: fail loud (PagerDuty), fail closed (block downstream), or fail open (warn + continue)? Each has tradeoffs.

### 9. Optimize a warehouse query that scans 900GB

- **Framing:** Partitioning + clustering on the highest-cardinality filter column. Re-architect to read 9GB instead of 900GB. Verify with EXPLAIN ANALYZE.
- **Gotcha:** "Add more compute" is the wrong answer. Senior framing always reduces **scan size first** and only adds compute if scan is already minimal.

### 10. Reverse ETL — sync warehouse → Salesforce / HubSpot

- **Framing:** Hightouch / Census for managed, or a custom worker that reads from the warehouse, applies a mapping config, and posts to the target API with idempotency keys.
- **Gotcha:** Senior framing handles **rate limits** and **partial-failure recovery** — what happens when Salesforce returns 429 for 200 of 1,000 rows? Resume from a checkpoint, don't restart.

### 11. Design a multi-region warehouse for compliance

- **Framing:** Per-region warehouse instances + a global metadata layer. Replicate dim tables, keep fact tables regional. Federated queries via Snowflake replication, BigQuery Omni, or a service mesh.
- **Gotcha:** GDPR + EU data residency means **cross-region joins are the regression**. Senior framing names which dashboards can run cross-region and which must stay regional.

### 12. Cost-optimize a $300K/year Snowflake bill

- **Framing:** Triage = warehouse size right-sizing + auto-suspend + query patterns + storage tier. Build a per-warehouse cost dashboard, attribute spend to teams, set budgets.
- **Gotcha:** Senior framing knows that **the top 10 queries usually consume 80% of the bill**. Don't optimize uniformly — find the hot queries first.

---

## II. Streaming + event-driven (9 questions)

### 13. Design a real-time fraud detection pipeline

- **Clarify:** Decision latency — 100ms, 1s, 10s? False-positive cost vs. false-negative cost? Model retraining cadence?
- **Framing:** Kafka ingest → Flink (keyed state on user_id) → feature store → model service → block/allow decision. Detection windows on event-time with watermarks.
- **Gotcha:** Senior framing handles **late events** — what if a fraud signal arrives 30s after the transaction? Either replay or accept the loss; defend the choice.

### 14. Design exactly-once Kafka → warehouse delivery

- **Framing:** Kafka Connect with idempotent sinks, OR Flink + two-phase commit, OR Kafka Streams with transactional producers.
- **Gotcha:** "At-least-once + dedup" is often the right answer, not exactly-once. Senior framing explains the **performance cost** of true exactly-once and recommends the simpler pattern when the data model tolerates duplicates.

### 15. Handle schema evolution in a Kafka pipeline

- **Framing:** Confluent Schema Registry with backward-compatible evolution mode. Producers validate at write-time, consumers fail gracefully on unknown fields.
- **Gotcha:** Senior framing distinguishes **backward**, **forward**, and **full** compatibility and recommends which to use based on producer-vs-consumer deployment ordering.

### 16. Design watermarks for an event-time aggregation

- **Framing:** Watermark = max(event_time) − allowed_lateness. Bounded out-of-orderness assumption. Late events go to a DLQ or a side output.
- **Gotcha:** Senior framing names the tradeoff: tight watermarks = lower latency but more late events; loose watermarks = higher latency but fewer drops.

### 17. Design a Kafka topic partitioning strategy

- **Framing:** Partition key = the dimension you want ordered (user_id for per-user ordering, region for geo-routing). Partition count = peak throughput / per-partition throughput (~10MB/s).
- **Gotcha:** Senior framing handles **rebalancing pain** — what happens when you add partitions on a live topic? Existing keys remap; downstream consumers may see ordering anomalies.

### 18. Build a CDC pipeline (Postgres → warehouse)

- **Framing:** Debezium → Kafka → Flink / dbt incremental → warehouse. Capture inserts, updates, deletes; convert to upserts at the sink.
- **Gotcha:** Senior framing handles **snapshot vs. streaming cutover** — how do you bootstrap with 10 years of history without missing events during the snapshot phase?

### 19. Design a streaming join (Kafka stream × Kafka stream)

- **Framing:** Co-partitioned topics on the join key, window-bounded join in Flink or Kafka Streams, persistent state in RocksDB.
- **Gotcha:** Senior framing accounts for **state size** — joining two streams with multi-day windows can blow up state. Spill to remote storage or shorten the window.

### 20. Build a real-time leaderboard at 1M events/sec

- **Framing:** Kafka ingest → Flink keyed state with TopK approximation → Redis sorted set for low-latency reads. Eventual consistency between Flink state and Redis cache.
- **Gotcha:** Senior framing knows that **exact TopK at 1M events/sec is impossible** without sharding. Approximation algorithms (Count-Min Sketch + heap) trade precision for throughput.

### 21. Migrate from Kafka to Pulsar (or vice versa)

- **Framing:** Dual-write phase (producers send to both), gradual consumer cutover, then producer cutover, then decommission. Schema registry compatibility throughout.
- **Gotcha:** Senior framing names the **per-partition ordering guarantees** that differ between platforms and which use cases are affected.

---

## III. Lakehouse + storage (7 questions)

### 22. Design an Iceberg lakehouse

- **Framing:** Catalog (REST / Nessie / Glue) → metadata layer → manifest files → Parquet data files. Multi-engine reads (Spark, Trino, Flink, Snowflake).
- **Gotcha:** Senior framing names the **REST vs. Hive vs. Glue vs. Nessie catalog tradeoffs** and which is right for which team size + compliance posture.

### 23. Handle schema evolution in a lakehouse

- **Framing:** Iceberg supports add/drop/rename/reorder by column ID. Most engines preserve historical schema for time-travel reads.
- **Gotcha:** Senior framing knows that **rename** isn't truly free — old engines reading with positional schema break.

### 24. Optimize a lakehouse with many small files

- **Framing:** Run `rewrite_data_files` regularly (Iceberg) or `OPTIMIZE` (Delta) to compact. Tune target file size to 128–256MB for OLAP workloads.
- **Gotcha:** Senior framing names the **read-vs-write tradeoff** — compaction kills write throughput temporarily. Schedule it during off-peak.

### 25. Migrate Hive tables → Iceberg

- **Framing:** Three options — `add_files` (zero data movement, catalog rewrite only), `migrate` (in-place, preserves files), or shadow tables (dual-write + cutover). Choose based on table size + downtime tolerance.
- **Gotcha:** Senior framing handles **partition spec changes** — Hive's partition columns become Iceberg's partition transforms, with subtle semantic differences.

### 26. Compare Iceberg vs. Delta vs. Hudi

- **Framing:** Iceberg = open catalog, multi-engine, mature compaction. Delta = Databricks-first, strongest SQL story. Hudi = best for streaming upserts.
- **Gotcha:** "I'd use Delta because Databricks" is a partial answer. Senior framing names the team's **engine fleet** (Spark only? Flink + Trino?) and the **catalog strategy** as the deciding factors.

### 27. Design partition + clustering for a 10TB fact table

- **Framing:** Partition by date (or higher-granularity if reads are bounded). Cluster by the most common filter column (customer_id, region). Validate with EXPLAIN ANALYZE.
- **Gotcha:** Senior framing knows that **partition cardinality matters** — 1M partitions kills metadata performance. Cap at ~10K.

### 28. Design a multi-engine lakehouse (Spark + Trino + Flink)

- **Framing:** Iceberg as the open table format, REST catalog (or Nessie) as the catalog. Validate that all three engines see the same row counts on the same table version.
- **Gotcha:** Senior framing names the **engine-specific edge cases** — Flink's CDC equality deletes need engine support; Trino's predicate pushdown depends on column statistics.

---

## IV. AI + ML systems (10 questions)

### 29. Design a production RAG system

- **Framing:** Ingest → chunking strategy → embedding model → vector store → retrieval (BM25 + dense + RRF) → reranker → LLM → response. Evaluation via RAGAS or recall@k.
- **Gotcha:** Senior framing benchmarks **multiple chunking strategies** (e.g., fixed, semantic, contextual) and names recall@k numbers as the production gate, not vibes.

### 30. Choose between pgvector / Pinecone / Weaviate / Qdrant

- **Framing:** pgvector if data volume < 10M vectors and you already run Postgres. Managed (Pinecone) if ops budget is zero. Self-hosted (Qdrant / Weaviate) if data sovereignty matters.
- **Gotcha:** Senior framing names the **HNSW vs. IVF** tuning tradeoff and the impact on recall + latency + memory.

### 31. Design an LLM evaluation framework

- **Framing:** Eval datasets (golden + adversarial), LLM-as-judge cascade (cheap model first, expensive model on disagreement), regression gates in CI, drift detection in production.
- **Gotcha:** Senior framing names the **judge-bias problem** — single-judge cascades produce systematic bias. Multi-judge agreement + variance reduces it.

### 32. Build a feature store for online + offline serving

- **Framing:** Feast / Tecton / homegrown. Offline store (warehouse / lakehouse) backs training. Online store (Redis / DynamoDB) backs inference. Both materialized from the same feature definitions.
- **Gotcha:** Senior framing names the **training-serving skew** problem and the materialization pattern that prevents it (write once, materialize to both stores via a single pipeline).

### 33. Design a model deployment pipeline

- **Framing:** Train → register (MLflow) → eval gate (recall@k, F1, business KPI) → canary deploy (BentoML / KServe) → progressive rollout → rollback on regression.
- **Gotcha:** Senior framing knows that **the rollback is the hard part** — what happens to in-flight requests during a rollback? Drain or kill?

### 34. Implement model drift detection

- **Framing:** Baseline distribution (training data) + production distribution (inference inputs). KS test, PSI, or embedding-based drift. Alert on threshold breach.
- **Gotcha:** Senior framing distinguishes **input drift** (the data changed) from **concept drift** (the relationship changed) and names which is harder to detect.

### 35. Design vLLM-based LLM serving for cost + latency

- **Framing:** vLLM with continuous batching + PagedAttention. Multi-model routing (cheap model for easy queries, expensive model for hard). Semantic cache. Ray Serve autoscale.
- **Gotcha:** Senior framing benchmarks **cost-per-1K-tokens** on the actual workload, not synthetic. Cheap models save money but only if accuracy holds.

### 36. Build an agentic data pipeline

- **Framing:** LangGraph supervisor + worker agents, RBAC tool registry, checkpointing (Redis), HITL approval queue for write operations.
- **Gotcha:** Senior framing names the **tool-safety problem** — agents that can write to production need RBAC + audit log + approval gates. No exceptions.

### 37. Design a multi-tenant RAG with PII protection

- **Framing:** Postgres RLS for tenant isolation + Presidio PII detection in the ingest pipeline + redacted documents in the vector store + audit log on every query.
- **Gotcha:** Senior framing handles **prompt-injection** — what stops a user query from extracting another tenant's chunks? Defense: query-time tenant filter + retrieved-chunk RLS check.

### 38. Cost-optimize an LLM application

- **Framing:** Semantic cache (similarity-based hit) for repeated queries. Model routing — cheap model first, fall back to expensive on uncertainty. Token-tracking dashboard per feature.
- **Gotcha:** Senior framing knows the **fail-open vs. fail-closed** budget pattern — when the budget is exceeded, do you serve a degraded response or block? Defend the choice.

---

## V. Platform + scale (9 questions)

### 39. Design a multi-team data platform

- **Framing:** Centralized IaC (Terraform) + shared catalog + per-team workspaces + cost-attribution + policy-as-code (OPA / Privacera).
- **Gotcha:** Senior framing names the **governance-vs-velocity tradeoff** — too much governance kills teams; too little kills the platform.

### 40. Design Airflow at scale (10K+ DAGs)

- **Framing:** KubernetesExecutor or CeleryExecutor with autoscaling workers. Per-team DAG namespaces. Shared connection management. DAG-parsing budget enforcement.
- **Gotcha:** Senior framing knows that **DAG-parse time** is the silent killer at scale. A heavy DAG file slows the entire scheduler.

### 41. Migrate from Airflow to Kubernetes-native orchestration

- **Framing:** Argo Workflows or Dagster, gradual cutover by team, dual-write phase, decommission old DAGs only after parity verification.
- **Gotcha:** Senior framing names what **does NOT migrate cleanly** — Airflow's macros, custom operators, and connection pool patterns often require rewrites.

### 42. Design a CI/CD pipeline for dbt + Spark + Python

- **Framing:** GitHub Actions matrix (lint → unit tests → dbt parse → integration tests → deploy). Slim CI for dbt (only test changed models). Multi-stage Docker for Python + Spark.
- **Gotcha:** Senior framing names the **eval-gate pattern** — for AI pipelines, the CI must include a quality-eval step that blocks bad model promotions.

### 43. Implement data contracts at organization scale

- **Framing:** ODCS-style contracts in Git, CI gating on compatibility (backward / forward / full), drift detection in production, owner accountability via PR review.
- **Gotcha:** Senior framing knows that **contracts are a people problem** — without producer + consumer agreement, contracts become bureaucracy. Start with high-pain interfaces only.

### 44. Build column-level lineage across 1K+ models

- **Framing:** OpenLineage + Marquez or a managed (Atlan, Datafold). dbt + Spark + Airflow integrations. PII classification piggybacks on the lineage graph.
- **Gotcha:** Senior framing knows that **automated lineage misses ~20% of edges** (stored procs, custom UDFs). Plan for that.

### 45. Design an on-call rotation for a data platform

- **Framing:** Tiered alerting (P0 wakes you, P1 next-business-day), runbook per failure mode, blameless postmortems, on-call comp.
- **Gotcha:** Senior framing distinguishes **data outages** (a table is wrong) from **infra outages** (a service is down). Different response patterns.

### 46. Implement multi-region disaster recovery for a data platform

- **Framing:** Primary region + warm standby. Snapshot replication for warehouse, geo-replicated object storage, multi-region Kafka via MirrorMaker 2. RPO / RTO defined per data tier.
- **Gotcha:** Senior framing names what is **NOT replicated** — Airflow DAG state, in-flight Flink jobs, secrets. Plan for them.

### 47. Cost-optimize a data + AI platform under FinOps

- **Framing:** Per-team cost attribution dashboard, budget alerts, monthly reviews. Three levers: query optimization, compute right-sizing, storage tier strategy. AI workloads need token-cost engineering on top.
- **Gotcha:** Senior framing names the **FinOps adoption pattern** — measure → allocate → optimize. Skip steps and you optimize the wrong thing.

---

## Closing notes

- **Senior signal #1:** Asking clarifying questions before drawing. The interviewer expects you not to make assumptions.
- **Senior signal #2:** Naming the tradeoff explicitly. "I'd use X because Y, but at scale Z I'd switch to W."
- **Senior signal #3:** Talking about failure modes — not just the happy path. What happens when the source is delayed? When the model regresses? When the bill spikes?

If you want to go deeper on any of these, every curriculum on AI-DE (https://ai-de.net/learn) is built to teach you one of these architectures from first principles, with shipped starter kits and ADRs.

— The AI-DE team
