Ship a
production-grade
Spark + Delta Lake batch pipeline for ShopStream
Build the e-commerce batch pipeline you'd actually defend in a senior interview. Multi-format ingest with Spark, a 9x optimization sprint with documented before/after metrics, a Delta Lake lakehouse with ACID + time travel + SCD2, and a Kafka/K8s streaming addendum — all running locally on a 5.5 GB scale-up dataset.
Spark optimization is one of the most-asked questions in senior DE rounds at Netflix, Uber, Airbnb, Databricks. After this, you can name the symptom (skew, OOM, shuffle), name the pattern (broadcast / AQE / salting), and name the fix.
- A working `shopstream_etl.py` ingesting CSV / JSON / Parquet into partitioned outputs
- A 4-pattern Spark optimization runbook (broadcast / cache / partition tune / salting) with documented 9x progression
- A Delta Lake lakehouse with ACID transactions, time travel, SCD Type 2, MERGE upserts, OPTIMIZE+ZORDER, and VACUUM
- A Kafka Structured Streaming pipeline with exactly-once checkpointing, watermarks, and deduplication
- A Spark-on-Kubernetes deployment: RBAC, ServiceMonitor, Prometheus metrics, build-and-push scripts
- A `scripts/scale_up_to_full.py` generator that grows the lean sample data to 5.5 GB
$PATH for the Spark JVM. Not a Spark-from-scratch tutorial — assumes you can read PySpark and want production patterns.Spark + Delta Lake is still the default batch stack at scale.
Every senior DE role assumes you can profile a slow Spark job, fix data skew, and ship a Delta lakehouse without breaking ACID. The patterns in this project are the ones you'll re-use for every batch system you ever own.
Optimization is the #1 interview topic
Senior DE rounds reliably probe broadcast vs SMJ, AQE, partition tuning, and salting. This project ships all four with before/after metrics so you can talk through them confidently.
Delta Lake is the production default
ACID + time travel + SCD2 on object storage are now table-stakes. Companies migrating off Hive/Parquet expect engineers who can ship Delta from day one — not just describe it.
Streaming overlay, not full rewrite
Most production stacks are batch-first with a streaming addendum. Part 4 mirrors that reality — exactly-once Kafka + K8s deployment that supplements the batch core, not a Lambda fantasy.
Local first, cluster optional
Parts 1-3 run on your laptop with a 5.5 GB synthetic dataset. Part 4 needs a real Kafka broker and K8s cluster — same as the production world. No managed-service hand-waving.
Module 01 is free. The rest unlocks with PRO.
Try the first 8-10 hours — stand up Spark locally, ingest the 5.5 GB ShopStream e-commerce dataset, write your first partitioned Parquet output. If the patterns click, upgrade to unlock the optimization sprint, the Delta lakehouse, and the Kafka/K8s deploy.
Apache Spark: Distributed Data Processing
This curriculum is the foundation for the project — every optimization pattern in module 02 has a deeper-dive lesson, and module 04 reuses the same Kubernetes manifests. PRO subscribers get full access.
Three sprints. Three checkpoints. One production batch pipeline.
Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.
ShopStream e-commerce dataset (5.5 GB) ingested through `shopstream_etl.py`. Multi-format reader. RFM segmentation. Partitioned Parquet output verified.
- ✓`shopstream_etl.py` orchestrator
- ✓Multi-format ingest (CSV / JSON / Parquet)
- ✓Partitioned Parquet output (year/month)
Baseline measured. Broadcast (1.8x) → caching → partition tuning → salting (2.5x) applied in sequence. Spark UI screenshots and before/after metrics for each.
- ✓Baseline + broadcast optimization scripts
- ✓Caching + partition-tuning scripts
- ✓Skew detection + salting scripts
Delta Lake configured. ACID + time travel + SCD2 + MERGE + ZORDER + VACUUM shipped. Kafka Structured Streaming with checkpointing and watermarks. Spark-on-K8s with RBAC + ServiceMonitor + Prometheus.
- ✓7 Delta operation scripts
- ✓5 streaming scripts (Kafka, watermark, dedup)
- ✓Dockerfile + K8s manifests + deploy script
One command. Local Spark + Delta Lake + scaled-up dataset.
Lean sample data ships in the zip (orders.csv 10K, products.json 1K, customers.parquet 1K, clickstream_logs.csv 50K, daily_order_updates 500). Run `scripts/scale_up_to_full.py --full` to grow it to 5.5 GB for the optimization sprint. Part 4 streaming + K8s requires a real Kafka broker and Kubernetes cluster.
What lives in the repo
Everything you need to run modules 01-03 locally on a 16 GB laptop, plus the streaming and Kubernetes manifests for module 04 once you have a cluster.
- part-1/run_all.py — ShopStream batch ETL orchestrator
- part-2/ — 6 optimization scripts (broadcast / cache / partition / salting)
- part-3/run_all.py — Delta Lake operations (ACID / time travel / SCD2 / MERGE / ZORDER / VACUUM)
- part-4/kubernetes/ — RBAC, deployment, ServiceMonitor manifests
- scripts/scale_up_to_full.py — Faker generator for the 5.5 GB scaled-up dataset
ShopStream Batch Pipeline Starter Kit
Pre-configured Python venv setup, lean sample data, run_all orchestrators for parts 1 & 3, scale-up generator, and the K8s manifests for part 4.
The same Spark job — but built for the 10x case.
Most Spark tutorials show you the read.csv(). This one shows what changes when the dataset is 100 GB+, the join is skewed, and the output table is being read by 5 downstream pipelines simultaneously.
inferSchema=True on every readOVERWRITE the whole partitionlocal[*] for everythingStructType with nullability declaredbroadcast() hint for tables <10 MB; AQE for the restshuffle_partitions = data_GB * 5MERGE INTO ... WHEN MATCHED with conditional logicVACUUM 168h + OPTIMIZE ZORDER on a schedulespark-submit --master k8s://... with RBAC + ServiceMonitorgroupBy().count(), fix with salting (2.5x)Real review from senior engineers who shipped this stack.
Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.
4 reviews / month
Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Spark/Delta for this project. Async, comments inline, average turnaround 31 hours.
2 office hours / month
Live 30-min sessions with a senior data engineer. Whiteboard a Spark optimization you're stuck on, mock a system-design interview, or talk through a real production incident.
One subscription. 15+ projects, all curriculum, code review.
PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.
Pick this if you’re shipping at scale, not learning to.
Data engineers leveling to senior
You've written Spark jobs that work but you can't always explain WHY they're slow. After this, you can read a Spark UI like a profiler and name the fix.
Platform / infra engineers
You own the cluster but don't always own the workloads. This project gives you the perspective from the application side — enough to push back on a bad job before it costs $5K.
Analytics engineers crossing into DE
You know SQL and dbt. You need the Spark + Delta mental model. Module 03 (Delta Lake) is the bridge — same SCD2 + MERGE patterns you already know, in PySpark.
Staff / tech leads sizing the migration
Your team has a 10-year-old Hive pipeline. You need to know what the Delta migration actually looks like, what breaks, and how long the optimization sprint really takes.
Going deeper? Three tracks back this project.
The Spark deep-dive is the spine. These three curriculums let you go deeper on the layers around it — Python ETL fundamentals, dimensional modeling, and the cloud + container infrastructure module 04 rides on.
Quick answers.
Ready to ship a real Spark pipeline?
Start with module 01 — free, no card. About 8-10 hours. By the end you'll have Spark running locally with the ShopStream e-commerce dataset, a working batch ETL, and partitioned Parquet output verified against rowcounts.