Skip to content
ai-de.net/Projects/P05 · ShopStream Spark Batch Pipeline
PRO · module 01 free previewBatch trackP05

Ship a
production-grade
Spark + Delta Lake batch pipeline for ShopStream

Build the e-commerce batch pipeline you'd actually defend in a senior interview. Multi-format ingest with Spark, a 9x optimization sprint with documented before/after metrics, a Delta Lake lakehouse with ACID + time travel + SCD2, and a Kafka/K8s streaming addendum — all running locally on a 5.5 GB scale-up dataset.

Timeline
38-46 hours
Difficulty
Intermediate+
Stack
Spark · Delta Lake · Kafka · K8s

Spark optimization is one of the most-asked questions in senior DE rounds at Netflix, Uber, Airbnb, Databricks. After this, you can name the symptom (skew, OOM, shuffle), name the pattern (broadcast / AQE / salting), and name the fix.

By the end you will have
  • A working `shopstream_etl.py` ingesting CSV / JSON / Parquet into partitioned outputs
  • A 4-pattern Spark optimization runbook (broadcast / cache / partition tune / salting) with documented 9x progression
  • A Delta Lake lakehouse with ACID transactions, time travel, SCD Type 2, MERGE upserts, OPTIMIZE+ZORDER, and VACUUM
  • A Kafka Structured Streaming pipeline with exactly-once checkpointing, watermarks, and deduplication
  • A Spark-on-Kubernetes deployment: RBAC, ServiceMonitor, Prometheus metrics, build-and-push scripts
  • A `scripts/scale_up_to_full.py` generator that grows the lean sample data to 5.5 GB
PREREQComfortable with Python (functions, classes), basic SQL, and 16 GB RAM available locally. Java 11 or 17 must be on $PATH for the Spark JVM. Not a Spark-from-scratch tutorial — assumes you can read PySpark and want production patterns.
shopstream.bronze.* · job v2.14
9x optimized
Ingest
Transform
Optimize
Serve
orders.csv10M rows · 1.5 GB
products.json500K · 200 MB
customers.parquet5M · 800 MB
clickstream.csv50M · 3 GB
shopstream_etl.py
SparkSessionAQE on · shuffle=200
join + groupByRFM segmentation
withColumnyear/month parts
write.parquetpartitionBy
PySpark 3.5
broadcast()1.8x · small joins
.cache()MEMORY_AND_DISK
partition tune100-200 MB sweet spot
salting2.5x · skew fix
baseline → 9x
Delta LakeACID · SCD2 · time travel
Kafka SSexactly-once · watermark
Spark on K8sRBAC · auto-scale
Prometheusmetrics · dashboards
lakehouse + streaming + k8s
# 9x via 4 documented patterns
broadcast (1.8x) · cache · partition tune
· salting (2.5x) → 9x cumulative
before/after metrics in module 02
→ patterns generalize to every batch system you own
● Lean → 5.5 GB scale-up
~1 MB ships in zip · runs locally
scale_up_to_full.py --full
10M orders + 50M clicks · 16 GB RAM
→ same code path, two data sizes
5.5 GB
Scale-up dataset
4
Optimization patterns
9x
Documented speedup
Why this matters in 2026

Spark + Delta Lake is still the default batch stack at scale.

Every senior DE role assumes you can profile a slow Spark job, fix data skew, and ship a Delta lakehouse without breaking ACID. The patterns in this project are the ones you'll re-use for every batch system you ever own.

Optimization is the #1 interview topic

Senior DE rounds reliably probe broadcast vs SMJ, AQE, partition tuning, and salting. This project ships all four with before/after metrics so you can talk through them confidently.

Delta Lake is the production default

ACID + time travel + SCD2 on object storage are now table-stakes. Companies migrating off Hive/Parquet expect engineers who can ship Delta from day one — not just describe it.

Streaming overlay, not full rewrite

Most production stacks are batch-first with a streaming addendum. Part 4 mirrors that reality — exactly-once Kafka + K8s deployment that supplements the batch core, not a Lambda fantasy.

Local first, cluster optional

Parts 1-3 run on your laptop with a 5.5 GB synthetic dataset. Part 4 needs a real Kafka broker and K8s cluster — same as the production world. No managed-service hand-waving.

Curriculum · 4 modules · 38-46 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 8-10 hours — stand up Spark locally, ingest the 5.5 GB ShopStream e-commerce dataset, write your first partitioned Parquet output. If the patterns click, upgrade to unlock the optimization sprint, the Delta lakehouse, and the Kafka/K8s deploy.

P05 · 38-46 hours · 4 modules
Free preview PRO required
Module 01 is free — no card required. Run a real Spark batch ETL before paying.
M01
Foundation: Spark batch ETL on the ShopStream dataset
Configure SparkSession (executor memory, shuffle partitions, AQE). Ingest multi-format sources (CSV, JSON, Parquet) into a unified DataFrame. Run RFM customer segmentation. Write partitioned Parquet by year/month with controlled file sizes.
8-10h6 lessonsFREE PREVIEW
Start →
M02
Performance: 4 documented optimization patterns
Establish a baseline with wall-clock timing. Apply broadcast joins for small tables (1.8x). Cache intermediate DataFrames. Tune partitions to the 100-200 MB sweet spot. Detect data skew and apply salting (2.5x). Read the Spark UI like a profiler.
10-12h6 lessonsPRO TIER
Unlock with PRO →
M03
Delta Lake: ACID, time travel, SCD2, MERGE, OPTIMIZE
Configure Delta + DeltaCatalog. Write the transaction log. Query historical snapshots by version and timestamp. Implement SCD Type 2 with effective-date tracking. MERGE upserts with conditional logic. OPTIMIZE with ZORDER for 5-10x query speedup. VACUUM with retention.
10-12h7 lessonsPRO TIER
Unlock with PRO →
M04
Streaming + Kubernetes: production deploy with Kafka and Prometheus
Structured Streaming from Kafka with exactly-once via checkpointing. Watermarking for late data. Deduplication with state stores. Containerize the job with Docker. Deploy to Kubernetes with RBAC, resource limits, and a Prometheus ServiceMonitor.
10-12h11 lessonsPRO TIER
Unlock with PRO →
3 modules locked · Unlock all PRO content for $29/mo
Upgrade to PRO →
Backed by curriculum

Apache Spark: Distributed Data Processing

14 modules·~110 hours·RDD / DataFrame internals·Optimization patterns·Delta Lake basics·Streaming·Kubernetes deployment
Open curriculum

This curriculum is the foundation for the project — every optimization pattern in module 02 has a deeper-dive lesson, and module 04 reuses the same Kubernetes manifests. PRO subscribers get full access.

The build, in 3 phases

Three sprints. Three checkpoints. One production batch pipeline.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.

01~9h
Ship a working Spark batch ETL

ShopStream e-commerce dataset (5.5 GB) ingested through `shopstream_etl.py`. Multi-format reader. RFM segmentation. Partitioned Parquet output verified.

  • `shopstream_etl.py` orchestrator
  • Multi-format ingest (CSV / JSON / Parquet)
  • Partitioned Parquet output (year/month)
02~11h
Cut the runtime via 4 documented patterns

Baseline measured. Broadcast (1.8x) → caching → partition tuning → salting (2.5x) applied in sequence. Spark UI screenshots and before/after metrics for each.

  • Baseline + broadcast optimization scripts
  • Caching + partition-tuning scripts
  • Skew detection + salting scripts
03~22h
Add Delta lakehouse + streaming + K8s deploy

Delta Lake configured. ACID + time travel + SCD2 + MERGE + ZORDER + VACUUM shipped. Kafka Structured Streaming with checkpointing and watermarks. Spark-on-K8s with RBAC + ServiceMonitor + Prometheus.

  • 7 Delta operation scripts
  • 5 streaming scripts (Kafka, watermark, dedup)
  • Dockerfile + K8s manifests + deploy script
Project setup · 10 minutes

One command. Local Spark + Delta Lake + scaled-up dataset.

Lean sample data ships in the zip (orders.csv 10K, products.json 1K, customers.parquet 1K, clickstream_logs.csv 50K, daily_order_updates 500). Run `scripts/scale_up_to_full.py --full` to grow it to 5.5 GB for the optimization sprint. Part 4 streaming + K8s requires a real Kafka broker and Kubernetes cluster.

What lives in the repo

Everything you need to run modules 01-03 locally on a 16 GB laptop, plus the streaming and Kubernetes manifests for module 04 once you have a cluster.

  • part-1/run_all.py — ShopStream batch ETL orchestrator
  • part-2/ — 6 optimization scripts (broadcast / cache / partition / salting)
  • part-3/run_all.py — Delta Lake operations (ACID / time travel / SCD2 / MERGE / ZORDER / VACUUM)
  • part-4/kubernetes/ — RBAC, deployment, ServiceMonitor manifests
  • scripts/scale_up_to_full.py — Faker generator for the 5.5 GB scaled-up dataset
Download · Starter Kit

ShopStream Batch Pipeline Starter Kit

Pre-configured Python venv setup, lean sample data, run_all orchestrators for parts 1 & 3, scale-up generator, and the K8s manifests for part 4.

~1 MB · 50 files · 4 sample CSVs/JSON/Parquet · PRO required
~/projects/logistics-batch-pipeline — zsh
1. Unzip and create a Python venv
$ unzip logistics-batch-pipeline-starter.zip
$ cd logistics-batch-pipeline-starter
$ python3 -m venv .venv && source .venv/bin/activate
2. Install dependencies (PySpark, Delta, Faker)
$ pip install --upgrade pip && pip install -r requirements.txt
3. Grow lean data to the 5.5 GB tutorial scale (optional)
$ python scripts/scale_up_to_full.py --full
4. Run the part-1 batch ETL
$ python part-1/run_all.py
5. Run the part-3 Delta lakehouse build
$ python part-3/run_all.py
5.5 GB
Scale-up max
10M
Order rows (full)
4
Format types
1 MB
Lean zip ships
Production hardening

The same Spark job — but built for the 10x case.

Most Spark tutorials show you the read.csv(). This one shows what changes when the dataset is 100 GB+, the join is skewed, and the output table is being read by 5 downstream pipelines simultaneously.

Tutorial-gradeWhat you have today
×
Schema
inferSchema=True on every read
×
Joins
Default sort-merge join on every table
×
Shuffle partitions
Default 200, regardless of data size
×
Updates
OVERWRITE the whole partition
×
Delta retention
Default; old snapshots accumulate forever
×
Submit
local[*] for everything
×
Skew
Hope your join keys are uniform
Production-gradeModules 02-04
Schema
Explicit StructType with nullability declared
Joins
broadcast() hint for tables <10 MB; AQE for the rest
Shuffle partitions
Tuned to 100-200 MB per partition: shuffle_partitions = data_GB * 5
Updates
MERGE INTO ... WHEN MATCHED with conditional logic
Delta retention
VACUUM 168h + OPTIMIZE ZORDER on a schedule
Submit
spark-submit --master k8s://... with RBAC + ServiceMonitor
Skew
Detect via groupBy().count(), fix with salting (2.5x)
PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

CR

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Spark/Delta for this project. Async, comments inline, average turnaround 31 hours.

31h
avg turnaround
9.2/10
helpfulness
94%
return next month
OH

2 office hours / month

Live 30-min sessions with a senior data engineer. Whiteboard a Spark optimization you're stuck on, mock a system-design interview, or talk through a real production incident.

30 min
per session
2 / mo
included
+ group
unlimited
What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT
Projects
Production-grade builds
2
15+
8
Curriculum modules
All 7 tracks
Phase 1 only
All
All + bonus
Code review credits
Senior engineer review
0
4 / month
Unlimited
Career path access
5 paths × full plans
1 path
All 5
All 5 + 1:1
Certificate
Verifiable on LinkedIn
Yes
Yes + portfolio review
Community
Discord + office hours
Read-only
Full + 2/mo
Full + 4/mo
$29/mo
billed monthly · cancel anytime
or annual
$249/yr save 28%
Upgrade to PRO
Who this is for

Pick this if you’re shipping at scale, not learning to.

DE

Data engineers leveling to senior

You've written Spark jobs that work but you can't always explain WHY they're slow. After this, you can read a Spark UI like a profiler and name the fix.

PE

Platform / infra engineers

You own the cluster but don't always own the workloads. This project gives you the perspective from the application side — enough to push back on a bad job before it costs $5K.

AE

Analytics engineers crossing into DE

You know SQL and dbt. You need the Spark + Delta mental model. Module 03 (Delta Lake) is the bridge — same SCD2 + MERGE patterns you already know, in PySpark.

ST

Staff / tech leads sizing the migration

Your team has a 10-year-old Hive pipeline. You need to know what the Delta migration actually looks like, what breaks, and how long the optimization sprint really takes.

FAQ

Quick answers.

The directory is named `logistics-batch-pipeline` for legacy reasons, but the actual code is a ShopStream e-commerce dataset (orders, products, customers, clickstream). Same Spark patterns apply to logistics, retail, fintech, or any domain — but be honest with yourself about what's in the tutorial.
No for modules 01-03 — they run locally on a 16 GB laptop. Yes for module 04 — Spark-on-K8s + Kafka Structured Streaming need a real cluster. We don't ship a one-click Kafka docker-compose because the production patterns assume a real broker. If you don't have one, modules 01-03 are still 28-34 hours of real production-grade work.
It's a documented pattern progression. The four optimizations (broadcast, caching, partition tuning, salting) each have a before/after metric in the tutorial code. The 9x is the cumulative ceiling of those patterns; your actual speedup depends on your dataset, cluster, and query shape. The interview-ready takeaway is the FOUR patterns, not the single number.
P04 is a lakehouse-deep-dive on Iceberg — table format mechanics, MERGE, compaction, snapshots. P05 (this) is a Spark + Delta sweep — multi-format ingest, optimization patterns, Delta operations, and a streaming/K8s overlay. Both are Batch track. P04 is depth; P05 is breadth. Doing both gives you Iceberg AND Delta side by side.
No. Module 04 ships a Kafka Structured Streaming pipeline with exactly-once semantics — but it's an addendum to the batch core, not a unified Lambda or Kappa runtime. CDC ingestion gets a dedicated project (P03 Agentic Data Pipeline).
All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks (including the 110-hour Spark deep-dive), all 5 career paths, certificate of completion, and full community access. Cancel anytime.
Yes — Spark optimization is the single most-asked topic. After this you can name the symptom (skew, OOM, shuffle), name the pattern (broadcast / AQE / salting / partition tuning), and walk through the Spark UI to prove it. Plus you have a working Delta + K8s + Prometheus deploy you can reference in system-design rounds.

Ready to ship a real Spark pipeline?

Start with module 01 — free, no card. About 8-10 hours. By the end you'll have Spark running locally with the ShopStream e-commerce dataset, a working batch ETL, and partitioned Parquet output verified against rowcounts.

P05 · ShopStream Spark Batch Pipeline · PRO · module 01 freeUpgrade to PRO →
Press Cmd+K to open