ai-de.net/Projects/P05 · ShopStream Spark Batch Pipeline

PRO · module 01 free previewBatch trackP05

Ship a
production-grade
Spark + Delta Lake batch pipeline for ShopStream

Build the e-commerce batch pipeline you'd actually defend in a senior interview. Multi-format ingest with Spark, a 9x optimization sprint with documented before/after metrics, a Delta Lake lakehouse with ACID + time travel + SCD2, and a Kafka/K8s streaming addendum — all running locally on a 5.5 GB scale-up dataset.

Timeline

38-46 hours

Difficulty

Intermediate+

Stack

Spark · Delta Lake · Kafka · K8s

See PRO benefits

Spark optimization is one of the most-asked questions in senior DE rounds at Netflix, Uber, Airbnb, Databricks. After this, you can name the symptom (skew, OOM, shuffle), name the pattern (broadcast / AQE / salting), and name the fix.

By the end you will have

A working `shopstream_etl.py` ingesting CSV / JSON / Parquet into partitioned outputs
A 4-pattern Spark optimization runbook (broadcast / cache / partition tune / salting) with documented 9x progression
A Delta Lake lakehouse with ACID transactions, time travel, SCD Type 2, MERGE upserts, OPTIMIZE+ZORDER, and VACUUM
A Kafka Structured Streaming pipeline with exactly-once checkpointing, watermarks, and deduplication
A Spark-on-Kubernetes deployment: RBAC, ServiceMonitor, Prometheus metrics, build-and-push scripts
A `scripts/scale_up_to_full.py` generator that grows the lean sample data to 5.5 GB

PREREQComfortable with Python (functions, classes), basic SQL, and 16 GB RAM available locally. Java 11 or 17 must be on $PATH for the Spark JVM. Not a Spark-from-scratch tutorial — assumes you can read PySpark and want production patterns.

shopstream.bronze.* · job v2.14

9x optimized

Ingest

Transform

Optimize

Serve

orders.csv10M rows · 1.5 GB

products.json500K · 200 MB

customers.parquet5M · 800 MB

clickstream.csv50M · 3 GB

shopstream_etl.py

SparkSessionAQE on · shuffle=200

join + groupByRFM segmentation

withColumnyear/month parts

write.parquetpartitionBy

PySpark 3.5

broadcast()1.8x · small joins

.cache()MEMORY_AND_DISK

partition tune100-200 MB sweet spot

salting2.5x · skew fix

baseline → 9x

Delta LakeACID · SCD2 · time travel

Kafka SSexactly-once · watermark

Spark on K8sRBAC · auto-scale

Prometheusmetrics · dashboards

lakehouse + streaming + k8s

# 9x via 4 documented patterns

broadcast (1.8x) · cache · partition tune

· salting (2.5x) → 9x cumulative

before/after metrics in module 02

→ patterns generalize to every batch system you own

● Lean → 5.5 GB scale-up

~1 MB ships in zip · runs locally

scale_up_to_full.py --full

10M orders + 50M clicks · 16 GB RAM

→ same code path, two data sizes

5.5 GB

Scale-up dataset

Optimization patterns

Documented speedup

Why this matters in 2026

Spark + Delta Lake is still the default batch stack at scale.

Every senior DE role assumes you can profile a slow Spark job, fix data skew, and ship a Delta lakehouse without breaking ACID. The patterns in this project are the ones you'll re-use for every batch system you ever own.

Optimization is the #1 interview topic

Senior DE rounds reliably probe broadcast vs SMJ, AQE, partition tuning, and salting. This project ships all four with before/after metrics so you can talk through them confidently.

Delta Lake is the production default

ACID + time travel + SCD2 on object storage are now table-stakes. Companies migrating off Hive/Parquet expect engineers who can ship Delta from day one — not just describe it.

Streaming overlay, not full rewrite

Most production stacks are batch-first with a streaming addendum. Part 4 mirrors that reality — exactly-once Kafka + K8s deployment that supplements the batch core, not a Lambda fantasy.

Local first, cluster optional

Parts 1-3 run on your laptop with a 5.5 GB synthetic dataset. Part 4 needs a real Kafka broker and K8s cluster — same as the production world. No managed-service hand-waving.

Curriculum · 4 modules · 38-46 hours

Module 01 is free. The rest unlocks with PRO.

Try the first 8-10 hours — stand up Spark locally, ingest the 5.5 GB ShopStream e-commerce dataset, write your first partitioned Parquet output. If the patterns click, upgrade to unlock the optimization sprint, the Delta lakehouse, and the Kafka/K8s deploy.

P05 · 38-46 hours · 4 modules

Free preview PRO required

Module 01 is free — no card required. Run a real Spark batch ETL before paying.

M01

✓Foundation: Spark batch ETL on the ShopStream dataset

Configure SparkSession (executor memory, shuffle partitions, AQE). Ingest multi-format sources (CSV, JSON, Parquet) into a unified DataFrame. Run RFM customer segmentation. Write partitioned Parquet by year/month with controlled file sizes.

8-10h6 lessonsFREE PREVIEW

Start →

M02

⊘Performance: 4 documented optimization patterns

Establish a baseline with wall-clock timing. Apply broadcast joins for small tables (1.8x). Cache intermediate DataFrames. Tune partitions to the 100-200 MB sweet spot. Detect data skew and apply salting (2.5x). Read the Spark UI like a profiler.

10-12h6 lessonsPRO TIER

Unlock with PRO →

M03

⊘Delta Lake: ACID, time travel, SCD2, MERGE, OPTIMIZE

Configure Delta + DeltaCatalog. Write the transaction log. Query historical snapshots by version and timestamp. Implement SCD Type 2 with effective-date tracking. MERGE upserts with conditional logic. OPTIMIZE with ZORDER for 5-10x query speedup. VACUUM with retention.

10-12h7 lessonsPRO TIER

Unlock with PRO →

M04

⊘Streaming + Kubernetes: production deploy with Kafka and Prometheus

Structured Streaming from Kafka with exactly-once via checkpointing. Watermarking for late data. Deduplication with state stores. Containerize the job with Docker. Deploy to Kubernetes with RBAC, resource limits, and a Prometheus ServiceMonitor.

10-12h11 lessonsPRO TIER

Unlock with PRO →

3 modules locked · Unlock all PRO content for $29/mo

Upgrade to PRO →

Backed by curriculum

Apache Spark: Distributed Data Processing

14 modules·~110 hours·RDD / DataFrame internals·Optimization patterns·Delta Lake basics·Streaming·Kubernetes deployment

Open curriculum→

This curriculum is the foundation for the project — every optimization pattern in module 02 has a deeper-dive lesson, and module 04 reuses the same Kubernetes manifests. PRO subscribers get full access.

The build, in 3 phases

Three sprints. Three checkpoints. One production batch pipeline.

Each phase ends with a tagged commit and a working artifact. No ambiguity about where you are.

01~9h

Ship a working Spark batch ETL

ShopStream e-commerce dataset (5.5 GB) ingested through `shopstream_etl.py`. Multi-format reader. RFM segmentation. Partitioned Parquet output verified.

✓`shopstream_etl.py` orchestrator
✓Multi-format ingest (CSV / JSON / Parquet)
✓Partitioned Parquet output (year/month)

02~11h

Cut the runtime via 4 documented patterns

Baseline measured. Broadcast (1.8x) → caching → partition tuning → salting (2.5x) applied in sequence. Spark UI screenshots and before/after metrics for each.

✓Baseline + broadcast optimization scripts
✓Caching + partition-tuning scripts
✓Skew detection + salting scripts

03~22h

Add Delta lakehouse + streaming + K8s deploy

Delta Lake configured. ACID + time travel + SCD2 + MERGE + ZORDER + VACUUM shipped. Kafka Structured Streaming with checkpointing and watermarks. Spark-on-K8s with RBAC + ServiceMonitor + Prometheus.

✓7 Delta operation scripts
✓5 streaming scripts (Kafka, watermark, dedup)
✓Dockerfile + K8s manifests + deploy script

Project setup · 10 minutes

One command. Local Spark + Delta Lake + scaled-up dataset.

Lean sample data ships in the zip (orders.csv 10K, products.json 1K, customers.parquet 1K, clickstream_logs.csv 50K, daily_order_updates 500). Run `scripts/scale_up_to_full.py --full` to grow it to 5.5 GB for the optimization sprint. Part 4 streaming + K8s requires a real Kafka broker and Kubernetes cluster.

What lives in the repo

Everything you need to run modules 01-03 locally on a 16 GB laptop, plus the streaming and Kubernetes manifests for module 04 once you have a cluster.

part-1/run_all.py — ShopStream batch ETL orchestrator
part-2/ — 6 optimization scripts (broadcast / cache / partition / salting)
part-3/run_all.py — Delta Lake operations (ACID / time travel / SCD2 / MERGE / ZORDER / VACUUM)
part-4/kubernetes/ — RBAC, deployment, ServiceMonitor manifests
scripts/scale_up_to_full.py — Faker generator for the 5.5 GB scaled-up dataset

Download · Starter Kit

ShopStream Batch Pipeline Starter Kit

Pre-configured Python venv setup, lean sample data, run_all orchestrators for parts 1 & 3, scale-up generator, and the K8s manifests for part 4.

~1 MB · 50 files · 4 sample CSVs/JSON/Parquet · PRO required

~/projects/logistics-batch-pipeline — zsh

1. Unzip and create a Python venv

$ unzip logistics-batch-pipeline-starter.zip

$ cd logistics-batch-pipeline-starter

$ python3 -m venv .venv && source .venv/bin/activate

2. Install dependencies (PySpark, Delta, Faker)

$ pip install --upgrade pip && pip install -r requirements.txt

3. Grow lean data to the 5.5 GB tutorial scale (optional)

$ python scripts/scale_up_to_full.py --full

4. Run the part-1 batch ETL

$ python part-1/run_all.py

5. Run the part-3 Delta lakehouse build

$ python part-3/run_all.py

5.5 GB

Scale-up max

10M

Order rows (full)

Format types

1 MB

Lean zip ships

Production hardening

The same Spark job — but built for the 10x case.

Most Spark tutorials show you the read.csv(). This one shows what changes when the dataset is 100 GB+, the join is skewed, and the output table is being read by 5 downstream pipelines simultaneously.

Tutorial-gradeWhat you have today

Schema

inferSchema=True on every read

Joins

Default sort-merge join on every table

Shuffle partitions

Default 200, regardless of data size

Updates

OVERWRITE the whole partition

Delta retention

Default; old snapshots accumulate forever

Submit

local[*] for everything

Skew

Hope your join keys are uniform

Production-gradeModules 02-04

✓

Schema

Explicit StructType with nullability declared

✓

Joins

broadcast() hint for tables <10 MB; AQE for the rest

✓

Shuffle partitions

Tuned to 100-200 MB per partition: shuffle_partitions = data_GB * 5

✓

Updates

MERGE INTO ... WHEN MATCHED with conditional logic

✓

Delta retention

VACUUM 168h + OPTIMIZE ZORDER on a schedule

✓

Submit

spark-submit --master k8s://... with RBAC + ServiceMonitor

✓

Skew

Detect via groupBy().count(), fix with salting (2.5x)

PRO benefit · code review

Real review from senior engineers who shipped this stack.

Submit your repo, get line-by-line feedback within 48 hours. The kind of review that's quietly worth thousands of dollars in time-to-staff.

4 reviews / month

Submit a repo, a PR, or a refactor proposal. Reviewer is matched to your domain — Spark/Delta for this project. Async, comments inline, average turnaround 31 hours.

31h

avg turnaround

9.2/10

helpfulness

94%

return next month

2 office hours / month

Live 30-min sessions with a senior data engineer. Whiteboard a Spark optimization you're stuck on, mock a system-design interview, or talk through a real production incident.

30 min

per session

2 / mo

included

+ group

unlimited

What PRO unlocks

One subscription. 15+ projects, all curriculum, code review.

PRO is built for senior+ engineers who want production-grade builds and feedback loops — not more tutorials.

What you getFREEPROEXPERT

Projects

Production-grade builds

15+

Curriculum modules

All 7 tracks

Phase 1 only

All

All + bonus

Code review credits

Senior engineer review

4 / month

Unlimited

Career path access

5 paths × full plans

1 path

All 5

All 5 + 1:1

Certificate

Verifiable on LinkedIn

—

Yes

Yes + portfolio review

Community

Discord + office hours

Read-only

Full + 2/mo

Full + 4/mo

$29/mo

billed monthly · cancel anytime

or annual

$249/yr save 28%

Upgrade to PRO →

Who this is for

Pick this if you’re shipping at scale, not learning to.

Data engineers leveling to senior

You've written Spark jobs that work but you can't always explain WHY they're slow. After this, you can read a Spark UI like a profiler and name the fix.

Platform / infra engineers

You own the cluster but don't always own the workloads. This project gives you the perspective from the application side — enough to push back on a bad job before it costs $5K.

Analytics engineers crossing into DE

You know SQL and dbt. You need the Spark + Delta mental model. Module 03 (Delta Lake) is the bridge — same SCD2 + MERGE patterns you already know, in PySpark.

Staff / tech leads sizing the migration

Your team has a 10-year-old Hive pipeline. You need to know what the Delta migration actually looks like, what breaks, and how long the optimization sprint really takes.

Related curriculum

Going deeper? Three tracks back this project.

The Spark deep-dive is the spine. These three curriculums let you go deeper on the layers around it — Python ETL fundamentals, dimensional modeling, and the cloud + container infrastructure module 04 rides on.

FAQ

Quick answers.

Is the project really 'logistics' or e-commerce?+

The directory is named `logistics-batch-pipeline` for legacy reasons, but the actual code is a ShopStream e-commerce dataset (orders, products, customers, clickstream). Same Spark patterns apply to logistics, retail, fintech, or any domain — but be honest with yourself about what's in the tutorial.

Do I need a Kafka broker or a Kubernetes cluster to do this?+

No for modules 01-03 — they run locally on a 16 GB laptop. Yes for module 04 — Spark-on-K8s + Kafka Structured Streaming need a real cluster. We don't ship a one-click Kafka docker-compose because the production patterns assume a real broker. If you don't have one, modules 01-03 are still 28-34 hours of real production-grade work.

Is the '9x speedup' a measured benchmark?+

It's a documented pattern progression. The four optimizations (broadcast, caching, partition tuning, salting) each have a before/after metric in the tutorial code. The 9x is the cumulative ceiling of those patterns; your actual speedup depends on your dataset, cluster, and query shape. The interview-ready takeaway is the FOUR patterns, not the single number.

How is this different from P04 Iceberg Lakehouse Foundations?+

P04 is a lakehouse-deep-dive on Iceberg — table format mechanics, MERGE, compaction, snapshots. P05 (this) is a Spark + Delta sweep — multi-format ingest, optimization patterns, Delta operations, and a streaming/K8s overlay. Both are Batch track. P04 is depth; P05 is breadth. Doing both gives you Iceberg AND Delta side by side.

Does this include CDC or a streaming-first architecture?+

No. Module 04 ships a Kafka Structured Streaming pipeline with exactly-once semantics — but it's an addendum to the batch core, not a unified Lambda or Kappa runtime. CDC ingestion gets a dedicated project (P03 Agentic Data Pipeline).

What does PRO actually unlock for $29/mo?+

All 15+ PRO projects, 4 code-review credits per month, 2 office-hours sessions, full curriculum across all 7 tracks (including the 110-hour Spark deep-dive), all 5 career paths, certificate of completion, and full community access. Cancel anytime.

Will this help with senior+ DE interviews?+

Yes — Spark optimization is the single most-asked topic. After this you can name the symptom (skew, OOM, shuffle), name the pattern (broadcast / AQE / salting / partition tuning), and walk through the Spark UI to prove it. Plus you have a working Delta + K8s + Prometheus deploy you can reference in system-design rounds.

Ready to ship a real Spark pipeline?

Start with module 01 — free, no card. About 8-10 hours. By the end you'll have Spark running locally with the ShopStream e-commerce dataset, a working batch ETL, and partitioned Parquet output verified against rowcounts.

See PRO benefits

P05 · ShopStream Spark Batch Pipeline · PRO · module 01 freeUpgrade to PRO →

Ship aproduction-gradeSpark + Delta Lake batch pipeline for ShopStream

Spark + Delta Lake is still the default batch stack at scale.

Optimization is the #1 interview topic

Delta Lake is the production default

Streaming overlay, not full rewrite

Local first, cluster optional

Module 01 is free. The rest unlocks with PRO.

Apache Spark: Distributed Data Processing

Three sprints. Three checkpoints. One production batch pipeline.

One command. Local Spark + Delta Lake + scaled-up dataset.

What lives in the repo

ShopStream Batch Pipeline Starter Kit

The same Spark job — but built for the 10x case.

Real review from senior engineers who shipped this stack.

4 reviews / month

2 office hours / month

One subscription. 15+ projects, all curriculum, code review.

Pick this if you’re shipping at scale, not learning to.

Data engineers leveling to senior

Platform / infra engineers

Analytics engineers crossing into DE

Staff / tech leads sizing the migration

Going deeper? Three tracks back this project.

Python for Data Engineers

Advanced Data Modeling & Architecture

Cloud Data Infrastructure & FinOps

Quick answers.

Ready to ship a real Spark pipeline?

Ship a
production-grade
Spark + Delta Lake batch pipeline for ShopStream