Skip to content
Capstone Project35-40 hrs

Global Logistics Batch Pipeline

Process raw supply chain CSV and JSON dumps from S3 into clean, analytical tables using distributed processing.

4 Parts/8 Tools/50GB+ Dataset
shopstream / spark-pipeline
INGEST
CSV 1.5GB
JSON 200MB
Parquet 800MB
Logs 3GB
TRANSFORM
Joins
Aggregations
Partitioning
UDFs
OPTIMIZE
Broadcast
Caching
Shuffle
Kryo
SERVE
Delta Lake
Streaming
K8s Deploy
Monitoring

fig 1 — spark data processing pipeline

THROUGHPUT

50GB+

Raw E-commerce Data

OPTIMIZATION

9X

Latency Reduction

INFRASTRUCTURE

K8s

Dynamic Scaling

STORAGE

Delta

ACID Lakehouse

What You'll Build

A complete big data platform for ShopStream, a high-growth e-commerce company processing millions of events daily.

Batch ETL Pipeline

Multi-format ingestion (CSV/JSON/Parquet), DataFrame transformations, partitioning, and schema validation

9x Performance Gain

Systematic optimization: broadcast joins, caching, partition tuning, shuffle minimization, Kryo serialization

Delta Lake Lakehouse

ACID transactions, time travel queries, schema evolution, SCD Type 2, Z-ordering, and merge operations

Streaming on K8s

Structured Streaming with Kafka integration, auto-scaling on Kubernetes, Prometheus monitoring

Curriculum

Each part builds on the previous. Start with batch processing, optimize, add Delta Lake, deploy to production.

Technical Standards

Production patterns you'll implement across all four parts.

Performance
9Xspeedup

Systematic optimization from 45min to 5min: broadcast joins, caching, and partition tuning

Scalability
50GB+processed

Delta Lake ACID lakehouse with Kubernetes auto-scaling and real-time streaming

Architecture
Lambdapattern

Unified batch + streaming with Kafka integration and Prometheus monitoring

Environment Setup

Launch the Spark cluster locally with Docker Compose and submit your first job.

shopstream-spark
# Clone the project & generate e-commerce data
$ git clone https://github.com/aide-hub/shopstream-spark.git
$ cd shopstream-spark

# Launch Spark cluster + Kafka + Prometheus
$ docker-compose -f docker-compose.spark.yml up -d

# Submit first Spark job (Part 1: Batch ETL)
$ spark-submit --master spark://localhost:7077 \
$ --conf spark.sql.shuffle.partitions=200 \
$ jobs/batch_etl.py --input data/ --output output/

Tech Stack

Apache SparkPySparkDelta LakeKubernetesKafkaPrometheusPython 3.11Docker

Prerequisites

  • Python 3.11+ (functions, classes, list comprehensions)
  • Basic SQL (SELECT, WHERE, GROUP BY)
  • 16GB RAM laptop or cloud access (AWS/GCP/Azure)
  • Git basics (clone, commit, push)

Related Learning Path

Level up your Apache Spark skills with the companion skill toolkit covering core concepts, performance tuning, and production deployment.

Spark Learning Path

What is Apache Spark?

/guide/what-is-apache-spark — complete reference guide

Ready to build production Spark pipelines?

Start with Part 1: Foundation — Batch Data Processing

Press Cmd+K to open