Global Logistics Batch Pipeline
Process raw supply chain CSV and JSON dumps from S3 into clean, analytical tables using distributed processing.
fig 1 — spark data processing pipeline
THROUGHPUT
50GB+
Raw E-commerce Data
OPTIMIZATION
9X
Latency Reduction
INFRASTRUCTURE
K8s
Dynamic Scaling
STORAGE
Delta
ACID Lakehouse
What You'll Build
A complete big data platform for ShopStream, a high-growth e-commerce company processing millions of events daily.
Batch ETL Pipeline
Multi-format ingestion (CSV/JSON/Parquet), DataFrame transformations, partitioning, and schema validation
9x Performance Gain
Systematic optimization: broadcast joins, caching, partition tuning, shuffle minimization, Kryo serialization
Delta Lake Lakehouse
ACID transactions, time travel queries, schema evolution, SCD Type 2, Z-ordering, and merge operations
Streaming on K8s
Structured Streaming with Kafka integration, auto-scaling on Kubernetes, Prometheus monitoring
Curriculum
Each part builds on the previous. Start with batch processing, optimize, add Delta Lake, deploy to production.
Technical Standards
Production patterns you'll implement across all four parts.
Systematic optimization from 45min to 5min: broadcast joins, caching, and partition tuning
Delta Lake ACID lakehouse with Kubernetes auto-scaling and real-time streaming
Unified batch + streaming with Kafka integration and Prometheus monitoring
Environment Setup
Launch the Spark cluster locally with Docker Compose and submit your first job.
# Clone the project & generate e-commerce data$ git clone https://github.com/aide-hub/shopstream-spark.git$ cd shopstream-spark# Launch Spark cluster + Kafka + Prometheus$ docker-compose -f docker-compose.spark.yml up -d# Submit first Spark job (Part 1: Batch ETL)$ spark-submit --master spark://localhost:7077 \$ --conf spark.sql.shuffle.partitions=200 \$ jobs/batch_etl.py --input data/ --output output/
Tech Stack
Prerequisites
- Python 3.11+ (functions, classes, list comprehensions)
- Basic SQL (SELECT, WHERE, GROUP BY)
- 16GB RAM laptop or cloud access (AWS/GCP/Azure)
- Git basics (clone, commit, push)
Related Learning Path
Level up your Apache Spark skills with the companion skill toolkit covering core concepts, performance tuning, and production deployment.
Spark Learning PathWhat is Apache Spark?
/guide/what-is-apache-spark — complete reference guide
Ready to build production Spark pipelines?
Start with Part 1: Foundation — Batch Data Processing