ShopStream Spark Project
Step-by-Step Walkthrough: Build a Production Spark Data Pipeline
What You'll Build
In this walkthrough, you'll build a production-grade Spark pipeline for ShopStream, an e-commerce platform processing millions of transactions daily. You'll learn to:
- Set up Apache Spark locally with Docker
- Ingest data from multiple formats (CSV, JSON, Parquet)
- Transform data using DataFrame operations
- Perform product analytics and customer segmentation
- Write optimized Parquet output with partitioning
Prerequisites
Set Up Spark Environment
20 min1.1 Create Project Directory
Create a directory structure for your Spark project:
1.2 Create docker-compose.yml
Create a Docker Compose file to run Spark with Jupyter:
1.3 Download Sample Data
Create sample e-commerce data files:
1.4 Start Spark Environment
abc123def456 jupyter/pyspark-notebook:spark-3.5.0 Up 10 seconds
1.5 Access Jupyter Notebook
Get the Jupyter token and access the notebook:
Open your browser to: http://localhost:8888
1.6 Verify Spark Installation
Create a new notebook and test Spark:
Spark UI: http://localhost:4040
- • If container fails to start, ensure Docker has at least 4GB RAM allocated
- • If port 8888 is busy, change port mapping in docker-compose.yml
- • On Windows, use PowerShell or WSL2 for commands
Build Multi-Format Data Ingestion
30 min2.1 Read CSV Data (Products)
Create a Spark job to read product data:
|-- product_id: integer
|-- name: string
|-- category: string
|-- price: double
|-- stock: integer
2.2 Read JSON Data (Orders)
2.3 Data Quality Validation
Add validation checks before processing:
inferSchema during development, but define explicit schemas for production to avoid type inference overhead.Create DataFrame Transformations
25 min3.1 Join Orders with Products
3.2 Product Analytics Aggregation
Calculate product performance metrics:
| category|total_orders| revenue|avg_order_value|
+-------------+------------+---------+---------------+
| Electronics| 2| 1359.97| 679.99|
| Furniture| 1| 199.99| 199.99|
+-------------+------------+---------+---------------+
3.3 Customer Segmentation
Write Optimized Output & Test
15 min4.1 Write Partitioned Parquet
Save processed data with optimizations:
4.2 Verify Output
4.3 Check Spark UI
Monitor your job execution:
- Open
http://localhost:4040in your browser - Click on the "Jobs" tab to see all executed jobs
- Click on a job to see the DAG visualization
- Check the "Storage" tab to see cached DataFrames
- Review the "Executors" tab for resource usage
✓ DAG should show stages: read → transform → write
✓ No tasks should be marked as "Failed"
4.4 Performance Testing
segment=VIP partition, skipping the other data entirely (partition pruning).- • Out of memory errors: Reduce data size or increase Docker memory
- • Slow performance: Check Spark UI for data skew in partitions
- • File not found: Verify paths are relative to Jupyter working directory
Walkthrough Complete!
You've successfully built a production Spark pipeline with multi-format ingestion, transformations, and optimized output. You're ready for Part 2!