How to Run a Spark Job
Install PySpark, create a SparkSession, read your data into a DataFrame, apply transformations, and write the output. The entire flow runs locally in minutes — no cluster required for development.
Install PySpark
# Requires Java 11 or 17
pip install pyspark==3.5.0
# Verify installation
python -c "import pyspark; print(pyspark.__version__)"Install Java first if you don't have it: brew install openjdk@17 on macOS or the OpenJDK package on Linux.
Create a SparkSession
# job.py
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName('my_first_job')\
.master('local[*]') # use all local CPUs
.getOrCreate()local[*] uses all available CPU cores on your machine. For a cluster, replace with the master URL e.g. spark://host:7077.
Read your data
# Read Parquet (preferred — schema-aware, compressed)
df = spark.read.parquet('data/events/')
# Read CSV with header + inferred schema
df = spark.read.csv(
'data/orders.csv',
header=True,
inferSchema=True # slow on large files — use StructType in prod
)
df.printSchema()
df.show(5)For production, define an explicit StructType schema instead of inferSchema=True — it avoids a full file scan to infer types.
Transform with DataFrame API
from pyspark.sql.functions import col, sum, count, when
result = df\
.filter(col('status') == 'completed')\
.withColumn('is_large', when(col('amount') > 1000, True).otherwise(False))\
.groupBy('user_id', 'is_large')\
.agg(
count('*').alias('order_count'),
sum('amount').alias('total_spend')
)Spark operations are lazy — nothing runs until you call an action like .show(), .count(), or .write. This lets Spark optimize the full query plan before executing.
Write the output and run
# Write as Parquet, partitioned by date
result.write\
.mode('overwrite')\
.partitionBy('date')\
.parquet('output/user_spend/')
# Run locally
python job.py
# Submit to a Spark cluster
spark-submit --master spark://host:7077 \
--executor-memory 4g \
job.pyAfter running, check the Spark UI at http://localhost:4040 to see stage timings, task distribution, and shuffle stats.
When to Use spark-submit vs python
- •Use python job.py for local development and testing — fast feedback loop, no overhead
- •Use spark-submit when deploying to a Spark standalone cluster, YARN, or Kubernetes
- •Use Databricks or EMR jobs for managed cloud execution — no spark-submit needed
- •Always test locally first with a sample dataset before submitting to the cluster
Common Issues
Java not found error
Spark requires Java 11 or 17. Run "java -version" to check. Install OpenJDK and set JAVA_HOME environment variable.
Out of memory on Driver
You called .collect() or .toPandas() on too much data. Write to Parquet instead and read a small sample for inspection.
Slow job with 200 shuffle partitions
The default spark.sql.shuffle.partitions=200 creates too many small tasks for small datasets. For local testing, set it to 4–8: .config("spark.sql.shuffle.partitions", "8")
Permission denied writing to S3
Set AWS credentials via environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or use an IAM role if running on EC2/EMR.
FAQ
- How do I run a Spark job locally?
- Install pyspark with pip, create a Python file with a SparkSession using .master("local[*]"), and run it with "python job.py". Spark will use your local machine as a single-node cluster.
- What is spark-submit?
- spark-submit is the CLI tool for submitting Spark jobs to a cluster. You specify the master URL, resources, and your Python or JAR file. Example: spark-submit --master spark://host:7077 --executor-memory 4g job.py
- How do I read a CSV file in PySpark?
- Use spark.read.csv("path", header=True, inferSchema=True). For production, define an explicit StructType schema to avoid the full file scan that inferSchema triggers.