Skip to content

How to Run a Spark Job

Install PySpark, create a SparkSession, read your data into a DataFrame, apply transformations, and write the output. The entire flow runs locally in minutes — no cluster required for development.

1

Install PySpark

# Requires Java 11 or 17
pip install pyspark==3.5.0

# Verify installation
python -c "import pyspark; print(pyspark.__version__)"

Install Java first if you don't have it: brew install openjdk@17 on macOS or the OpenJDK package on Linux.

2

Create a SparkSession

# job.py
from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .appName('my_first_job')\
    .master('local[*]')  # use all local CPUs
    .getOrCreate()

local[*] uses all available CPU cores on your machine. For a cluster, replace with the master URL e.g. spark://host:7077.

3

Read your data

# Read Parquet (preferred — schema-aware, compressed)
df = spark.read.parquet('data/events/')

# Read CSV with header + inferred schema
df = spark.read.csv(
    'data/orders.csv',
    header=True,
    inferSchema=True  # slow on large files — use StructType in prod
)

df.printSchema()
df.show(5)

For production, define an explicit StructType schema instead of inferSchema=True — it avoids a full file scan to infer types.

4

Transform with DataFrame API

from pyspark.sql.functions import col, sum, count, when

result = df\
    .filter(col('status') == 'completed')\
    .withColumn('is_large', when(col('amount') > 1000, True).otherwise(False))\
    .groupBy('user_id', 'is_large')\
    .agg(
        count('*').alias('order_count'),
        sum('amount').alias('total_spend')
    )

Spark operations are lazy — nothing runs until you call an action like .show(), .count(), or .write. This lets Spark optimize the full query plan before executing.

5

Write the output and run

# Write as Parquet, partitioned by date
result.write\
    .mode('overwrite')\
    .partitionBy('date')\
    .parquet('output/user_spend/')

# Run locally
python job.py

# Submit to a Spark cluster
spark-submit --master spark://host:7077 \
  --executor-memory 4g \
  job.py

After running, check the Spark UI at http://localhost:4040 to see stage timings, task distribution, and shuffle stats.

When to Use spark-submit vs python

  • Use python job.py for local development and testing — fast feedback loop, no overhead
  • Use spark-submit when deploying to a Spark standalone cluster, YARN, or Kubernetes
  • Use Databricks or EMR jobs for managed cloud execution — no spark-submit needed
  • Always test locally first with a sample dataset before submitting to the cluster

Common Issues

Java not found error

Spark requires Java 11 or 17. Run "java -version" to check. Install OpenJDK and set JAVA_HOME environment variable.

Out of memory on Driver

You called .collect() or .toPandas() on too much data. Write to Parquet instead and read a small sample for inspection.

Slow job with 200 shuffle partitions

The default spark.sql.shuffle.partitions=200 creates too many small tasks for small datasets. For local testing, set it to 4–8: .config("spark.sql.shuffle.partitions", "8")

Permission denied writing to S3

Set AWS credentials via environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or use an IAM role if running on EC2/EMR.

FAQ

How do I run a Spark job locally?
Install pyspark with pip, create a Python file with a SparkSession using .master("local[*]"), and run it with "python job.py". Spark will use your local machine as a single-node cluster.
What is spark-submit?
spark-submit is the CLI tool for submitting Spark jobs to a cluster. You specify the master URL, resources, and your Python or JAR file. Example: spark-submit --master spark://host:7077 --executor-memory 4g job.py
How do I read a CSV file in PySpark?
Use spark.read.csv("path", header=True, inferSchema=True). For production, define an explicit StructType schema to avoid the full file scan that inferSchema triggers.

Related

Press Cmd+K to open