Spark RDDs Explained: What They Are and How They Work
An RDD (Resilient Distributed Dataset) is Spark's low-level data abstraction — an immutable, partitioned collection of records distributed across a cluster. In modern Spark, you should use DataFrames instead for almost everything. RDDs still matter for understanding how Spark works and for edge cases requiring fine-grained control.
RDD vs DataFrame: The Same Job
# RDD approach (legacy — avoid in new code)
rdd = spark.sparkContext.textFile('data.csv')
result = rdd\
.filter(lambda line: 'purchase' in line)\
.map(lambda line: line.split(',')[2])\
.collect()
# DataFrame approach (preferred — faster, readable)
from pyspark.sql.functions import col
df = spark.read.csv('data.csv', header=True)
result = df.filter(col('event_type') == 'purchase')\
.select('amount')
Core RDD Concepts
Resilient
Fault Tolerant
Spark tracks the lineage of every RDD. If a partition is lost, it recomputes only that partition from the original data — no full restart needed.
Distributed
Partitioned Across Nodes
Each RDD is split into partitions. Each partition lives on a different worker node and is processed in parallel by that node's executor.
Dataset
Immutable Collection
RDDs are immutable — transformations create new RDDs rather than modifying the original. This enables safe parallel execution and reliable lineage tracking.
RDD vs DataFrame vs Dataset
| Feature | RDD | DataFrame | Dataset (Scala/Java) |
|---|---|---|---|
| Schema awareness | ✗ | ✓ | ✓ |
| Catalyst optimizer | ✗ | ✓ | ✓ |
| Type safety | Runtime only | Runtime only | Compile time |
| Performance | Baseline | Faster (optimized) | Faster (optimized) |
| Python support | ✓ | ✓ | ✗ |
| Use today? | Rarely | ✓ Primary API | Scala/Java only |
Common Mistakes
Writing new code with RDDs instead of DataFrames
DataFrames are Spark's primary API for data engineering. They're faster (Catalyst optimizer), more readable, and better supported. Use RDDs only when DataFrames genuinely can't solve your problem.
Using Python lambda functions with RDDs
Python lambdas on RDDs require serializing Python objects across the cluster (pickle), which is slow and fragile. DataFrames with built-in pyspark.sql.functions avoid this entirely.
Calling .collect() after each transformation for inspection
Use .take(5) or .show() to sample data during development. .collect() pulls the full dataset to the driver — fine for final results, catastrophic for inspection mid-pipeline.
FAQ
- What is an RDD in Apache Spark?
- An RDD (Resilient Distributed Dataset) is Spark's foundational data abstraction — an immutable, partitioned collection of records distributed across a cluster. In modern Spark, most engineers use DataFrames instead, which benefit from automatic query optimization.
- What is the difference between RDD and DataFrame in Spark?
- RDDs are unstructured collections with no schema knowledge. DataFrames have named columns and types, and are optimized by Spark's Catalyst engine — making them significantly faster for structured data operations.
- Should I use RDDs or DataFrames in 2026?
- Use DataFrames for virtually all data engineering work. Only use RDDs for unstructured data that doesn't fit a schema, fine-grained partitioning control, or maintaining legacy Spark 1.x code.
- What does "resilient" mean in RDD?
- "Resilient" means fault-tolerant. Spark tracks the lineage of every RDD. If a partition is lost due to node failure, Spark recomputes only that partition from the original data without rerunning the entire job.