ADR-004: Orchestration is Ray + Airflow, not Spark / Dask / single-node cron | LLM Ingestion Pipeline

Context

The pipeline has two distinct orchestration concerns:

DAG-shape scheduling — daily eval runs, weekly ingest re-crawls, model retrain triggers. Needs cron / interval scheduling, retries, alerting, dependencies between tasks.
Fan-out compute — embedding generation across 100K+ docs, distributed dedup at scale, parallel inference for batch eval. Needs distributed task execution with GPU placement and shared state.

The classic options:

Spark / pyspark — distributed compute on JVM. Strong for batch data processing. Heavy footprint; JVM dependency in a Python pipeline.
Dask — Python-native distributed compute. Lighter than Spark. Strong DataFrame API but weaker GPU story.
Ray — Python-native distributed compute with first-class GPU support and @ray.remote(num_gpus=N) placement.
single-node cron — run everything on one beefy machine. Simplest but doesn't scale past ~10K docs/day.

For DAG scheduling separately:

Airflow — the de-facto industry default. DAGs as Python.
Prefect — lighter, modern. Smaller ecosystem.
Dagster — asset-first model. Different mental model.

Decision

Adopt Ray for fan-out compute + Airflow for DAG scheduling, with the boundary at "Airflow operators call Ray remote functions".

# dags/ingest_dag.py — Airflow DAG
from airflow import DAG
from airflow.decorators import task
from datetime import datetime, timedelta

with DAG(
    dag_id="weekly_ingest",
    schedule_interval="@weekly",
    start_date=datetime(2026, 5, 1),
    default_args={"retries": 3, "retry_delay": timedelta(minutes=15)},
) as dag:
    @task
    def crawl_seeds(): ...

    @task
    def dedup_corpus(crawl_path): ...

    @task
    def embed_docs(dedup_path):
        # Ray fan-out for the embedding step
        import ray
        @ray.remote(num_gpus=0.5)
        def embed_chunk(chunk): ...
        results = ray.get([embed_chunk.remote(c) for c in chunks])
        return results

    crawl_seeds() >> dedup_corpus() >> embed_docs()

# dags/eval_dag.py — daily regression eval
@task
def run_eval_suite(eval_dataset_path):
    import ray
    @ray.remote
    def judge_question(q, context): ...
    scores = ray.get([judge_question.remote(q, c) for q, c in pairs])
    return aggregate(scores)

Tradeoffs we accept

Lever	Spark	Dask	Ray (chosen)	Single-node cron
Python-native	No (JVM)	Yes	Yes	Yes
GPU placement	Manual (Spark RAPIDS)	Awkward	First-class (`num_gpus=N`)	None
DataFrame API	Strong	Strong	Weaker (Modin/Datasets)	pandas
Local-dev parity	JVM container	Pip install	Pip install	Native
Operational complexity	Spark cluster ops	Dask scheduler ops	Ray head + workers	None
Fault tolerance	Native (RDD lineage)	Task retries	Task retries + actor restarts	None
Memory model	Off-heap JVM	Distributed Pandas	Object store	Process memory
Cost (small cluster, 100K docs/day)	$400+/mo (cluster)	$200/mo	$200/mo	$50/mo

For DAG scheduling:

Lever	Airflow (chosen)	Prefect	Dagster
Industry adoption	Highest	Growing	Growing
Operator ecosystem	Largest	Medium	Medium
Hosted options	MWAA, Astronomer, Composer	Cloud	Cloud
Mental model	DAGs of tasks	Flow state machines	Asset-centric
Tutorial reproducibility	Local docker-compose works	Easy	Easy
Backfill story	First-class	First-class	First-class

We optimize for GPU-first fan-out (Ray) + industry-default scheduler (Airflow). The pipeline's bottleneck is GPU embedding generation at the 100K-doc step; Ray's num_gpus=0.5 placement (you can fractionally claim a GPU) is the deciding capability. Spark would work but the JVM dep and operational overhead aren't worth it for a predominantly Python pipeline.

Spark IS used in one optional Tier-2 path (distributed_dedup.py) because pyspark's DataFrame API is genuinely the cleanest for the specific case of distributed MinHash sharding. That's deliberate — ADR-002 documents the dedup decision; this ADR-004 documents the overall orchestration decision.

Consequences (positive)

Module 05 ships runnable Airflow DAGs that start with airflow standalone on a laptop. The dags/ingest_dag.py + dags/eval_dag.py files are the production shape — no toy code.
Ray's @ray.remote(num_gpus=0.5) lets a single g5.xlarge run two embedding jobs in parallel without GPU contention.
The Airflow + Ray boundary is clean: Airflow knows about retries and scheduling, Ray knows about GPU placement and fan-out. Neither needs to know about the other's concerns.
The CI gate (.github/workflows/eval.yml) calls the same run_eval_suite function the daily DAG calls — no parallel implementation.

Consequences (negative)

Two systems to operate. Airflow scheduler + Ray head node + Ray workers. ~$48/mo on EC2 for the Airflow scheduler vs $0 for cron. Mitigation: cost-model CSV documents this. Below 10K docs/day, single-node cron is the right answer; this ADR is for the post- scaling regime.
Ray learning curve. Async actor model is unfamiliar to teams used to plain function calls. Mitigation: Module 05 walks through @ray.remote step by step; Tier-1 path uses synchronous stubs.
Airflow cluster setup is non-trivial. Production Airflow needs a metadata DB, a scheduler, workers, a webserver. Mitigation: the tutorial uses airflow standalone (single binary, SQLite metadata) for local; Tier-2 documents MWAA/Astronomer for production.
No native data-quality gate primitives. Great Expectations or Soda would slot in here but aren't part of the orchestration ADR. Future ADR could add them.

Reversal plan

The orchestration interface is "task functions called by a scheduler". Replacement is bounded:

Spark for the entire compute layer — replace Ray's @ray.remote with Spark UDFs. Forces a JVM dep and SparkSession bootstrapping. ~2 engineer-weeks. Documented but not recommended.
Dask — replace @ray.remote with dask.delayed. ~3 engineer- days. Loses GPU placement; gains DataFrame API. Net negative for this project.
Prefect for the scheduler — replace Airflow DAGs with Prefect flows. ~1 engineer-week. Lighter footprint; smaller ecosystem.
Argo Workflows — k8s-native; replace Airflow if the team's already on k8s. ~1-2 engineer-weeks.

Estimated effort: 3 engineer-days (Dask) to 2 engineer-weeks (full Spark migration). All reversible.

References

dags/ingest_dag.py (weekly ingest DAG with Ray fan-out for embed)
dags/eval_dag.py (daily eval DAG with Ray fan-out for judge calls)
pipeline/cost_tracker.py (per-task cost tracking the DAG calls)
app/metrics.py (Prometheus metrics emitted by Airflow + Ray)
.github/workflows/eval.yml (CI gate that calls the same eval function)
ADR-002 (Spark IS used optionally in distributed_dedup.py — documented exception)