Apache Airflow in Production

Airbnb

Pioneer · scheduler maintainer

Problem

Sensor-induced scheduler starvation

Managing thousands of batch ETL jobs and ML workflows across multiple data sources became unmanageable with cron jobs. The team needed better visibility, dependency management, and fault recovery — but as adoption exploded, sensors blocked scheduler slots and brought everything to a crawl.

Scale

Daily tasks: 100,000+
DAGs: 2,000+
ETL processed: 10+ PB/day
Team size: 500+ users
Clusters: 5 (prod/staging/dev)
Infrastructure: AWS + Kubernetes

Solution

Custom SmartSensor + dynamic DAG generation

Airbnb rebuilt the sensor pattern from the ground up — a multi-cluster setup with separate Airflow instances for prod, staging, and dev, custom KubernetesPodOperator for isolated task execution, and a custom SmartSensor that reduced sensor overhead by 90%. The result: stable orchestration at a scale very few teams ever reach.

Apache AirflowKubernetesAWS S3Apache SparkPrestoDruidSegment

Multi-cluster setup: separate Airflow instances for prod, staging, and dev
DAGs stored in Git and deployed via GitOps pipeline
KubernetesPodOperator for isolated task execution
Custom operators for common Airbnb workflows (Spark, Presto)
Airflow metadata DB on Amazon RDS with read replicas
S3 for logs and intermediate data storage
Integration with internal tools: data catalog, lineage tracker, alerting

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonsmart_sensor.pySmartSensor for efficient file sensing

from airflow.sensors.base import BaseSensorOperator
from airflow.utils.decorators import apply_defaults

class SmartS3Sensor(BaseSensorOperator):
    """A batched S3 sensor that satisfies up to N sensor
    instances with a single S3 API call per polling cycle."""

    @apply_defaults
    def __init__(self, bucket, key_prefix, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.bucket = bucket
        self.key_prefix = key_prefix

    def poke(self, context):
        # Check partition completion via Hive metastore
        partition = context["execution_date"].strftime("%Y-%m-%d")
        key = f"{self.key_prefix}/dt={partition}/_SUCCESS"

        from airflow.hooks.S3_hook import S3Hook
        hook = S3Hook(aws_conn_id="aws_default")
        return hook.check_for_key(key, self.bucket)

Pythondynamic_dag_generator.pyDynamic DAG generation from templates

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import yaml

def generate_dag_from_config(config):
    """Generates a DAG dynamically from a config dict."""
    dag = DAG(
        dag_id=config["dag_id"],
        schedule_interval=config["schedule"],
        default_args=config["default_args"],
        catchup=False,
    )

    with dag:
        extract = PythonOperator(
            task_id="extract",
            python_callable=extract_data,
            op_kwargs={"source": config["source"]},
        )
        transform = PythonOperator(
            task_id="transform",
            python_callable=transform_data,
            op_kwargs={"transformations": config["transformations"]},
        )
        load = PythonOperator(
            task_id="load",
            python_callable=load_data,
            op_kwargs={"destination": config["destination"]},
        )

        extract >> transform >> load
    return dag

# Generate DAGs from a YAML of pipeline configs. Airbnb has 800+ DAGs
# auto-built from a single YAML, eliminating boilerplate and enforcing
# standards.
for config in load_pipeline_configs():
    globals()[config["dag_id"]] = generate_dag_from_config(config)

Pythonk8s_pod_operator.pyKubernetesPodOperator for isolated task execution

from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from kubernetes.client import models as k8s

# Run a Spark job in an isolated K8s pod
spark_task = KubernetesPodOperator(
    task_id="spark_aggregation",
    name="spark-aggregation",
    namespace="dataeng",
    image="airbnb/spark:3.4",
    cmds=["spark-submit"],
    arguments=[
        "--master", "k8s://https://kubernetes.default.svc",
        "--conf", "spark.executor.instances=20",
        "/app/aggregate.py",
        "--date", "{{ ds }}",
    ],
    resources=k8s.V1ResourceRequirements(
        requests={"cpu": "1", "memory": "4Gi"},
        limits={"cpu": "4", "memory": "16Gi"},
    ),
    secrets=[
        {"key": "key", "name": "aws-credentials", "deploy_target": "AWS_ACCESS_KEY_ID"},
        {"key": "secret", "name": "aws-credentials", "deploy_target": "AWS_SECRET_ACCESS_KEY"},
    ],
    is_delete_operator_pod=True,
    in_cluster=True,
    get_logs=True,
)

Outcomes

Business outcomes

Reduced custom "SmartSensor" to reduce scheduler overhead by 90%
Built dynamic DAG generation framework to template-create workflows
Implemented advanced SLA monitoring with granular alerting per task
Developed data quality checks integrated directly into DAG execution
Custom UI extensions for better DAG visualization and debugging

Technical outcomes

Reduced backfill jobs from 8 hrs to 22 mins via parallel execution
97% reduction in scheduler latency for high-fan-out DAGs
Cut p99 scheduler heartbeat lag from 18s to under 200ms
Reduced operational toll: from weekly pages to one minor per quarter

Impact

$2M annually in reduced compute waste and faster debugging

Net of the engineering cost of building SmartSensor. The 90% sensor-overhead reduction freed enough scheduler slots that Airbnb could grow DAG count 4× without provisioning more capacity.

Takeaways

Invest in monitoring and alerting early — it's critical at scale. Built conventional metrics for scheduler, executor, and database performance.
Keep DAGs idempotent and stateless. Use XCom sparingly and prefer external storage (S3) for large data passing between tasks.
Custom operators save time but create maintenance burden. Build a custom-code vs. using standard operators with good ownership.
Database performance is the bottleneck at scale. Use read-replicas, connection-pooling, consider sharding metadata DBs.
Don't underestimate the importance of good DAG-design patterns. Create reusable templates and enforce code review for new pipelines.

Lyft

Massive-scale operator

Problem

200K+ daily tasks on a single Airflow cluster

As Lyft's data team grew from 40 to 300 engineers, the original Airflow setup couldn't keep up. Scheduler heartbeat misses, slot exhaustion, and a metadata DB that became a global mutex meant nightly batch runs slipping into the morning peak — and downstream dashboards perpetually out of date.

Scale

Daily tasks: 200,000+
Active DAGs: 3,500+
Engineers: 300+
Worker pods: 5,000+ concurrent
Region: Multi-AZ on EKS
Metadata DB: Aurora · sharded

Solution

Multi-tenant Airflow on Kubernetes with cell architecture

Lyft sharded their Airflow deployment by team into "cells" — independent Airflow instances per business unit, all running on a shared EKS platform. Each cell has its own metadata DB shard, scheduler, and worker pool. Cross-cell dependencies use an event bus, not direct task linkage.

Apache AirflowEKSAuroraKafkaS3IcebergTrino

Sharded Airflow by team into independent "cells" on EKS
Each cell has its own metadata DB to eliminate global locks
Cross-cell deps use a Kafka event bus, not direct task linkage
Platform team owns the cell template; product teams own their cells
Centralized observability via OpenLineage + Marquez
Cost attribution via K8s namespace labels → finance dashboards

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

YAMLlyft_cell_template.yamlCell template — one Airflow per team

apiVersion: airflow.lyft.com/v1
kind: AirflowCell
metadata:
  name: marketplace
spec:
  team: marketplace-data
  ownership: dataeng-marketplace@lyft.com

  scheduler:
    replicas: 2
    resources:
      requests: { cpu: 2, memory: 4Gi }
      limits:   { cpu: 4, memory: 8Gi }

  workers:
    minReplicas: 5
    maxReplicas: 80
    autoscale:
      metric: pending_tasks
      target: 50

  metadataDB:
    engine: aurora-postgres
    shard: cell-marketplace
    storage: 200Gi

  eventBus:
    kafka:
      topic: airflow.events.marketplace
      consumerGroup: airflow-events

  observability:
    openlineage: { enabled: true, endpoint: marquez.internal }
    cost:        { namespace: af-cell-marketplace }

Outcomes

Business outcomes

Linear scaling to 200K+ tasks/day without scheduler instability
On-call burden cut 70% by isolating failures per cell
Self-service: new team can spin up a cell in <2 hours
Cost attribution to teams within 5% accuracy

Technical outcomes

p99 scheduler heartbeat: 14s → 180ms
Worker pod startup: 45s → 8s with image pre-warming
Metadata DB write contention reduced 95%
Cross-cell dep latency: median 2 minutes via event bus

Impact

Linear scaling for the next 3× of growth — without re-architecting again

The cell architecture is the unblocker. Lyft can onboard new business units without anyone on the platform team writing a line of code. Each cell stays small enough to debug; the platform stays simple enough to operate.

Takeaways

Don't scale Airflow vertically. Shard it. A single cluster past ~30K tasks/day is a constant fire.
Per-team cells beat per-team Airflow installations — the platform team owns the template, product teams own configuration.
Cross-cell deps should be event-driven, not task-graph linked. Coupling DAGs across cells reintroduces the original problem.
Cost attribution is a feature, not a side effect. Tag every pod with team label or you'll never get accurate finance reports.
OpenLineage is worth the integration cost. Without lineage across cells, debugging a downstream issue takes hours.

Common pitfalls

The mistakes both teams hit on the road from "it works on my laptop" to "it runs the business."

Using default SequentialExecutor in production

Problem

Many teams start with SQLite and the default executor, which serializes all task execution. Pipeline scaling proceeds until it doesn't.

Solution

Migrate to CeleryExecutor or KubernetesExecutor early. Use a real metadata DB (Postgres/MySQL) and a real broker (Redis/RabbitMQ).

Impact

Eliminates the average DAG execution time from 45 minutes to 5 hours. Locked tasks become 100× fewer simultaneously.

Heavy XCom usage for passing large datasets between tasks

Problem

XCom is metadata-DB-backed. Putting megabytes (let alone gigabytes) of intermediate state through XCom hammers the database and slows every other task.

Solution

Use S3 / GCS / a fast object store as the medium and pass S3 URIs through XCom. Or — better — restructure the DAG so two tasks share a workspace dir.

Impact

Reduced metadata DB load 8×. Scheduler heartbeat-lag decreased from 18 seconds to 200 ms.

Not setting max_active_runs for time-sensitive DAGs

Problem

A 6-hour DAG that runs hourly will queue up and create overlap. Multiple runs trample on each other and the metadata DB gets cranky.

Solution

Set max_active_runs=1 for DAGs whose runs naturally overlap (or use SLA-aware scheduling). Add timeout watchdogs to fail fast on stuck tasks.

Impact

Eliminated overlapping-run incidents (~12% of pages) and reduced overall DAG queue depth by 60%.

Build it, don't just read about it

Build your own Airflow at scale

Reading is half the value. The other half is shipping a production-grade Airflow deployment yourself — with sensors, dynamic DAGs, Kubernetes executors, and the cost attribution patterns Airbnb and Lyft use.

Our Airflow on Kubernetes module covers the full setup: cell architecture, sharded metadata DB, autoscaling workers, OpenLineage, and the on-call playbook that comes with running it at scale.

Start the Airflow on K8s module Browse Airflow projects