Skip to content
Engineering Insights
Architecture

Why We Migrated from Airflow to Kubernetes-Native Orchestration

Jordan ReevesMar 18, 20268 min read

Want to build this yourself?

This architecture is covered in our hands-on projects. Build it in the AI-DE sandbox.

Explore Projects

The Breaking Point

Our Airflow setup served us well for the first two years. A managed Cloud Composer instance, 400 DAGs, and a team of 12 data engineers. Then the numbers started telling a different story.

MetricAirflow (Cloud Composer)After Migration (Argo)
DAG parse time90–120 secondsEliminated
Scheduler RAM8 GB~200 MB
Task start latency45–90 seconds5–15 seconds
Blast radiusEntire schedulerSingle workflow
Infrastructure cost~$4,200/month~$380/month

A single misconfigured DAG could bring down the entire scheduler. That was the final straw.

The Architecture Problem

Airflow vs Argo Workflows — Architecture Comparison

Airflow (Cloud Composer)

Webserver
Scheduler (8 GB RAM)
Celery Workers (×N)
Metadata DB (Postgres)
Redis / RabbitMQ
KubernetesPodOperator → Pod

6 separate services to operate

migrate

Argo Workflows (K8s-Native)

Kubernetes API Server
Argo Controller (~200 MB)
WorkflowTemplates (CRDs in Git)
Workflow Pods (auto-provisioned)
(No separate MetaDB / Redis / Webserver)

1 controller + Kubernetes = done

We were already using KubernetesPodOperator for every meaningful task — every task was a pod. Airflow had become a thin, expensive scheduling wrapper around Kubernetes, maintaining its own database, scheduler process, web server, and message broker for no incremental value.

Argo Workflows runs entirely as Kubernetes CRDs. No separate database, no scheduler process — the Kubernetes control plane is the orchestration layer.

Side-by-Side: Airflow vs Argo

The same pipeline task in both systems shows exactly where the complexity lives.

Airflow KubernetesPodOperator:

python
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from kubernetes.client import models as k8s

transform_task = KubernetesPodOperator(
    task_id="transform_events",
    namespace="data-pipelines",
    image="my-registry/transform:v1.2.3",
    cmds=["python", "transform.py"],
    arguments=["--date={{ ds }}"],
    env_vars={"SNOWFLAKE_CONN": "{{ var.value.snowflake_conn }}"},
    resources=k8s.V1ResourceRequirements(
        requests={"cpu": "500m", "memory": "1Gi"},
        limits={"cpu": "2", "memory": "4Gi"},
    ),
    is_delete_operator_pod=True,
    dag=dag,
)

Equivalent Argo WorkflowTemplate (pure YAML, lives in Git):

yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: transform-events
  namespace: data-pipelines
spec:
  entrypoint: transform
  arguments:
    parameters:
      - name: date
  templates:
    - name: transform
      container:
        image: my-registry/transform:v1.2.3
        command: [python, transform.py]
        args: ["--date={{workflow.parameters.date}}"]
        env:
          - name: SNOWFLAKE_CONN
            valueFrom:
              secretKeyRef:
                name: snowflake-secret
                key: connection_string
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: "2"
            memory: 4Gi

The Argo version is more verbose but is a plain Kubernetes manifest: Git-versioned, deployed via ArgoCD, no UI-managed variables, no Airflow metadata DB.

The Sensor Problem

The hardest migration challenge: Airflow sensors have no direct Argo equivalent. Sensors block a slot waiting for an external condition. In Argo you compose them from init containers.

Waiting for an upstream S3 partition in Argo:

yaml
templates:
  - name: wait-for-upstream
    initContainers:
      - name: poll-s3
        image: amazon/aws-cli:latest
        command: [sh, -c]
        args:
          - |
            until aws s3 ls s3://data-lake/events/{{workflow.parameters.date}}/; do
              echo "Partition not ready — retrying in 30s"
              sleep 30
            done
    container:
      image: my-registry/process:latest
      command: [python, process.py]
      args: ["--date={{workflow.parameters.date}}"]

More code than a sensor — but it's a standard Kubernetes init container: observable with kubectl, debuggable with logs, no Airflow scheduler holding a slot open.

The 60-Day Migration Protocol

60-Day Migration Protocol

1
Deploy Argo

Weeks 1–2

2
New pipelines → Argo

Weeks 3–6

3
Migrate existing

Weeks 7–10

4
Decommission Airflow

Weeks 11–12

Full cutover

Day 90

Both systems ran in parallel during Weeks 3–12 — zero forced cutover risk

We never did a hard cutover. Running both systems in parallel for 60 days meant zero forced migration risk. The critical rule: migrate pipelines in reverse order of criticality — start with the lowest-stakes jobs first, build operational confidence, then touch revenue-critical pipelines.

Triggering Argo workflows programmatically:

python
import requests

def trigger_argo_workflow(date: str, namespace: str = "data-pipelines") -> str:
    """Submit an Argo workflow and return the workflow name."""
    payload = {
        "resourceKind": "WorkflowTemplate",
        "resourceName": "transform-events",
        "submitOptions": {
            "parameters": [f"date={date}"]
        }
    }
    resp = requests.post(
        f"https://argo.internal/api/v1/workflows/{namespace}/submit",
        json=payload,
        headers={"Authorization": f"Bearer {get_argo_token()}"},
    )
    resp.raise_for_status()
    return resp.json()["metadata"]["name"]

What We Gained

  • DAG parse time: eliminated — WorkflowTemplates are CRDs, loaded at submit time
  • Scheduler RAM: 8 GB → ~200 MB (Argo controller is lightweight)
  • Blast radius: a broken template affects only that workflow, never the control plane
  • GitOps native: all workflows in Git, deployed via ArgoCD, full audit trail
  • Prometheus metrics built-in: workflow duration, failure rate, pod resource usage
  • What We Lost

  • The Airflow UI is genuinely better for non-engineers browsing pipeline history
  • XCom replacement: artifact passing goes through S3 — more config, more cost
  • Smaller ecosystem: fewer pre-built provider integrations than Airflow's 80+ providers
  • Steeper on-ramp: engineers need solid Kubernetes knowledge to debug workflow pods
  • Is This Right for Your Team?

    ProfileRecommendation
    < 100 DAGs, mixed-technical stakeholdersStay on Airflow
    Already deep in K8s, hitting scheduler limitsMigrate to Argo Workflows
    Need GitOps-native, Git-as-source-of-truthArgo is the best fit
    Heavy sensor usage, non-engineer usersAirflow + KubernetesExecutor
    New greenfield project on KubernetesStart with Argo from day one

    Build Your Kubernetes Foundation

    The biggest barrier to this migration isn't the Argo YAML — it's not deeply understanding Kubernetes itself. Pod scheduling, resource limits, RBAC, namespaces, persistent volumes, and debugging pods under failure are what separate engineers who execute this smoothly from those who hit walls.

    Our Kubernetes for Data Engineers module covers exactly this: KubernetesExecutor setup, deploying Airflow on K8s, resource quotas, namespace isolation, secrets management, and GitOps deployment patterns with ArgoCD — everything you need to run this migration confidently in production.

    Start the Kubernetes for Data Engineers module

    Ready to go deeper?

    Explore our full curriculum — hands-on skill toolkits built for production data engineering.

    Press Cmd+K to open