Architecture

Why We Migrated from Airflow to Kubernetes-Native Orchestration

Jordan ReevesMar 18, 20268 min read

Want to build this yourself?

This architecture is covered in our hands-on projects. Build it in the AI-DE sandbox.

The Breaking Point

Our Airflow setup served us well for the first two years. A managed Cloud Composer instance, 400 DAGs, and a team of 12 data engineers. Then the numbers started telling a different story.

Metric	Airflow (Cloud Composer)	After Migration (Argo)
DAG parse time	90–120 seconds	Eliminated
Scheduler RAM	8 GB	~200 MB
Task start latency	45–90 seconds	5–15 seconds
Blast radius	Entire scheduler	Single workflow
Infrastructure cost	~$4,200/month	~$380/month

A single misconfigured DAG could bring down the entire scheduler. That was the final straw.

The Architecture Problem

Airflow vs Argo Workflows — Architecture Comparison

Airflow (Cloud Composer)

Webserver

Scheduler (8 GB RAM)

Celery Workers (×N)

Metadata DB (Postgres)

Redis / RabbitMQ

KubernetesPodOperator → Pod

6 separate services to operate

migrate→

Argo Workflows (K8s-Native)

Kubernetes API Server

Argo Controller (~200 MB)

WorkflowTemplates (CRDs in Git)

Workflow Pods (auto-provisioned)

(No separate MetaDB / Redis / Webserver)

1 controller + Kubernetes = done

We were already using KubernetesPodOperator for every meaningful task — every task was a pod. Airflow had become a thin, expensive scheduling wrapper around Kubernetes, maintaining its own database, scheduler process, web server, and message broker for no incremental value.

Argo Workflows runs entirely as Kubernetes CRDs. No separate database, no scheduler process — the Kubernetes control plane is the orchestration layer.

Side-by-Side: Airflow vs Argo

The same pipeline task in both systems shows exactly where the complexity lives.

Airflow KubernetesPodOperator:

python

from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from kubernetes.client import models as k8s

transform_task = KubernetesPodOperator(
    task_id="transform_events",
    namespace="data-pipelines",
    image="my-registry/transform:v1.2.3",
    cmds=["python", "transform.py"],
    arguments=["--date={{ ds }}"],
    env_vars={"SNOWFLAKE_CONN": "{{ var.value.snowflake_conn }}"},
    resources=k8s.V1ResourceRequirements(
        requests={"cpu": "500m", "memory": "1Gi"},
        limits={"cpu": "2", "memory": "4Gi"},
    ),
    is_delete_operator_pod=True,
    dag=dag,
)

Equivalent Argo WorkflowTemplate (pure YAML, lives in Git):

yaml

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: transform-events
  namespace: data-pipelines
spec:
  entrypoint: transform
  arguments:
    parameters:
      - name: date
  templates:
    - name: transform
      container:
        image: my-registry/transform:v1.2.3
        command: [python, transform.py]
        args: ["--date={{workflow.parameters.date}}"]
        env:
          - name: SNOWFLAKE_CONN
            valueFrom:
              secretKeyRef:
                name: snowflake-secret
                key: connection_string
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: "2"
            memory: 4Gi

The Argo version is more verbose but is a plain Kubernetes manifest: Git-versioned, deployed via ArgoCD, no UI-managed variables, no Airflow metadata DB.

The Sensor Problem

The hardest migration challenge: Airflow sensors have no direct Argo equivalent. Sensors block a slot waiting for an external condition. In Argo you compose them from init containers.

Waiting for an upstream S3 partition in Argo:

yaml

templates:
  - name: wait-for-upstream
    initContainers:
      - name: poll-s3
        image: amazon/aws-cli:latest
        command: [sh, -c]
        args:
          - |
            until aws s3 ls s3://data-lake/events/{{workflow.parameters.date}}/; do
              echo "Partition not ready — retrying in 30s"
              sleep 30
            done
    container:
      image: my-registry/process:latest
      command: [python, process.py]
      args: ["--date={{workflow.parameters.date}}"]

More code than a sensor — but it's a standard Kubernetes init container: observable with kubectl, debuggable with logs, no Airflow scheduler holding a slot open.

The 60-Day Migration Protocol

60-Day Migration Protocol

Deploy Argo

Weeks 1–2

→

New pipelines → Argo

Weeks 3–6

→

Migrate existing

Weeks 7–10

→

Decommission Airflow

Weeks 11–12

→

✓

Full cutover

Day 90

Both systems ran in parallel during Weeks 3–12 — zero forced cutover risk

We never did a hard cutover. Running both systems in parallel for 60 days meant zero forced migration risk. The critical rule: migrate pipelines in reverse order of criticality — start with the lowest-stakes jobs first, build operational confidence, then touch revenue-critical pipelines.

Triggering Argo workflows programmatically:

python

import requests

def trigger_argo_workflow(date: str, namespace: str = "data-pipelines") -> str:
    """Submit an Argo workflow and return the workflow name."""
    payload = {
        "resourceKind": "WorkflowTemplate",
        "resourceName": "transform-events",
        "submitOptions": {
            "parameters": [f"date={date}"]
        }
    }
    resp = requests.post(
        f"https://argo.internal/api/v1/workflows/{namespace}/submit",
        json=payload,
        headers={"Authorization": f"Bearer {get_argo_token()}"},
    )
    resp.raise_for_status()
    return resp.json()["metadata"]["name"]

What We Gained

DAG parse time: eliminated — WorkflowTemplates are CRDs, loaded at submit time

Scheduler RAM: 8 GB → ~200 MB (Argo controller is lightweight)

Blast radius: a broken template affects only that workflow, never the control plane

GitOps native: all workflows in Git, deployed via ArgoCD, full audit trail

Prometheus metrics built-in: workflow duration, failure rate, pod resource usage

What We Lost

The Airflow UI is genuinely better for non-engineers browsing pipeline history

XCom replacement: artifact passing goes through S3 — more config, more cost

Smaller ecosystem: fewer pre-built provider integrations than Airflow's 80+ providers

Steeper on-ramp: engineers need solid Kubernetes knowledge to debug workflow pods

Is This Right for Your Team?

Profile	Recommendation
< 100 DAGs, mixed-technical stakeholders	Stay on Airflow
Already deep in K8s, hitting scheduler limits	Migrate to Argo Workflows
Need GitOps-native, Git-as-source-of-truth	Argo is the best fit
Heavy sensor usage, non-engineer users	Airflow + KubernetesExecutor
New greenfield project on Kubernetes	Start with Argo from day one

Build Your Kubernetes Foundation

The biggest barrier to this migration isn't the Argo YAML — it's not deeply understanding Kubernetes itself. Pod scheduling, resource limits, RBAC, namespaces, persistent volumes, and debugging pods under failure are what separate engineers who execute this smoothly from those who hit walls.

Our Kubernetes for Data Engineers module covers exactly this: KubernetesExecutor setup, deploying Airflow on K8s, resource quotas, namespace isolation, secrets management, and GitOps deployment patterns with ArgoCD — everything you need to run this migration confidently in production.

→ Start the Kubernetes for Data Engineers module

Ready to go deeper?

Explore our full curriculum — hands-on skill toolkits built for production data engineering.

Browse Skill Toolkits More Articles