Why We Migrated from Airflow to Kubernetes-Native Orchestration
Want to build this yourself?
This architecture is covered in our hands-on projects. Build it in the AI-DE sandbox.
Explore ProjectsThe Breaking Point
Our Airflow setup served us well for the first two years. A managed Cloud Composer instance, 400 DAGs, and a team of 12 data engineers. Then the numbers started telling a different story.
| Metric | Airflow (Cloud Composer) | After Migration (Argo) |
|---|---|---|
| DAG parse time | 90–120 seconds | Eliminated |
| Scheduler RAM | 8 GB | ~200 MB |
| Task start latency | 45–90 seconds | 5–15 seconds |
| Blast radius | Entire scheduler | Single workflow |
| Infrastructure cost | ~$4,200/month | ~$380/month |
A single misconfigured DAG could bring down the entire scheduler. That was the final straw.
The Architecture Problem
Airflow vs Argo Workflows — Architecture Comparison
Airflow (Cloud Composer)
6 separate services to operate
Argo Workflows (K8s-Native)
1 controller + Kubernetes = done
We were already using KubernetesPodOperator for every meaningful task — every task was a pod. Airflow had become a thin, expensive scheduling wrapper around Kubernetes, maintaining its own database, scheduler process, web server, and message broker for no incremental value.
Argo Workflows runs entirely as Kubernetes CRDs. No separate database, no scheduler process — the Kubernetes control plane is the orchestration layer.
Side-by-Side: Airflow vs Argo
The same pipeline task in both systems shows exactly where the complexity lives.
Airflow KubernetesPodOperator:
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from kubernetes.client import models as k8s
transform_task = KubernetesPodOperator(
task_id="transform_events",
namespace="data-pipelines",
image="my-registry/transform:v1.2.3",
cmds=["python", "transform.py"],
arguments=["--date={{ ds }}"],
env_vars={"SNOWFLAKE_CONN": "{{ var.value.snowflake_conn }}"},
resources=k8s.V1ResourceRequirements(
requests={"cpu": "500m", "memory": "1Gi"},
limits={"cpu": "2", "memory": "4Gi"},
),
is_delete_operator_pod=True,
dag=dag,
)Equivalent Argo WorkflowTemplate (pure YAML, lives in Git):
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: transform-events
namespace: data-pipelines
spec:
entrypoint: transform
arguments:
parameters:
- name: date
templates:
- name: transform
container:
image: my-registry/transform:v1.2.3
command: [python, transform.py]
args: ["--date={{workflow.parameters.date}}"]
env:
- name: SNOWFLAKE_CONN
valueFrom:
secretKeyRef:
name: snowflake-secret
key: connection_string
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: "2"
memory: 4GiThe Argo version is more verbose but is a plain Kubernetes manifest: Git-versioned, deployed via ArgoCD, no UI-managed variables, no Airflow metadata DB.
The Sensor Problem
The hardest migration challenge: Airflow sensors have no direct Argo equivalent. Sensors block a slot waiting for an external condition. In Argo you compose them from init containers.
Waiting for an upstream S3 partition in Argo:
templates:
- name: wait-for-upstream
initContainers:
- name: poll-s3
image: amazon/aws-cli:latest
command: [sh, -c]
args:
- |
until aws s3 ls s3://data-lake/events/{{workflow.parameters.date}}/; do
echo "Partition not ready — retrying in 30s"
sleep 30
done
container:
image: my-registry/process:latest
command: [python, process.py]
args: ["--date={{workflow.parameters.date}}"]More code than a sensor — but it's a standard Kubernetes init container: observable with kubectl, debuggable with logs, no Airflow scheduler holding a slot open.
The 60-Day Migration Protocol
60-Day Migration Protocol
Weeks 1–2
Weeks 3–6
Weeks 7–10
Weeks 11–12
Day 90
Both systems ran in parallel during Weeks 3–12 — zero forced cutover risk
We never did a hard cutover. Running both systems in parallel for 60 days meant zero forced migration risk. The critical rule: migrate pipelines in reverse order of criticality — start with the lowest-stakes jobs first, build operational confidence, then touch revenue-critical pipelines.
Triggering Argo workflows programmatically:
import requests
def trigger_argo_workflow(date: str, namespace: str = "data-pipelines") -> str:
"""Submit an Argo workflow and return the workflow name."""
payload = {
"resourceKind": "WorkflowTemplate",
"resourceName": "transform-events",
"submitOptions": {
"parameters": [f"date={date}"]
}
}
resp = requests.post(
f"https://argo.internal/api/v1/workflows/{namespace}/submit",
json=payload,
headers={"Authorization": f"Bearer {get_argo_token()}"},
)
resp.raise_for_status()
return resp.json()["metadata"]["name"]What We Gained
What We Lost
Is This Right for Your Team?
| Profile | Recommendation |
|---|---|
| < 100 DAGs, mixed-technical stakeholders | Stay on Airflow |
| Already deep in K8s, hitting scheduler limits | Migrate to Argo Workflows |
| Need GitOps-native, Git-as-source-of-truth | Argo is the best fit |
| Heavy sensor usage, non-engineer users | Airflow + KubernetesExecutor |
| New greenfield project on Kubernetes | Start with Argo from day one |
Build Your Kubernetes Foundation
The biggest barrier to this migration isn't the Argo YAML — it's not deeply understanding Kubernetes itself. Pod scheduling, resource limits, RBAC, namespaces, persistent volumes, and debugging pods under failure are what separate engineers who execute this smoothly from those who hit walls.
Our Kubernetes for Data Engineers module covers exactly this: KubernetesExecutor setup, deploying Airflow on K8s, resource quotas, namespace isolation, secrets management, and GitOps deployment patterns with ArgoCD — everything you need to run this migration confidently in production.
Ready to go deeper?
Explore our full curriculum — hands-on skill toolkits built for production data engineering.