How to Write an Airflow DAG
Install Airflow, create a Python file in your dags/ directory, define tasks with @task, set dependencies by calling tasks in order, then run the scheduler. The full pipeline takes under 10 minutes to stand up locally.
Install Airflow
# Install with constraints (recommended)
pip install 'apache-airflow==2.9.0' \
--constraint 'https://raw.githubusercontent.com/apache/airflow/constraints-2.9.0/constraints-3.11.txt'
# Initialize the metadata database
airflow db initUse the official constraints file to avoid dependency conflicts. For teams, use the Docker Compose quickstart instead.
Create your DAG file
# dags/daily_etl.py
from airflow.decorators import dag, task
from datetime import datetime
@dag(
dag_id='daily_etl',
schedule='@daily',
start_date=datetime(2024, 1, 1),
catchup=False # Important!
)
def daily_etl():
...
daily_etl() # Instantiate the DAGSet catchup=False unless you explicitly want Airflow to backfill all historical intervals from start_date.
Define tasks with @task
@dag(schedule='@daily', start_date=datetime(2024, 1, 1), catchup=False)
def daily_etl():
@task()
def extract() -> list:
return fetch_from_api() # returns a list of records
@task()
def transform(records: list) -> list:
return [clean(r) for r in records]
@task()
def load(records: list) -> None:
write_to_warehouse(records)
TaskFlow automatically serializes return values as XCom. Keep returned data small — use file paths for large datasets.
Set task dependencies
# Inside the DAG function body:
# Call tasks in order — Airflow infers the graph
raw = extract()
cleaned = transform(raw)
load(cleaned)
# Full DAG instantiation at module level
daily_etl()The dependency graph is inferred from the Python call order. extract → transform → load will run in sequence.
Run and monitor
# Start the scheduler (in one terminal)
airflow scheduler
# Start the webserver (in another terminal)
airflow webserver --port 8080
# Or trigger a DAG manually from CLI
airflow dags trigger daily_etlNavigate to http://localhost:8080, find your DAG, toggle it ON, and monitor task runs in the Grid view.
What's Happening
When you save a Python file to the dags/ folder, the Airflow scheduler automatically parses it every 30 seconds. It reads the DAG definition, computes the task dependency graph, and schedules task runs based on the schedule parameter. The metadata database stores each task instance's state (queued → running → success/failed) and the web UI reads from that same database to show you the grid and logs in real time.
When to Use Custom DAGs
- •You have multi-step workflows where step B depends on step A completing successfully
- •You need automatic retries with configurable delay on failure
- •You want full audit history — what ran, when, and what the output was
- •You're orchestrating dbt models, Spark jobs, or ML training pipelines
Common Issues
DAG not appearing in the UI
Check for Python syntax errors — the scheduler silently skips DAG files that fail to parse. Run `python dags/your_dag.py` locally to catch errors.
XCom size errors
You're returning too much data from a @task. Return a file path or S3 URI instead of a DataFrame or large list.
Tasks running out of expected order
Make sure you're calling tasks inside the DAG function body in the right sequence. The graph is built from the call order, not the function definition order.
Catchup running hundreds of old runs
You forgot catchup=False. Set it in the @dag decorator or in your airflow.cfg under [scheduler] catchup_by_default = False.
FAQ
- What is a DAG in Airflow?
- A DAG (Directed Acyclic Graph) is a Python file that defines a workflow. It contains tasks as nodes and dependencies as edges. Airflow's scheduler reads DAG files and triggers task runs based on the schedule you define.
- What is the TaskFlow API in Airflow?
- The TaskFlow API (introduced in Airflow 2.0) lets you define tasks using Python decorators (@dag, @task) instead of manually instantiating operator classes. It automatically handles XCom passing between tasks.
- How do I pass data between Airflow tasks?
- With the TaskFlow API, return a value from one @task function and pass it as an argument to the next. Airflow automatically stores the value in XCom. For large data, return a file path or S3 URI instead.