Airflow Orchestration Basics
Fundamentals of Airflow DAG design, scheduling, retries, and operational patterns.
DAG Design Principles
Building Reliable DAGs
Apache Airflow orchestrates workflows as directed acyclic graphs (DAGs). Each node is a task; edges define execution order. Reliable DAG design prioritizes idempotency, clarity, and operability over cleverness.
Idempotency
An idempotent task produces the same outcome whether it runs once or five times (assuming the same inputs). This matters because Airflow retries failed tasks automatically. Non-idempotent tasks—appending duplicate rows, sending duplicate emails, incrementing counters—cause data corruption on retry.
Design patterns for idempotency:
- Use merge/upsert instead of blind append for loads
- Partition by execution date and overwrite partitions
- Store processing checkpoints in metadata tables
- Make external side effects conditional on success markers
Example: Instead of INSERT INTO target SELECT * FROM staging, use MERGE INTO target USING staging ON target.id = staging.id WHEN MATCHED THEN UPDATE ....
Explicit Dependencies
Define dependencies explicitly with >> and << operators or set_upstream/set_downstream. Avoid implicit ordering through shared external sensors unless necessary.
Use task groups to organize related steps without flattening readability. A extract_load_transform group containing extract, validate, load, and transform tasks keeps the DAG graph navigable.
Retries, Timeouts, and SLAs
Configure defaults at the DAG level and override per task:
default_args = {
"retries": 2,
"retry_delay": timedelta(minutes=5),
"execution_timeout": timedelta(hours=2),
}
SLAs (sla parameter) trigger alerts when tasks exceed expected duration—useful for pipelines with downstream business deadlines.
Distinguish retryable failures (transient network blip) from non-retryable (SQL syntax error). Use retries=0 for tasks where retry is unsafe.
Documentation in DAG Metadata
Set doc_md on DAGs with purpose, owner, runbook link, and expected runtime. Future on-call engineers should understand impact without reading every task.
with DAG(
dag_id="fct_orders_daily",
description="Refresh orders mart for finance reporting",
doc_md=__doc__,
...
) as dag:
Anti-Patterns
God DAGs: One DAG with two hundred tasks and no groups. Split by domain or schedule.
External task sensors everywhere: Creates brittle cross-DAG coupling. Prefer data-aware scheduling or explicit dataset dependencies (Airflow 2.4+ datasets).
Dynamic start dates: Use fixed start_date in UTC; avoid datetime.now().
Key Takeaways
- Design every task to be safely retryable.
- Use task groups and explicit dependencies for maintainability.
- Configure retries, timeouts, and SLAs appropriate to business impact.
- Document DAG purpose and runbooks in metadata.
Reflection
What makes a DAG "production ready" on your team? Which of these principles is most often skipped?
Scheduling, Sensors, and Dependencies
Scheduling Fundamentals
Airflow schedules DAGs using cron expressions or the timedelta schedule. Understand the difference between execution_date (logical date of the run) and start_date (when scheduling begins)—confusion here causes off-by-one partition bugs.
Cron and Timetables
A daily DAG at 6 AM UTC:
schedule="0 6 * * *"
Use data intervals (Airflow 2.2+) consciously. For daily batch processing order data for January 15, the logical date is January 15 even if the job runs January 16 at 6 AM.
Align partition filters with logical date:
partition = "{{ ds }}" # 2024-01-15
Catchup and Backfill
catchup=True schedules all missed runs between start_date and now. Useful for historical loads; dangerous when enabled accidentally on new DAGs—it can enqueue hundreds of runs instantly.
For backfills, use the CLI airflow dags backfill with explicit date ranges and monitor cluster capacity.
Sensors: When and How
Sensors wait for external conditions: file landing in S3, partition appearing in Hive, another DAG completing.
Prefer deferrable sensors (Airflow 2.2+) to free worker slots while waiting. Set poke_interval and timeout to avoid infinite waits.
Replace file sensors with object storage event triggers where possible—event-driven is more efficient than polling.
Dataset Dependencies (Modern Airflow)
Airflow datasets let DAGs declare data dependencies explicitly:
with DAG(..., schedule=[orders_dataset], ...) as dag:
...
Producer DAGs update datasets; consumer DAGs schedule when inputs change. This replaces fragile external task sensor chains.
Coordinating Cross-Team Pipelines
When your mart depends on an upstream team's export:
- Document the contract: format, landing time, SLA
- Validate on arrival before heavy processing
- Alert upstream on validation failure with actionable messages
- Avoid hard coupling via sensors when SLAs slip frequently—design graceful degradation
Key Takeaways
- Align partition logic with logical execution dates.
- Control catchup and backfill to prevent runaway job volume.
- Prefer deferrable sensors and dataset dependencies over polling.
- Document cross-team data contracts with SLAs.
Reflection
Have you ever had an off-by-one date bug in a scheduled pipeline? What scheduling concept was misunderstood?
Production Operations and On-Call
Operating Airflow in Production
Moving DAGs to production requires more than passing tests—it requires observability, alerting, and runbooks that enable on-call response without heroics.
Monitoring and Alerting
Monitor at three levels:
Platform health: Scheduler heartbeats, executor queue depth, worker availability, metadata database connections.
DAG run status: Failed runs on Tier 1 DAGs page immediately. Use callbacks (on_failure_callback) to post to Slack or PagerDuty with context.
Business SLAs: Even successful DAG runs can miss freshness SLAs if runtime drifts. Track end-to-end latency from source landing to mart availability.
Integrate Airflow metrics with your observability stack (Datadog, Prometheus, CloudWatch). Dashboards should show failure rate trends, not just point-in-time status.
Structured Failure Callbacks
def alert_on_failure(context):
task_instance = context["task_instance"]
send_slack_message(
channel="#data-alerts",
text=f"Task {task_instance.task_id} failed in DAG {task_instance.dag_id}",
attachments=[{"text": str(context.get("exception"))}],
)
Include links to logs, runbook, and recent deploys in alert messages.
Runbooks and Incident Response
Every Tier 1 DAG needs a runbook covering:
- Business impact if delayed or failed
- Upstream dependencies and owners
- Common failure modes and fixes
- Escalation path
- Safe re-run instructions
During incidents, prefer clearing failed tasks and retrying over manual data fixes when tasks are idempotent.
Deployment and Versioning
Store DAGs in version control. Deploy via CI/CD with staging validation. Pin provider package versions to avoid breaking changes from upgrades.
Use feature flags or is_paused_upon_creation=True for new DAGs until validated.
Security and Secrets
Never embed credentials in DAG files. Use Airflow connections and secrets backends (AWS Secrets Manager, HashiCorp Vault). Restrict connection access by role.
Audit who can trigger DAGs and clear tasks in production—manual triggers bypass scheduling safeguards.
Key Takeaways
- Alert on platform health, DAG failures, and business SLAs separately.
- Failure callbacks should include context and runbook links.
- Maintain runbooks for Tier 1 DAGs with safe re-run procedures.
- Manage deployments and secrets through standard engineering practices.
Reflection
If your most critical DAG failed at 3 AM, would on-call know the business impact and first steps without waking up the author?