Airflow Orchestration Basics

Fundamentals of Airflow DAG design, scheduling, retries, and operational patterns.

Login to track progress
Intermediate55 min · 3 lessons

DAG Design Principles

Building Reliable DAGs

Apache Airflow orchestrates workflows as directed acyclic graphs (DAGs). Each node is a task; edges define execution order. Reliable DAG design prioritizes idempotency, clarity, and operability over cleverness.

Idempotency

An idempotent task produces the same outcome whether it runs once or five times (assuming the same inputs). This matters because Airflow retries failed tasks automatically. Non-idempotent tasks—appending duplicate rows, sending duplicate emails, incrementing counters—cause data corruption on retry.

Design patterns for idempotency:

  • Use merge/upsert instead of blind append for loads
  • Partition by execution date and overwrite partitions
  • Store processing checkpoints in metadata tables
  • Make external side effects conditional on success markers

Example: Instead of INSERT INTO target SELECT * FROM staging, use MERGE INTO target USING staging ON target.id = staging.id WHEN MATCHED THEN UPDATE ....

Explicit Dependencies

Define dependencies explicitly with >> and << operators or set_upstream/set_downstream. Avoid implicit ordering through shared external sensors unless necessary.

Use task groups to organize related steps without flattening readability. A extract_load_transform group containing extract, validate, load, and transform tasks keeps the DAG graph navigable.

Retries, Timeouts, and SLAs

Configure defaults at the DAG level and override per task:

default_args = {
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(hours=2),
}

SLAs (sla parameter) trigger alerts when tasks exceed expected duration—useful for pipelines with downstream business deadlines.

Distinguish retryable failures (transient network blip) from non-retryable (SQL syntax error). Use retries=0 for tasks where retry is unsafe.

Documentation in DAG Metadata

Set doc_md on DAGs with purpose, owner, runbook link, and expected runtime. Future on-call engineers should understand impact without reading every task.

with DAG(
    dag_id="fct_orders_daily",
    description="Refresh orders mart for finance reporting",
    doc_md=__doc__,
    ...
) as dag:

Anti-Patterns

God DAGs: One DAG with two hundred tasks and no groups. Split by domain or schedule.

External task sensors everywhere: Creates brittle cross-DAG coupling. Prefer data-aware scheduling or explicit dataset dependencies (Airflow 2.4+ datasets).

Dynamic start dates: Use fixed start_date in UTC; avoid datetime.now().

Key Takeaways

  • Design every task to be safely retryable.
  • Use task groups and explicit dependencies for maintainability.
  • Configure retries, timeouts, and SLAs appropriate to business impact.
  • Document DAG purpose and runbooks in metadata.

Reflection

What makes a DAG "production ready" on your team? Which of these principles is most often skipped?

Scheduling, Sensors, and Dependencies

Scheduling Fundamentals

Airflow schedules DAGs using cron expressions or the timedelta schedule. Understand the difference between execution_date (logical date of the run) and start_date (when scheduling begins)—confusion here causes off-by-one partition bugs.

Cron and Timetables

A daily DAG at 6 AM UTC:

schedule="0 6 * * *"

Use data intervals (Airflow 2.2+) consciously. For daily batch processing order data for January 15, the logical date is January 15 even if the job runs January 16 at 6 AM.

Align partition filters with logical date:

partition = "{{ ds }}"  # 2024-01-15

Catchup and Backfill

catchup=True schedules all missed runs between start_date and now. Useful for historical loads; dangerous when enabled accidentally on new DAGs—it can enqueue hundreds of runs instantly.

For backfills, use the CLI airflow dags backfill with explicit date ranges and monitor cluster capacity.

Sensors: When and How

Sensors wait for external conditions: file landing in S3, partition appearing in Hive, another DAG completing.

Prefer deferrable sensors (Airflow 2.2+) to free worker slots while waiting. Set poke_interval and timeout to avoid infinite waits.

Replace file sensors with object storage event triggers where possible—event-driven is more efficient than polling.

Dataset Dependencies (Modern Airflow)

Airflow datasets let DAGs declare data dependencies explicitly:

with DAG(..., schedule=[orders_dataset], ...) as dag:
    ...

Producer DAGs update datasets; consumer DAGs schedule when inputs change. This replaces fragile external task sensor chains.

Coordinating Cross-Team Pipelines

When your mart depends on an upstream team's export:

  1. Document the contract: format, landing time, SLA
  2. Validate on arrival before heavy processing
  3. Alert upstream on validation failure with actionable messages
  4. Avoid hard coupling via sensors when SLAs slip frequently—design graceful degradation

Key Takeaways

  • Align partition logic with logical execution dates.
  • Control catchup and backfill to prevent runaway job volume.
  • Prefer deferrable sensors and dataset dependencies over polling.
  • Document cross-team data contracts with SLAs.

Reflection

Have you ever had an off-by-one date bug in a scheduled pipeline? What scheduling concept was misunderstood?

Production Operations and On-Call

Operating Airflow in Production

Moving DAGs to production requires more than passing tests—it requires observability, alerting, and runbooks that enable on-call response without heroics.

Monitoring and Alerting

Monitor at three levels:

Platform health: Scheduler heartbeats, executor queue depth, worker availability, metadata database connections.

DAG run status: Failed runs on Tier 1 DAGs page immediately. Use callbacks (on_failure_callback) to post to Slack or PagerDuty with context.

Business SLAs: Even successful DAG runs can miss freshness SLAs if runtime drifts. Track end-to-end latency from source landing to mart availability.

Integrate Airflow metrics with your observability stack (Datadog, Prometheus, CloudWatch). Dashboards should show failure rate trends, not just point-in-time status.

Structured Failure Callbacks

def alert_on_failure(context):
    task_instance = context["task_instance"]
    send_slack_message(
        channel="#data-alerts",
        text=f"Task {task_instance.task_id} failed in DAG {task_instance.dag_id}",
        attachments=[{"text": str(context.get("exception"))}],
    )

Include links to logs, runbook, and recent deploys in alert messages.

Runbooks and Incident Response

Every Tier 1 DAG needs a runbook covering:

  • Business impact if delayed or failed
  • Upstream dependencies and owners
  • Common failure modes and fixes
  • Escalation path
  • Safe re-run instructions

During incidents, prefer clearing failed tasks and retrying over manual data fixes when tasks are idempotent.

Deployment and Versioning

Store DAGs in version control. Deploy via CI/CD with staging validation. Pin provider package versions to avoid breaking changes from upgrades.

Use feature flags or is_paused_upon_creation=True for new DAGs until validated.

Security and Secrets

Never embed credentials in DAG files. Use Airflow connections and secrets backends (AWS Secrets Manager, HashiCorp Vault). Restrict connection access by role.

Audit who can trigger DAGs and clear tasks in production—manual triggers bypass scheduling safeguards.

Key Takeaways

  • Alert on platform health, DAG failures, and business SLAs separately.
  • Failure callbacks should include context and runbook links.
  • Maintain runbooks for Tier 1 DAGs with safe re-run procedures.
  • Manage deployments and secrets through standard engineering practices.

Reflection

If your most critical DAG failed at 3 AM, would on-call know the business impact and first steps without waking up the author?