Data Observability and Incident Management

Detect, triage, and resolve data incidents with observability practices and RCA workflows.

Intermediate50 min · 3 lessons

Observability Pillars for Data

Beyond Dashboards: Observability for Pipelines

Application observability tracks logs, metrics, and traces. Data observability adapts these concepts to datasets and pipelines: freshness, volume, schema, distribution, and lineage. The goal is detecting user-impacting issues before stakeholders discover them in a board meeting.

Freshness

Freshness measures how recently a table was updated relative to its SLA. A fct_orders table with an SLA of "data through yesterday available by 8 AM" should have max(created_at) within expected bounds.

Implement freshness checks as scheduled queries comparing max(updated_at) to current_timestamp. Alert when lag exceeds threshold.

Account for intentional delays: weekly datasets need weekly SLAs, not hourly alerts.

Volume

Volume monitoring detects anomalous row counts—sudden drops suggest upstream failures; spikes suggest duplication or duplicate runs.

Use statistical baselines: compare today's count to seven-day median with tolerance bands. Static thresholds ("must be > 1000") miss seasonal patterns.

Schema

Schema changes break downstream SQL silently until runtime. Detect new columns, removed columns, and type changes by comparing current information schema to a stored snapshot.

dbt source freshness and schema tests help; dedicated schema change detectors (OpenMetadata, Monte Carlo, custom) compare across runs.

Distribution and Quality Metrics

Track null rates, distinct counts, min/max for critical columns. A sudden jump in null customer_id indicates upstream extraction issues even when row volume looks normal.

Composite health scores roll up multiple signals into green/yellow/red status for executive visibility.

Lineage-Aware Alerting

When a staging table fails freshness, downstream marts are implicitly stale. Lineage-aware tools suppress redundant alerts and show blast radius. Without lineage, five downstream alerts fire for one root cause.

Key Takeaways

Monitor freshness, volume, schema, and distribution—not just job success.
Use statistical baselines for volume and quality metrics.
Schema change detection prevents silent downstream breakage.
Lineage-aware alerting reduces noise and clarifies impact.

Reflection

Which observability signal do you wish you had monitored last quarter? Would it have changed the incident outcome?

Incident Triage and Response

When Alerts Fire: Structured Response

Data incidents differ from application outages—pipelines may "succeed" while producing wrong or stale data. A structured triage process prevents panic and shortens resolution time.

Severity Classification

Define severities with business impact:

Severity	Impact	Response
SEV1	Wrong data in production dashboards or billing	Immediate, all-hands
SEV2	Stale data past SLA, no wrong values yet	Same business day
SEV3	Non-critical dataset delayed	Next business day
SEV4	Informational anomaly	Backlog

Map alert rules to severities upfront so on-call does not debate at 2 AM.

Triage Checklist

When an alert fires:

Confirm impact: Which dashboards, reports, or models consume this dataset?
Identify scope: Is the issue isolated to one partition, region, or source?
Check recent changes: Deploys, schema changes, upstream outages
Communicate status: Post in incident channel with known impact and ETA
Mitigate first: Roll back, pause publish, or switch to yesterday's snapshot
Fix root cause: After mitigation stabilizes consumers

Avoid fixing forward silently without communicating—stakeholders may act on bad data during investigation.

Communication Templates

Prepare templates for status updates:

[SEV2] fct_orders stale — finance dashboard showing data through T-2
Impact: Executive revenue report delayed
Mitigation: Pointing dashboard to backup table
ETA: 10:30 AM PT
Owner: @data-platform-oncall

Clear, factual updates reduce duplicate tickets and executive escalations.

Escalation Paths

Document when to escalate to upstream vendors, source system teams, or infrastructure. Data incidents often root in outside systems—know the contacts before you need them.

Key Takeaways

Classify severity by business impact, not technical complexity.
Follow a triage checklist: confirm, scope, communicate, mitigate, fix.
Communicate early and often with structured status updates.
Know escalation paths for upstream dependencies.

Reflection

Think of your last data incident. Was severity agreed quickly? Was communication proactive or reactive?

Root Cause Analysis and Prevention

Closing the Loop with RCA

Fixing the immediate issue restores service; root cause analysis prevents recurrence. Treat data incidents as learning opportunities with documented outcomes and actionable follow-ups.

RCA Structure

A useful RCA answers:

What happened? Timeline of detection, impact, mitigation, resolution
Why did it happen? Root cause, not just proximate trigger
Why did we not detect it sooner? Observability gaps
What will we change? Preventive actions with owners and due dates

Use five-whys cautiously—dig to systemic issues (missing test, unclear ownership) not individual blame.

Timeline Documentation

Capture timestamps:

When data went bad (introduction time)
When consumers would first notice
When monitoring detected (if at all)
When incident declared and mitigated
When fully resolved

Compare introduction time to detection time—large gaps indicate observability debt.

Preventive Actions

Common action types:

Add dbt test or observability check (regression prevention)
Fix idempotency bug in pipeline code
Update runbook with new failure mode
Improve alert routing or severity
Source system contract negotiation

Track actions in your backlog with the same priority as feature work. Unimplemented RCA actions guarantee repeat incidents.

Blameless Culture

Focus on systems and processes. Engineers who fear blame hide near-misses. Celebrate detections that prevented worse outcomes.

Share RCAs internally (sanitized) so other teams learn without experiencing the same failure.

Example RCA Summary

Incident: Duplicate rows in fct_orders on 2024-03-01 caused revenue overstatement in board deck.

Root cause: Airflow retry re-ran append-only load without partition overwrite after transient S3 timeout.

Detection gap: No uniqueness test on order_id in production schedule.

Actions: (1) Changed load to merge by order_id; (2) Added unique test with store_failures; (3) Updated runbook for retry behavior.

Key Takeaways

RCAs document timeline, root cause, detection gaps, and preventive actions.
Measure detection lag and close observability gaps.
Track RCA actions to completion in the team backlog.
Maintain blameless culture to surface near-misses early.

Reflection

Does your team write RCAs for data incidents? What prevented the last one from recurring—or did it happen again?