Data Observability and Incident Management
Detect, triage, and resolve data incidents with observability practices and RCA workflows.
Observability Pillars for Data
Beyond Dashboards: Observability for Pipelines
Application observability tracks logs, metrics, and traces. Data observability adapts these concepts to datasets and pipelines: freshness, volume, schema, distribution, and lineage. The goal is detecting user-impacting issues before stakeholders discover them in a board meeting.
Freshness
Freshness measures how recently a table was updated relative to its SLA. A fct_orders table with an SLA of "data through yesterday available by 8 AM" should have max(created_at) within expected bounds.
Implement freshness checks as scheduled queries comparing max(updated_at) to current_timestamp. Alert when lag exceeds threshold.
Account for intentional delays: weekly datasets need weekly SLAs, not hourly alerts.
Volume
Volume monitoring detects anomalous row counts—sudden drops suggest upstream failures; spikes suggest duplication or duplicate runs.
Use statistical baselines: compare today's count to seven-day median with tolerance bands. Static thresholds ("must be > 1000") miss seasonal patterns.
Schema
Schema changes break downstream SQL silently until runtime. Detect new columns, removed columns, and type changes by comparing current information schema to a stored snapshot.
dbt source freshness and schema tests help; dedicated schema change detectors (OpenMetadata, Monte Carlo, custom) compare across runs.
Distribution and Quality Metrics
Track null rates, distinct counts, min/max for critical columns. A sudden jump in null customer_id indicates upstream extraction issues even when row volume looks normal.
Composite health scores roll up multiple signals into green/yellow/red status for executive visibility.
Lineage-Aware Alerting
When a staging table fails freshness, downstream marts are implicitly stale. Lineage-aware tools suppress redundant alerts and show blast radius. Without lineage, five downstream alerts fire for one root cause.
Key Takeaways
- Monitor freshness, volume, schema, and distribution—not just job success.
- Use statistical baselines for volume and quality metrics.
- Schema change detection prevents silent downstream breakage.
- Lineage-aware alerting reduces noise and clarifies impact.
Reflection
Which observability signal do you wish you had monitored last quarter? Would it have changed the incident outcome?
Incident Triage and Response
When Alerts Fire: Structured Response
Data incidents differ from application outages—pipelines may "succeed" while producing wrong or stale data. A structured triage process prevents panic and shortens resolution time.
Severity Classification
Define severities with business impact:
| Severity | Impact | Response |
|---|---|---|
| SEV1 | Wrong data in production dashboards or billing | Immediate, all-hands |
| SEV2 | Stale data past SLA, no wrong values yet | Same business day |
| SEV3 | Non-critical dataset delayed | Next business day |
| SEV4 | Informational anomaly | Backlog |
Map alert rules to severities upfront so on-call does not debate at 2 AM.
Triage Checklist
When an alert fires:
- Confirm impact: Which dashboards, reports, or models consume this dataset?
- Identify scope: Is the issue isolated to one partition, region, or source?
- Check recent changes: Deploys, schema changes, upstream outages
- Communicate status: Post in incident channel with known impact and ETA
- Mitigate first: Roll back, pause publish, or switch to yesterday's snapshot
- Fix root cause: After mitigation stabilizes consumers
Avoid fixing forward silently without communicating—stakeholders may act on bad data during investigation.
Communication Templates
Prepare templates for status updates:
[SEV2] fct_orders stale — finance dashboard showing data through T-2
Impact: Executive revenue report delayed
Mitigation: Pointing dashboard to backup table
ETA: 10:30 AM PT
Owner: @data-platform-oncall
Clear, factual updates reduce duplicate tickets and executive escalations.
Escalation Paths
Document when to escalate to upstream vendors, source system teams, or infrastructure. Data incidents often root in outside systems—know the contacts before you need them.
Key Takeaways
- Classify severity by business impact, not technical complexity.
- Follow a triage checklist: confirm, scope, communicate, mitigate, fix.
- Communicate early and often with structured status updates.
- Know escalation paths for upstream dependencies.
Reflection
Think of your last data incident. Was severity agreed quickly? Was communication proactive or reactive?
Root Cause Analysis and Prevention
Closing the Loop with RCA
Fixing the immediate issue restores service; root cause analysis prevents recurrence. Treat data incidents as learning opportunities with documented outcomes and actionable follow-ups.
RCA Structure
A useful RCA answers:
- What happened? Timeline of detection, impact, mitigation, resolution
- Why did it happen? Root cause, not just proximate trigger
- Why did we not detect it sooner? Observability gaps
- What will we change? Preventive actions with owners and due dates
Use five-whys cautiously—dig to systemic issues (missing test, unclear ownership) not individual blame.
Timeline Documentation
Capture timestamps:
- When data went bad (introduction time)
- When consumers would first notice
- When monitoring detected (if at all)
- When incident declared and mitigated
- When fully resolved
Compare introduction time to detection time—large gaps indicate observability debt.
Preventive Actions
Common action types:
- Add dbt test or observability check (regression prevention)
- Fix idempotency bug in pipeline code
- Update runbook with new failure mode
- Improve alert routing or severity
- Source system contract negotiation
Track actions in your backlog with the same priority as feature work. Unimplemented RCA actions guarantee repeat incidents.
Blameless Culture
Focus on systems and processes. Engineers who fear blame hide near-misses. Celebrate detections that prevented worse outcomes.
Share RCAs internally (sanitized) so other teams learn without experiencing the same failure.
Example RCA Summary
Incident: Duplicate rows in fct_orders on 2024-03-01 caused revenue overstatement in board deck.
Root cause: Airflow retry re-ran append-only load without partition overwrite after transient S3 timeout.
Detection gap: No uniqueness test on order_id in production schedule.
Actions: (1) Changed load to merge by order_id; (2) Added unique test with store_failures; (3) Updated runbook for retry behavior.
Key Takeaways
- RCAs document timeline, root cause, detection gaps, and preventive actions.
- Measure detection lag and close observability gaps.
- Track RCA actions to completion in the team backlog.
- Maintain blameless culture to surface near-misses early.
Reflection
Does your team write RCAs for data incidents? What prevented the last one from recurring—or did it happen again?