Data Governance for Practitioners

Hands-on governance for practitioners: ownership, classification, access, and catalog practices.

Login to track progress
Intermediate50 min · 3 lessons

Governance That Ships

Lightweight Governance for Practitioners

Governance often conjures images of committees, multi-month approvals, and blocked projects. Effective practitioner governance is different: minimal process that prevents real risks while keeping teams moving. The goal is confidence in data, not compliance theater.

Named Ownership

Every Tier 1 dataset needs a named owner—team or individual accountable for quality, documentation, and access decisions. Ownership appears in the catalog, not buried in a wiki.

Owners respond to incidents, approve access requests within SLA, and review breaking change notifications.

Without owners, datasets orphan and quality decays.

Data Contracts Between Producers and Consumers

A data contract specifies schema, grain, SLAs, and breaking change policy between upstream producers and downstream consumers.

Example contract clause: "Adding nullable columns is non-breaking; renaming columns requires 30-day deprecation notice and consumer acknowledgment."

Contracts can live as YAML in repo alongside dbt sources—enforceable in CI.

PR Checks for Breaking Changes

Automate detection of breaking schema changes in pull requests: dropped columns, type narrowing, grain changes. Block merge or require explicit approval from listed consumers.

Tools: dbt state comparison, schema diff actions, catalog integration.

Governance That Accelerates

Good governance answers questions quickly: "Can I use this dataset?" "Who approves access?" "Is this column PII?" Teams ship faster when answers are self-serve in catalog rather than tribal knowledge.

Key Takeaways

  • Assign named owners to critical datasets with visible accountability.
  • Use data contracts to formalize producer-consumer expectations.
  • Automate breaking change detection in CI.
  • Design governance for self-serve speed, not committee delay.

Reflection

What governance rule actually speeds teams up instead of blocking them? What rule exists only on paper?

Classification and Access Control

Protecting Data Without Blocking Work

Classification tags data by sensitivity (public, internal, confidential, PII, regulated). Access control enforces who can read or write based on role, need, and classification. Together they reduce breach risk while enabling legitimate analytics.

Classification Taxonomy

Keep taxonomy small and actionable:

LevelExamplesDefault access
PublicMarketing aggregatesBroad read
InternalOperational metricsEmployees
ConfidentialRevenue detailFinance + leadership
PIIEmail, phone, SSNMasked or role-restricted
RegulatedHIPAA, PCILegal/compliance approval

Tag at column level where sensitivity varies within tables.

Role-Based and Attribute-Based Access

RBAC assigns permissions to roles (analyst, finance_analyst, data_engineer). ABAC adds conditions (department, region, clearance).

Implement in warehouse (Snowflake RBAC, BigQuery IAM), lake (Lake Formation LF-Tags), and BI tools consistently.

Avoid duplicating conflicting policies across layers—document source of truth.

Least Privilege and Access Reviews

Grant minimum access needed. Quarterly access reviews for PII and regulated data: managers confirm direct reports still require access.

Automate provisioning via request workflows with approval routing to data owners.

Masking and Tokenization

For PII columns, apply dynamic masking (partial email, hashed identifier) for roles without full access. Tokenization replaces sensitive values with reversible tokens for approved use cases.

Analysts explore patterns without seeing raw PII.

Key Takeaways

  • Use a simple classification taxonomy applied at column level.
  • Enforce consistent RBAC/ABAC across warehouse, lake, and BI.
  • Conduct periodic access reviews for sensitive data.
  • Apply masking so analysts work safely with PII-adjacent datasets.

Reflection

Can you identify which columns in your most-used mart are PII? Who has access today—and is that access reviewed?

Catalog and Metadata Practices

Making Governance Discoverable

A catalog without adoption is empty metadata. Practitioner governance embeds catalog usage into daily workflows so documentation and lineage stay current.

Minimum Viable Metadata

For every production dataset, require:

  • Description and grain
  • Owner and slack channel
  • Classification tags
  • Freshness SLA
  • Link to dbt model or pipeline repo

Optional but valuable: sample queries, related dashboards, known limitations.

Lineage and Impact Analysis

Automated lineage from dbt, Spark, or ingestion tools shows upstream sources and downstream dashboards. Use lineage before schema changes to notify affected consumers proactively.

Impact analysis reduces surprise breakages and builds trust in platform changes.

Search and Discovery UX

If analysts cannot find datasets in under two minutes, they revert to asking engineers. Improve search with consistent naming, business glossary terms synced to catalog, and curated "golden" dataset collections.

Data stewards curate domain collections; they do not write every description personally.

Governance Metrics

Track catalog health:

  • Percentage of Tier 1 datasets with complete metadata
  • Average age of stale descriptions (unchanged since source schema changed)
  • Access request turnaround time
  • Percentage of datasets with assigned owner

Review monthly; escalate gaps to domain leads.

Integrating Catalog with CI

Block dbt merges if new marts lack descriptions. Sync dbt schema.yml docs to catalog automatically on deploy.

Single source of truth prevents doc drift between repo and catalog.

Key Takeaways

  • Require minimum metadata for production datasets: owner, grain, classification, SLA.
  • Use lineage for proactive consumer notification on changes.
  • Measure catalog completeness and fix gaps systematically.
  • Sync documentation from dbt CI to catalog as single source of truth.

Reflection

Is your catalog a living system integrated with CI, or a stale snapshot from a one-time crawl? What would increase analyst trust in catalog search?