dbt Best Practices

Production patterns for dbt project structure, testing, documentation, and CI/CD.

Intermediate60 min · 4 lessons

Project Structure and Layered Architecture

Why Structure Matters

A dbt project without conventions becomes unmaintainable quickly. After fifty models, nobody knows where to add a new transformation. After two hundred, circular dependencies and duplicate logic proliferate. A layered architecture gives every model a clear home and enforces a directed flow from raw sources to consumption-ready marts.

The goal is composability: small, testable building blocks that combine into larger outcomes. When each layer has a single responsibility, debugging simplifies. If a mart shows wrong revenue, you trace backward through intermediate joins to staging cleanses to source freshness.

The Staging → Intermediate → Marts Pattern

Staging Layer (`stg_`)

Staging models sit closest to sources. Their job is source-conformed cleaning, not business logic:

Rename columns to consistent conventions (snake_case)
Cast types (strings to dates, decimals)
Deduplicate source snapshots if needed
Filter soft-deleted records
One staging model per source table, typically

Example: stg_orders__shopify reads from the raw Shopify orders table, standardizes column names, and casts created_at to timestamp. It does not compute customer lifetime value or join to marketing data.

Keep staging thin. Heavy logic here makes reuse harder.

Intermediate Layer (`int_`)

Intermediate models encode reusable business logic that multiple marts need:

Joins across staging entities
Complex calculations (sessionization, attribution windows)
SCD handling
Aggregations at sub-mart grain

Example: int_orders_enriched joins stg_orders, stg_customers, and stg_products, applying currency normalization and discount logic. Both fct_orders and fct_order_lines can reference this model.

Use intermediate models to avoid copy-pasting the same ten-table join into five marts.

Marts Layer (`fct_` and `dim_`)

Marts are consumption-ready. They follow dimensional modeling conventions where appropriate:

Facts (fct_): measurable events at a defined grain (orders, page views, transactions)
Dimensions (dim_): descriptive entities (customers, products, dates)

Marts should be stable contracts for BI tools, ML features, and reverse ETL. Minimize breaking changes; version or deprecate thoughtfully.

Directory and Naming Conventions

Organize by layer and source system:

models/
  staging/
    shopify/
      stg_shopify__orders.sql
      stg_shopify__customers.sql
  intermediate/
    int_orders_enriched.sql
  marts/
    core/
      fct_orders.sql
      dim_customers.sql

Configure schema names in dbt_project.yml so staging lands in staging, marts in analytics, etc.

Naming conventions should be documented in the project README and enforced in code review.

Materialization Strategy

Default materializations by layer reduce decision fatigue:

Layer	Default	Rationale
Staging	view	Cheap rebuild, always fresh
Intermediate	ephemeral or view	Depends on reuse and cost
Marts	table or incremental	Performance for consumers

Use incremental models for large fact tables with append-heavy sources. Document unique keys and incremental strategies (merge, delete+insert, append) in model descriptions.

Key Takeaways

Use staging → intermediate → marts to separate concerns.
Keep staging thin; put shared logic in intermediate models.
Marts are stable contracts—design grains and keys deliberately.
Standardize naming, directories, and materializations across the project.

Reflection

How many layers does your team use today? Where does logic live that should move to a different layer?

Testing Strategy That Prevents Incidents

Tests That Matter

dbt ships with schema tests and supports custom generic and singular tests. The art is choosing tests that prevent real incidents without slowing development to a crawl. Prioritize tests on critical paths: primary keys, foreign keys, business invariants, and metrics that feed executive decisions.

Schema Tests in YAML

Define tests alongside models in schema.yml files:

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id

Co-locate tests with column definitions so reviewers see constraints in context.

Custom Generic Tests

When the same business rule applies across models, build a generic test. Example: assert_positive for amount columns, or assert_valid_status checking enum membership.

Generic tests live in tests/generic/ and accept arguments. They scale quality consistently across dozens of models.

Singular Tests for Complex Logic

Some rules require multi-table SQL: "refund amount never exceeds original order amount" or "monthly revenue matches sum of daily revenue." Write these as singular tests in tests/.

Use singular tests sparingly—they are harder to reuse but essential for cross-model invariants.

Unit Tests for Macros

Complex macros (currency conversion, attribution logic) deserve unit tests. dbt's unit testing feature (or external frameworks) lets you assert macro output for given inputs without building full models.

Test Selection by Tier

Align test rigor with dataset tier:

Tier 1 marts: unique + not_null on keys, relationship tests, custom business rules, freshness checks

Tier 2: keys and critical not_null constraints

Tier 3: best-effort, add tests when incidents occur

Store Failures for Debugging

Configure store_failures: true on selected tests so failing rows persist to audit tables. Analysts and on-call engineers can inspect exactly which records broke a rule without reproducing the failure manually.

Set retention policies on failure tables to avoid unbounded storage.

Testing in Development vs Production

Run tests locally before pushing. CI runs tests on modified subgraphs. Production runs full test suites on schedule or post-deploy.

Never disable a failing test to unblock a merge without understanding root cause. If the test is wrong, fix the test. If the data is wrong, fix the data.

Key Takeaways

Prioritize uniqueness, relationships, and business invariants on critical models.
Use generic tests for reusable rules; singular tests for complex cross-model logic.
Store failures to accelerate debugging.
Match test rigor to dataset tier.

Reflection

What test would have prevented your last data incident? Is that test in place today?

Documentation and Lineage

Documentation as a Product

Undocumented models are liabilities. The engineer who built int_attribution_v3 leaves; nobody knows which marketing touchpoints it credits or why a 30-day window was chosen. Documentation in dbt is not bureaucracy—it is the interface between producers and consumers.

Model and Column Descriptions

Every mart and dimension should have a model-level description explaining grain, business purpose, and known caveats. Every column consumed externally should describe meaning, source, and allowed values.

models:
  - name: fct_orders
    description: >
      One row per completed order. Excludes cancelled and test orders.
      Refreshed hourly. Primary consumer: finance revenue reporting.
    columns:
      - name: order_id
        description: Unique identifier from Shopify. Primary key.
      - name: net_revenue
        description: Order total minus tax, shipping, and refunds. USD.

Use the docs block in dbt docs to define reusable descriptions for common columns like created_at.

Exposures and Metrics

Model exposures link dbt models to downstream dashboards and applications. When a model changes, you know who to notify.

Define metrics in YAML (dbt Semantic Layer or metric definitions) so "monthly recurring revenue" has one canonical calculation, not twelve slightly different SQL versions in Looker, Tableau, and notebooks.

Lineage as Discovery

Auto-generated lineage in dbt docs helps new team members trace data flow. Encourage analysts to start debugging from the lineage graph rather than Slack threads.

Keep lineage clean by removing orphaned models and documenting sources thoroughly. Undocumented sources break trust in the graph.

Documentation Workflow

Make documentation a merge requirement for new marts. Pull request templates can include:

Model description added
Column descriptions for all exposed fields
Tests defined
Exposure linked if applicable

Review documentation in code review with the same rigor as SQL. Vague descriptions ("customer id") are worse than none—they imply completeness.

Key Takeaways

Document grain, purpose, and caveats at the model level.
Describe every externally consumed column.
Use exposures and metrics to connect models to downstream consumers.
Treat documentation updates as part of every mart PR.

Reflection

Open your most-used mart. Could a new analyst understand its grain and limitations from documentation alone? What is missing?

CI/CD and Deployment Patterns

Continuous Integration for Analytics

Analytics code deserves the same engineering discipline as application code. CI/CD for dbt reduces production incidents, speeds review cycles, and enables safe parallel development.

Pull Request CI Pipeline

A production-grade PR pipeline typically includes:

Linting: SQLFluff or similar for style consistency
Parse: dbt parse validates project structure
Slim CI run: dbt run --select state:modified+ against a staging target
Slim CI test: dbt test --select state:modified+
Optional: Compare row counts or checksums against production baseline

Use dbt defer to a production manifest so CI only builds what changed plus dependents, keeping feedback under ten minutes.

Environment Strategy

Maintain at least three environments:

Development: individual schemas (dbt_<user>_) for experimentation
Staging/CI: shared environment for merge validation
Production: scheduled or merge-triggered deploys

Never run experimental full-refresh against production without explicit approval and communication.

Deployment Orchestration

Common patterns:

Merge to main triggers production job: GitHub Actions or Airflow runs dbt build on production target after merge.

Scheduled production runs: Hourly or daily dbt run independent of merges, picking up merged code.

Blue-green for breaking changes: Build new tables with _v2 suffix, migrate consumers, deprecate old tables.

Coordinate deployments with downstream teams when column renames or grain changes affect dashboards.

Secrets and Targets

Store credentials in CI secret managers, not repository files. Use profiles.yml with environment variables for host, user, and password.

Separate targets in profiles for dev, ci, and prod with explicit role permissions. CI roles should not have production write access beyond staging schemas.

Handling Failures

When production dbt runs fail:

Alert on-call with model name and error
Do not silently retry without investigation
Document incident and add regression tests

Maintain run artifacts (logs, run results) for thirty days minimum to support debugging.

Key Takeaways

Run lint, parse, slim run, and slim test on every PR.
Separate dev, staging, and prod with appropriate permissions.
Coordinate breaking changes with downstream consumers.
Treat failed production runs as incidents, not routine noise.

Reflection

Does your team run dbt in CI today? What would break if you required passing tests before every merge to main?

dbt Best Practices

Project Structure and Layered Architecture

Why Structure Matters

The Staging → Intermediate → Marts Pattern

Staging Layer (stg_)

Intermediate Layer (int_)

Marts Layer (fct_ and dim_)

Directory and Naming Conventions

Materialization Strategy

Key Takeaways

Reflection

Testing Strategy That Prevents Incidents

Tests That Matter

Schema Tests in YAML

Custom Generic Tests

Singular Tests for Complex Logic

Unit Tests for Macros

Test Selection by Tier

Store Failures for Debugging

Testing in Development vs Production

Key Takeaways

Reflection

Documentation and Lineage

Documentation as a Product

Model and Column Descriptions

Exposures and Metrics

Lineage as Discovery

Documentation Workflow

Key Takeaways

Reflection

CI/CD and Deployment Patterns

Continuous Integration for Analytics

Pull Request CI Pipeline

Environment Strategy

Deployment Orchestration

Secrets and Targets

Handling Failures

Key Takeaways

Reflection

Staging Layer (`stg_`)

Intermediate Layer (`int_`)

Marts Layer (`fct_` and `dim_`)