dbt — Real World & Lab Guide

← Hub

What is dbt?

dbt (data build tool) is an open-source transformation framework that lets data teams write modular SQL SELECT statements and compile them into a full transformation DAG (directed acyclic graph) with lineage, tests, and documentation — all version-controlled in Git.

dbt sits in the T of ELT. You load raw data into your warehouse first (Fivetran, Airbyte, Snowpipe…), then dbt transforms it into clean, tested, documented tables.

One-line interview pitch: "dbt lets analysts write SQL like engineers — models are functions, refs create dependency graphs, tests enforce contracts, and the whole thing deploys like software."

dbt Core vs dbt Cloud

Featuredbt Core (OSS)dbt Cloud
CLI runs✓ Local/CI✓ Managed jobs
SchedulingExternal (Airflow, cron)Built-in scheduler + webhooks
IDEYour editorCloud IDE (browser)
Docs hostingSelf-hostOne-click
CostFreePaid (free tier exists)

Project structure

my_dbt_project/
├── dbt_project.yml        # project config: name, vars, model paths, materializations
├── profiles.yml           # connection config (usually ~/.dbt/profiles.yml)
├── models/
│   ├── staging/           # 1:1 with raw sources, light cleaning
│   │   ├── stg_orders.sql
│   │   └── _staging.yml   # source definitions + column tests
│   ├── intermediate/      # business logic, not exposed to BI
│   │   └── int_order_items_pivoted.sql
│   └── marts/             # final, consumer-ready models
│       ├── finance/
│       │   └── fct_revenue.sql
│       └── core/
│           └── dim_customers.sql
├── tests/                 # singular (custom) tests — one .sql per test
├── macros/                # reusable Jinja functions
├── seeds/                 # static CSV files loaded as tables
├── snapshots/             # SCD Type 2 history capture
├── analyses/              # ad-hoc SQL (compiled but not run by default)
└── docs/                  # markdown for doc blocks
Convention: staging/ = raw → clean rename; intermediate/ = joins/pivots; marts/ = fact/dim tables BI tools query.

Models

A model is a single .sql file containing one SELECT statement. dbt wraps it in CREATE TABLE AS or CREATE VIEW AS depending on the materialization.

Materializations

TypeWhat it doesWhen to use
view (default)Creates a SQL view — no data storedStaging, lightweight transforms
tableFull CTAS on every runSmall-medium models, frequently queried
incrementalINSERT/MERGE only new rowsLarge fact tables — avoid full rebuilds
ephemeralCompiled as a CTE, never materialisedIntermediate logic you don't want as a table
snapshotSCD Type 2 history tableSlowly-changing dimension tracking

Incremental model example

{{ config(materialized='incremental', unique_key='order_id') }}

SELECT
    order_id,
    customer_id,
    order_date,
    status,
    total_amount
FROM {{ source('raw', 'orders') }}

{% if is_incremental() %}
  -- only process rows newer than the latest row in the existing table
  WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
Watch out: Late-arriving data can be missed by incremental models. Add a buffer window (e.g. order_date > MAX(order_date) - INTERVAL '3 days') or use unique_key + merge strategy.

Sources & refs

{{ source() }} — raw data

References a table in your warehouse that dbt did not create. Define sources in a _sources.yml file for freshness checks and lineage.

-- models/staging/_staging.yml
version: 2
sources:
  - name: raw
    database: RAW_DB
    schema: salesforce
    tables:
      - name: orders
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
        loaded_at_field: _loaded_at
        columns:
          - name: order_id
            tests: [unique, not_null]

{{ ref() }} — dbt-managed models

References another model by name. dbt resolves the actual schema/table at compile time and builds the DAG automatically.

-- models/marts/fct_revenue.sql
SELECT
    o.order_id,
    c.customer_name,
    o.total_amount,
    o.order_date
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c USING (customer_id)
Rule: Always use ref() for dbt models and source() for raw warehouse tables. Never hardcode schema names.

Tests

Generic tests (schema tests)

Defined in YAML, applied to columns. Built-in: unique, not_null, accepted_values, relationships.

-- models/marts/_marts.yml
version: 2
models:
  - name: fct_revenue
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'delivered', 'cancelled']
      - name: customer_id
        tests:
          - relationships:
              to: ref('dim_customers')
              field: customer_id

Singular tests (custom SQL)

A .sql file in tests/ that returns rows when the test fails. Zero rows = pass.

-- tests/assert_revenue_positive.sql
SELECT order_id, total_amount
FROM {{ ref('fct_revenue') }}
WHERE total_amount <= 0

dbt-utils & dbt-expectations

Popular packages that add extra test types: not_empty_string, at_least_one, expect_column_values_to_be_between, and more.

-- packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0", "<2.0.0"]
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]

Macros & Jinja

dbt uses Jinja2 templating inside SQL files. Macros are reusable functions defined in macros/.

Built-in Jinja

-- Use variables
{{ var('start_date', '2024-01-01') }}

-- Conditional logic
{% if target.name == 'prod' %}
  WHERE created_at >= '2020-01-01'
{% else %}
  WHERE created_at >= DATEADD('year', -1, CURRENT_DATE)
{% endif %}

-- Loop
{% set payment_methods = ['credit_card', 'bank_transfer', 'gift_card'] %}
SELECT
  order_id,
  {% for method in payment_methods %}
  SUM(CASE WHEN payment_method = '{{ method }}' THEN amount END) AS {{ method }}_amount
  {% if not loop.last %},{% endif %}
  {% endfor %}
FROM {{ ref('stg_payments') }}
GROUP BY 1

Custom macro example

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, decimal_places=2) %}
  ROUND({{ column_name }} / 100.0, {{ decimal_places }})
{% endmacro %}

-- Usage in a model:
SELECT {{ cents_to_dollars('price_cents') }} AS price_dollars FROM ...

Snapshots (SCD Type 2)

Snapshots capture how a row looked at different points in time — creating a history table with dbt_valid_from / dbt_valid_to columns.

-- snapshots/snp_customers.sql
{% snapshot snp_customers %}

{{
  config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',
    updated_at='updated_at'
  )
}}

SELECT * FROM {{ source('raw', 'customers') }}

{% endsnapshot %}
StrategyHow it detects changes
timestampCompares updated_at column — fast, requires reliable update timestamp
checkHashes specified columns — no timestamp needed, more expensive

Seeds

Small static CSV files committed to the repo and loaded into the warehouse as tables via dbt seed. Good for lookup tables: country codes, cost centres, exchange rates.

-- seeds/country_codes.csv
country_code,country_name
US,United States
GB,United Kingdom
AU,Australia
-- Reference a seed like any other model
SELECT * FROM {{ ref('country_codes') }}
Limit: Seeds are not for big data. Keep them under a few thousand rows. For larger lookups, use a warehouse table and source().

Snowflake integration

profiles.yml for Snowflake

my_project:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: myorg-myaccount
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: TRANSFORMER
      warehouse: DBT_WH
      database: DEV_DB
      schema: DBT_DEV
      threads: 4
    prod:
      type: snowflake
      account: myorg-myaccount
      authenticator: externalbrowser   # SSO
      role: TRANSFORMER
      warehouse: DBT_WH_LARGE
      database: PROD_DB
      schema: DBT_PROD
      threads: 16

Snowflake-specific config options

{{ config(
    materialized='table',
    transient=true,          -- skips Fail-Safe (saves storage cost)
    cluster_by=['order_date', 'region'],
    query_tag='dbt_{{ model.name }}',
    copy_grants=true
) }}

Snowflake + dbt patterns

Snowflake access roles pattern

RolePrivilegeUsed by
TRANSFORMERCREATE TABLE/VIEW in dbt schemasdbt service account
REPORTERSELECT on marts schemaBI tools, analysts
LOADERINSERT into raw schemaFivetran / Airbyte

Key dbt commands

CommandWhat it does
dbt runBuild all models (or use --select to target specific ones)
dbt testRun all tests; use --select for subset
dbt buildRun + test + seed + snapshot in dependency order (preferred in CI)
dbt seedLoad CSV seeds into the warehouse
dbt snapshotRun snapshot models
dbt compileCompile Jinja → pure SQL (inspect in target/compiled/)
dbt docs generateGenerate docs site JSON
dbt docs serveServe docs locally at localhost:8080
dbt source freshnessCheck source table staleness against thresholds
dbt run-operation <macro>Execute a macro directly (e.g. create_schema)

Node selection syntax

# Run a single model
dbt run --select fct_revenue

# Run a model and all its upstream dependencies
dbt run --select +fct_revenue

# Run a model and all its downstream dependents
dbt run --select fct_revenue+

# Run all models in the marts folder
dbt run --select marts.*

# Run only models changed since last commit (CI optimization)
dbt run --select state:modified+

# Run all failed models from the previous run
dbt run --select result:error --state ./target

Interview Q&A

What is the difference between ref() and source()?

source() points to raw data in the warehouse that dbt didn't create (loaded by Fivetran, Snowpipe, etc.). It enables freshness checks and lineage back to ingestion. ref() points to another dbt model — it resolves the correct schema at compile time and builds the DAG automatically. Never hardcode schema names; always use one of these two functions.

When would you use an incremental model vs a full table?

Use incremental when the table grows continuously and a full rebuild is too slow or expensive (large fact tables: events, transactions). Use table when the data is small-medium, the rebuild is fast, or correctness requires seeing all rows at once (e.g. aggregates that could change historically). Watch for late-arriving data with incremental — add a lookback buffer or use unique_key + merge strategy.

How do you handle slowly-changing dimensions in dbt?

Use dbt snapshots for SCD Type 2. Define a snapshot with a unique_key and either a timestamp strategy (compare updated_at) or a check strategy (hash specified columns). dbt adds dbt_scd_id, dbt_valid_from, and dbt_valid_to columns. For SCD Type 1 (overwrite), a regular incremental model with unique_key and merge works fine.

How would you structure a dbt project for a team of 10 analysts?
  • Layered folder structure: staging/intermediate/marts/. Staging is 1:1 with sources; marts are BI-ready.
  • Sub-folders by domain: marts/finance/, marts/marketing/ — ownership is clear.
  • Naming conventions: stg_, int_, fct_, dim_ prefixes. Enforced via dbt package dbt_project_evaluator.
  • One YAML file per sub-folder for sources/tests, not one giant schema.yml.
  • CI runs dbt build --select state:modified+ to test only changed lineage.
What is the dbt DAG and why does it matter?

The DAG (directed acyclic graph) is the dependency graph dbt derives from all your ref() calls. It determines build order — upstream models run first. It matters because: (1) dbt can run independent branches in parallel, (2) --select model+ lets you rebuild a model and everything downstream, (3) dbt docs visualises the full lineage so analysts understand data flow. A cycle in ref() calls is an error — dbt enforces acyclicity.

How do you test data quality in dbt?

Three layers: (1) Generic testsunique, not_null, accepted_values, relationships defined in YAML; fast and declarative. (2) Singular tests — custom SQL in tests/ that returns rows on failure; good for business-logic assertions (e.g. revenue should never be negative). (3) Packagesdbt_expectations adds statistical and range tests; dbt_utils adds cross-model tests. Run dbt test in CI after every dbt run, or use dbt build to run both in one command.

How do you optimise dbt models on Snowflake?
  • Cluster keys on large tables matching typical filter columns (date, region, customer_id).
  • Transient tables for staging/intermediate — skip Fail-Safe, save storage cost.
  • Incremental models for growing fact tables — avoid full rebuild cost.
  • Separate warehouses for dbt runs vs BI queries — avoid contention.
  • Thread count: increase threads in profiles.yml to run more models in parallel.
  • Avoid SELECT * from large tables in staging — project only needed columns.
  • Query tags: query_tag: dbt_{{ model.name }} — see cost breakdown per model in Snowflake's query history.
What is a dbt macro and when would you write one?

A macro is a Jinja function defined in macros/ and callable from any model, test, or other macro. Write one when you have: (1) repeated SQL patterns (e.g. converting cents to dollars, generating surrogate keys, pivoting columns); (2) environment-specific logic (target.name == 'prod'); (3) operations that don't fit in a model — dbt run-operation can execute macros for admin tasks like schema creation. Don't over-macro — if it's used in one place, keep it inline.

How does dbt fit with Airflow / orchestration?

dbt handles within-warehouse transformation ordering via the DAG; Airflow (or Prefect, Dagster) handles cross-system orchestration — triggering ingestion, then dbt, then downstream exports. Common patterns: (1) Shell operator — Airflow runs dbt run as a subprocess; simple but opaque. (2) dbt Cloud API — trigger a dbt Cloud job via HTTP; monitor status. (3) Cosmos (Astronomer) — parses dbt manifests and creates individual Airflow tasks per dbt model, giving full task-level visibility and retry.

Pair this with Snowflake Architect and SQL Reference Guide for full stack interview prep.