dbt — Real World & Lab Guide

What is dbt?

dbt (data build tool) is an open-source transformation framework that lets data teams write modular SQL SELECT statements and compile them into a full transformation DAG (directed acyclic graph) with lineage, tests, and documentation — all version-controlled in Git.

dbt sits in the T of ELT. You load raw data into your warehouse first (Fivetran, Airbyte, Snowpipe…), then dbt transforms it into clean, tested, documented tables.

One-line interview pitch: "dbt lets analysts write SQL like engineers — models are functions, refs create dependency graphs, tests enforce contracts, and the whole thing deploys like software."

dbt Core vs dbt Cloud

Feature	dbt Core (OSS)	dbt Cloud
CLI runs	✓ Local/CI	✓ Managed jobs
Scheduling	External (Airflow, cron)	Built-in scheduler + webhooks
IDE	Your editor	Cloud IDE (browser)
Docs hosting	Self-host	One-click
Cost	Free	Paid (free tier exists)

Project structure

my_dbt_project/
├── dbt_project.yml        # project config: name, vars, model paths, materializations
├── profiles.yml           # connection config (usually ~/.dbt/profiles.yml)
├── models/
│   ├── staging/           # 1:1 with raw sources, light cleaning
│   │   ├── stg_orders.sql
│   │   └── _staging.yml   # source definitions + column tests
│   ├── intermediate/      # business logic, not exposed to BI
│   │   └── int_order_items_pivoted.sql
│   └── marts/             # final, consumer-ready models
│       ├── finance/
│       │   └── fct_revenue.sql
│       └── core/
│           └── dim_customers.sql
├── tests/                 # singular (custom) tests — one .sql per test
├── macros/                # reusable Jinja functions
├── seeds/                 # static CSV files loaded as tables
├── snapshots/             # SCD Type 2 history capture
├── analyses/              # ad-hoc SQL (compiled but not run by default)
└── docs/                  # markdown for doc blocks

Convention: staging/ = raw → clean rename; intermediate/ = joins/pivots; marts/ = fact/dim tables BI tools query.

Models

A model is a single .sql file containing one SELECT statement. dbt wraps it in CREATE TABLE AS or CREATE VIEW AS depending on the materialization.

Materializations

Type	What it does	When to use
view (default)	Creates a SQL view — no data stored	Staging, lightweight transforms
table	Full CTAS on every run	Small-medium models, frequently queried
incremental	INSERT/MERGE only new rows	Large fact tables — avoid full rebuilds
ephemeral	Compiled as a CTE, never materialised	Intermediate logic you don't want as a table
snapshot	SCD Type 2 history table	Slowly-changing dimension tracking

Incremental model example

{{ config(materialized='incremental', unique_key='order_id') }}

SELECT
    order_id,
    customer_id,
    order_date,
    status,
    total_amount
FROM {{ source('raw', 'orders') }}

{% if is_incremental() %}
  -- only process rows newer than the latest row in the existing table
  WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

Watch out: Late-arriving data can be missed by incremental models. Add a buffer window (e.g. order_date > MAX(order_date) - INTERVAL '3 days') or use unique_key + merge strategy.

Sources & refs

`{{ source() }}` — raw data

References a table in your warehouse that dbt did not create. Define sources in a _sources.yml file for freshness checks and lineage.

-- models/staging/_staging.yml
version: 2
sources:
  - name: raw
    database: RAW_DB
    schema: salesforce
    tables:
      - name: orders
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
        loaded_at_field: _loaded_at
        columns:
          - name: order_id
            tests: [unique, not_null]

`{{ ref() }}` — dbt-managed models

References another model by name. dbt resolves the actual schema/table at compile time and builds the DAG automatically.

-- models/marts/fct_revenue.sql
SELECT
    o.order_id,
    c.customer_name,
    o.total_amount,
    o.order_date
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c USING (customer_id)

Rule: Always use ref() for dbt models and source() for raw warehouse tables. Never hardcode schema names.

Tests

Generic tests (schema tests)

Defined in YAML, applied to columns. Built-in: unique, not_null, accepted_values, relationships.

-- models/marts/_marts.yml
version: 2
models:
  - name: fct_revenue
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'delivered', 'cancelled']
      - name: customer_id
        tests:
          - relationships:
              to: ref('dim_customers')
              field: customer_id

Singular tests (custom SQL)

A .sql file in tests/ that returns rows when the test fails. Zero rows = pass.

-- tests/assert_revenue_positive.sql
SELECT order_id, total_amount
FROM {{ ref('fct_revenue') }}
WHERE total_amount <= 0

dbt-utils & dbt-expectations

Popular packages that add extra test types: not_empty_string, at_least_one, expect_column_values_to_be_between, and more.

-- packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0", "<2.0.0"]
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]

Macros & Jinja

dbt uses Jinja2 templating inside SQL files. Macros are reusable functions defined in macros/.

Built-in Jinja

-- Use variables
{{ var('start_date', '2024-01-01') }}

-- Conditional logic
{% if target.name == 'prod' %}
  WHERE created_at >= '2020-01-01'
{% else %}
  WHERE created_at >= DATEADD('year', -1, CURRENT_DATE)
{% endif %}

-- Loop
{% set payment_methods = ['credit_card', 'bank_transfer', 'gift_card'] %}
SELECT
  order_id,
  {% for method in payment_methods %}
  SUM(CASE WHEN payment_method = '{{ method }}' THEN amount END) AS {{ method }}_amount
  {% if not loop.last %},{% endif %}
  {% endfor %}
FROM {{ ref('stg_payments') }}
GROUP BY 1

Custom macro example

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, decimal_places=2) %}
  ROUND({{ column_name }} / 100.0, {{ decimal_places }})
{% endmacro %}

-- Usage in a model:
SELECT {{ cents_to_dollars('price_cents') }} AS price_dollars FROM ...

Snapshots (SCD Type 2)

Snapshots capture how a row looked at different points in time — creating a history table with dbt_valid_from / dbt_valid_to columns.

-- snapshots/snp_customers.sql
{% snapshot snp_customers %}

{{
  config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',
    updated_at='updated_at'
  )
}}

SELECT * FROM {{ source('raw', 'customers') }}

{% endsnapshot %}

Strategy	How it detects changes
timestamp	Compares `updated_at` column — fast, requires reliable update timestamp
check	Hashes specified columns — no timestamp needed, more expensive

Seeds

Small static CSV files committed to the repo and loaded into the warehouse as tables via dbt seed. Good for lookup tables: country codes, cost centres, exchange rates.

-- seeds/country_codes.csv
country_code,country_name
US,United States
GB,United Kingdom
AU,Australia

-- Reference a seed like any other model
SELECT * FROM {{ ref('country_codes') }}

Limit: Seeds are not for big data. Keep them under a few thousand rows. For larger lookups, use a warehouse table and source().

Snowflake integration

profiles.yml for Snowflake

my_project:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: myorg-myaccount
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: TRANSFORMER
      warehouse: DBT_WH
      database: DEV_DB
      schema: DBT_DEV
      threads: 4
    prod:
      type: snowflake
      account: myorg-myaccount
      authenticator: externalbrowser   # SSO
      role: TRANSFORMER
      warehouse: DBT_WH_LARGE
      database: PROD_DB
      schema: DBT_PROD
      threads: 16

Snowflake-specific config options

{{ config(
    materialized='table',
    transient=true,          -- skips Fail-Safe (saves storage cost)
    cluster_by=['order_date', 'region'],
    query_tag='dbt_{{ model.name }}',
    copy_grants=true
) }}

Snowflake + dbt patterns

Use transient tables for intermediate/ephemeral models — no Fail-Safe charges.
Cluster keys on large fact tables match your most common filter columns.
Warehouse sizing: small warehouse for dev/test; dedicated larger warehouse for prod full refresh runs.
Dynamic Tables (Snowflake feature): dbt can target Snowflake Dynamic Tables for near-real-time refresh — check dynamic_table materialization in dbt-snowflake adapter.
Query tags: Set query_tag config to trace dbt runs in Snowflake's query history.

Snowflake access roles pattern

Role	Privilege	Used by
TRANSFORMER	CREATE TABLE/VIEW in dbt schemas	dbt service account
REPORTER	SELECT on marts schema	BI tools, analysts
LOADER	INSERT into raw schema	Fivetran / Airbyte

Key dbt commands

Command	What it does
`dbt run`	Build all models (or use `--select` to target specific ones)
`dbt test`	Run all tests; use `--select` for subset
`dbt build`	Run + test + seed + snapshot in dependency order (preferred in CI)
`dbt seed`	Load CSV seeds into the warehouse
`dbt snapshot`	Run snapshot models
`dbt compile`	Compile Jinja → pure SQL (inspect in `target/compiled/`)
`dbt docs generate`	Generate docs site JSON
`dbt docs serve`	Serve docs locally at localhost:8080
`dbt source freshness`	Check source table staleness against thresholds
`dbt run-operation <macro>`	Execute a macro directly (e.g. `create_schema`)

Node selection syntax

# Run a single model
dbt run --select fct_revenue

# Run a model and all its upstream dependencies
dbt run --select +fct_revenue

# Run a model and all its downstream dependents
dbt run --select fct_revenue+

# Run all models in the marts folder
dbt run --select marts.*

# Run only models changed since last commit (CI optimization)
dbt run --select state:modified+

# Run all failed models from the previous run
dbt run --select result:error --state ./target

Interview Q&A

What is the difference between ref() and source()?

source() points to raw data in the warehouse that dbt didn't create (loaded by Fivetran, Snowpipe, etc.). It enables freshness checks and lineage back to ingestion. ref() points to another dbt model — it resolves the correct schema at compile time and builds the DAG automatically. Never hardcode schema names; always use one of these two functions.

When would you use an incremental model vs a full table?

Use incremental when the table grows continuously and a full rebuild is too slow or expensive (large fact tables: events, transactions). Use table when the data is small-medium, the rebuild is fast, or correctness requires seeing all rows at once (e.g. aggregates that could change historically). Watch for late-arriving data with incremental — add a lookback buffer or use unique_key + merge strategy.

How do you handle slowly-changing dimensions in dbt?

Use dbt snapshots for SCD Type 2. Define a snapshot with a unique_key and either a timestamp strategy (compare updated_at) or a check strategy (hash specified columns). dbt adds dbt_scd_id, dbt_valid_from, and dbt_valid_to columns. For SCD Type 1 (overwrite), a regular incremental model with unique_key and merge works fine.

How would you structure a dbt project for a team of 10 analysts?

Layered folder structure: staging/ → intermediate/ → marts/. Staging is 1:1 with sources; marts are BI-ready.
Sub-folders by domain: marts/finance/, marts/marketing/ — ownership is clear.
Naming conventions: stg_, int_, fct_, dim_ prefixes. Enforced via dbt package dbt_project_evaluator.
One YAML file per sub-folder for sources/tests, not one giant schema.yml.
CI runs dbt build --select state:modified+ to test only changed lineage.

What is the dbt DAG and why does it matter?

The DAG (directed acyclic graph) is the dependency graph dbt derives from all your ref() calls. It determines build order — upstream models run first. It matters because: (1) dbt can run independent branches in parallel, (2) --select model+ lets you rebuild a model and everything downstream, (3) dbt docs visualises the full lineage so analysts understand data flow. A cycle in ref() calls is an error — dbt enforces acyclicity.

How do you test data quality in dbt?

Three layers: (1) Generic tests — unique, not_null, accepted_values, relationships defined in YAML; fast and declarative. (2) Singular tests — custom SQL in tests/ that returns rows on failure; good for business-logic assertions (e.g. revenue should never be negative). (3) Packages — dbt_expectations adds statistical and range tests; dbt_utils adds cross-model tests. Run dbt test in CI after every dbt run, or use dbt build to run both in one command.

How do you optimise dbt models on Snowflake?

Cluster keys on large tables matching typical filter columns (date, region, customer_id).
Transient tables for staging/intermediate — skip Fail-Safe, save storage cost.
Incremental models for growing fact tables — avoid full rebuild cost.
Separate warehouses for dbt runs vs BI queries — avoid contention.
Thread count: increase threads in profiles.yml to run more models in parallel.
Avoid SELECT * from large tables in staging — project only needed columns.
Query tags: query_tag: dbt_{{ model.name }} — see cost breakdown per model in Snowflake's query history.

What is a dbt macro and when would you write one?

A macro is a Jinja function defined in macros/ and callable from any model, test, or other macro. Write one when you have: (1) repeated SQL patterns (e.g. converting cents to dollars, generating surrogate keys, pivoting columns); (2) environment-specific logic (target.name == 'prod'); (3) operations that don't fit in a model — dbt run-operation can execute macros for admin tasks like schema creation. Don't over-macro — if it's used in one place, keep it inline.

How does dbt fit with Airflow / orchestration?

dbt handles within-warehouse transformation ordering via the DAG; Airflow (or Prefect, Dagster) handles cross-system orchestration — triggering ingestion, then dbt, then downstream exports. Common patterns: (1) Shell operator — Airflow runs dbt run as a subprocess; simple but opaque. (2) dbt Cloud API — trigger a dbt Cloud job via HTTP; monitor status. (3) Cosmos (Astronomer) — parses dbt manifests and creates individual Airflow tasks per dbt model, giving full task-level visibility and retry.

Pair this with Snowflake Architect and SQL Reference Guide for full stack interview prep.

What is dbt?

dbt Core vs dbt Cloud

Project structure

Models

Materializations

Incremental model example

Sources & refs

{{ source() }} — raw data

{{ ref() }} — dbt-managed models

Tests

Generic tests (schema tests)

Singular tests (custom SQL)

dbt-utils & dbt-expectations

Macros & Jinja

Built-in Jinja

Custom macro example

Snapshots (SCD Type 2)

Seeds

Snowflake integration

profiles.yml for Snowflake

Snowflake-specific config options

Snowflake + dbt patterns

Snowflake access roles pattern

Key dbt commands

Node selection syntax

Interview Q&A

`{{ source() }}` — raw data

`{{ ref() }}` — dbt-managed models