dbt (data build tool) is an open-source transformation framework that lets data teams write modular SQL SELECT statements and compile them into a full transformation DAG (directed acyclic graph) with lineage, tests, and documentation — all version-controlled in Git.
dbt sits in the T of ELT. You load raw data into your warehouse first (Fivetran, Airbyte, Snowpipe…), then dbt transforms it into clean, tested, documented tables.
| Feature | dbt Core (OSS) | dbt Cloud |
|---|---|---|
| CLI runs | ✓ Local/CI | ✓ Managed jobs |
| Scheduling | External (Airflow, cron) | Built-in scheduler + webhooks |
| IDE | Your editor | Cloud IDE (browser) |
| Docs hosting | Self-host | One-click |
| Cost | Free | Paid (free tier exists) |
my_dbt_project/
├── dbt_project.yml # project config: name, vars, model paths, materializations
├── profiles.yml # connection config (usually ~/.dbt/profiles.yml)
├── models/
│ ├── staging/ # 1:1 with raw sources, light cleaning
│ │ ├── stg_orders.sql
│ │ └── _staging.yml # source definitions + column tests
│ ├── intermediate/ # business logic, not exposed to BI
│ │ └── int_order_items_pivoted.sql
│ └── marts/ # final, consumer-ready models
│ ├── finance/
│ │ └── fct_revenue.sql
│ └── core/
│ └── dim_customers.sql
├── tests/ # singular (custom) tests — one .sql per test
├── macros/ # reusable Jinja functions
├── seeds/ # static CSV files loaded as tables
├── snapshots/ # SCD Type 2 history capture
├── analyses/ # ad-hoc SQL (compiled but not run by default)
└── docs/ # markdown for doc blocks
staging/ = raw → clean rename; intermediate/ = joins/pivots; marts/ = fact/dim tables BI tools query.
A model is a single .sql file containing one SELECT statement. dbt wraps it in CREATE TABLE AS or CREATE VIEW AS depending on the materialization.
| Type | What it does | When to use |
|---|---|---|
| view (default) | Creates a SQL view — no data stored | Staging, lightweight transforms |
| table | Full CTAS on every run | Small-medium models, frequently queried |
| incremental | INSERT/MERGE only new rows | Large fact tables — avoid full rebuilds |
| ephemeral | Compiled as a CTE, never materialised | Intermediate logic you don't want as a table |
| snapshot | SCD Type 2 history table | Slowly-changing dimension tracking |
{{ config(materialized='incremental', unique_key='order_id') }}
SELECT
order_id,
customer_id,
order_date,
status,
total_amount
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
-- only process rows newer than the latest row in the existing table
WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
order_date > MAX(order_date) - INTERVAL '3 days') or use unique_key + merge strategy.
{{ source() }} — raw dataReferences a table in your warehouse that dbt did not create. Define sources in a _sources.yml file for freshness checks and lineage.
-- models/staging/_staging.yml
version: 2
sources:
- name: raw
database: RAW_DB
schema: salesforce
tables:
- name: orders
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
loaded_at_field: _loaded_at
columns:
- name: order_id
tests: [unique, not_null]
{{ ref() }} — dbt-managed modelsReferences another model by name. dbt resolves the actual schema/table at compile time and builds the DAG automatically.
-- models/marts/fct_revenue.sql
SELECT
o.order_id,
c.customer_name,
o.total_amount,
o.order_date
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c USING (customer_id)
ref() for dbt models and source() for raw warehouse tables. Never hardcode schema names.
Defined in YAML, applied to columns. Built-in: unique, not_null, accepted_values, relationships.
-- models/marts/_marts.yml
version: 2
models:
- name: fct_revenue
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'delivered', 'cancelled']
- name: customer_id
tests:
- relationships:
to: ref('dim_customers')
field: customer_id
A .sql file in tests/ that returns rows when the test fails. Zero rows = pass.
-- tests/assert_revenue_positive.sql
SELECT order_id, total_amount
FROM {{ ref('fct_revenue') }}
WHERE total_amount <= 0
Popular packages that add extra test types: not_empty_string, at_least_one, expect_column_values_to_be_between, and more.
-- packages.yml
packages:
- package: dbt-labs/dbt_utils
version: [">=1.0.0", "<2.0.0"]
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
dbt uses Jinja2 templating inside SQL files. Macros are reusable functions defined in macros/.
-- Use variables
{{ var('start_date', '2024-01-01') }}
-- Conditional logic
{% if target.name == 'prod' %}
WHERE created_at >= '2020-01-01'
{% else %}
WHERE created_at >= DATEADD('year', -1, CURRENT_DATE)
{% endif %}
-- Loop
{% set payment_methods = ['credit_card', 'bank_transfer', 'gift_card'] %}
SELECT
order_id,
{% for method in payment_methods %}
SUM(CASE WHEN payment_method = '{{ method }}' THEN amount END) AS {{ method }}_amount
{% if not loop.last %},{% endif %}
{% endfor %}
FROM {{ ref('stg_payments') }}
GROUP BY 1
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, decimal_places=2) %}
ROUND({{ column_name }} / 100.0, {{ decimal_places }})
{% endmacro %}
-- Usage in a model:
SELECT {{ cents_to_dollars('price_cents') }} AS price_dollars FROM ...
Snapshots capture how a row looked at different points in time — creating a history table with dbt_valid_from / dbt_valid_to columns.
-- snapshots/snp_customers.sql
{% snapshot snp_customers %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at'
)
}}
SELECT * FROM {{ source('raw', 'customers') }}
{% endsnapshot %}
| Strategy | How it detects changes |
|---|---|
| timestamp | Compares updated_at column — fast, requires reliable update timestamp |
| check | Hashes specified columns — no timestamp needed, more expensive |
Small static CSV files committed to the repo and loaded into the warehouse as tables via dbt seed. Good for lookup tables: country codes, cost centres, exchange rates.
-- seeds/country_codes.csv
country_code,country_name
US,United States
GB,United Kingdom
AU,Australia
-- Reference a seed like any other model
SELECT * FROM {{ ref('country_codes') }}
source().
my_project:
target: dev
outputs:
dev:
type: snowflake
account: myorg-myaccount
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
role: TRANSFORMER
warehouse: DBT_WH
database: DEV_DB
schema: DBT_DEV
threads: 4
prod:
type: snowflake
account: myorg-myaccount
authenticator: externalbrowser # SSO
role: TRANSFORMER
warehouse: DBT_WH_LARGE
database: PROD_DB
schema: DBT_PROD
threads: 16
{{ config(
materialized='table',
transient=true, -- skips Fail-Safe (saves storage cost)
cluster_by=['order_date', 'region'],
query_tag='dbt_{{ model.name }}',
copy_grants=true
) }}
dynamic_table materialization in dbt-snowflake adapter.query_tag config to trace dbt runs in Snowflake's query history.| Role | Privilege | Used by |
|---|---|---|
| TRANSFORMER | CREATE TABLE/VIEW in dbt schemas | dbt service account |
| REPORTER | SELECT on marts schema | BI tools, analysts |
| LOADER | INSERT into raw schema | Fivetran / Airbyte |
| Command | What it does |
|---|---|
dbt run | Build all models (or use --select to target specific ones) |
dbt test | Run all tests; use --select for subset |
dbt build | Run + test + seed + snapshot in dependency order (preferred in CI) |
dbt seed | Load CSV seeds into the warehouse |
dbt snapshot | Run snapshot models |
dbt compile | Compile Jinja → pure SQL (inspect in target/compiled/) |
dbt docs generate | Generate docs site JSON |
dbt docs serve | Serve docs locally at localhost:8080 |
dbt source freshness | Check source table staleness against thresholds |
dbt run-operation <macro> | Execute a macro directly (e.g. create_schema) |
# Run a single model
dbt run --select fct_revenue
# Run a model and all its upstream dependencies
dbt run --select +fct_revenue
# Run a model and all its downstream dependents
dbt run --select fct_revenue+
# Run all models in the marts folder
dbt run --select marts.*
# Run only models changed since last commit (CI optimization)
dbt run --select state:modified+
# Run all failed models from the previous run
dbt run --select result:error --state ./target
ref() and source()?source() points to raw data in the warehouse that dbt didn't create (loaded by Fivetran, Snowpipe, etc.). It enables freshness checks and lineage back to ingestion. ref() points to another dbt model — it resolves the correct schema at compile time and builds the DAG automatically. Never hardcode schema names; always use one of these two functions.
Use incremental when the table grows continuously and a full rebuild is too slow or expensive (large fact tables: events, transactions). Use table when the data is small-medium, the rebuild is fast, or correctness requires seeing all rows at once (e.g. aggregates that could change historically). Watch for late-arriving data with incremental — add a lookback buffer or use unique_key + merge strategy.
Use dbt snapshots for SCD Type 2. Define a snapshot with a unique_key and either a timestamp strategy (compare updated_at) or a check strategy (hash specified columns). dbt adds dbt_scd_id, dbt_valid_from, and dbt_valid_to columns. For SCD Type 1 (overwrite), a regular incremental model with unique_key and merge works fine.
staging/ → intermediate/ → marts/. Staging is 1:1 with sources; marts are BI-ready.marts/finance/, marts/marketing/ — ownership is clear.stg_, int_, fct_, dim_ prefixes. Enforced via dbt package dbt_project_evaluator.dbt build --select state:modified+ to test only changed lineage.The DAG (directed acyclic graph) is the dependency graph dbt derives from all your ref() calls. It determines build order — upstream models run first. It matters because: (1) dbt can run independent branches in parallel, (2) --select model+ lets you rebuild a model and everything downstream, (3) dbt docs visualises the full lineage so analysts understand data flow. A cycle in ref() calls is an error — dbt enforces acyclicity.
Three layers: (1) Generic tests — unique, not_null, accepted_values, relationships defined in YAML; fast and declarative. (2) Singular tests — custom SQL in tests/ that returns rows on failure; good for business-logic assertions (e.g. revenue should never be negative). (3) Packages — dbt_expectations adds statistical and range tests; dbt_utils adds cross-model tests. Run dbt test in CI after every dbt run, or use dbt build to run both in one command.
threads in profiles.yml to run more models in parallel.SELECT * from large tables in staging — project only needed columns.query_tag: dbt_{{ model.name }} — see cost breakdown per model in Snowflake's query history.A macro is a Jinja function defined in macros/ and callable from any model, test, or other macro. Write one when you have: (1) repeated SQL patterns (e.g. converting cents to dollars, generating surrogate keys, pivoting columns); (2) environment-specific logic (target.name == 'prod'); (3) operations that don't fit in a model — dbt run-operation can execute macros for admin tasks like schema creation. Don't over-macro — if it's used in one place, keep it inline.
dbt handles within-warehouse transformation ordering via the DAG; Airflow (or Prefect, Dagster) handles cross-system orchestration — triggering ingestion, then dbt, then downstream exports. Common patterns: (1) Shell operator — Airflow runs dbt run as a subprocess; simple but opaque. (2) dbt Cloud API — trigger a dbt Cloud job via HTTP; monitor status. (3) Cosmos (Astronomer) — parses dbt manifests and creates individual Airflow tasks per dbt model, giving full task-level visibility and retry.
Pair this with Snowflake Architect and SQL Reference Guide for full stack interview prep.