Skip to content
Data Modeling

What is Data Modeling?

The practice of defining how data is structured, organized, and related in your warehouse — the difference between dashboards that are fast and trustworthy versus slow and wrong.

Quick Answer

Data modeling is the process of defining how data is structured, stored, and related within a database or data warehouse. It specifies tables, columns, data types, relationships, and — most critically — the grain (what one row represents). The most common approach for analytics is dimensional modeling: fact tables (measurements) joined to dimension tables (context), arranged in a star schema. Good data models make queries fast, results trustworthy, and analytics self-serve.

What is Data Modeling?

Every table in your warehouse is a data model. The question is whether it was designed intentionally or just dumped from a source system. A well-designed data model answers three questions:

  • What does one row represent? (the grain)
  • What can you measure? (fact columns: revenue, quantity, duration)
  • How can you slice it? (dimension columns: customer, product, date, region)

Data modeling sits between raw source data (staging) and the analytics layer (marts). In the modern stack it is typically implemented in dbt — SQL transformations that produce clean, tested, documented models in your warehouse.

Dimensional Modeling

  • • Facts + dimensions in a star or snowflake schema
  • • Optimized for analytical queries and aggregations
  • • Business-friendly column names and grain
  • • Best for: BI tools, dashboards, self-serve analytics

Data Vault

  • • Hubs, Links, Satellites for full auditability
  • • Handles many source systems with different history
  • • Load-flexible — add new sources without refactoring
  • • Best for: enterprise EDW, compliance, multi-source

Why Data Modeling Matters

Without Data Modeling

  • • Analysts write 50-line queries to answer basic questions
  • • Finance and marketing report different revenue numbers
  • • Dashboards break when source schemas change
  • • New analysts need weeks to understand the data
  • • Joins produce duplicates — nobody knows why

With Data Modeling

  • • Analysts query fct_orders and get answers in 3 lines
  • • Single source of truth — one definition of revenue
  • • Models tested and documented in dbt
  • • New analysts onboard in hours, not weeks
  • • Grain is declared — joins are safe by design

What You Can Build with Data Modeling

Revenue Reporting

A fct_orders fact table joined to dim_customers and dim_products — one source of truth for finance and growth.

Product Analytics

fct_events + dim_users enables funnel analysis, retention cohorts, and feature adoption metrics.

Marketing Attribution

fct_sessions and fct_conversions with dim_campaigns for multi-touch attribution across channels.

Inventory & Supply Chain

fct_inventory_snapshots with slowly changing dimensions for stock levels across warehouses.

Customer 360

A wide dim_customers table combining CRM, billing, and support data into a single customer record.

Financial Consolidation

fct_general_ledger with dim_accounts and dim_cost_centers for P&L and balance sheet rollups.

How Data Modeling Works

In the modern data stack, data modeling is a four-stage process: raw source data lands in staging tables, transformation logic produces intermediate models, final fact and dimension tables are materialized, and a semantic layer (metrics definitions) sits on top for BI tools.

STAGE

raw source tables

INTERMEDIATE

cleaned & joined

MARTS

fct_ + dim_ tables

SEMANTIC

metrics & measures

fct_orders — grain: one row per order line item

-- models/marts/fct_orders.sql
SELECT
    -- grain: one row per order line item
    o.order_id,
    o.order_item_id,
    o.ordered_at,
    -- foreign keys → dimensions
    o.customer_id,
    o.product_id,
    d.date_day       AS order_date_key,
    -- measures (additive facts)
    o.quantity,
    o.unit_price,
    o.quantity * o.unit_price  AS gross_revenue,
    o.discount_amount,
    o.quantity * o.unit_price - o.discount_amount  AS net_revenue
FROM    {{ ref('stg_orders') }}  o
LEFT JOIN  {{ ref('dim_date') }}  d
    ON d.date_day = o.ordered_at::date

dim_customers — slowly changing dimension (SCD Type 2)

-- models/marts/dim_customers.sql
SELECT
    customer_id,
    full_name,
    email,
    country,
    customer_segment,
    first_order_date,
    lifetime_orders,
    lifetime_revenue,
    -- SCD Type 2 metadata
    dbt_valid_from,
    dbt_valid_to,
    dbt_scd_id
FROM    {{ snapshot('customers_snapshot') }}
WHERE    dbt_valid_to IS NULL  -- current records only

Data Modeling Approaches Compared

Star Schema vs Snowflake Schema

Star Schema

  • • Flat, denormalized dimensions
  • • Simple joins: fact → dim (one hop)
  • • Faster queries, easier to understand
  • • Slightly more storage

Snowflake Schema

  • • Normalized dimensions in sub-tables
  • • Multi-hop joins: fact → dim → sub-dim
  • • Less storage, more complex queries
  • • Harder to self-serve
Verdict: Star schema for analytics and BI. Storage is cheap in modern cloud warehouses — avoid snowflake schema unless disk cost is a genuine constraint.

Dimensional Modeling vs Data Vault

Dimensional Modeling

  • • Analyst-friendly star schema output
  • • Designed around business processes
  • • Fast to build and query
  • • Harder to extend with new sources

Data Vault

  • • Hub/Link/Satellite for full auditability
  • • Designed for multi-source integration
  • • Load-flexible and historically complete
  • • Complex to query — needs mart layer on top
Verdict: Dimensional modeling for most teams. Data vault for regulated industries (finance, healthcare) or large enterprises with 10+ source systems requiring full audit trails.

Normalized Models vs One Big Table (OBT)

Normalized (fct + dim)

  • • Separate facts and dimensions
  • • No redundancy — single definition
  • • Joins required for analysis
  • • Best for large warehouses

One Big Table (OBT)

  • • Pre-joined wide table
  • • Zero joins for analysts
  • • Redundant data — harder to update
  • • Best for small teams / ad-hoc
Verdict: Start with normalized fct + dim for maintainability. Add OBT shortcuts for high-traffic queries where join cost is measurable.
FeatureStar SchemaData VaultOBT
Analyst friendliness✓ high✗ low (needs mart)✓ highest
Query complexityLow (1–2 joins)High (many joins)None
Historical tracking✓ SCD types✓ native via Satellites✗ snapshot only
Multi-source flexibility✗ hard to extend✓ purpose-built✗ very hard
Storage efficiencyMediumLow (very redundant)Low
Best toolingdbtdbt + data vault packagesdbt / raw SQL

Common Mistakes

Mixing grains in a fact table

Putting order-level and order-line-level rows in the same fact table produces silent double-counting. Declare the grain explicitly (in a comment or dbt description) and enforce it in code. One fact table = one grain.

Modelling source tables, not business processes

Copying raw tables into a warehouse is not data modeling. Dimensional models are built around business processes (what happened: an order was placed) — not source system tables (what the OLTP schema looked like).

Not testing for duplicates and referential integrity

Every fact table should have a dbt unique + not_null test on its primary key. Every foreign key should have a relationships test against its dimension. Without these, silent data quality failures go undetected.

Skipping the semantic layer

If the definition of "active customer" or "net revenue" lives inside a BI tool query, every analyst will define it differently. Define metrics once in a semantic layer (dbt metrics, MetricFlow) and let tools query it.

Who Should Learn Data Modeling?

Analytics Engineer

You write dbt models and want a principled framework for how to structure facts, dims, and marts. This is your core discipline.

Data Engineer

You build pipelines that land raw data in the warehouse. Understanding downstream data models helps you stage data in the right shape.

Data Analyst

You query the warehouse daily. Understanding how fct_ and dim_ tables are designed makes your SQL faster and your metrics more accurate.

Related Concepts

FAQs

What is data modeling?
Data modeling is the process of defining how data is structured, organized, and related within a database or data warehouse. A data model specifies the tables (or entities), their columns, data types, relationships (joins), and the grain (what one row represents). Good data models make queries fast, results trustworthy, and downstream logic simple. Bad models create duplicates, ambiguous joins, and slow dashboards.
What is dimensional modeling?
Dimensional modeling is a data warehouse design technique invented by Ralph Kimball. It organizes data into fact tables (measurements like revenue, clicks, orders) and dimension tables (context like customers, products, dates). Fact tables are narrow and tall; dimension tables are wide and short. The result is a star schema optimized for analytical queries — fast aggregations, intuitive joins, and business-friendly column names.
What is the difference between a star schema and a snowflake schema?
A star schema has one layer of dimension tables directly joined to the fact table — simple, fast, and easy to query. A snowflake schema normalizes dimensions further into sub-dimensions (e.g., a country table hanging off a city table which hangs off a customer table) — reducing storage but increasing join complexity. Most modern analytics warehouses (BigQuery, Snowflake, Redshift) prefer star schema because storage is cheap and join performance is excellent.
What is data vault modeling?
Data vault is a modeling approach designed for enterprise data warehouses that need to integrate data from many source systems. It uses three entity types: Hubs (unique business keys), Links (relationships between hubs), and Satellites (descriptive attributes with full history). Data vault is highly auditable and load-flexible but complex to query — most teams add a mart layer (star schema) on top for analyst consumption.
What is grain in data modeling?
The grain of a fact table defines what one row represents — the most atomic level of detail captured. For example: one row per order line item (not per order), or one row per daily user-product impression (not per session). Mixing grains in a single fact table is the most common data modeling mistake — it produces incorrect aggregations and subtle double-counting bugs.

What You'll Build with AI-DE

The Commerce Data Warehouse project walks you through building a production-grade dimensional model for an e-commerce business — staging, intermediate models, fct_orders, dim_customers, and a semantic layer — all in dbt.

  • • Star schema with fct_orders, fct_sessions, and 5 dimension tables
  • • SCD Type 2 dim_customers tracking plan changes over time
  • • dbt tests for grain, uniqueness, and referential integrity
  • • Semantic layer with MetricFlow for self-serve metrics
  • • Incremental materializations for large fact tables
Press Cmd+K to open