Build an AI Tactical Analyst with NFL Data, dbt, and RAG: A Full Data Engineering Pipeline
Want to build this yourself?
This architecture is covered in our hands-on projects. Build it in the AI-DE sandbox.
Explore ProjectsTL;DR
While everyone else is arguing about the halftime show, we're building the scouting report. This tutorial walks through a full production-style data + AI pipeline on real NFL play-by-play data:
All code is embedded below and packaged as a runnable project on AI-DE.
Key takeaways:
Why NFL Data? (And Why It's a Better Teacher Than E-Commerce)
E-commerce tutorials are exhausted. Everyone has built a churn model on the same synthetic dataset.
NFL play-by-play is a better teacher because it has every property of a real production dataset:
If you can build this pipeline, you can build product analytics at any company.
Architecture Overview
NFL AI Tactical Analyst — Full Pipeline Architecture
nfl_data_py → DuckDB
~50k plays, 300+ columns
staging → intermediate → marts
EPA, CPOE, per-QB aggregates
uniqueness · not_null · range checks
quality gate — pipeline halts on failure
rolling 5-game EPA + CPOE
window functions over mart_qb_rolling_form
dbt marts → LLM context → decision
"Go for it. 68% success at this down/distance."
Every step is idempotent — dbt test failure halts the pipeline before the LLM sees dirty data
This is the three-plane model applied to sports data. Every layer has one job. Execution authority stays with the orchestrator. This is the same pattern you'd use to ship a real product analytics platform — we're just using football instead of funnels.
Step 1 — Project Setup and Ingestion
We use nfl_data_py to pull real play-by-play data and DuckDB as a zero-setup local warehouse.
import os
import subprocess
from pathlib import Path
PROJECT_NAME = "nfl_ai_bootcamp"
def create_project(base_path):
"""Scaffold the dbt + DuckDB project."""
base = Path(base_path) / PROJECT_NAME
base.mkdir(parents=True, exist_ok=True)
(base / "requirements.txt").write_text(
"dbt-duckdb>=1.7\n"
"nfl_data_py>=0.3\n"
"pandas\nnumpy\nopenai>=1.0\n"
)
subprocess.run(
["dbt", "init", PROJECT_NAME, "--skip-profile-setup"],
cwd=base, check=True,
)
return baseLoad NFL play-by-play into DuckDB
import duckdb
import nfl_data_py as nfl
# Pull 2024 regular season play-by-play
pbp = nfl.import_pbp_data([2024])
con = duckdb.connect("nfl_analytics.duckdb")
con.execute("CREATE SCHEMA IF NOT EXISTS raw")
con.register("pbp_df", pbp)
con.execute("CREATE OR REPLACE TABLE raw.plays AS SELECT * FROM pbp_df")
print(f"Loaded {con.execute('SELECT COUNT(*) FROM raw.plays').fetchone()[0]:,} plays")That single call gives you ~50,000 plays with 300+ columns — this is the messy, real-world data you'd get from any sports API.
Configure dbt profile
Save this as `~/.dbt/profiles.yml`. Use an absolute path for DuckDB — otherwise dbt and your Python scripts will end up pointing at different database files depending on which directory you run them from.
# ~/.dbt/profiles.yml
nfl_ai_bootcamp:
target: dev
outputs:
dev:
type: duckdb
path: /absolute/path/to/nfl_ai_bootcamp/nfl_analytics.duckdb
schema: analytics
threads: 4Install dbt packages
We use dbt_utils for range checks in the test layer:
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: [">=1.1.0", "<2.0.0"]dbt depsSkip this step and the quality-gate tests in Step 5 will fail with `'dbt_utils' is undefined`.
Step 2 — The dbt Staging Layer
Raw play-by-play data is the NFL equivalent of commentary text. The staging layer's job is to clean it and expose typed, named columns. Nothing fancy — just discipline.
-- models/staging/stg_plays.sql
with source as (
select * from {{ source('raw', 'plays') }}
),
renamed as (
select
play_id,
game_id,
posteam as possession_team,
defteam as defense_team,
qtr as quarter,
down,
ydstogo as yards_to_go,
yardline_100,
play_type,
passer_player_id as qb_id,
passer_player_name as qb_name,
passing_yards,
pass_attempt,
complete_pass,
epa,
cpoe,
success,
week
from source
where play_type in ('pass', 'run')
and down is not null
)
select * from renamed# models/staging/_sources.yml
version: 2
sources:
- name: raw
schema: raw
tables:
- name: plays
description: "Raw NFL play-by-play from nfl_data_py"Step 3 — Intermediate Layer: Per-QB Aggregation
Intermediate models are where business logic lives. Here we aggregate plays to QB-game level.
-- models/intermediate/int_qb_game_stats.sql
with plays as (
select * from {{ ref('stg_plays') }}
where pass_attempt = 1
and qb_id is not null
),
qb_game as (
select
qb_id,
qb_name,
game_id,
week,
possession_team,
count(*) as total_attempts,
sum(complete_pass) as completions,
sum(passing_yards) as passing_yards,
avg(epa) as avg_epa,
avg(cpoe) as avg_cpoe,
avg(success::int) as success_rate
from plays
group by 1, 2, 3, 4, 5
)
select * from qb_gameStep 4 — Marts Layer: EPA and CPOE (The Metrics That Matter)
This is where we compute the signals that actually separate elite QBs from replacements.
-- models/marts/mart_qb_performance.sql
with game_stats as (
select * from {{ ref('int_qb_game_stats') }}
),
season_agg as (
select
qb_id,
qb_name,
count(distinct game_id) as games_played,
sum(total_attempts) as attempts,
sum(completions) * 1.0 / sum(total_attempts) as completion_pct,
avg(avg_epa) as season_epa_per_play,
avg(avg_cpoe) as season_cpoe,
avg(success_rate) as success_rate
from game_stats
group by 1, 2
having sum(total_attempts) >= 100
)
select
*,
case
when season_epa_per_play >= 0.20 then 'Elite'
when season_epa_per_play >= 0.10 then 'Above Average'
when season_epa_per_play >= 0.00 then 'Average'
else 'Below Replacement'
end as tier
from season_agg
order by season_epa_per_play descEPA per play above 0.20 is elite territory — Mahomes, Allen, Burrow in a good year. CPOE above +3% means the QB completes passes at a rate meaningfully higher than league average given the difficulty of each throw. These are the same derived metrics front offices pay for.
Step 5 — The Defensive Line: dbt Tests
Before any of this reaches the AI, it has to pass quality gates. This is the CI/CD layer.
# models/marts/_marts.yml
version: 2
models:
- name: mart_qb_performance
description: "Season-level QB performance with EPA/CPOE tiers"
columns:
- name: qb_id
tests:
- unique
- not_null
- name: attempts
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
- name: completion_pct
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 1dbt deps # one-time, installs dbt_utils
dbt run
dbt testIf a test fails, the pipeline stops. The AI never sees bad data. This is the difference between a demo and a system you can trust on a live broadcast.
Step 6 — Feature Engineering: Rolling Form
One-season averages aren't enough — the AI needs to know if a QB is hot or cold right now. A 5-game rolling EPA is the standard.
-- models/marts/mart_qb_rolling_form.sql
with game_stats as (
select * from {{ ref('int_qb_game_stats') }}
)
select
qb_id,
qb_name,
game_id,
week,
avg_epa,
avg(avg_epa) over (
partition by qb_id
order by week
rows between 4 preceding and current row
) as rolling_5_game_epa,
avg(avg_cpoe) over (
partition by qb_id
order by week
rows between 4 preceding and current row
) as rolling_5_game_cpoe
from game_statsRolling features are what every serious sports model uses. They're also exactly the kind of windowed aggregation that shows up in every real product analytics pipeline — same SQL pattern, different domain.
Step 7 — The Decision Layer: RAG Tactical Analyst
Now the payoff. The AI reads the dbt marts as context and answers tactical questions in natural language.
# ai/tactical_analyst.py
import os
import duckdb
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
con = duckdb.connect("nfl_analytics.duckdb")
def get_qb_context(qb_name: str) -> str:
"""Pull rolling form + season tier from dbt marts."""
rolling = con.execute("""
select week, rolling_5_game_epa, rolling_5_game_cpoe
from analytics.mart_qb_rolling_form
where qb_name = ?
order by week desc
limit 1
""", [qb_name]).fetchone()
season = con.execute("""
select season_epa_per_play, season_cpoe, tier
from analytics.mart_qb_performance
where qb_name = ?
""", [qb_name]).fetchone()
if not rolling or not season:
return f"No data found for {qb_name}."
return (
f"{qb_name} — Tier: {season[2]}. "
f"Season EPA/play: {season[0]:.3f}, CPOE: {season[1]:.2f}. "
f"Last 5-game rolling EPA: {rolling[1]:.3f}, "
f"rolling CPOE: {rolling[2]:.2f}."
)
def tactical_call(qb_name: str, down: int, yards_to_go: int, yardline: int) -> str:
context = get_qb_context(qb_name)
prompt = f"""
You are an NFL tactical analyst with access to real play-by-play data.
Context from dbt marts:
{context}
Situation:
- Down: {down}
- Yards to go: {yards_to_go}
- Yardline (distance to opponent end zone): {yardline}
Should the team go for it or punt/kick?
Justify using EPA and rolling form. Give a probability of success
based on historical patterns at this down/distance.
"""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.2,
)
return resp.choices[0].message.content
# 4th and 2, opponent 40, with Mahomes hot
print(tactical_call("P.Mahomes", down=4, yards_to_go=2, yardline=40))Sample output:
> Recommendation: Go for it. Mahomes is in Elite tier with a season EPA/play of 0.24 and a rolling 5-game EPA of 0.31 — he's trending up, not down. On 4th-and-2 inside opponent territory, historical conversion rate is ~68%. EPA math favors going for it by a wide margin versus a 57-yard field goal attempt. Punt only if you're protecting a 2-score lead with under 4 minutes left.
You didn't train a model. You gave the LLM a clean, trustworthy context pulled from a properly layered dbt project. That's RAG done right.
Step 8 — Orchestration: The Full Pipeline Runner
Tie it together with a single entry point. In production this would be an Airflow DAG; for local dev, a Python runner is fine.
# run_pipeline.py
import subprocess
import sys
def run_step(name, cmd):
print(f"\n▶ {name}")
r = subprocess.run(cmd, shell=True)
if r.returncode != 0:
print(f"✖ {name} failed")
sys.exit(1)
print(f"✓ {name} ok")
if __name__ == "__main__":
run_step("0. Install dbt packages", "dbt deps")
run_step("1. Ingest NFL data", "python ingest_nfl.py")
run_step("2. dbt run", "dbt run")
run_step("3. dbt test", "dbt test")
run_step("4. Tactical analyst", "python ai/tactical_analyst.py")Every step is idempotent. Every failure halts the pipeline. This is the bare-minimum production discipline — and it's what separates an engineer from someone who "knows the tools."
Common Mistakes (What NOT to Do)
Why This Matters for Your Career
This is the same architecture used in real product analytics orgs — just dressed up in football pads.
| Pattern in this project | Same pattern at product companies |
|---|---|
| Ingestion (nfl_data_py → DuckDB) | Event stream → data warehouse |
| EPA / CPOE per QB | LTV / session quality / conversion propensity |
| Rolling 5-game window | Rolling 30-day engagement feature |
| dbt contract + not_null test | SLA on feature availability for ML model |
| RAG over mart_qb_performance | RAG over customer feature table |
Ingestion → dbt layers → tests → features → decision layer is the canonical modern data stack. If you can explain this pipeline end-to-end in an interview, you're interviewing at the AI Data Engineer level.
What to Build Next
Natural extensions on top of this foundation:
These are exactly the progressions in the AI Data Engineer path on AI-DE.
Ready to go deeper?
Explore our full curriculum — hands-on skill toolkits built for production data engineering.