Learn what production data teams
actually do.
31 skills, 7 tracks, one clear path from “what’s a DAG” to “staff platform engineer.” Every skill ends with a deployable project.
How tracks build on each other.
Take tracks in any order — but most land a job fastest starting at Foundations → Batch → one specialty.
Your first three skills.
SQL Mastery for Data Engineers
Window functions, CTEs, execution plans — SQL you'd pass at a FAANG screen.
Python for Data Engineers
Production Python: typing, async, Pydantic, packaging — not a notebook tour.
dbt & Analytics Engineering
Incremental models, tests, macros, exposures — done the way prod requires.
Foundations
SQL Mastery for Data Engineers
Window functions, CTEs, execution plans — SQL you'd pass at a FAANG screen.
Python for Data Engineers
Production Python: typing, async, Pydantic, packaging — not a notebook tour.
Advanced Data Modeling
Dimensional, Data Vault, OBT — and when to reach for which.
Cloud Data Infra & FinOps
AWS/GCP for data teams — IAM, networking, cost, the parts that matter.
dbt & Analytics Engineering
Incremental models, tests, macros, exposures — done the way prod requires.
Batch pipelines
Apache Spark Deep Dive
Partitioning, shuffles, AQE, Iceberg — Spark for engineers who own the cluster.
Apache Airflow
Production orchestration: DAG patterns, idempotency, KubernetesExecutor.
Apache Iceberg & Lakehouse
Time travel, hidden partitioning, MERGE — the lakehouse format that won.
Warehouse Internals
MPP engines from the inside: query planning, distribution, spills.
Streaming
Real-Time Streaming Architecture
When to actually reach for streaming — and when you absolutely should not.
Kafka Streams
Topics, partitions, retention, consumer groups — Kafka with correct intuition.
Flink & Stream Processing
Stateful streaming, windows, watermarks, exactly-once with idempotent sinks.
Event Design & Contracts
Schemas that evolve safely. Protobuf + registry + CI compat tests.
Data quality
Data Observability & Quality
Contracts, GE, anomaly detection, lineage — and what's worth alerting on.
Governance & Data Contracts
Producer-side contracts, PII, access control, audit trails that hold up.
AI & vectors
RAG Learning Path
Hybrid retrieval, reranking, eval — production RAG, not a demo.
Vector Databases
pgvector, Qdrant, Lance — when to use each, and how indexing actually works.
LLM Data Pipelines
Batch enrichment with structured outputs, retries, cost budgets, at scale.
LLM Evaluation
Build a labeled eval set, run recall@k + LLM-judge, iterate without vibes.
Feature Stores for ML
Feast, point-in-time correctness, offline/online parity — no training/serving skew.
Agentic Workflows
Multi-step agent pipelines, tool use, trace eval, when to actually use them.
AI Inference & Serving
Online serving, model routing, autoscale, latency budgets in production.
Dataset Engineering
Curate, clean, version, and version-pin datasets that survive review.
Enterprise GenAI & Security
Multi-tenant LLM, prompt injection, secrets, audit — the hard parts.
MLOps for Data Engineers
CI/CD for models, monitoring, retraining, drift — the parts DEs actually own.
Platform
DataOps: CI/CD & IaC
Terraform, GitHub Actions, Argo, dbt CI — pipeline platform plumbing.
Cost Optimization for DEs
Attribute, alert, and reduce. The boring work that funds the fun work.
API & External Integration
Auth, retries, rate limits, idempotency — the unglamorous integration tier.
Staff Engineering
Product Thinking for DEs
Metrics, semantic layers, user-of-data thinking — bridge to the business.
System Design for DEs
Trade-off framework for the design round — and the architecture review.
Staff Engineer Playbook
Architecture, RFCs, calibration — the non-coding parts of staff DE work.
Not sure which skill to open first? We’ll pick.
Two minutes. Five questions. A sequenced learning plan and your free first lesson in your inbox.