Skip to content

Data Engineering Guides

Production-grade reference guides to the data and AI engineering stack. Each guide is paired with a hands-on skill and capstone project.

LangGraph
What are Agentic Workflows?
The complete guide to agentic workflows — LLM agents, LangGraph supervisor patterns, tool design, Redis checkpointing, and when to go agentic vs traditional DAGs.
14 min readUpdated May 2026
LLM Data
What is an LLM Pipeline? The complete guide for data engineers
The complete guide to LLM data pipelines — data collection, deduplication, tokenization, sequence packing, and how to build production training datasets for language models.
15 min readUpdated May 2026
REST
What is API Data Ingestion?
The complete guide to API data ingestion — cursor pagination, rate-limit handling, OAuth 2.0 token refresh, watermark-based incremental sync, idempotent writes, and webhook vs polling patterns.
14 min readUpdated May 2026
FinOps
What is Data Cost Optimization?
The complete guide to data cost optimization — compute right-sizing, storage tiering, query efficiency, FinOps governance, chargeback models, and the four levers that reduce cloud data platform spend without breaking SLAs.
13 min readUpdated May 2026
Architecture
What is Data Engineering System Design?
The complete guide to data engineering system design — the 6-layer architecture model, lambda vs kappa vs medallion patterns, RFC and ADR structure, and the tradeoff reasoning that separates senior from staff-level engineers.
14 min readUpdated May 2026
5 pillars
What is Data Observability?
The complete guide to data observability — freshness monitoring, volume anomaly detection, lineage tracking, SLOs, and how it compares to data quality and data testing.
12 min readUpdated May 2026
DataOps
What is DataOps?
The complete guide to DataOps — CI/CD for data pipelines, automated quality testing, environment promotion, data contracts, and the 4-pillar model that turns fragile data workflows into production-grade platforms.
14 min readUpdated May 2026
dbt
What is dbt? The complete guide for data engineers
dbt (data build tool) is an open-source SQL transformation framework. Learn what dbt does, how it works, ELT vs ETL, dbt Core vs dbt Cloud, and when to use it.
11 min readUpdated Mar 2026
Roadmap
Data Engineer Roadmap 2026: From beginner to AI Systems engineer
The complete 2026 data engineering roadmap — SQL, Python, dbt, Spark, Kafka, and LLM pipelines — structured as a phase-by-phase learning journey with real projects at every stage.
10 min readUpdated Feb 2026
Skills
The complete data engineer skills checklist (2026)
The complete data engineer skills checklist — SQL, Python, dbt, Spark, Kafka, Iceberg, LLM pipelines, and vector databases. The exact tech stack hiring managers look for.
8 min readUpdated Feb 2026
Airflow
What is Apache Airflow?
The complete guide to Apache Airflow — DAGs, scheduling, executors, TaskFlow API, and how it compares to Prefect and cron.
12 min readUpdated Jan 2026
Spark
What is Apache Spark?
The complete guide to Apache Spark — DataFrames, RDDs, Spark SQL, streaming, and how it compares to pandas and MapReduce.
14 min readUpdated Jan 2026
Kafka
What is Apache Kafka?
The complete guide to Apache Kafka — topics, partitions, consumer groups, retention, and how it compares to RabbitMQ and Pulsar.
13 min readUpdated Jan 2026
RAG
What is RAG? The complete guide to Retrieval-Augmented Generation
The complete guide to Retrieval-Augmented Generation — chunking, embeddings, vector search, reranking, and how it compares to fine-tuning.
13 min readUpdated Jan 2026
MLOps
What is MLOps? The complete guide for data engineers
The complete guide to MLOps — experiment tracking, model registries, CI/CD deployment, drift monitoring, and how it compares to DevOps.
14 min readUpdated Jan 2026
Iceberg
What is Apache Iceberg?
The complete guide to Apache Iceberg — table format internals, hidden partitioning, time travel, schema evolution, and how it compares to Delta Lake and Hudi.
14 min readUpdated Dec 2025
Flink
What is Apache Flink?
The complete guide to Apache Flink — stateful stream processing, event-time windows, watermarks, exactly-once guarantees, and how it compares to Spark and Kafka Streams.
14 min readUpdated Dec 2025
Star schema
What is Data Modeling?
The complete guide to data modeling — dimensional modeling, star schema, data vault, grain, fact and dimension tables, SCDs, and how to implement them in dbt.
13 min readUpdated Dec 2025
Feast
What is a Feature Store?
The complete guide to feature stores — offline/online dual-store architecture, point-in-time correctness, training-serving skew, Feast, and when you need one.
13 min readUpdated Dec 2025
ODCS
What is a Data Contract?
The complete guide to data contracts — ODCS YAML format, schema versioning, CI/CD breaking change enforcement, PII classification, and how contracts differ from dbt tests and schemas.
13 min readUpdated Dec 2025
MinHash
What is Dataset Engineering?
The complete guide to dataset engineering — building training datasets for ML models, MinHash deduplication, quality filtering, dataset versioning with DVC, data cards, and data flywheel architectures.
15 min readUpdated Nov 2025

Press Cmd+K to open

No internet connection. Some features may be unavailable.