Guides
In-depth guides and reference pages for data engineers.
The Top Data Engineer Skills (2026)
The complete data engineer skills checklist — SQL, Python, dbt, Spark, Kafka, Iceberg, LLM pipelines, and vector databases. The exact tech stack hiring managers look for.
Data Engineer Roadmap 2026
The complete 2026 data engineering roadmap — SQL, Python, dbt, Spark, Kafka, and LLM pipelines — structured as a phase-by-phase learning journey with real projects at every stage.
What is dbt?
The complete guide to dbt — what it does, how it works, ELT vs ETL, dbt Core vs dbt Cloud, and when to use it.
What is Apache Airflow?
The complete guide to Apache Airflow — DAGs, scheduling, executors, TaskFlow API, and how it compares to Prefect and cron.
What is Apache Spark?
The complete guide to Apache Spark — DataFrames, RDDs, Spark SQL, streaming, and how it compares to pandas and MapReduce.
What is Apache Kafka?
The complete guide to Apache Kafka — topics, partitions, consumer groups, retention, and how it compares to RabbitMQ and Pulsar.
What is RAG?
The complete guide to Retrieval-Augmented Generation — chunking, embeddings, vector search, reranking, and how it compares to fine-tuning.
What is MLOps?
The complete guide to MLOps — experiment tracking, model registries, CI/CD deployment, drift monitoring, and how it compares to DevOps.
What is Apache Iceberg?
The complete guide to Apache Iceberg — table format internals, hidden partitioning, time travel, schema evolution, and how it compares to Delta Lake and Hudi.
What is Apache Flink?
The complete guide to Apache Flink — stateful stream processing, event-time windows, watermarks, exactly-once guarantees, and how it compares to Spark and Kafka Streams.
What is Data Modeling?
The complete guide to data modeling — dimensional modeling, star schema, data vault, grain, fact and dimension tables, SCDs, and how to implement them in dbt.
What is an LLM Pipeline?
The complete guide to LLM data pipelines — data collection, deduplication, tokenization, sequence packing, and how to build production training datasets for language models.
What is a Feature Store?
The complete guide to feature stores — offline/online dual-store architecture, point-in-time correctness, training-serving skew, Feast, and when you need one.
What is a Data Contract?
The complete guide to data contracts — ODCS YAML format, schema versioning, CI/CD breaking change enforcement, PII classification, and how contracts differ from dbt tests and schemas.
What is Data Observability?
The complete guide to data observability — freshness monitoring, volume anomaly detection, lineage tracking, SLOs, and how it compares to data quality and data testing.
What is API Data Ingestion?
The complete guide to API data ingestion — cursor pagination, rate-limit handling with exponential backoff, OAuth 2.0 token refresh, watermark-based incremental sync, idempotent writes, and webhook vs polling patterns.
What is Dataset Engineering?
The complete guide to dataset engineering — building training datasets for ML models, MinHash deduplication, quality filtering, dataset versioning with DVC, data cards, and data flywheel architectures.
What is Data Cost Optimization?
The complete guide to data cost optimization — compute right-sizing, storage tiering, query efficiency, FinOps governance, chargeback models, and the four levers that reduce cloud data platform spend without breaking SLAs.
What is DataOps?
The complete guide to DataOps — CI/CD for data pipelines, automated quality testing, environment promotion, data contracts, and the 4-pillar model that turns fragile data workflows into production-grade platforms.
What is Data Engineering System Design?
The complete guide to data engineering system design — the 6-layer architecture model, lambda vs kappa vs medallion patterns, RFC and ADR structure, and the tradeoff reasoning that separates senior from staff-level engineers.
What are Agentic Workflows?
The complete guide to agentic workflows — LLM agents, LangGraph supervisor patterns, tool design, Redis checkpointing, and when to go agentic vs traditional DAGs.