Skip to content

Guides

In-depth guides and reference pages for data engineers.

Skills8 min read

The Top Data Engineer Skills (2026)

The complete data engineer skills checklist — SQL, Python, dbt, Spark, Kafka, Iceberg, LLM pipelines, and vector databases. The exact tech stack hiring managers look for.

Roadmap10 min read

Data Engineer Roadmap 2026

The complete 2026 data engineering roadmap — SQL, Python, dbt, Spark, Kafka, and LLM pipelines — structured as a phase-by-phase learning journey with real projects at every stage.

dbt10 min read

What is dbt?

The complete guide to dbt — what it does, how it works, ELT vs ETL, dbt Core vs dbt Cloud, and when to use it.

Airflow12 min read

What is Apache Airflow?

The complete guide to Apache Airflow — DAGs, scheduling, executors, TaskFlow API, and how it compares to Prefect and cron.

Spark14 min read

What is Apache Spark?

The complete guide to Apache Spark — DataFrames, RDDs, Spark SQL, streaming, and how it compares to pandas and MapReduce.

Kafka13 min read

What is Apache Kafka?

The complete guide to Apache Kafka — topics, partitions, consumer groups, retention, and how it compares to RabbitMQ and Pulsar.

RAG13 min read

What is RAG?

The complete guide to Retrieval-Augmented Generation — chunking, embeddings, vector search, reranking, and how it compares to fine-tuning.

MLOps14 min read

What is MLOps?

The complete guide to MLOps — experiment tracking, model registries, CI/CD deployment, drift monitoring, and how it compares to DevOps.

Iceberg14 min read

What is Apache Iceberg?

The complete guide to Apache Iceberg — table format internals, hidden partitioning, time travel, schema evolution, and how it compares to Delta Lake and Hudi.

Flink14 min read

What is Apache Flink?

The complete guide to Apache Flink — stateful stream processing, event-time windows, watermarks, exactly-once guarantees, and how it compares to Spark and Kafka Streams.

Data Modeling13 min read

What is Data Modeling?

The complete guide to data modeling — dimensional modeling, star schema, data vault, grain, fact and dimension tables, SCDs, and how to implement them in dbt.

LLM Pipeline15 min read

What is an LLM Pipeline?

The complete guide to LLM data pipelines — data collection, deduplication, tokenization, sequence packing, and how to build production training datasets for language models.

Feature Stores13 min read

What is a Feature Store?

The complete guide to feature stores — offline/online dual-store architecture, point-in-time correctness, training-serving skew, Feast, and when you need one.

Data Contracts13 min read

What is a Data Contract?

The complete guide to data contracts — ODCS YAML format, schema versioning, CI/CD breaking change enforcement, PII classification, and how contracts differ from dbt tests and schemas.

Data Observability12 min read

What is Data Observability?

The complete guide to data observability — freshness monitoring, volume anomaly detection, lineage tracking, SLOs, and how it compares to data quality and data testing.

API Ingestion14 min read

What is API Data Ingestion?

The complete guide to API data ingestion — cursor pagination, rate-limit handling with exponential backoff, OAuth 2.0 token refresh, watermark-based incremental sync, idempotent writes, and webhook vs polling patterns.

Dataset Engineering15 min read

What is Dataset Engineering?

The complete guide to dataset engineering — building training datasets for ML models, MinHash deduplication, quality filtering, dataset versioning with DVC, data cards, and data flywheel architectures.

Cost Optimization13 min read

What is Data Cost Optimization?

The complete guide to data cost optimization — compute right-sizing, storage tiering, query efficiency, FinOps governance, chargeback models, and the four levers that reduce cloud data platform spend without breaking SLAs.

DataOps14 min read

What is DataOps?

The complete guide to DataOps — CI/CD for data pipelines, automated quality testing, environment promotion, data contracts, and the 4-pillar model that turns fragile data workflows into production-grade platforms.

System Design14 min read

What is Data Engineering System Design?

The complete guide to data engineering system design — the 6-layer architecture model, lambda vs kappa vs medallion patterns, RFC and ADR structure, and the tradeoff reasoning that separates senior from staff-level engineers.

Agentic14 min read

What are Agentic Workflows?

The complete guide to agentic workflows — LLM agents, LangGraph supervisor patterns, tool design, Redis checkpointing, and when to go agentic vs traditional DAGs.

Press Cmd+K to open