Engineering Insights
Hard-won lessons, system design teardowns, and architecture guides from the frontlines of data engineering.
Why We Migrated from Airflow to Kubernetes-Native Orchestration
After three years running Airflow at scale, we hit the ceiling: resource contention, slow DAG parsing, and a scheduler that became a single point of failure. Here's the full story of how we rebuilt orchestration on Argo Workflows — what we gained, what we lost, and the lessons you can steal.
7 articles
The Reality of Streaming: When to Actually Use Apache Flink
Flink is extraordinarily powerful and extraordinarily complex. Most teams reach for it before they need it — and pay the operational price. Here's a framework for deciding when stream processing is justified, and when a micro-batch approach will serve you just as well.
Implementing Data Contracts in a dbt Monorepo
Data contracts promise to fix the silent breakage problem — upstream schema changes that quietly corrupt downstream reports. But the tooling is still maturing. Here's what actually worked for us: a lightweight contract layer built on dbt meta, JSON Schema, and a pre-merge CI check.
Building a Cost-Efficient RAG Pipeline with Pinecone
RAG pipelines can get expensive fast: embedding costs, vector storage costs, LLM inference costs. After running our internal knowledge base RAG in production for six months, here's what we optimized to cut costs by 70% without sacrificing retrieval quality.
Stop Building Toy Pipelines: The 2026 Data Engineering Portfolio Guide (with Code)
Hiring managers see hundreds of GitHub repos with a Jupyter notebook and a README promising an "end-to-end pipeline." They pass on all of them. Here's how to build a PySpark + dbt + Airflow portfolio project that demonstrates production-grade thinking — with full code.
Snowflake vs BigQuery in 2026: A Cost Analysis
We ran the same workload — 8 TB scanned daily, mixed ad-hoc and scheduled queries, three BI tools — on both Snowflake and BigQuery for 30 days. The winner depends heavily on your query patterns. Here are the numbers.
How to Design a Modern Data + AI System: Control, Data, and Decision Planes
Most data teams build AI features by bolting an LLM onto their existing pipeline and calling it done. The systems that actually work in production separate concerns into three explicit planes: a Control Plane that orchestrates, a Data Plane that models, and a Decision Plane that decides. Here's the architecture.
Build an AI Tactical Analyst with NFL Data, dbt, and RAG: A Full Data Engineering Pipeline
While everyone else argues about the halftime show, we're building the scouting report. This tutorial walks through a full production-style data + AI pipeline on real NFL play-by-play data: ingestion via nfl_data_py, dbt staging → marts with EPA and CPOE, quality gates, rolling features, and a RAG-powered tactical analyst that answers 'go for it or punt?' — all code included.