EXPERT-tier retrieval-quality RAG: 4-strategy chunking A/B (62/78/85%), hybrid BM25 + dense + RRF, cross-encoder reranker, RAGAS 4-metric canary, LLM gateway with fallback. 5 ADRs + cost-model CSV bundled.

FastAPIOpenAIPineconeQdrantRedisPrometheus

AI & VECTORS~16h · ●●●

What hiring managers ask

The portfolio projects that actually get interviews in 2026.

Every data engineer with a GitHub has “a project.” Most of them are toy ETL pipelines that read a CSV from S3 and write a Parquet file to S3. They don’t survive a senior interview — and they don’t survive an AI-era one. The 30+ projects below are built around what hiring managers at top-tier data teams are actually testing in 2026.

What makes a project interview-ready

Three things separate a portfolio project that earns an on-site from one that gets filtered. One: the project ships a deployable artifact — a Docker Compose stack, a Kubernetes manifest, a Terraform module — not just a Jupyter notebook. Two: the project has measurable outcomes you can defend in an interview (“the hybrid retriever hit recall@10 = 0.92,” “the Spark job dropped from 47min to 5min,” “the Snowflake bill went from $300K to $120K”). Three: the project has an Architecture Decision Record (ADR) explaining the tradeoff — and ideally at least one Deprecated ADR showing a reversal, which is the single strongest signal of real-world experience.

Which project should you build first?

If you’re newer than mid-level, start with the free E-commerce Data Warehouse — Kimball dimensional modeling, dbt staging → intermediate → mart, incremental contracts. It’s the project every interviewer recognises and it’s the cleanest foundation for the others. If you’re an experienced DE moving into AI, start with Enterprise RAG — hybrid BM25 + dense + RRF + cross-encoder reranker, with a 4-strategy chunking A/B benchmark (62% / 78% / 85% accuracy across strategies). If you’re aiming at staff-level, build Flink Fraud Detection — keyed state on RocksDB, exactly-once Kafka via two-phase commit, sub-100ms decision windows. Every one of these is the kind of system an interviewer can ask “what would you do if X failed?” for an hour.

What ships with every project

Each project here is more than a tutorial. It ships with a starter kit (Docker / Postgres / Redis / Spark / dbt stack that actually boots), a runnable cost-model CSV with cited provider list prices, committed ADRs that document the tradeoffs considered, seed data, and a working CI pipeline. Expert projects include 5 ADRs (one Deprecated) and the cost-model bundled in the starter zip. That’s the kind of repo a hiring manager opens, reads the ADRs, and decides to invite you to an on-site without watching a screen-share.

The 7 project tracks

The 30+ projects span 7 tracks mapped to the curriculum: Foundations (warehouse + metrics layer), Batch (ingestion + orchestration), Streaming (Kafka + Flink), Lakehouse (Iceberg + multi-engine), Platform (CI/CD + governance + cost), AI Systems (RAG + agentic + inference serving), and Leadership (the staff-engineer playbook). Filter the catalog below by track, tier (Free / Professional / Expert), and difficulty to find yours.

Each project below shows its track, tier, estimated time-to-ship, and the specific architectures it teaches. Click in to see the full ADR list and what’s in the starter kit before you start.

30 shown

— foundations

Foundations

SQL, Python, data modeling, dbt, cloud basics. The ground every other track stands on.2 projects

P03Foundations

free

Commerce data warehouse

Kimball star schema with 22 dbt models — atomic + event facts, SCD2 via snapshots, incremental processing, and a GitHub Actions Slim CI gate.

dbtPostgresGitHub Actions

~16h

P19Foundations

free

Ecommerce analytics modeling layer

Free 17-model dbt analytics modeling layer for ShopCo — star schema with documented grain, incremental + SCD2, RFM-scored LTV, and a dbt Cloud + Slim CI production spine.

dbtSnowflakeGitHub Actionsdbt CloudSQL

~10h

— batch

Batch pipelines

Airflow, Spark, Iceberg — the production workhorse of every modern data team.4 projects

Real-world production projects.Not toy tutorials.

The 3 projects that get you hired.

Flink fraud detection

Iceberg Lakehouse Foundations

Enterprise RAG — retrieval-quality build

The portfolio projects that actually get interviews in 2026.

What makes a project interview-ready

Which project should you build first?

What ships with every project

The 7 project tracks

Foundations

Commerce data warehouse

Ecommerce analytics modeling layer

Batch pipelines

Iceberg Lakehouse Foundations

ShopStream Spark Batch Pipeline

Airflow + dbt: production pipeline foundations

IceLake Commerce — end-to-end Iceberg tour

Streaming

Flink fraud detection

Real-time fraud detection on Kafka Streams

Schema evolution & data contracts

Real-time fraud feature store

Data quality

Data observability stack

Data governance & contracts

AI & vectors

Enterprise RAG — retrieval-quality build

PredictFlow — production MLOps platform with Feast + BentoML

LLM evaluation framework — multi-judge cascade + recall@k gate

AI cost optimization (CostGuard)

Agentic data pipeline — LangGraph supervisor + HITL + ADRs

AI retrieval platform — pgvector + hybrid + RRF + cross-encoder

AI serving platform — vLLM + Ray Serve under SLA

LLM training-data pipeline — crawl + dedup + RAG + LLMOps

Enterprise AI platform — multi-tenant governed RAG

Platform

CI/CD data platform

DataGuard reliability

Cloud cost optimization

Multi-source ingestion service

Staff Engineering

Uber Event Platform: Staff Design Portfolio

Full-stack AI platform — full RAG system + production hardening

Multi-cloud platform foundation

Experimentation platform on dbt + scipy

Staff+ leadership playbook

Can’t decide which one to start with?

Real-world production projects.
Not toy tutorials.