Stop Building Toy Pipelines: The 2026 Data Engineering Portfolio Guide

Most data engineering portfolios look exactly the same. A GitHub repo with a Jupyter notebook, a README.md that promises an "end-to-end data pipeline," and a DAG that reads a static CSV from a local folder and prints the row count. Hiring managers see hundreds of these. They pass on all of them. The hiring gap is technical credibility — and here is how you close it.

The Problem: The Unstructured AI Data Mess

Every company wants to use AI. But they're stuck on one massive problem: their data is a complete mess.

Think about customer feedback: emails, chat logs, support tickets. It is unstructured, noisy, and impossible for an LLM to use directly without costing a fortune in API tokens and hallucinating wildly.

If you want a portfolio project that guarantees an interview, build the pipeline that solves this problem. We are going to build a production-style architecture using PySpark, dbt, and Apache Airflow to transform 5,000,000 rows of raw, chaotic customer feedback into structured, AI-ready data.

The Architecture Blueprint

Raw Layer — Unprocessed omnichannel data straight from our source systems.
Intermediate Layer (PySpark via dbt) — The heavy lifting. Cleaning text, removing HTML, tokenizing, and pre-computing signals across millions of rows without memory crashes.
Semantic Layer (SQL via dbt) — Structured outputs perfectly prepped for analytics and AI embeddings.
Orchestration Layer (Airflow) — The babysitter. Automating the pipeline with strict retries, SLA monitoring, and Slack alerting for when things inevitably break.

Step 1: Ingest Real-World Chaos, Not Clean CSVs

A credible portfolio handles real-world failure modes. Simulate a data-lake landing zone that receives omnichannel data. The support_chat is a nested JSON string. The email_ticket contains HTML tags. The product_review has emojis. This is what real data engineering looks like.

CSVdata/raw_omnichannel_events.csv

event_id,channel,raw_payload,created_at
evt_001,support_chat,"{""agent"": ""Hello"", ""user"": ""App crashed on iOS 17 after update""}",2026-03-20T10:00:00Z
evt_002,product_review,"⭐⭐⭐⭐⭐ The new dashboard is incredibly fast!",2026-03-20T10:05:00Z
evt_003,email_ticket,"<p>URGENT: Billing failed for invoice #9921. Refund?</p>",2026-03-20T10:15:00Z

Step 2: The PySpark Heavy Lifting

To process 5,000,000 rows daily, you cannot rewrite the entire table (Full Refresh). You will bankrupt your company's cloud compute budget and crash the cluster.

This dbt Python model demonstrates senior-level thinking: Incremental Materialization (only processing new data) and Spark Repartitioning (distributing the workload to prevent Out-Of-Memory errors). We also pre-compute sentiment and issue tags — we do not need an expensive LLM to tell us that the word "refund" means it's a billing issue. Pre-computing saves massive amounts of money.

Pythonmodels/intermediate/int_events_tokenized.py

import pyspark.sql.functions as F

def model(dbt, session):
    # 1. PRODUCTION CONFIGURATION (FOR SCALE)
    dbt.config(
        materialized="incremental",
        unique_key="event_id",
        partition_by={"field": "event_date", "data_type": "date"},
    )

    df = dbt.ref("raw_omnichannel_events")

    # 2. INCREMENTAL PROCESSING LOGIC
    if dbt.is_incremental:
        max_date = session.sql(f"SELECT MAX(created_at) FROM {dbt.this}").collect()[0][0]
        df = df.filter(F.col("created_at") > F.lit(max_date))

    # 3. COMPUTE OPTIMIZATION (Prevent OOM on 5M rows)
    df = df.repartition(16)

    # 4. OMNICHANNEL PARSING (THE CHAOS HANDLER)
    extracted_df = df.withColumn(
        "extracted_text",
        F.when(F.col("channel") == "support_chat",
               F.get_json_object(F.col("raw_payload"), "$.user"))
         .when(F.col("channel") == "email_ticket",
               F.regexp_replace(F.col("raw_payload"), "<[^>]*>", ""))
         .otherwise(F.col("raw_payload"))
    )

    # 5. CLEAN, TOKENIZE, TAG
    cleaned_df = extracted_df.withColumn(
        "clean_text",
        F.lower(F.regexp_replace("extracted_text", "[^\\w\\s]", "")),
    ).filter(F.col("clean_text") != "na")

    tokenized_df = (
        cleaned_df
        .withColumn("tokens", F.split(F.col("clean_text"), "\\s+"))
        .withColumn("token_count", F.size(F.col("tokens")))
    )

    final_df = tokenized_df.withColumn(
        "issue_type",
        F.when(F.col("clean_text").rlike("crash|error|bug|failed"), "technical_issue")
         .when(F.col("clean_text").rlike("refund|billing|invoice"), "billing_issue")
         .otherwise("general"),
    ).withColumn("event_date", F.to_date(F.col("created_at")))

    return final_df.select(
        "event_id", "channel", "clean_text", "tokens", "token_count",
        "issue_type", "created_at", "event_date",
    )

Step 3: The Semantic Layer

SQLmodels/semantic/mrt_customer_insights.sql

{{ config(materialized='table') }}

SELECT
    event_id AS customer_event_id,
    channel,
    issue_type,
    token_count,
    clean_text AS ai_ready_context,
    created_at AS timestamp
FROM {{ ref('int_events_tokenized') }}
WHERE token_count > 2 -- filter out empty noise

Step 4: The Orchestrator

This is where 90% of candidates fail. Every tutorial DAG runs perfectly. Production DAGs don't. When you are processing 5,000,000 rows, transient network errors happen. Spark clusters temporarily run out of memory. APIs timeout.

A senior engineer builds a DAG that assumes failure is inevitable. Implement retries, retry_delay, explicit logging, and an on_failure_callback to alert the team when SLAs are breached.

Pythondags/omnichannel_ai_pipeline.py

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
import logging

def slack_alert_on_failure(context):
    """Triggers a Slack webhook if the pipeline fails after all retries."""
    ti = context.get('task_instance')
    msg = f"🚨 DAG Failed: {ti.dag_id}/{ti.task_id} — <{ti.log_url}|logs>"
    logging.error(msg)
    # In production: requests.post(SLACK_WEBHOOK_URL, json={'text': msg})

default_args = {
    'owner': 'data_engineering_team',
    'depends_on_past': False,
    'email_on_failure': False,
    'retries': 3,                                     # never fail on a transient blip
    'retry_delay': timedelta(minutes=5),              # wait for Spark to recover
    'on_failure_callback': slack_alert_on_failure,    # alert only after 3 retries
}

with DAG(
    'omnichannel_ai_preparation_pipeline',
    default_args=default_args,
    description='Ingests, cleans, and prepares 5M+ daily customer records for LLM embeddings.',
    schedule_interval='0 3 * * *',
    start_date=days_ago(1),
    catchup=False,
    tags=['production', 'ai_platform', 'pyspark'],
) as dag:

    run_spark_transformations = BashOperator(
        task_id='run_pyspark_cleaning_layer',
        bash_command='dbt run --select models/intermediate/',
        execution_timeout=timedelta(hours=2),   # PySpark jobs are heavy
    )

    run_semantic_layer = BashOperator(
        task_id='run_sql_semantic_layer',
        bash_command='dbt run --select models/semantic/',
    )

    run_data_tests = BashOperator(
        task_id='run_dbt_data_tests',
        bash_command='dbt test',
    )

    run_spark_transformations >> run_semantic_layer >> run_data_tests

The Final Workflow

Instead of sending raw text directly to an LLM, send clean, structured, chunked input. This drastically reduces token cost, latency, and noise.

The workflow becomes: Airflow schedule → PySpark (clean / chunk) → dbt (structure) → AI model. Not: raw data → AI.

The Interview Pitch

Every hiring manager looking at your work is asking one implicit question: can this person build something I'd trust running in production? When you show them an incremental PySpark job wrapped in an Airflow DAG with explicit failure callbacks and retry logic, you answer that question instantly.

Signal in your project	What the hiring manager hears
Incremental dbt model with `unique_key`	Understands cost discipline and idempotency
Spark repartition before heavy transforms	Knows how to prevent OOM at scale
Pre-computed tags before LLM call	Will not blow my token budget
`on_failure_callback` + 3 retries + 5min delay	Has operated something on-call
dbt tests in the DAG, not just locally	Treats data quality as part of the pipeline, not after it

Stop building toy pipelines.

Build the portfolio project that actually gets interviews

A credible portfolio project assumes failure is inevitable. It uses incremental materialization, explicit retries, dbt tests inside the DAG, and pre-computation to avoid burning LLM tokens. The patterns are repeatable across stacks; you just have to wire them up once.

AI-DE projects are designed exactly around this brief — production-style architectures, not toy notebooks. Pick one and ship it; come out with a project that maps cleanly onto an interview story.

Browse all projects See career paths