Skip to content
Engineering Insights
AI/MLOps

Stop Building Toy Pipelines: The 2026 Data Engineering Portfolio Guide (with Code)

LarryMar 21, 202614 min read

Want to build this yourself?

This architecture is covered in our hands-on projects. Build it in the AI-DE sandbox.

Explore Projects

Most data engineering portfolios look exactly the same.

A GitHub repo with a Jupyter notebook, a README.md that promises an "end-to-end data pipeline," and a DAG that reads a static CSV from a local folder and prints the row count.

Hiring managers see hundreds of these. They pass on all of them.

Data engineering demand has increased sharply over the last 18 months, but the bar has raised. Every company that rushed to ship an AI model in 2024 is now sitting on a pile of unreliable data infrastructure. They need engineers who can build and maintain production-grade pipelines—not engineers who can just name every tool in the modern data stack.

The hiring gap is technical credibility. Here is how you close that gap, complete with the exact code and architecture you need to build a portfolio project that actually moves the needle in 2026.

The Problem: The Unstructured AI Data Mess

Every company wants to use AI. But they're stuck on one massive problem: Their data is a complete mess.

Think about customer feedback: Emails. Chat logs. Support tickets. It is unstructured, it is noisy, and it is impossible for an LLM to use directly without costing a fortune in API tokens and hallucinating wildly.

If you want a portfolio project that guarantees an interview, build the pipeline that solves this problem. We are going to build a production-style architecture using PySpark, dbt, and Apache Airflow to transform 5,000,000 rows of raw, chaotic customer feedback into structured, AI-ready data.

The Architecture Blueprint

We are going to design this using a standard enterprise architecture across four distinct layers:

  • Raw Layer: Unprocessed omnichannel data straight from our source systems.
  • Intermediate Layer (PySpark via dbt): The heavy lifting. Cleaning text, removing HTML, tokenizing, and pre-computing signals across millions of rows without memory crashes.
  • Semantic Layer (SQL via dbt): Structured outputs perfectly prepped for analytics and AI embeddings.
  • Orchestration Layer (Airflow): The babysitter. Automating the pipeline with strict retries, SLA monitoring, and Slack alerting for when things inevitably break.
  • Step 1: Ingest Real-World Chaos, Not Clean CSVs

    A credible portfolio handles real-world failure modes. We will simulate a data lake landing zone that receives "Omnichannel" data.

    Look at this data. The support_chat is a nested JSON string. The email_ticket contains HTML tags. The product_review has emojis. This is what real data engineering looks like.

    plaintext
    # data/raw_omnichannel_events.csv
    event_id,channel,raw_payload,created_at
    evt_001,support_chat,"{""agent"": ""Hello"", ""user"": ""App crashed on iOS 17 after update""}",2026-03-20T10:00:00Z
    evt_002,product_review,"⭐⭐⭐⭐⭐ The new dashboard is incredibly fast!",2026-03-20T10:05:00Z
    evt_003,email_ticket,"<p>URGENT: Billing failed for invoice #9921. Refund?</p>",2026-03-20T10:15:00Z

    Step 2: The PySpark Heavy Lifting (Intermediate Layer)

    To process 5,000,000 rows daily, you cannot rewrite the entire table (Full Refresh). You will bankrupt your company's cloud compute budget and crash the cluster.

    In this dbt Python model, we demonstrate senior-level thinking: Incremental Materialization (only processing new data) and Spark Repartitioning (distributing the workload to prevent Out-Of-Memory errors).

    Furthermore, we pre-compute sentiment and issue tags. We do not need to call an expensive LLM to tell us that the word "refund" means it's a billing issue. Pre-computing saves massive amounts of money.

    python
    # models/intermediate/int_events_tokenized.py
    import pyspark.sql.functions as F
    
    def model(dbt, session):
        # 1. PRODUCTION CONFIGURATION (FOR SCALE)
        dbt.config(
            materialized="incremental",
            unique_key="event_id",
            partition_by={"field": "event_date", "data_type": "date"}
        )
    
        df = dbt.ref("raw_omnichannel_events")
    
        # 2. INCREMENTAL PROCESSING LOGIC
        if dbt.is_incremental:
            max_date_query = f"SELECT MAX(created_at) FROM {dbt.this}"
            max_date = session.sql(max_date_query).collect()[0][0]
            df = df.filter(F.col("created_at") > F.lit(max_date))
    
        # 3. COMPUTE OPTIMIZATION (Prevent OOM on 5M rows)
        df = df.repartition(16)
    
        # 4. OMNICHANNEL PARSING (THE CHAOS HANDLER)
        extracted_df = df.withColumn(
            "extracted_text",
            F.when(
                F.col("channel") == "support_chat",
                F.get_json_object(F.col("raw_payload"), "$.user")
            )
            .when(
                F.col("channel") == "email_ticket",
                F.regexp_replace(F.col("raw_payload"), "<[^>]*>", "") # Strip HTML
            )
            .otherwise(F.col("raw_payload"))
        )
    
        # Clean and normalize
        cleaned_df = extracted_df.withColumn(
            "clean_text",
            F.lower(F.regexp_replace("extracted_text", "[^\w\s]", ""))
        ).filter(F.col("clean_text") != "na")
    
        # 5. TOKENIZATION & TAGGING (Save LLM Costs)
        tokenized_df = cleaned_df.withColumn(
            "tokens", F.split(F.col("clean_text"), "\s+")
        ).withColumn("token_count", F.size(F.col("tokens")))
    
        final_df = tokenized_df.withColumn(
            "issue_type",
            F.when(F.col("clean_text").rlike("crash|error|bug|failed"), "technical_issue")
             .when(F.col("clean_text").rlike("refund|billing|invoice"), "billing_issue")
             .otherwise("general")
        ).withColumn("event_date", F.to_date(F.col("created_at")))
    
        return final_df.select(
            "event_id", "channel", "clean_text", "tokens", "token_count", "issue_type", "created_at", "event_date"
        )

    Step 3: The Semantic Layer

    Now that the data is clean, tokenized, and tagged, we expose it for Business Intelligence and AI Embedding pipelines using standard SQL.

    sql
    -- models/semantic/mrt_customer_insights.sql
    {{ config(materialized='table') }}
    
    SELECT
        event_id AS customer_event_id,
        channel,
        issue_type,
        token_count,
        clean_text AS ai_ready_context,
        created_at AS timestamp
    FROM {{ ref('int_events_tokenized') }}
    WHERE token_count > 2 -- Filter out empty noise

    Step 4: The Orchestrator (Apache Airflow)

    This is where 90% of candidates fail. Every tutorial DAG runs perfectly. Production DAGs don't.

    When you are processing 5,000,000 rows, transient network errors happen. Spark clusters temporarily run out of memory. APIs timeout. If your portfolio just shows `BashOperator('dbt run')`, you are signaling that you have never operated a pipeline in the real world.

    A senior engineer builds a DAG that assumes failure is inevitable. We implement retries, retry_delay, explicit logging, and an on_failure_callback to alert the team when SLAs are breached.

    python
    # dags/omnichannel_ai_pipeline.py
    from airflow import DAG
    from airflow.operators.bash import BashOperator
    from airflow.utils.dates import days_ago
    from datetime import timedelta
    import logging
    
    # 1. ALERTING: The Failure Callback
    def slack_alert_on_failure(context):
        """Triggers a Slack webhook if the pipeline fails after all retries."""
        task_instance = context.get('task_instance')
        dag_id = task_instance.dag_id
        task_id = task_instance.task_id
        log_url = task_instance.log_url
    
        error_message = f"🚨 *DAG Failed!* 🚨\n*DAG:* {dag_id}\n*Task:* {task_id}\n*Logs:* <{log_url}|View Logs>"
        logging.error(f"Executing Slack Alert: {error_message}")
        # In production: requests.post(SLACK_WEBHOOK_URL, json={'text': error_message})
    
    # 2. PRODUCTION DEFAULTS
    default_args = {
        'owner': 'data_engineering_team',
        'depends_on_past': False,
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 3, # Never fail on the first transient network blip
        'retry_delay': timedelta(minutes=5), # Wait 5 mins for the Spark cluster to recover
        'on_failure_callback': slack_alert_on_failure, # Alert ONLY after 3 retries fail
    }
    
    with DAG(
        'omnichannel_ai_preparation_pipeline',
        default_args=default_args,
        description='Ingests, cleans, and prepares 5M+ daily customer records for LLM embeddings.',
        schedule_interval='0 3 * * *', # Run daily at 3:00 AM
        start_date=days_ago(1),
        catchup=False,
        tags=['production', 'ai_platform', 'pyspark'],
    ) as dag:
    
        # Task 1: Run the PySpark heavy lifting (Intermediate Layer)
        run_spark_transformations = BashOperator(
            task_id='run_pyspark_cleaning_layer',
            bash_command='dbt run --select models/intermediate/',
            # Specific override: PySpark jobs are heavy, give it a longer timeout
            execution_timeout=timedelta(hours=2)
        )
    
        # Task 2: Build the Semantic models (SQL)
        run_semantic_layer = BashOperator(
            task_id='run_sql_semantic_layer',
            bash_command='dbt run --select models/semantic/',
        )
    
        # Task 3: Data Quality Tests
        run_data_tests = BashOperator(
            task_id='run_dbt_data_tests',
            bash_command='dbt test',
        )
    
        # Define Dependencies
        run_spark_transformations >> run_semantic_layer >> run_data_tests

    The Final Workflow

    Instead of sending raw text directly to an LLM, we send clean, structured, chunked input. This drastically reduces token cost, latency, and noise.

    The workflow becomes: Airflow Schedule → PySpark (Clean/Chunk) → dbt (Structure) → AI Model. Not: Raw data → AI.

    If you are going into an interview, this is what you tell them. Every hiring manager looking at your work is asking one implicit question: Can this person build something I'd trust running in production? When you show them an incremental PySpark job wrapped in an Airflow DAG with explicit failure callbacks and retry logic, you instantly answer that question.

    Ready to go deeper?

    Explore our full curriculum — hands-on skill toolkits built for production data engineering.

    Press Cmd+K to open