Skip to content
Engineering Insights
Streaming

The Reality of Streaming: When to Actually Use Apache Flink

Maya ChenMar 14, 20266 min read

Want to build this yourself?

This architecture is covered in our hands-on projects. Build it in the AI-DE sandbox.

Explore Projects

The Streaming Tax

Every streaming system comes with a tax: operational complexity, stateful debugging, exactly-once semantics that are harder than they look, and engineers who need to deeply understand watermarks and event time. Before picking Flink, be honest about whether you're paying that tax for a real reason.

TechnologyLatencyOps ComplexityBest For
dbt / Batch SparkMinutes–HoursLowHistorical analytics, reporting
Spark Structured Streaming10 s – 60 sMediumNear-real-time dashboards, most use cases
Kafka Streams100 ms – 5 sMediumSimple stateless/stateful transforms
Apache Flink< 100 msHighCEP, large stateful joins, CDC pipelines

Start Here: The Decision Framework

Streaming Technology Decision Framework

What latency does your use case require?
Minutes → Hours

Batch

dbt · Spark batch · Airflow

Simple, cheap, reliable

10 s – 60 s

Micro-batch

Spark Structured Streaming

dbt + Materialize

Good default for most teams

< 1 second

Need stateful joins / CEP / large state?

No →

Kafka Streams

Yes →

Apache Flink

When you've exhausted simpler options

The single most important question is latency. Most "real-time" dashboards executives request need 30-second freshness, not 100ms. Micro-batch gets you there at a fraction of the cost.

Use Flink only when you can answer yes to at least one of these:

  • Do you need sub-second latency end-to-end?
  • Do you need stateful joins across streams with unbounded or very large state?
  • Do you need Complex Event Processing (CEP) — detecting patterns across sequences of events?
  • Are you building CDC pipelines at high volume where Debezium + Flink is the proven stack?
  • What Flink Actually Looks Like

    Flink Streaming Pipeline Architecture

    Sources

    Kafka Topic
    CDC (Debezium)
    HTTP Events

    Flink Operators

    Map / Filter
    Keyed State
    Window (Tumble / Session)
    CEP Pattern Match
    Async I/O Lookup
    Watermark Strategy

    State

    RocksDB
    Checkpoints → S3

    Sinks

    Kafka (results)
    Postgres / OLTP
    S3 / Iceberg
    Prometheus

    Checkpoints create consistent snapshots across all operators — enabling exactly-once semantics end-to-end

    The Same Pipeline: Spark vs Flink

    Here's a fraud score aggregation in Spark Structured Streaming (micro-batch, 30-second trigger):

    python
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import window, sum as _sum, count
    
    spark = SparkSession.builder.appName("fraud-score").getOrCreate()
    
    events = (
        spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "broker:9092")
        .option("subscribe", "payment-events")
        .load()
    )
    
    aggregated = (
        events
        .withWatermark("event_time", "2 minutes")
        .groupBy(
            window("event_time", "5 minutes", "30 seconds"),
            "user_id"
        )
        .agg(
            _sum("amount").alias("total_amount"),
            count("*").alias("tx_count")
        )
    )
    
    query = (
        aggregated.writeStream
        .trigger(processingTime="30 seconds")   # micro-batch
        .format("kafka")
        .option("topic", "fraud-scores")
        .start()
    )

    And the same pipeline in Flink's DataStream API (true event-time streaming):

    python
    from pyflink.datastream import StreamExecutionEnvironment
    from pyflink.datastream.connectors.kafka import KafkaSource, KafkaSink
    from pyflink.common.watermark_strategy import WatermarkStrategy
    from pyflink.common.time import Duration
    
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(30_000)  # checkpoint every 30s → exactly-once
    
    source = (
        KafkaSource.builder()
        .set_bootstrap_servers("broker:9092")
        .set_topics("payment-events")
        .set_value_only_deserializer(PaymentSchema())
        .build()
    )
    
    watermark_strategy = (
        WatermarkStrategy
        .for_bounded_out_of_orderness(Duration.of_seconds(5))
        .with_timestamp_assigner(PaymentTimestampAssigner())
    )
    
    stream = env.from_source(source, watermark_strategy, "payments")
    
    result = (
        stream
        .key_by(lambda e: e.user_id)
        .window(TumblingEventTimeWindows.of(Time.minutes(5)))
        .aggregate(FraudScoreAggregate())
    )
    
    result.sink_to(KafkaSink.builder().set_topic("fraud-scores").build())
    env.execute("fraud-score-pipeline")

    The Flink version processes on true event time with watermarks, not wall-clock triggers — meaning late-arriving events (network delays, mobile clients) are handled correctly rather than dropped.

    CEP: Flink's Killer Feature

    Complex Event Processing is where Flink has no real competitor. Detecting sequences of events — "3 failed logins within 60 seconds followed by a password reset" — is trivial with Flink CEP:

    java
    Pattern<LoginEvent, ?> suspiciousPattern = Pattern
        .<LoginEvent>begin("failed_logins")
        .where(e -> e.getStatus().equals("FAILED"))
        .timesOrMore(3)
        .within(Time.seconds(60))
        .followedBy("password_reset")
        .where(e -> e.getType().equals("PASSWORD_RESET"));
    
    PatternStream<LoginEvent> patternStream =
        CEP.pattern(loginStream.keyBy(LoginEvent::getUserId), suspiciousPattern);
    
    patternStream.select(match -> {
        List<LoginEvent> failedLogins = match.get("failed_logins");
        LoginEvent resetEvent = match.get("password_reset").get(0);
        return new SecurityAlert(resetEvent.getUserId(), failedLogins.size());
    });

    Replicating this in Spark requires maintaining your own per-user state machine. In Flink it's 10 lines.

    Exactly-Once: How Checkpointing Works

    Flink's exactly-once guarantee relies on periodic checkpoints that snapshot all operator state to durable storage (typically S3):

    python
    # Flink checkpoint configuration — production settings
    env.enable_checkpointing(60_000)  # every 60 seconds
    
    checkpoint_config = env.get_checkpoint_config()
    checkpoint_config.set_checkpointing_mode(CheckpointingMode.EXACTLY_ONCE)
    checkpoint_config.set_min_pause_between_checkpoints(30_000)
    checkpoint_config.set_checkpoint_timeout(120_000)
    checkpoint_config.set_tolerable_checkpoint_failure_number(3)
    checkpoint_config.enable_externalized_checkpoints(
        ExternalizedCheckpointRetention.RETAIN_ON_CANCELLATION
    )
    
    # Use RocksDB for large state (> a few GB)
    env.set_state_backend(EmbeddedRocksDBStateBackend())

    If your job crashes mid-window, Flink replays from the last checkpoint. Kafka offsets are committed as part of the checkpoint — so you get end-to-end exactly-once from source to sink with no duplicates.

    The Real Cost: Operational Complexity

    Before you ship to production, understand what you're signing up for:

    ConcernWhat It Means in Practice
    WatermarksLate events after the watermark are dropped — tune allowedLateness carefully
    RocksDB tuningLarge state requires block cache sizing, compaction tuning, and SSD-backed volumes
    BackpressureSlow sinks propagate pressure upstream — monitor source lag and operator throughput
    Checkpoint failuresA missed checkpoint doesn't fail the job, but recovery replays further back
    JVM overheadFlink runs on the JVM — GC pauses at high throughput require heap + G1GC tuning

    None of these are deal-breakers, but each requires an engineer who understands Flink internals. Budget for that learning curve before committing.

    When to Stay on Spark Structured Streaming

  • Your latency requirement is > 10 seconds
  • Your team already knows Spark and PySpark
  • You want to share code between batch and streaming jobs
  • Your state is bounded and simple (no complex joins or CEP)
  • You want easier local development and debugging
  • Spark Structured Streaming gets you to 10-second latency with far less operational burden. For most analytics use cases, that's the right call.

    Build the Foundation First

    The biggest operational risk with Flink isn't the API — it's not deeply understanding event time semantics, watermarks, and state management before you hit production. A mis-tuned watermark that's too aggressive will silently drop late events. Unbounded state that isn't compacted will fill your RocksDB volumes and crash the job.

    Our Apache Flink learning path covers exactly what you need before going live: time semantics and watermarks, state backends and RocksDB tuning, windowing patterns, Kafka integration with CDC, and production deployment — from first pipeline to fault-tolerant production job.

    Start the Apache Flink & Stream Processing path

    Ready to go deeper?

    Explore our full curriculum — hands-on skill toolkits built for production data engineering.

    Press Cmd+K to open