The Reality of Streaming: When to Actually Use Apache Flink

Every streaming system comes with a tax: operational complexity, stateful debugging, exactly-once semantics that are harder than they look, and engineers who need to deeply understand watermarks and event time. Before picking Flink, be honest about whether you're paying that tax for a real reason.

The Streaming Tax

Technology	Latency	Ops complexity	Best for
dbt / Batch Spark	Minutes–Hours	Low	Historical analytics, reporting
Spark Structured Streaming	10s – 60s	Medium	Near-real-time dashboards, most use cases
Kafka Streams	100ms – 5s	Medium	Simple stateless/stateful transforms
Apache Flink	< 100ms	High	CEP, large stateful joins, CDC pipelines

The Decision Framework

Streaming Technology Decision Framework

What latency does your use case require?

Minutes → Hours

Batch

dbt · Spark batch · Airflow

Simple, cheap, reliable

10 s – 60 s

Micro-batch

Spark Structured Streaming

dbt + Materialize

Good default for most teams

< 1 second

Need stateful joins / CEP / large state?

No →

Kafka Streams

Yes →

Apache Flink

When you've exhausted simpler options

The single most important question is latency. Most "real-time" dashboards executives request need 30-second freshness, not 100ms. Micro-batch gets you there at a fraction of the cost.

Use Flink only when you can answer yes to at least one of these:

Do you need sub-second latency end-to-end?
Do you need stateful joins across streams with unbounded or very large state?
Do you need Complex Event Processing (CEP) — detecting patterns across sequences of events?
Are you building CDC pipelines at high volume where Debezium + Flink is the proven stack?

What Flink Actually Looks Like

Flink Streaming Pipeline Architecture

Sources

Kafka Topic

CDC (Debezium)

HTTP Events

→

Flink Operators

Map / Filter

Keyed State

Window (Tumble / Session)

CEP Pattern Match

Async I/O Lookup

Watermark Strategy

↕

State

RocksDB

Checkpoints → S3

→

Sinks

Kafka (results)

Postgres / OLTP

S3 / Iceberg

Prometheus

Checkpoints create consistent snapshots across all operators — enabling exactly-once semantics end-to-end

Here's a fraud-score aggregation in Spark Structured Streaming (micro-batch, 30-second trigger):

Pythonspark_fraud_score.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import window, sum as _sum, count

spark = SparkSession.builder.appName("fraud-score").getOrCreate()

events = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker:9092")
    .option("subscribe", "payment-events")
    .load()
)

aggregated = (
    events
    .withWatermark("event_time", "2 minutes")
    .groupBy(window("event_time", "5 minutes", "30 seconds"), "user_id")
    .agg(
        _sum("amount").alias("total_amount"),
        count("*").alias("tx_count"),
    )
)

query = (
    aggregated.writeStream
    .trigger(processingTime="30 seconds")   # micro-batch
    .format("kafka")
    .option("topic", "fraud-scores")
    .start()
)

And the same pipeline in Flink's DataStream API (true event-time streaming):

Pythonflink_fraud_score.py

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import KafkaSource, KafkaSink
from pyflink.common.watermark_strategy import WatermarkStrategy
from pyflink.common.time import Duration

env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(30_000)   # checkpoint every 30s → exactly-once

source = (
    KafkaSource.builder()
    .set_bootstrap_servers("broker:9092")
    .set_topics("payment-events")
    .set_value_only_deserializer(PaymentSchema())
    .build()
)

watermark_strategy = (
    WatermarkStrategy
    .for_bounded_out_of_orderness(Duration.of_seconds(5))
    .with_timestamp_assigner(PaymentTimestampAssigner())
)

stream = env.from_source(source, watermark_strategy, "payments")

result = (
    stream
    .key_by(lambda e: e.user_id)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(FraudScoreAggregate())
)

result.sink_to(KafkaSink.builder().set_topic("fraud-scores").build())
env.execute("fraud-score-pipeline")

The Flink version processes on true event time with watermarks, not wall-clock triggers — meaning late-arriving events (network delays, mobile clients) are handled correctly rather than dropped.

CEP: Flink's Killer Feature

Complex Event Processing is where Flink has no real competitor. Detecting sequences of events — "3 failed logins within 60 seconds followed by a password reset" — is trivial with Flink CEP:

JavaSuspiciousPattern.java

Pattern<LoginEvent, ?> suspiciousPattern = Pattern
    .<LoginEvent>begin("failed_logins")
    .where(e -> e.getStatus().equals("FAILED"))
    .timesOrMore(3)
    .within(Time.seconds(60))
    .followedBy("password_reset")
    .where(e -> e.getType().equals("PASSWORD_RESET"));

PatternStream<LoginEvent> patternStream =
    CEP.pattern(loginStream.keyBy(LoginEvent::getUserId), suspiciousPattern);

patternStream.select(match -> {
    List<LoginEvent> failedLogins = match.get("failed_logins");
    LoginEvent resetEvent = match.get("password_reset").get(0);
    return new SecurityAlert(resetEvent.getUserId(), failedLogins.size());
});

Replicating this in Spark requires maintaining your own per-user state machine. In Flink it's 10 lines.

The Real Cost: Operational Complexity

Concern	What it means in practice
Watermarks	Late events after the watermark are dropped — tune allowedLateness carefully
RocksDB tuning	Large state requires block cache sizing, compaction tuning, and SSD-backed volumes
Backpressure	Slow sinks propagate pressure upstream — monitor source lag and operator throughput
Checkpoint failures	A missed checkpoint doesn't fail the job, but recovery replays further back
JVM overhead	Flink runs on the JVM — GC pauses at high throughput require heap + G1GC tuning

None of these are deal-breakers, but each requires an engineer who understands Flink internals. Budget for that learning curve before committing.

When to Stay on Spark Structured Streaming

Your latency requirement is > 10 seconds.
Your team already knows Spark and PySpark.
You want to share code between batch and streaming jobs.
Your state is bounded and simple (no complex joins or CEP).
You want easier local development and debugging.

Spark Structured Streaming gets you to 10-second latency with far less operational burden. For most analytics use cases, that's the right call.

Before you reach for Flink…

Master the streaming fundamentals first

The biggest operational risk with Flink isn't the API — it's not deeply understanding event time semantics, watermarks, and state management before you hit production. A mis-tuned watermark drops late events silently; unbounded state fills RocksDB volumes and crashes the job.

Our Apache Flink module covers exactly what you need before going live: time semantics and watermarks, state backends and RocksDB tuning, windowing patterns, Kafka integration with CDC, and production deployment.

Start the Flink module Browse streaming projects

The Reality of Streaming: When to Actually Use Apache Flink

The Streaming Tax

The Decision Framework

What Flink Actually Looks Like

CEP: Flink's Killer Feature

The Real Cost: Operational Complexity

When to Stay on Spark Structured Streaming

Master the streaming fundamentals first

Keep reading.

Data Observability in 2026: Monte Carlo vs Great Expectations vs Soda — A Data Engineer's Honest Comparison

CI/CD for Data Pipelines: The Production Guide

How to Design a Modern Data + AI System: Control, Data, and Decision Planes