The Reality of Streaming: When to Actually Use Apache Flink
Want to build this yourself?
This architecture is covered in our hands-on projects. Build it in the AI-DE sandbox.
Explore ProjectsThe Streaming Tax
Every streaming system comes with a tax: operational complexity, stateful debugging, exactly-once semantics that are harder than they look, and engineers who need to deeply understand watermarks and event time. Before picking Flink, be honest about whether you're paying that tax for a real reason.
| Technology | Latency | Ops Complexity | Best For |
|---|---|---|---|
| dbt / Batch Spark | Minutes–Hours | Low | Historical analytics, reporting |
| Spark Structured Streaming | 10 s – 60 s | Medium | Near-real-time dashboards, most use cases |
| Kafka Streams | 100 ms – 5 s | Medium | Simple stateless/stateful transforms |
| Apache Flink | < 100 ms | High | CEP, large stateful joins, CDC pipelines |
Start Here: The Decision Framework
Streaming Technology Decision Framework
Batch
dbt · Spark batch · Airflow
Simple, cheap, reliable
Micro-batch
Spark Structured Streaming
dbt + Materialize
Good default for most teams
Need stateful joins / CEP / large state?
No →
Kafka Streams
Yes →
Apache Flink
When you've exhausted simpler options
The single most important question is latency. Most "real-time" dashboards executives request need 30-second freshness, not 100ms. Micro-batch gets you there at a fraction of the cost.
Use Flink only when you can answer yes to at least one of these:
What Flink Actually Looks Like
Flink Streaming Pipeline Architecture
Sources
Flink Operators
State
Sinks
Checkpoints create consistent snapshots across all operators — enabling exactly-once semantics end-to-end
The Same Pipeline: Spark vs Flink
Here's a fraud score aggregation in Spark Structured Streaming (micro-batch, 30-second trigger):
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, sum as _sum, count
spark = SparkSession.builder.appName("fraud-score").getOrCreate()
events = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "payment-events")
.load()
)
aggregated = (
events
.withWatermark("event_time", "2 minutes")
.groupBy(
window("event_time", "5 minutes", "30 seconds"),
"user_id"
)
.agg(
_sum("amount").alias("total_amount"),
count("*").alias("tx_count")
)
)
query = (
aggregated.writeStream
.trigger(processingTime="30 seconds") # micro-batch
.format("kafka")
.option("topic", "fraud-scores")
.start()
)And the same pipeline in Flink's DataStream API (true event-time streaming):
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import KafkaSource, KafkaSink
from pyflink.common.watermark_strategy import WatermarkStrategy
from pyflink.common.time import Duration
env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(30_000) # checkpoint every 30s → exactly-once
source = (
KafkaSource.builder()
.set_bootstrap_servers("broker:9092")
.set_topics("payment-events")
.set_value_only_deserializer(PaymentSchema())
.build()
)
watermark_strategy = (
WatermarkStrategy
.for_bounded_out_of_orderness(Duration.of_seconds(5))
.with_timestamp_assigner(PaymentTimestampAssigner())
)
stream = env.from_source(source, watermark_strategy, "payments")
result = (
stream
.key_by(lambda e: e.user_id)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(FraudScoreAggregate())
)
result.sink_to(KafkaSink.builder().set_topic("fraud-scores").build())
env.execute("fraud-score-pipeline")The Flink version processes on true event time with watermarks, not wall-clock triggers — meaning late-arriving events (network delays, mobile clients) are handled correctly rather than dropped.
CEP: Flink's Killer Feature
Complex Event Processing is where Flink has no real competitor. Detecting sequences of events — "3 failed logins within 60 seconds followed by a password reset" — is trivial with Flink CEP:
Pattern<LoginEvent, ?> suspiciousPattern = Pattern
.<LoginEvent>begin("failed_logins")
.where(e -> e.getStatus().equals("FAILED"))
.timesOrMore(3)
.within(Time.seconds(60))
.followedBy("password_reset")
.where(e -> e.getType().equals("PASSWORD_RESET"));
PatternStream<LoginEvent> patternStream =
CEP.pattern(loginStream.keyBy(LoginEvent::getUserId), suspiciousPattern);
patternStream.select(match -> {
List<LoginEvent> failedLogins = match.get("failed_logins");
LoginEvent resetEvent = match.get("password_reset").get(0);
return new SecurityAlert(resetEvent.getUserId(), failedLogins.size());
});Replicating this in Spark requires maintaining your own per-user state machine. In Flink it's 10 lines.
Exactly-Once: How Checkpointing Works
Flink's exactly-once guarantee relies on periodic checkpoints that snapshot all operator state to durable storage (typically S3):
# Flink checkpoint configuration — production settings
env.enable_checkpointing(60_000) # every 60 seconds
checkpoint_config = env.get_checkpoint_config()
checkpoint_config.set_checkpointing_mode(CheckpointingMode.EXACTLY_ONCE)
checkpoint_config.set_min_pause_between_checkpoints(30_000)
checkpoint_config.set_checkpoint_timeout(120_000)
checkpoint_config.set_tolerable_checkpoint_failure_number(3)
checkpoint_config.enable_externalized_checkpoints(
ExternalizedCheckpointRetention.RETAIN_ON_CANCELLATION
)
# Use RocksDB for large state (> a few GB)
env.set_state_backend(EmbeddedRocksDBStateBackend())If your job crashes mid-window, Flink replays from the last checkpoint. Kafka offsets are committed as part of the checkpoint — so you get end-to-end exactly-once from source to sink with no duplicates.
The Real Cost: Operational Complexity
Before you ship to production, understand what you're signing up for:
| Concern | What It Means in Practice |
|---|---|
| Watermarks | Late events after the watermark are dropped — tune allowedLateness carefully |
| RocksDB tuning | Large state requires block cache sizing, compaction tuning, and SSD-backed volumes |
| Backpressure | Slow sinks propagate pressure upstream — monitor source lag and operator throughput |
| Checkpoint failures | A missed checkpoint doesn't fail the job, but recovery replays further back |
| JVM overhead | Flink runs on the JVM — GC pauses at high throughput require heap + G1GC tuning |
None of these are deal-breakers, but each requires an engineer who understands Flink internals. Budget for that learning curve before committing.
When to Stay on Spark Structured Streaming
Spark Structured Streaming gets you to 10-second latency with far less operational burden. For most analytics use cases, that's the right call.
Build the Foundation First
The biggest operational risk with Flink isn't the API — it's not deeply understanding event time semantics, watermarks, and state management before you hit production. A mis-tuned watermark that's too aggressive will silently drop late events. Unbounded state that isn't compacted will fill your RocksDB volumes and crash the job.
Our Apache Flink learning path covers exactly what you need before going live: time semantics and watermarks, state backends and RocksDB tuning, windowing patterns, Kafka integration with CDC, and production deployment — from first pipeline to fault-tolerant production job.
Ready to go deeper?
Explore our full curriculum — hands-on skill toolkits built for production data engineering.