What is Apache Flink?
The open-source distributed stream processing engine for stateful, real-time data pipelines — with exactly-once guarantees and millisecond latency.
Quick Answer
Apache Flink is a distributed stream processing engine that processes real-time data with stateful computations, event-time semantics, and exactly-once guarantees. Unlike batch systems, Flink processes each event as it arrives — enabling sub-second latency for fraud detection, real-time analytics, and event-driven pipelines. It stores and checkpoints state (counts, aggregations, joins) to durable storage, so jobs recover automatically from failures without data loss.
What is Apache Flink?
Apache Flink was created at TU Berlin in 2010 (then called Stratosphere) and donated to the Apache Software Foundation in 2014. Today it's maintained by companies including Alibaba, Ververica, Amazon, and Google, and is the dominant engine for stateful stream processing at scale.
Flink processes two types of data: unbounded streams (real-time: Kafka topics, CDC feeds, clickstreams) and bounded datasets (batch: files, database tables). The same API handles both, making it a unified processing engine.
DataStream API
- • Low-level stream transformations
- • Full control over state and time
- • map, filter, keyBy, process, window
- • Best for complex event processing
Table API / Flink SQL
- • SQL-first declarative queries on streams
- • Continuous queries over Kafka topics
- • TUMBLE, HOP, SESSION window functions
- • Best for analytics and BI use cases
Why Flink Matters
Before Flink
- • Fraud detected minutes after the transaction
- • Batch jobs run hourly — dashboards always stale
- • Complex stateful logic required custom databases
- • Job failures meant reprocessing hours of data
- • Event-time ordering impossible without massive buffers
With Flink
- • Fraud rules fire within milliseconds of the event
- • Real-time dashboards updated continuously
- • Built-in keyed state replaces external state stores
- • Checkpoints restore jobs to exact pre-failure position
- • Watermarks handle late data without custom logic
What You Can Build with Flink
Fraud Detection
Score transactions in real time using velocity rules, behavioral windows, and ML model inference.
Real-Time Analytics
Count, aggregate, and join event streams for live dashboards and product metrics.
CDC Pipelines
Capture database changes from Postgres/MySQL and propagate to downstream sinks instantly.
Event-Driven Microservices
Trigger downstream actions (notifications, inventory updates) from Kafka events.
Stream-Batch Unification
Run the same Flink SQL query on live Kafka topics and historical S3 data.
Anomaly Detection
Detect metric spikes, SLO violations, and IoT sensor anomalies with sliding windows.
How Flink Works
A Flink job has three logical layers: a JobManager that coordinates execution and manages checkpoints, TaskManagers that run the parallel pipeline operators, and a State Backend (in-memory or RocksDB) that stores keyed state between events.
SOURCE
Kafka / Kinesis / File
TRANSFORM
map / filter / keyBy
WINDOW
tumbling / sliding / session
SINK
Kafka / Postgres / ES
DataStream job: count transactions per customer in a 1-minute window
# Python DataStream API (PyFlink)
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import TumblingEventTimeWindows
from pyflink.common.time import Time
env = StreamExecutionEnvironment.get_execution_environment()
# Read from Kafka
transactions = env.from_source(
source=kafka_source,
watermark_strategy=WatermarkStrategy
.for_bounded_out_of_orderness(Duration.of_seconds(5)),
source_name="transactions")
# Key by customer → tumbling 1-min window → sum amount
result = (
transactions
.key_by(lambda t: t.customer_id)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(SumAggregateFunction())
)
result.sink_to(kafka_sink)
env.execute("tx-volume-per-customer")
Flink SQL: continuous query over a Kafka topic
-- Create a Kafka-backed streaming table
CREATE TABLE transactions (
customer_id STRING,
amount DECIMAL(10,2),
ts TIMESTAMP(3),
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
)
WITH (
'connector' = 'kafka'...
);
-- Continuous 1-minute tumbling window aggregation
SELECT customer_id,
TUMBLE_START(ts, INTERVAL '1' MINUTE) AS window_start,
SUM(amount) AS total_amount,
COUNT(*) AS tx_count
FROM transactions
GROUP BY customer_id,
TUMBLE(ts, INTERVAL '1' MINUTE);
Flink vs Other Tools
Flink vs Apache Spark
Apache Flink
- • True per-event streaming (sub-second latency)
- • Native event-time + watermarks
- • Rich stateful APIs (ValueState, MapState)
- • Exactly-once via distributed snapshots
Apache Spark
- • Micro-batch streaming (seconds latency)
- • Larger ML/analytics ecosystem
- • Better SQL and DataFrame support
- • Easier to operate and debug
Flink vs Kafka Streams
Apache Flink
- • Distributed cluster — scales to any throughput
- • Multi-source: Kafka, Kinesis, CDC, files
- • Flink SQL for declarative queries
- • Rich window semantics + watermarks
Kafka Streams
- • Embedded library — runs inside your app
- • Kafka-only source and sink
- • No separate cluster to manage
- • Simpler ops for Kafka-native pipelines
Flink vs Faust (Python)
Apache Flink (PyFlink)
- • Production-grade at petabyte scale
- • Full window + exactly-once semantics
- • Kubernetes Operator for deployment
- • Higher learning curve
Faust (Python)
- • Pure Python, runs as async app
- • Kafka-only, lighter weight
- • Easier to integrate with Python ML stack
- • Limited production adoption
| Feature | Flink | Spark Streaming | Kafka Streams |
|---|---|---|---|
| Streaming model | True event-by-event | Micro-batch | True event-by-event |
| Latency | Sub-second | 1–30 seconds | Sub-second |
| Stateful processing | ✓ rich state APIs | ✓ limited | ✓ KTable |
| Event-time / watermarks | ✓ native | ✓ with limits | ✓ native |
| Exactly-once | ✓ distributed snapshots | ✓ micro-batch | ✓ via Kafka txn |
| Non-Kafka sources | ✓ Kinesis, CDC, files | ✓ many | ✗ Kafka only |
| SQL support | ✓ Flink SQL | ✓ Spark SQL | ✗ basic |
| Cluster required | ✓ yes | ✓ yes | ✗ embedded library |
Common Mistakes
Using processing time instead of event time
Processing time uses the wall clock when Flink receives the event — not when it actually happened. For any use case where event ordering matters (fraud, session analysis, joins), always use event time with watermarks.
Forgetting to set a state TTL
Flink state grows unbounded unless you set a Time-To-Live (TTL). Accumulating state for inactive keys will eventually exhaust memory or RocksDB disk space. Always set state.setTtl() for long-running jobs.
Setting checkpoint interval too short
Frequent checkpoints reduce recovery time but add constant overhead. A 30-second checkpoint interval is a good default. Too short (1 second) means checkpointing never finishes before the next one starts.
Ignoring operator chaining and parallelism
Flink chains compatible operators into one task by default (good). But if you set parallelism=1 for the entire job, you lose Flink's horizontal scaling. Set per-operator parallelism intentionally based on throughput needs.
Who Should Learn Flink?
Junior DE
You know Python and SQL and want to add stream processing to your skill set. Start with Flink SQL before diving into the DataStream API.
Senior DE
You build batch pipelines and need sub-second latency for fraud, anomaly detection, or real-time metrics. Flink replaces your cron jobs.
Staff / Architect
You're choosing a streaming platform for your organization. Flink's stateful processing and exactly-once guarantees make it the enterprise default.
Related Concepts
FAQs
- What is Apache Flink?
- Apache Flink is an open-source distributed stream processing engine designed for stateful, real-time data pipelines. It processes unbounded (streaming) and bounded (batch) data with exactly-once guarantees, event-time processing, and millisecond latency. Flink is used by companies like Alibaba, Netflix, Uber, and LinkedIn for fraud detection, real-time analytics, and event-driven applications.
- What is the difference between Flink and Spark?
- Flink is a true streaming engine — it processes each event as it arrives (low latency, sub-second). Spark Structured Streaming is micro-batch — it buffers events into small batches (higher latency, simpler ops). Flink wins on latency and stateful processing. Spark wins on batch workloads, SQL analytics, and ecosystem maturity. Choose Flink when you need sub-second latency or complex event-time joins; choose Spark when batch and streaming share the same codebase.
- What is stateful stream processing in Flink?
- Stateful processing means Flink can remember information across events — not just transform one event at a time. For example, counting transactions per customer in the last 5 minutes requires state (the running count). Flink stores this state in memory (HashMapStateBackend) or RocksDB for large state, checkpoints it to durable storage (S3/HDFS), and restores it automatically on failure — providing exactly-once guarantees.
- What are Flink windows?
- Windows divide an infinite stream into finite buckets for aggregation. Tumbling windows: non-overlapping fixed intervals (e.g., count fraud per 1-minute window). Sliding windows: overlapping intervals (e.g., 5-min window every 1 min). Session windows: activity-based gaps (e.g., user session ends after 30 min of inactivity). Flink supports both processing-time and event-time windows, with watermarks to handle late-arriving data.
- What is exactly-once processing in Flink?
- Exactly-once means each event affects application state exactly once, even if the job restarts. Flink achieves this via distributed snapshots (Chandy-Lamport algorithm): it periodically checkpoints state to durable storage and, on failure, restores from the last checkpoint. For end-to-end exactly-once (source to sink), Flink uses two-phase commit with Kafka transactional producers and idempotent sinks.
What You'll Build with AI-DE
The Flink Fraud Detection project takes you from a local Flink + Kafka setup to a production fraud detection system deployed on Kubernetes — exactly-once guarantees included.
- • DataStream API pipeline with keyed state and tumbling windows
- • Watermarks and event-time fraud velocity rules
- • Kafka transactional sink for exactly-once delivery
- • RocksDB state backend with incremental checkpointing to S3
- • Flink Kubernetes Operator deployment with Prometheus monitoring