What is Apache Flink? (2026)

Quick answer

Apache Flink is an open-source distributed stream processing engine for stateful, real-time data pipelines. Unlike batch systems, it processes each event as it arrives — enabling sub-second latency, event-time semantics, and exactly-once guarantees for fraud detection, real-time analytics, and event-driven pipelines. Flink checkpoints state to durable storage, so jobs recover from failures without data loss. Learn Flink hands-on at /learn/flink-streaming or build a production fraud detector with /projects/flink-fraud-detection.

What is Apache Flink?

Apache Flink was created at TU Berlin in 2010 (originally called Stratosphere) and donated to the Apache Software Foundation in 2014. Today it's maintained by Alibaba, Ververica, Amazon, Confluent, and Google, and is the dominant engine for stateful stream processing at scale.

Flink processes two kinds of data with a single API: unbounded streams (real-time Kafka topics, CDC feeds, clickstreams) and bounded datasets (files, database tables). The same job code handles both, which makes Flink a unified processing engine — not a streaming engine bolted onto a batch system.

You write Flink jobs in three styles. The DataStream API is low-level, with full control over state, time, and operator chaining — best for complex event processing. Flink SQL is a declarative SQL-on-streams layer with tumbling, sliding, and session window functions — best for analytics and BI. The Table API sits between them: SQL-like operators in Java/Python/Scala without the string-based query overhead.

SKILL · FLINK

Master Flink in 6 hours, hands-on.

From your first DataStream job to event-time windows, watermarks, RocksDB state, and exactly-once Kafka sinks. Real fraud rules, real checkpoints, real Kubernetes deploys.

Start learning →

Why does Flink matter?

Per-event streaming with sub-second latency — fraud rules fire within milliseconds, not minutes
Native event-time + watermarks correctly handle out-of-order and late-arriving data
Rich stateful APIs (ValueState, ListState, MapState) replace external state stores
Distributed snapshots provide exactly-once guarantees across restarts and failures
Unified API for streaming and batch — same code runs on Kafka topics or S3 files
Production-grade at petabyte scale at Alibaba (4 billion events/sec on Singles' Day)

How does Flink work?

A Flink job has three logical components that coordinate execution:

JobManager — the coordinator. Schedules tasks, manages checkpoint barriers, and handles failure recovery.
TaskManagers — the workers. Run the parallel pipeline operators (sources, transforms, sinks) across the cluster.
State Backend — the memory layer. Stores keyed state in HashMap (in-memory) or RocksDB (large state on disk), checkpointed to S3/HDFS.

When you submit a job, the JobManager parses the operator graph, distributes parallel instances to TaskManagers, and starts processing events from sources. State updates happen locally on each TaskManager; checkpoints periodically snapshot the entire pipeline state for recovery.

# Python DataStream API: count tx per customer in 1-minute windows
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import TumblingEventTimeWindows
from pyflink.common.time import Time

env = StreamExecutionEnvironment.get_execution_environment()

transactions = env.from_source(
    source=kafka_source,
    watermark_strategy=WatermarkStrategy
        .for_bounded_out_of_orderness(Duration.of_seconds(5)),
    source_name="transactions",
)

result = (
    transactions
    .key_by(lambda t: t.customer_id)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(SumAggregateFunction())
)

result.sink_to(kafka_sink)
env.execute("tx-volume-per-customer")

The same logic in Flink SQL — a continuous query that runs forever over a Kafka topic:

CREATE TABLE transactions (
    customer_id  STRING,
    amount       DECIMAL(10,2),
    ts           TIMESTAMP(3),
    WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);

SELECT
    customer_id,
    TUMBLE_START(ts, INTERVAL '1' MINUTE) AS window_start,
    SUM(amount) AS total_amount,
    COUNT(*)    AS tx_count
FROM transactions
GROUP BY customer_id, TUMBLE(ts, INTERVAL '1' MINUTE);

Flink vs Spark Streaming vs Kafka Streams

Feature	Flink	Spark Streaming	Kafka Streams
Streaming model	True per-event	Micro-batch	True per-event
Latency	Sub-second	1-30 seconds	Sub-second
Stateful processing	Rich state APIs	Limited	KTable / KStream
Event-time + watermarks	Native	With limits	Native
Exactly-once	Distributed snapshots	Micro-batch idempotent	Kafka transactions
Non-Kafka sources	Kinesis, CDC, files	Many connectors	Kafka only
SQL support	Flink SQL (full)	Spark SQL (full)	Basic ksqlDB
Deployment	Cluster (K8s, YARN)	Cluster (K8s, YARN)	Embedded library

Verdict: Choose Flink for sub-second streaming, complex stateful logic, and event-time joins. Choose Spark Streaming for batch-heavy workloads, SQL analytics, or when team familiarity with Spark matters more than latency. Choose Kafka Streams for lightweight Kafka-only microservices where the embedded-library model beats running a separate cluster.

What you can build with Flink

Flink shines anywhere events need to be transformed, joined, or aggregated within milliseconds of arrival. The most common patterns:

Fraud detection — score transactions in real time using velocity rules, behavioral windows, and inline ML model inference
Real-time analytics — count, aggregate, and join event streams for live dashboards and product metrics
CDC pipelines — capture database changes from Postgres/MySQL and propagate them to lakes, warehouses, or search indexes
Event-driven microservices — trigger downstream actions (notifications, inventory updates) from Kafka events
Stream-batch unification — run the same Flink SQL query on live Kafka topics and historical S3 data
Anomaly detection — detect metric spikes, SLO violations, and IoT sensor anomalies with sliding windows

PROJECT · FLINK-FRAUD-DETECTION

Build a real fraud pipeline end-to-end.

DataStream API job with keyed state, tumbling windows, RocksDB checkpoints, Kafka transactional sinks, and a Flink Kubernetes Operator deployment with Prometheus monitoring.

Open project →

Common mistakes (and what to do instead)

Using processing time instead of event time — processing time uses the wall clock when Flink receives the event, not when it actually happened. For fraud, session analysis, or joins, always use event time with watermarks.
Forgetting to set a state TTL — Flink state grows unbounded unless you set a Time-To-Live. Inactive-key state will exhaust memory or RocksDB disk. Always set state.setTtl() on long-running jobs.
Setting checkpoint interval too short — 1-second intervals never finish before the next one starts. A 30-second interval is a healthy default; tune up or down based on state size and recovery SLO.
Ignoring operator chaining and parallelism — leaving parallelism=1 throws away Flink's horizontal scaling. Set per-operator parallelism intentionally based on throughput, and use disableChaining() only when isolation matters.
Skipping end-to-end exactly-once — checkpoints give you exactly-once inside Flink, but the sink must be transactional (Kafka transactional producer, idempotent JDBC) for end-to-end guarantees.
Deploying PyFlink for ultra-low-latency hot paths — PyFlink adds JVM/Python serialization overhead. For sub-100ms paths, prefer Java or Scala; reserve PyFlink for ML-heavy logic or rapid prototyping.

Who is Flink for?

Flink is built for streaming data engineers and platform engineers who need real-time pipelines with strong correctness guarantees. If your business depends on detecting events within seconds — fraud, anomalies, real-time personalization — Flink is almost certainly the right engine.

Teams that benefit most:

Fintech teams running real-time fraud and AML detection on transaction streams
Adtech and growth teams computing live attribution, sessionization, and personalization features
IoT and observability teams aggregating sensor and metric streams with sliding-window anomaly detection
ML platform teams pushing features from event streams into online feature stores with sub-second freshness

Frequently asked questions

What is Apache Flink?

Apache Flink is an open-source distributed stream processing engine designed for stateful, real-time data pipelines. It processes unbounded (streaming) and bounded (batch) data with exactly-once guarantees, event-time processing, and millisecond latency. Companies like Alibaba, Netflix, Uber, and LinkedIn use it for fraud detection, real-time analytics, and event-driven applications.

What is the difference between Flink and Spark?

Flink is a true streaming engine — it processes each event as it arrives (sub-second latency). Spark Structured Streaming is micro-batch — it buffers events into small batches (higher latency, simpler ops). Flink wins on latency and stateful processing. Spark wins on batch, SQL analytics, and ecosystem maturity. Choose Flink for sub-second use cases or complex event-time joins; Spark when batch and streaming share a codebase.

What is stateful stream processing in Flink?

Stateful processing means Flink remembers information across events — not just transforms one event at a time. Counting transactions per customer in the last 5 minutes requires state (the running count). Flink stores state in memory or RocksDB for large state, checkpoints it to durable storage (S3/HDFS), and restores it automatically on failure — providing exactly-once guarantees.

What are Flink windows?

Windows divide an infinite stream into finite buckets for aggregation. Tumbling windows are non-overlapping fixed intervals. Sliding windows overlap (e.g., 5-min window every 1 min). Session windows group activity until a gap appears. Flink supports both processing-time and event-time windows, with watermarks to handle late-arriving data correctly.

What is exactly-once processing in Flink?

Exactly-once means each event affects application state exactly once, even if the job restarts. Flink achieves this via distributed snapshots (Chandy-Lamport algorithm): it periodically checkpoints state to durable storage and restores from the last checkpoint on failure. For end-to-end exactly-once (source to sink), Flink uses two-phase commit with Kafka transactional producers and idempotent sinks.

What to do next

Start shipping.

Three steps from a guide to a job-ready portfolio. Pick one and start now — the rest will follow.

01 · LEARN

Take the skill

Self-paced module with code, exercises, and a deliverable. Free preview, paid completion.

Start S0X · Flink →

02 · BUILD

Ship the project

Production-grade build with starter kit + mentor code review. The artifact that gets you interviews.

Open P0X · Flink Fraud Detection →

03 · COMMIT

Pick a career path

The full progression — skills + projects + interview prep — for the role you actually want.

See paths →

What is Apache Flink?

Master Flink in 6 hours, hands-on.

Why does Flink matter?

How does Flink work?

Flink vs Spark Streaming vs Kafka Streams

What you can build with Flink

Build a real fraud pipeline end-to-end.

Common mistakes (and what to do instead)

Who is Flink for?

Frequently asked questions

Start shipping.

Take the skill

Ship the project

Pick a career path

Related guides

What is Apache Kafka?

What is Apache Spark?

What is Apache Iceberg?