Skip to content
Flink

What is Apache Flink?

The open-source distributed stream processing engine for stateful, real-time data pipelines — with exactly-once guarantees and millisecond latency.

Quick Answer

Apache Flink is a distributed stream processing engine that processes real-time data with stateful computations, event-time semantics, and exactly-once guarantees. Unlike batch systems, Flink processes each event as it arrives — enabling sub-second latency for fraud detection, real-time analytics, and event-driven pipelines. It stores and checkpoints state (counts, aggregations, joins) to durable storage, so jobs recover automatically from failures without data loss.

What is Apache Flink?

Apache Flink was created at TU Berlin in 2010 (then called Stratosphere) and donated to the Apache Software Foundation in 2014. Today it's maintained by companies including Alibaba, Ververica, Amazon, and Google, and is the dominant engine for stateful stream processing at scale.

Flink processes two types of data: unbounded streams (real-time: Kafka topics, CDC feeds, clickstreams) and bounded datasets (batch: files, database tables). The same API handles both, making it a unified processing engine.

DataStream API

  • • Low-level stream transformations
  • • Full control over state and time
  • • map, filter, keyBy, process, window
  • • Best for complex event processing

Table API / Flink SQL

  • • SQL-first declarative queries on streams
  • • Continuous queries over Kafka topics
  • • TUMBLE, HOP, SESSION window functions
  • • Best for analytics and BI use cases

Why Flink Matters

Before Flink

  • • Fraud detected minutes after the transaction
  • • Batch jobs run hourly — dashboards always stale
  • • Complex stateful logic required custom databases
  • • Job failures meant reprocessing hours of data
  • • Event-time ordering impossible without massive buffers

With Flink

  • • Fraud rules fire within milliseconds of the event
  • • Real-time dashboards updated continuously
  • • Built-in keyed state replaces external state stores
  • • Checkpoints restore jobs to exact pre-failure position
  • • Watermarks handle late data without custom logic

What You Can Build with Flink

Fraud Detection

Score transactions in real time using velocity rules, behavioral windows, and ML model inference.

Real-Time Analytics

Count, aggregate, and join event streams for live dashboards and product metrics.

CDC Pipelines

Capture database changes from Postgres/MySQL and propagate to downstream sinks instantly.

Event-Driven Microservices

Trigger downstream actions (notifications, inventory updates) from Kafka events.

Stream-Batch Unification

Run the same Flink SQL query on live Kafka topics and historical S3 data.

Anomaly Detection

Detect metric spikes, SLO violations, and IoT sensor anomalies with sliding windows.

How Flink Works

A Flink job has three logical layers: a JobManager that coordinates execution and manages checkpoints, TaskManagers that run the parallel pipeline operators, and a State Backend (in-memory or RocksDB) that stores keyed state between events.

SOURCE

Kafka / Kinesis / File

TRANSFORM

map / filter / keyBy

WINDOW

tumbling / sliding / session

SINK

Kafka / Postgres / ES

DataStream job: count transactions per customer in a 1-minute window

# Python DataStream API (PyFlink)
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import TumblingEventTimeWindows
from pyflink.common.time import Time

env = StreamExecutionEnvironment.get_execution_environment()

# Read from Kafka
transactions = env.from_source(
    source=kafka_source,
    watermark_strategy=WatermarkStrategy
        .for_bounded_out_of_orderness(Duration.of_seconds(5)),
    source_name="transactions")

# Key by customer → tumbling 1-min window → sum amount
result = (
    transactions
    .key_by(lambda t: t.customer_id)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(SumAggregateFunction())
)

result.sink_to(kafka_sink)
env.execute("tx-volume-per-customer")

Flink SQL: continuous query over a Kafka topic

-- Create a Kafka-backed streaming table
CREATE TABLE transactions (
    customer_id  STRING,
    amount       DECIMAL(10,2),
    ts           TIMESTAMP(3),
    WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
)
WITH (
    'connector' = 'kafka'...
);

-- Continuous 1-minute tumbling window aggregation
SELECT    customer_id,
    TUMBLE_START(ts, INTERVAL '1' MINUTE) AS window_start,
    SUM(amount)  AS total_amount,
    COUNT(*)     AS tx_count
FROM    transactions
GROUP BY    customer_id,
    TUMBLE(ts, INTERVAL '1' MINUTE);

Flink vs Other Tools

Flink vs Apache Spark

Apache Flink

  • • True per-event streaming (sub-second latency)
  • • Native event-time + watermarks
  • • Rich stateful APIs (ValueState, MapState)
  • • Exactly-once via distributed snapshots

Apache Spark

  • • Micro-batch streaming (seconds latency)
  • • Larger ML/analytics ecosystem
  • • Better SQL and DataFrame support
  • • Easier to operate and debug
Verdict: Flink for sub-second streaming and complex state. Spark for batch-heavy workloads, SQL analytics, or when team familiarity matters.

Flink vs Kafka Streams

Apache Flink

  • • Distributed cluster — scales to any throughput
  • • Multi-source: Kafka, Kinesis, CDC, files
  • • Flink SQL for declarative queries
  • • Rich window semantics + watermarks

Kafka Streams

  • • Embedded library — runs inside your app
  • • Kafka-only source and sink
  • • No separate cluster to manage
  • • Simpler ops for Kafka-native pipelines
Verdict: Kafka Streams for lightweight Kafka-only microservices. Flink for complex topologies, non-Kafka sources, or when SQL and operational tooling matter.

Flink vs Faust (Python)

Apache Flink (PyFlink)

  • • Production-grade at petabyte scale
  • • Full window + exactly-once semantics
  • • Kubernetes Operator for deployment
  • • Higher learning curve

Faust (Python)

  • • Pure Python, runs as async app
  • • Kafka-only, lighter weight
  • • Easier to integrate with Python ML stack
  • • Limited production adoption
Verdict: Faust for fast Python prototyping. Flink (PyFlink or Java) for production workloads that need reliability guarantees and scale.
FeatureFlinkSpark StreamingKafka Streams
Streaming modelTrue event-by-eventMicro-batchTrue event-by-event
LatencySub-second1–30 secondsSub-second
Stateful processing✓ rich state APIs✓ limited✓ KTable
Event-time / watermarks✓ native✓ with limits✓ native
Exactly-once✓ distributed snapshots✓ micro-batch✓ via Kafka txn
Non-Kafka sources✓ Kinesis, CDC, files✓ many✗ Kafka only
SQL support✓ Flink SQL✓ Spark SQL✗ basic
Cluster required✓ yes✓ yes✗ embedded library

Common Mistakes

Using processing time instead of event time

Processing time uses the wall clock when Flink receives the event — not when it actually happened. For any use case where event ordering matters (fraud, session analysis, joins), always use event time with watermarks.

Forgetting to set a state TTL

Flink state grows unbounded unless you set a Time-To-Live (TTL). Accumulating state for inactive keys will eventually exhaust memory or RocksDB disk space. Always set state.setTtl() for long-running jobs.

Setting checkpoint interval too short

Frequent checkpoints reduce recovery time but add constant overhead. A 30-second checkpoint interval is a good default. Too short (1 second) means checkpointing never finishes before the next one starts.

Ignoring operator chaining and parallelism

Flink chains compatible operators into one task by default (good). But if you set parallelism=1 for the entire job, you lose Flink's horizontal scaling. Set per-operator parallelism intentionally based on throughput needs.

Who Should Learn Flink?

Junior DE

You know Python and SQL and want to add stream processing to your skill set. Start with Flink SQL before diving into the DataStream API.

Senior DE

You build batch pipelines and need sub-second latency for fraud, anomaly detection, or real-time metrics. Flink replaces your cron jobs.

Staff / Architect

You're choosing a streaming platform for your organization. Flink's stateful processing and exactly-once guarantees make it the enterprise default.

Related Concepts

FAQs

What is Apache Flink?
Apache Flink is an open-source distributed stream processing engine designed for stateful, real-time data pipelines. It processes unbounded (streaming) and bounded (batch) data with exactly-once guarantees, event-time processing, and millisecond latency. Flink is used by companies like Alibaba, Netflix, Uber, and LinkedIn for fraud detection, real-time analytics, and event-driven applications.
What is the difference between Flink and Spark?
Flink is a true streaming engine — it processes each event as it arrives (low latency, sub-second). Spark Structured Streaming is micro-batch — it buffers events into small batches (higher latency, simpler ops). Flink wins on latency and stateful processing. Spark wins on batch workloads, SQL analytics, and ecosystem maturity. Choose Flink when you need sub-second latency or complex event-time joins; choose Spark when batch and streaming share the same codebase.
What is stateful stream processing in Flink?
Stateful processing means Flink can remember information across events — not just transform one event at a time. For example, counting transactions per customer in the last 5 minutes requires state (the running count). Flink stores this state in memory (HashMapStateBackend) or RocksDB for large state, checkpoints it to durable storage (S3/HDFS), and restores it automatically on failure — providing exactly-once guarantees.
What are Flink windows?
Windows divide an infinite stream into finite buckets for aggregation. Tumbling windows: non-overlapping fixed intervals (e.g., count fraud per 1-minute window). Sliding windows: overlapping intervals (e.g., 5-min window every 1 min). Session windows: activity-based gaps (e.g., user session ends after 30 min of inactivity). Flink supports both processing-time and event-time windows, with watermarks to handle late-arriving data.
What is exactly-once processing in Flink?
Exactly-once means each event affects application state exactly once, even if the job restarts. Flink achieves this via distributed snapshots (Chandy-Lamport algorithm): it periodically checkpoints state to durable storage and, on failure, restores from the last checkpoint. For end-to-end exactly-once (source to sink), Flink uses two-phase commit with Kafka transactional producers and idempotent sinks.

What You'll Build with AI-DE

The Flink Fraud Detection project takes you from a local Flink + Kafka setup to a production fraud detection system deployed on Kubernetes — exactly-once guarantees included.

  • • DataStream API pipeline with keyed state and tumbling windows
  • • Watermarks and event-time fraud velocity rules
  • • Kafka transactional sink for exactly-once delivery
  • • RocksDB state backend with incremental checkpointing to S3
  • • Flink Kubernetes Operator deployment with Prometheus monitoring
Press Cmd+K to open