Skip to content

What is Apache Kafka?

The distributed event streaming platform that powers real-time data pipelines at LinkedIn, Netflix, Uber, and thousands of companies — ingesting millions of events per second with fault-tolerant, replay-able storage.

Quick Answer

Apache Kafka is a distributed event streaming platform. Producers publish events to topics; consumers subscribe and read them independently. Kafka stores events in a durable, ordered log — consumers can replay historical events at any time. Built for millions of events per second with horizontal scalability and sub-second latency.

What is Apache Kafka?

Apache Kafka was created at LinkedIn to handle their activity stream — tracking every click, view, and interaction across the platform. It was open-sourced in 2011 and became an Apache top-level project in 2012. Today it processes trillions of events daily across the world's largest data platforms.

Unlike traditional message queues that delete messages after delivery, Kafka retains all events in an append-only log. This makes it both a messaging system and a storage system — consumers can read live events or replay historical ones from any point in time.

Producers

Applications that write events to Kafka topics. Decoupled from consumers — producers don't know or care who reads the data.

Brokers

Kafka servers that store topic partitions and serve reads/writes. A Kafka cluster has multiple brokers for fault tolerance and parallelism.

Consumers

Applications that read events from topics. Consumer groups share partitions for parallel processing. Each group tracks its own offset independently.

Why Kafka Matters

Without Kafka

  • Direct service-to-service API calls create tight coupling
  • Downstream failures cascade and take down producers
  • No event replay — lost data is gone forever
  • Can't add new consumers without modifying producers
  • Batch jobs only — no real-time event processing

With Kafka

  • Fully decoupled — producers and consumers evolve independently
  • Consumers fail and restart without losing events (offsets track position)
  • Replay any historical window from the retained log
  • New consumers added with zero producer changes
  • Real-time pipelines processing events within milliseconds

What You Can Do with Kafka

Real-Time Event Streaming

Stream clickstream, user activity, IoT sensor data, and application logs with millisecond latency.

Change Data Capture (CDC)

Capture database changes from Postgres, MySQL, or MongoDB in real time and propagate to downstream systems.

Microservices Decoupling

Replace synchronous API calls between services with async event publishing — producers and consumers evolve independently.

Real-Time Analytics

Feed data warehouses and OLAP stores (ClickHouse, Druid) with live event streams for sub-second dashboards.

Log Aggregation

Collect application and infrastructure logs from thousands of services into a single, searchable stream.

Stream Processing

Connect Kafka to Flink or Spark Streaming for real-time joins, aggregations, fraud detection, and ML inference.

How Kafka Works

Events flow from producers → brokers (partitioned topics) → consumer groups. Each partition is an ordered, immutable log:

Producer

Writes events

Topic / Partition

Durable ordered log

Consumer Group

Reads at own offset

Downstream

DB, stream processor, lake

Producing and consuming events with the Python kafka-python client:

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer: publish events to a topic
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode()
)
producer.send('user-events', {'user_id': 42, 'action': 'purchase'})

# Consumer: read events from a topic
consumer = KafkaConsumer(
    'user-events',
    bootstrap_servers=['localhost:9092'],
    group_id='analytics-service',
    auto_offset_reset='earliest'  # replay from beginning
)
for msg in consumer:
    process_event(json.loads(msg.value))

Kafka vs Other Tools

Kafka vs RabbitMQ

Apache Kafka

  • • Log-based: events are retained and replayable
  • • Millions of events/sec with horizontal scaling
  • • Multiple consumer groups read the same events independently
  • • Designed for streaming and event sourcing

RabbitMQ

  • • Queue-based: messages deleted after acknowledgement
  • • Simpler setup, better for low-volume task queues
  • • Flexible routing with exchanges and bindings
  • • Better for RPC-style request/response patterns
Verdict: Kafka for high-throughput event streaming with replay. RabbitMQ for task queues, job processing, and low-volume async messaging.

Kafka vs Apache Pulsar

Apache Kafka

  • • Massive ecosystem and community (Confluent, MSK, Aiven)
  • • Battle-tested at extreme scale (LinkedIn, Netflix)
  • • KRaft mode removes ZooKeeper dependency
  • • Simpler architecture, easier to operate

Apache Pulsar

  • • Native multi-tenancy built in from the start
  • • Tiered storage (BookKeeper + object storage) natively
  • • Geo-replication out of the box
  • • Younger ecosystem, fewer managed services
Verdict: Kafka is the safer default in 2026 — larger community, more managed services, proven at scale. Choose Pulsar if multi-tenancy or geo-replication is a primary requirement.

Kafka vs Redis Pub/Sub

Apache Kafka

  • • Durable: events stored on disk, retained indefinitely
  • • Consumer offset tracking — resume after failure
  • • Scales to millions of events/sec across brokers
  • • Built for production-grade streaming workloads

Redis Pub/Sub

  • • In-memory: messages lost if consumer is offline
  • • No offset tracking — fire-and-forget delivery
  • • Extremely fast for ephemeral notifications
  • • Zero setup overhead for simple use cases
Verdict: Use Redis Pub/Sub for ephemeral notifications (live feed updates, presence). Use Kafka when you need durability, replay, or consumer offset tracking.
FeatureKafkaRabbitMQRedis Pub/Sub
Event durability✓ (disk)✓ (queue)✗ (memory only)
Event replay
Multiple consumers✓ (groups)Limited✓ (broadcast)
ThroughputMillions/secThousands/secMillions/sec
OrderingPer partitionPer queue
Setup complexityMediumLowNone

Common Kafka Mistakes

Using too few partitions

Partitions are the unit of parallelism. If you have 1 partition, only 1 consumer in a group can process events. Start with partitions = max consumers you ever expect, and over-partition rather than under-partition.

Not setting retention policy for your use case

Default Kafka retention is 7 days. For event sourcing or audit logs you may need indefinite retention. For ephemeral event routing you may want 1 hour. Set log.retention.hours explicitly per topic.

Using Kafka as a database

Kafka is not a query engine. You can't do ad-hoc queries, joins, or aggregations directly on topics. Use Kafka to feed real databases, data warehouses, or stream processors (Flink, ksqlDB) for query workloads.

Committing offsets before processing

If you commit the offset before successfully processing the message, a consumer crash will skip that message permanently. Commit after successful processing, not before.

One topic for everything

Putting all event types into a single topic makes schema evolution, access control, and consumer filtering difficult. Use separate topics per event type or domain entity.

Who Should Learn Kafka?

Junior DE

You're building event-driven pipelines for the first time. Learning Kafka topics, producers, and consumer groups gives you the foundation for any real-time data architecture.

Senior DE

You own streaming reliability. Partition strategies, exactly-once semantics, schema registry, consumer lag alerting, and CDC patterns are your toolkit for production Kafka.

Staff DE

You design the event platform. Multi-cluster topology, tiered storage, cross-datacenter replication, and capacity planning for 1B+ events/day is where staff-level impact lives.

Related Concepts

Frequently Asked Questions

What is Apache Kafka?
Apache Kafka is a distributed event streaming platform originally built at LinkedIn in 2011. It acts as a durable, high-throughput log that producers write events to and consumers read from. Kafka decouples data producers from consumers, enabling real-time pipelines that can handle millions of events per second.
What is a Kafka topic?
A Kafka topic is a named category or feed where events are published. Topics are split into partitions for parallelism — each partition is an ordered, immutable log of events. Consumers read from topics by tracking their offset (position) in each partition. Topics can retain events for hours, days, or indefinitely.
What is the difference between Kafka and a traditional message queue?
Traditional message queues (RabbitMQ, ActiveMQ) delete messages after they are consumed. Kafka retains all events in an ordered log for a configurable retention period. This means multiple consumer groups can independently read the same events, you can replay historical data, and events are never lost on consumer failure.
What is Kafka used for in data engineering?
Kafka is used for real-time event streaming, change data capture (CDC) from databases, decoupling microservices, log aggregation, clickstream analytics, fraud detection pipelines, and feeding data lakes with real-time events. It is the backbone of most modern streaming data architectures.
What is the difference between Kafka and Spark Streaming?
Kafka is the transport layer — it ingests, stores, and delivers events. Spark Streaming (or Flink) is the processing layer — it reads from Kafka and applies transformations, aggregations, and joins. They are complementary: Kafka is the queue, Spark/Flink is the compute engine.

What You'll Build with AI-DE

In the Kafka Event Routing project, you'll design Uber's event platform from first principles:

  • Decompose requirements and run capacity estimation (10K → 1B events/day)
  • Design Kafka topic strategy, delivery guarantees, and storage tiering
  • Architect the serving layer for real-time driver ETA and surge pricing
  • Design observability stack, disaster recovery, and SLA framework
View the Kafka project →
Press Cmd+K to open