What is Apache Kafka?
The distributed event streaming platform that powers real-time data pipelines at LinkedIn, Netflix, Uber, and thousands of companies — ingesting millions of events per second with fault-tolerant, replay-able storage.
Quick Answer
Apache Kafka is a distributed event streaming platform. Producers publish events to topics; consumers subscribe and read them independently. Kafka stores events in a durable, ordered log — consumers can replay historical events at any time. Built for millions of events per second with horizontal scalability and sub-second latency.
What is Apache Kafka?
Apache Kafka was created at LinkedIn to handle their activity stream — tracking every click, view, and interaction across the platform. It was open-sourced in 2011 and became an Apache top-level project in 2012. Today it processes trillions of events daily across the world's largest data platforms.
Unlike traditional message queues that delete messages after delivery, Kafka retains all events in an append-only log. This makes it both a messaging system and a storage system — consumers can read live events or replay historical ones from any point in time.
Producers
Applications that write events to Kafka topics. Decoupled from consumers — producers don't know or care who reads the data.
Brokers
Kafka servers that store topic partitions and serve reads/writes. A Kafka cluster has multiple brokers for fault tolerance and parallelism.
Consumers
Applications that read events from topics. Consumer groups share partitions for parallel processing. Each group tracks its own offset independently.
Why Kafka Matters
Without Kafka
- ✗Direct service-to-service API calls create tight coupling
- ✗Downstream failures cascade and take down producers
- ✗No event replay — lost data is gone forever
- ✗Can't add new consumers without modifying producers
- ✗Batch jobs only — no real-time event processing
With Kafka
- ✓Fully decoupled — producers and consumers evolve independently
- ✓Consumers fail and restart without losing events (offsets track position)
- ✓Replay any historical window from the retained log
- ✓New consumers added with zero producer changes
- ✓Real-time pipelines processing events within milliseconds
What You Can Do with Kafka
Real-Time Event Streaming
Stream clickstream, user activity, IoT sensor data, and application logs with millisecond latency.
Change Data Capture (CDC)
Capture database changes from Postgres, MySQL, or MongoDB in real time and propagate to downstream systems.
Microservices Decoupling
Replace synchronous API calls between services with async event publishing — producers and consumers evolve independently.
Real-Time Analytics
Feed data warehouses and OLAP stores (ClickHouse, Druid) with live event streams for sub-second dashboards.
Log Aggregation
Collect application and infrastructure logs from thousands of services into a single, searchable stream.
Stream Processing
Connect Kafka to Flink or Spark Streaming for real-time joins, aggregations, fraud detection, and ML inference.
How Kafka Works
Events flow from producers → brokers (partitioned topics) → consumer groups. Each partition is an ordered, immutable log:
Producer
Writes events
Topic / Partition
Durable ordered log
Consumer Group
Reads at own offset
Downstream
DB, stream processor, lake
Producing and consuming events with the Python kafka-python client:
from kafka import KafkaProducer, KafkaConsumer
import json
# Producer: publish events to a topic
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda v: json.dumps(v).encode()
)
producer.send('user-events', {'user_id': 42, 'action': 'purchase'})
# Consumer: read events from a topic
consumer = KafkaConsumer(
'user-events',
bootstrap_servers=['localhost:9092'],
group_id='analytics-service',
auto_offset_reset='earliest' # replay from beginning
)
for msg in consumer:
process_event(json.loads(msg.value))Kafka vs Other Tools
Kafka vs RabbitMQ
Apache Kafka
- • Log-based: events are retained and replayable
- • Millions of events/sec with horizontal scaling
- • Multiple consumer groups read the same events independently
- • Designed for streaming and event sourcing
RabbitMQ
- • Queue-based: messages deleted after acknowledgement
- • Simpler setup, better for low-volume task queues
- • Flexible routing with exchanges and bindings
- • Better for RPC-style request/response patterns
Kafka vs Apache Pulsar
Apache Kafka
- • Massive ecosystem and community (Confluent, MSK, Aiven)
- • Battle-tested at extreme scale (LinkedIn, Netflix)
- • KRaft mode removes ZooKeeper dependency
- • Simpler architecture, easier to operate
Apache Pulsar
- • Native multi-tenancy built in from the start
- • Tiered storage (BookKeeper + object storage) natively
- • Geo-replication out of the box
- • Younger ecosystem, fewer managed services
Kafka vs Redis Pub/Sub
Apache Kafka
- • Durable: events stored on disk, retained indefinitely
- • Consumer offset tracking — resume after failure
- • Scales to millions of events/sec across brokers
- • Built for production-grade streaming workloads
Redis Pub/Sub
- • In-memory: messages lost if consumer is offline
- • No offset tracking — fire-and-forget delivery
- • Extremely fast for ephemeral notifications
- • Zero setup overhead for simple use cases
| Feature | Kafka | RabbitMQ | Redis Pub/Sub |
|---|---|---|---|
| Event durability | ✓ (disk) | ✓ (queue) | ✗ (memory only) |
| Event replay | ✓ | ✗ | ✗ |
| Multiple consumers | ✓ (groups) | Limited | ✓ (broadcast) |
| Throughput | Millions/sec | Thousands/sec | Millions/sec |
| Ordering | Per partition | Per queue | ✗ |
| Setup complexity | Medium | Low | None |
Common Kafka Mistakes
Using too few partitions
Partitions are the unit of parallelism. If you have 1 partition, only 1 consumer in a group can process events. Start with partitions = max consumers you ever expect, and over-partition rather than under-partition.
Not setting retention policy for your use case
Default Kafka retention is 7 days. For event sourcing or audit logs you may need indefinite retention. For ephemeral event routing you may want 1 hour. Set log.retention.hours explicitly per topic.
Using Kafka as a database
Kafka is not a query engine. You can't do ad-hoc queries, joins, or aggregations directly on topics. Use Kafka to feed real databases, data warehouses, or stream processors (Flink, ksqlDB) for query workloads.
Committing offsets before processing
If you commit the offset before successfully processing the message, a consumer crash will skip that message permanently. Commit after successful processing, not before.
One topic for everything
Putting all event types into a single topic makes schema evolution, access control, and consumer filtering difficult. Use separate topics per event type or domain entity.
Who Should Learn Kafka?
Junior DE
You're building event-driven pipelines for the first time. Learning Kafka topics, producers, and consumer groups gives you the foundation for any real-time data architecture.
Senior DE
You own streaming reliability. Partition strategies, exactly-once semantics, schema registry, consumer lag alerting, and CDC patterns are your toolkit for production Kafka.
Staff DE
You design the event platform. Multi-cluster topology, tiered storage, cross-datacenter replication, and capacity planning for 1B+ events/day is where staff-level impact lives.
Related Concepts
Frequently Asked Questions
- What is Apache Kafka?
- Apache Kafka is a distributed event streaming platform originally built at LinkedIn in 2011. It acts as a durable, high-throughput log that producers write events to and consumers read from. Kafka decouples data producers from consumers, enabling real-time pipelines that can handle millions of events per second.
- What is a Kafka topic?
- A Kafka topic is a named category or feed where events are published. Topics are split into partitions for parallelism — each partition is an ordered, immutable log of events. Consumers read from topics by tracking their offset (position) in each partition. Topics can retain events for hours, days, or indefinitely.
- What is the difference between Kafka and a traditional message queue?
- Traditional message queues (RabbitMQ, ActiveMQ) delete messages after they are consumed. Kafka retains all events in an ordered log for a configurable retention period. This means multiple consumer groups can independently read the same events, you can replay historical data, and events are never lost on consumer failure.
- What is Kafka used for in data engineering?
- Kafka is used for real-time event streaming, change data capture (CDC) from databases, decoupling microservices, log aggregation, clickstream analytics, fraud detection pipelines, and feeding data lakes with real-time events. It is the backbone of most modern streaming data architectures.
- What is the difference between Kafka and Spark Streaming?
- Kafka is the transport layer — it ingests, stores, and delivers events. Spark Streaming (or Flink) is the processing layer — it reads from Kafka and applies transformations, aggregations, and joins. They are complementary: Kafka is the queue, Spark/Flink is the compute engine.
What You'll Build with AI-DE
In the Kafka Event Routing project, you'll design Uber's event platform from first principles:
- •Decompose requirements and run capacity estimation (10K → 1B events/day)
- •Design Kafka topic strategy, delivery guarantees, and storage tiering
- •Architect the serving layer for real-time driver ETA and surge pricing
- •Design observability stack, disaster recovery, and SLA framework