What Is Kafka and Why Is It Used? The Streaming Backbone Powering Modern Data
In 2011, LinkedIn had a problem. Their data pipelines were a mess of point-to-point integrations. Every new system meant another custom connector. Engineers spent more time gluing things together than building features. So they built Apache Kafka.
I've been running Kafka in production since 2017. At SIVARO, we've built systems processing 200K events/second on Kafka clusters. I've seen it scale. I've seen it break. And I've watched a generation of engineers discover Kafka — both the man and the software — with near-religious obsession.
Let me clear something up: this article is about the distributed streaming platform. Not the novelist. But the confusion is telling. Both Franz Kafka and Apache Kafka share something — they make you confront systems that don't make sense until you understand the mechanics underneath. (Reddit on Gen Z's Kafka obsession)
So what is Kafka and why is it used? Simple answer: it's a distributed log that lets you move data between systems in real time. Practical answer: it's the closest thing we have to a universal data bus for modern infrastructure. And Gen Z's fascination with the author Franz Kafka? That's a different kind of pipeline — one processing existential dread instead of JSON payloads. (Why GenZ is addicted to this author)
Let me show you what a decade of running Kafka taught me.
The Core Idea: A Commit Log That Never Sleeps
Most people think Kafka is a message queue. It's not. At least, not exactly.
A message queue delivers a message once, then forgets it. Kafka keeps everything. For days or weeks, depending on your config. It's a distributed commit log — think of it as a database optimized for append-only writes, with zero indexing overhead, and the ability to replay history.
Here's the mental model I use:
Producer → [Kafka Topic (immutable log)] ← Consumer
Producers write events to the end of the log. Consumers read from wherever they want. If a consumer crashes and comes back 3 days later? It picks up where it left off. The data is still there.
This changes everything. In a queue system, if your consumer fails, the message is gone. In Kafka, the message waits. (Kafka Wikipedia — yes, the Wikipedia page for the author gets more traffic than the tech documentation some months)
Why Gen Z Is Obsessed with Kafka (Both of Them)
Let me break the fourth wall for a second.
If you search "what is Kafka and why is it used?" on YouTube, you'll get two completely different sets of results. One about Apache Kafka for engineers. One about Franz Kafka for literature students.
And somehow, both audiences are growing. (Gen-Z's obsession with Kafka & Dostoevsky)
I think the literary Kafka resonates because his work captures something about modern life — bureaucracy, alienation, systems that operate with no regard for the individual. Sound familiar? That's exactly what distributed systems feel like when they break.
Here's what I tell my junior engineers who ask "is Kafka good or evil?": The software is amoral. It's a tool. But it does force you to confront uncomfortable truths about state, consistency, and failure. Just like reading The Trial.
The numbers back this up. Redis, RabbitMQ, Kinesis — they all solve subsets of the problem. But Kafka has become the standard. Why? Because it's the only system that says "keep everything, figure out consumption later."
That's the Gen Z appeal in a nutshell. (Why is Gen Z obsessed with Kafka?)
What Was Kafka Known For? (The Tech Answer)
If you asked an engineer in 2015 "what was Kafka known for?", they'd say "log aggregation." LinkedIn used it to track user activity. Twitter used it for analytics. That was the use case.
Then something shifted.
Companies started realizing that Kafka wasn't just for logs. It was for everything. At SIVARO, we've used Kafka for:
- Event sourcing — every state change is an event, stored forever
- Stream processing — joining, filtering, aggregating data in real time
- Commit log for microservices — services publish facts, others subscribe
- Database change data capture (CDC) — capturing every INSERT/UPDATE/DELETE from Postgres or MySQL
- Metrics and monitoring — 200K events/second, 24/7
The killer feature? Exactly-once semantics. In a world where data loss costs millions, Kafka guarantees you won't duplicate or drop messages. Not without trade-offs — it's slower than at-least-once — but for financial transactions or inventory systems, it's non-negotiable.
Architecture: What's Under the Hood
I'm going to skip the 10,000-foot overview and give you the ground truth.
Topics and Partitions
A topic is a category. Think "orders" or "page_views". Each topic is split into partitions. Partitions are where the parallelism lives.
Topic: orders
Partition 0 → [order-1, order-4, order-7]
Partition 1 → [order-2, order-5, order-8]
Partition 2 → [order-3, order-6, order-9]
More partitions = more throughput. But also more file handles, more memory, more complexity. We run clusters with 200+ partitions per topic. Some teams run 1000+. At some point, you hit diminishing returns.
Producers and Consumers
Producers choose which partition to write to. You can hash on a key (order_id, user_id) to guarantee ordering. Or you can round-robin. Each has trade-offs.
python
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
# Send with key for ordering guarantee
producer.send('orders',
key=b'user_123',
value=b'{"order_id": 456, "amount": 99.99}')
producer.flush()
Consumers read from partitions. A consumer group lets you parallelize reading. Each partition is consumed by exactly one consumer in the group. No conflicts.
python
from kafka import KafkaConsumer
consumer = KafkaConsumer('orders',
bootstrap_servers='localhost:9092',
group_id='order_processor',
auto_offset_reset='earliest')
for message in consumer:
print(f"Partition: {message.partition}, Offset: {message.offset}")
print(f"Value: {message.value.decode('utf-8')}")
Brokers and Replication
Kafka runs on a cluster of servers (brokers). Data is replicated across brokers. In our production setup, we use replication factor 3 — your data exists on 3 machines. If one crashes, Kafka elects a new leader from the replicas.
The dirty secret? Leader election is not instant. We've seen 15-second blackouts during broker failures. If your system can't tolerate that, you need to design around it. Or use a different tool.
Why Not Just Use RabbitMQ or Redis?
I get this question constantly.
RabbitMQ is great for work queues — send a task, process it, delete it. But it doesn't persist messages well. It doesn't replay. It doesn't scale to 200K events/second without serious pain.
Redis is fast. Blazing fast. But it's a cache. When memory fills up, it evicts data. You can't rely on Redis for durable, replayable event storage. We learned this the hard way in 2019 — lost 4 hours of events during a Redis failover. Never again.
Kafka trades latency for durability and throughput. A single Kafka broker can handle millions of writes per second. The trade-off: latency is in the 2-10ms range, not microseconds. (100 years after his death, Gen Z loves Franz Kafka — yes, this article is about the author, but the parallels to data persistence are uncanny)
Stream Processing: Where Kafka Gets Interesting
Raw event streaming is useful. But stream processing is where you unlock value.
Kafka Streams is a Java library. You write code that transforms, joins, and aggregates data as it flows through Kafka. No external cluster. No SQL. Just Java code running in your application.
java
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> orders = builder.stream("orders");
KTable<String, Long> orderCounts = orders
.groupByKey()
.count(Materialized.as("order-counts"));
orderCounts.toStream().to("order-counts-output");
The alternative is ksqlDB. It's SQL for streams. You write:
sql
CREATE STREAM orders (order_id INT, user_id INT, amount DOUBLE)
WITH (KAFKA_TOPIC='orders', VALUE_FORMAT='JSON');
CREATE TABLE order_stats AS
SELECT user_id, COUNT(*) AS order_count, SUM(amount) AS total_spent
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY user_id
EMIT CHANGES;
We use ksqlDB for ad-hoc analysis. Kafka Streams for production pipelines. The SQL approach is easier to maintain. The Java approach is more flexible. Choose based on your team.
The Dark Side: Kafka Pain I've Seen
Let me be honest. Kafka is not easy.
Operational Complexity
You need Zookeeper. Or KRaft (the new consensus protocol). You need to tune JVM heap, page cache, and disk I/O. If you get it wrong, Kafka silently drops performance.
I once spent 3 days debugging a cluster where producers were timing out. Turned out the network MTU was set wrong. Kafka doesn't tell you this. It just throws timeout exceptions.
Consumer Lag
If your consumers can't keep up with producers, messages pile up. Kafka storage fills up. Retention kicks in. Data gets deleted. This happened to a fintech company I consulted for. They lost 2 hours of transaction data because a consumer group was misconfigured.
The fix: monitor consumer lag religiously. We use Burrow (LinkedIn's lag checker). It saved us multiple times.
Schema Management
Kafka doesn't enforce schemas. You can write a JSON object with field price, and a consumer expects price_in_cents. Suddenly your pipeline breaks silently.
Use Schema Registry. Avro or Protobuf. Enforce schema evolution. We learned this after a 3-hour production incident where a producer started sending amount instead of value. No warnings. Just corrupted data downstream.
Real-World Patterns We Use at SIVARO
Here's what works for us.
Pattern 1: Event Sourcing with Compaction
Every change to a user's profile is an event. We store them in a Kafka topic with log compaction enabled. Kafka keeps the latest value for each key. If a consumer needs to rebuild state, it reads from the compacted topic.
This replaces database triggers. It replaces complex SQL queries. It's simpler and faster.
Pattern 2: Database CDC with Debezium
Debezium reads Postgres write-ahead logs and publishes changes to Kafka. We capture every INSERT, UPDATE, and DELETE in real time. Downstream systems get live data without polling.
yaml
# Debezium connector config
{
"name": "postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.dbname": "orders",
"topic.prefix": "dbserver1"
}
}
Pattern 3: Microservices Event Bus
We don't do synchronous REST between microservices. Each service publishes events to Kafka. Other services subscribe. If one service is down, the events queue up. When it recovers, it catches up.
This decouples everything. We deploy services independently. No more cascading failures.
Kafka vs. The Alternatives: What I Tell Clients
| System | Best For | Worst For |
|---|---|---|
| Kafka | Event streaming, CDC, log aggregation | Low-latency (<1ms) messaging |
| RabbitMQ | Work queues, task distribution | Large-scale event storage |
| Redis | Caching, real-time counters | Durable event persistence |
| Kinesis | AWS-native streaming | Multi-cloud, high cost |
| Pulsar | Multi-tenant, geo-replication | Maturity, operational complexity |
My rule of thumb: if you're building a system that needs to replay data, use Kafka. If you're routing tasks to workers, use RabbitMQ. If you're just caching, use Redis.
FAQ: What People Actually Ask Me
Is Kafka hard to learn?
Yes. The concepts are simple. The operational reality is not. Plan for 2-3 weeks before your team is productive. Plan for 6 months before you're comfortable with production operations.
What size cluster should I start with?
Start with 3 brokers. Replication factor 2. Use managed Kafka (Confluent Cloud, MSK, Aiven) until you hit 50K events/second. Self-managed below that threshold wastes engineering time.
Can Kafka replace my database?
No. But it can replace your message queue, your ETL pipelines, and your event bus. Don't store master data in Kafka. Store events. Derive state from events in a database.
What's the best Kafka client library?
For Python: kafka-python or confluent-kafka-python (faster, C-based). For Java: use the official Kafka client. For Go: franz-go or confluent-kafka-go.
Should I use Kafka for small applications?
Probably not. Kafka has operational overhead. For simple pub-sub with < 1K messages/day, use Redis Pub/Sub or a simple HTTP webhook. Kafka shines at scale and durability.
What was Kafka known for originally?
Log aggregation from LinkedIn's data infrastructure. The first production use was tracking user activity across the site. It was open-sourced in 2011. By 2015, Netflix, Uber, and Twitter were using it for core infrastructure.
Why is Gen Z obsessed with Kafka (the author)?
Because Franz Kafka wrote about feeling trapped by systems you can't control. Gen Z feels that in a gig economy, algorithmic content feeds, and bureaucratic institutions. The parallels are obvious. (Why GenZ is secretly obsessed with this author)
The Bottom Line
Kafka is the closest thing we have to a universal data bus. It's not perfect. It's not easy. But when you need to move 200K events per second between 50 microservices, across 3 data centers, with exactly-once semantics and replay capability — there's no real alternative.
The question "what is Kafka and why is it used?" has a short answer and a long answer. The short answer: it's a distributed commit log. The long answer fills books.
But here's what I know after 7 years in production: every system I've built with Kafka has been more resilient, more scalable, and more maintainable than the alternatives. Not because Kafka is magic. Because it forces you to think about data differently.
You don't send messages. You log events. You don't delete data. You retain and compact. You don't couple services. You stream facts.
That's the mental shift. Once you make it, you can't go back.
And maybe that's also why Gen Z reads Kafka — the author, not the platform. Because once you see the absurdity of the system, you can't unsee it. (Franz Kafka's personal story)
Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.