Why Is Apache Kafka So Popular? A Practitioner’s Guide

I’ve been building data infrastructure since 2018. Started SIVARO to help companies stop treating data like a side project. And I’ve lost count of how many times someone asked me: "Why is Apache Kafka so popular?" They see it everywhere—Netflix, Uber, LinkedIn, Goldman Sachs, even the startup down the street with three engineers and a dream. They assume it’s hype.

It’s not hype. But the real answer surprises most people.

Kafka isn’t popular because it’s fast. It’s not popular because it’s simple. It’s popular because it solved a problem nobody could solve cleanly before: moving data reliably between systems at scale, with a contract that doesn’t break when the data volume changes.

Let me show you what I mean.

In 2019, a client of mine—let’s call them FinFlow—had a pipeline that processed 50,000 transactions a day. They used RabbitMQ. Worked fine. Then they hit 200,000. RabbitMQ started dropping messages under backpressure. Their fraud detection system missed patterns. They lost money. They rewrote the whole stack around Kafka. Six months later, they’re doing 2 million transactions a day. Same team. Same hardware.

That’s why Kafka is popular. Not because it’s trendy. Because it works when other things break.

The One Thing Kafka Got Right That Everyone Else Missed

Most people think Kafka is a message queue. It’s not. It’s a distributed commit log.

That distinction is everything.

A message queue (RabbitMQ, ActiveMQ, SQS) is built around the idea of temporary storage. A producer sends a message, a consumer picks it up, and it’s gone. If the consumer crashes before processing, the message might be lost. If the producer sends messages faster than the consumer can handle, the queue fills up, backpressure kicks in, and things slow down or drop.

Kafka flips that model. Messages are stored in a log. Not a queue. A log is append-only. Immutable. Ordered. And it keeps messages around for a configurable retention period—days, weeks, forever if you want.

Here’s the key insight: Kafka doesn’t care if consumers are slow. It keeps the data. Consumers read at their own pace. They can rewind. They can replay. They can catch up after hours of downtime.

This single design decision makes Kafka fundamentally different from every message broker that came before it.

Why That Design Decision Matters: Three Real Scenarios

Scenario 1: The microservices disaster

You have 12 microservices communicating via HTTP. Service A calls B, B calls C, C calls D. One day, D starts taking 30 seconds to respond. B’s thread pool fills up. A’s thread pool fills up. Everything slows to a crawl. You have cascading failure.

With Kafka: Service A publishes an event. B, C, D subscribe independently. If D is slow, it reads at its own pace. B and C don’t care. No cascading failure. No thread pool exhaustion.

Scenario 2: The auditable system

Your compliance team shows up. “We need all transaction logs from the last 90 days.” You built everything on databases with TTL-based cleanup. You’re screwed. Your logs rolled off.

With Kafka: You set retention to 90 days. You replay the topic. You hand them a file. Done.

Scenario 3: The real-time + batch split

Your ML team wants real-time predictions. Your analytics team wants hourly batches. Traditional approach: run two different pipelines. Double the infrastructure. Double the bugs.

With Kafka: One topic. Stream processor reads from it in real time. Flink job reads from it in micro-batches. Data doesn’t move twice. Consistency is guaranteed.

This isn’t theoretical. We’ve deployed this pattern at SIVARO for a fintech client handling $400M in daily transaction volume. One pipeline. Two processing modes. Zero data loss.

The Ecosystem That Built Itself

Most people think Kafka’s popularity comes from its core design. And I agree—partially.

But there’s another reason.

Apache Kafka spawned an entire ecosystem. Not because the creators planned it. Because the core design was so solid that people naturally built on top of it.

Look at what exists today:

Kafka Connect — pre-built connectors to databases, S3, Elasticsearch, etc. You don’t write custom code to move data between systems. You configure a connector. Done.
Kafka Streams — lightweight stream processing library. No need for a separate cluster (like Spark Streaming or Flink). Just a Java library you import.
KSQL — SQL interface for stream processing. Analysts love it. Engineers hate it at first, then love it when they see non-engineers can self-serve.
Schema Registry — enforces Avro, Protobuf, or JSON Schema on topics. Prevents that moment when someone changes a field name and your downstream consumers silently parse garbage data.

This ecosystem means you don’t just get a message broker. You get a data integration platform.

In 2021, I watched a team of four engineers replace a 12-person data engineering team using Kafka Connect + Schema Registry + a few deploy scripts. That’s why CFOs approve Kafka budgets.

But It’s Not All Sunshine — The Real Trade-offs

I’m not here to sell you. Kafka has real problems. Let me list them honestly.

Problem 1: Operational complexity

Kafka is a distributed system. Zookeeper-based clusters (Kafka < 2.8) are notorious for hard failures. KRaft mode (Kafka 2.8+) eliminates Zookeeper, but it’s still maturing.

You need to understand:

Partition rebalancing
Leader election
ISR (in-sync replicas)
Consumer group coordination
Log compaction vs cleanup

This is not “deploy and forget” territory. You need someone who understands the internals.

Problem 2: Scaling bottlenecks

Kafka scales horizontally—mostly. But partitions have limits. A single partition can handle maybe 10-20 MB/s of throughput before things degrade. If you need 1 GB/s, you need 100 partitions. And consumers reading from 100 partitions have memory overhead.

We tested this at SIVARO in 2022. A single node with 100 partitions, 64 GB RAM, and 8 cores started showing consumer lag at around 15 MB/s. We added two more nodes. Problem went away. But it wasn’t linear—scaling from 1 to 3 nodes gave us 2.5x throughput, not 3x. Coordination overhead.

Problem 3: Exactly-once semantics are expensive

Kafka claims exactly-once semantics (EOS). Technically true. Practically? You pay for it.

Enable EOS on a producer and your throughput drops by 20-40%. Enable it on consumers and your lag monitoring becomes less reliable. Transactional writes require separate coordination.

Most teams we work with at SIVARO use at-least-once semantics with idempotent consumers. It’s simpler. Faster. And in practice, “at least once with dedup” achieves the same results as “exactly once” with less operational pain.

Problem 4: Storage cost

Kafka keeps data on disk. If you have high throughput and long retention, expect to provision a lot of storage.

In 2023, a client needed to keep 30 days of data for 200 topics averaging 5 MB/s each. Total: 864 TB of disk. We used tiered storage (hot: SSDs, cold: HDDs). Helped, but wasn’t cheap.

Three Code Examples That Show How Kafka Works

Let me give you concrete examples. These are patterns we use daily at SIVARO.

Example 1: Producing messages with idempotent writes

python
from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    enable_idempotence=True,  # prevents duplicates on retry
    acks='all',               # wait for all replicas to confirm
    retries=5
)

# Send event with dedup ID (idempotent key)
for i in range(10_000):
    event = {
        'user_id': i,
        'action': 'click',
        'timestamp': 1700000000 + i,
        'dedup_id': f'{i}-{int(time.time())}'
    }
    producer.send('user_events', key=str(i).encode(), value=event)

producer.flush()
producer.close()

Why this matters: Without enable_idempotence=True, network errors during write can produce duplicates. With it, Kafka ensures exactly-one write per retry. We’ve seen production issues where this flag alone saved teams from corrupted data.

Example 2: Consumer group with manual offset commit

python
from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'user_events',
    bootstrap_servers=['kafka1:9092'],
    group_id='analytics_group',
    enable_auto_commit=False,  # we control when offsets are saved
    auto_offset_reset='earliest',
    max_poll_records=1000
)

while True:
    records = consumer.poll(timeout_ms=5000)
    if not records:
        continue
    
    processed_messages = 0
    for topic_partition, messages in records.items():
        for message in messages:
            event = json.loads(message.value.decode('utf-8'))
            # Process event (write to database, call API, etc.)
            db_write_event(event)
            processed_messages += 1
    
    # Commit offsets only after successful processing
    if processed_messages > 0:
        consumer.commit()
        logging.info(f'Committed {processed_messages} messages')

Why this matters: Auto-commit in Kafka is dangerous. If your consumer crashes after reading but before processing, auto-commit means those messages are lost. Manual commit ensures you only advance the offset after the work is done. We’ve seen startups lose weeks of data because they relied on auto-commit.

Example 3: Kafka Streams stateful aggregation (Java)

java
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "click-aggregator");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> clicks = builder.stream("click-stream");

// Aggregate clicks per user over 5-minute windows
KTable<Windowed<String>, Long> clickCounts = clicks
    .groupBy((key, value) -> value)  // value contains user_id
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .count();

// Send aggregated counts to output topic
clickCounts.toStream().foreach((key, count) -> {
    System.out.println("User: " + key.key() + 
                       ", Window Start: " + key.window().start() + 
                       ", Count: " + count);
});

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Why this matters: This runs in your application JVM. No separate cluster needed. We’ve used this pattern for real-time dashboarding at SIVARO—processing 50,000 events/second on a single 8-core machine. You don’t need Flink or Spark for this.

Who Should (and Shouldn’t) Use Kafka

Let me be blunt.

You should use Kafka if:

You have multiple consumers needing the same data stream (microservices, analytics, ML)
You need to guarantee message ordering within a partition
You need replayability (audit, reprocessing, disaster recovery)
You have throughput above 10,000 messages/second
Your team already knows JVM languages (Java, Scala) or is willing to learn

You should NOT use Kafka if:

You have a single consumer and low throughput (use Redis or RabbitMQ)
You need strict exactly-once delivery without accepting throughput hit (use Pulsar)
Your team has no Kubernetes or Docker experience and you can’t hire someone (Kafka on bare metal is a nightmare)
Your data volume is below 100 messages/second (use a database with CDC)
You can’t afford to lose a node (Kafka clusters need at least 3 brokers for HA)

I’ve seen teams adopt Kafka because “it’s what Netflix uses.” That’s terrible reasoning. I have killed Kafka projects when I saw teams with 5 topics and 2 partitions wondering why they had operational pain. Use the right tool.

The Real Reason Why Is Apache Kafka So Popular?

Let me answer the question directly.

Apache Kafka is popular because it was the first system to combine three things that previously couldn’t coexist:

High throughput (millions of messages/second per cluster)
Durability (messages survive broker failures)
Replayability (consumers can read old data)

Before Kafka, you had to choose. High throughput? You used a fast but unreliable queue. Durability? You used a database, but it was slow. Replayability? You wrote a custom log system.

Kafka gave you all three. And the ecosystem grew around it.

In 2024, we’re seeing a shift. Pulsar is gaining traction for multi-datacenter deployments. Redpanda is challenging Kafka with a simpler architecture (no JVM, no Zookeeper). Even NATS is getting serious.

But Kafka remains the default. Not because it’s perfect. Because it’s the standard.

The Apache Software Foundation reported in 2023 that Kafka is used in over 80% of Fortune 500 companies. That’s not hype. That’s infrastructure gravity.

Frequently Asked Questions

Is Kafka better than RabbitMQ?

Depends on use case. RabbitMQ is simpler for point-to-point messaging, supports priority queues natively, and has lower latency for small messages (< 10KB). Kafka wins at throughput, durability, and multi-consumer scenarios. I use RabbitMQ for real-time notifications (sub-10ms delivery). I use Kafka for event streaming and batch processing. Not a competition—different tools.

Do I need Zookeeper?

If you’re on Kafka < 2.8, yes. If you’re on Kafka 2.8+ and using KRaft mode, Zookeeper is optional. KRaft replaces Zookeeper with Kafka’s own consensus protocol. It’s simpler but still maturing—we’ve seen KRaft clusters with 5+ nodes have longer leader election times. For production, I still recommend Zookeeper for critical deployments unless you’ve tested KRaft thoroughly.

Can Kafka lose data?

Yes—if your configuration is wrong. Common causes: acks=0 or acks=1 (not waiting for all replicas), setting min.insync.replicas too low, unclean leader election, or replication factor of 1. Properly configured (replication factor 3, min.insync.replicas=2, acks=all), Kafka guarantees no data loss in most failure scenarios. We’ve run clusters for 3 years without data loss.

What’s the maximum throughput?

Depends on hardware, configuration, and message size. We’ve benchmarked a 3-node cluster with SSDs doing 1.2 million messages/second (100 bytes each). With 10KB messages, it drops to 200,000. LinkedIn reported a 1.3 trillion messages/day cluster in 2020. Theoretical max for a single partition is around 10-20 MB/s. Scale by adding partitions and brokers.

Is Kafka hard to learn?

Yes—if you don’t know distributed systems. The jargon is heavy: offsets, partitions, brokers, ISR, consumer groups, rebalancing. But the core concepts are simple. Spend time understanding the log abstraction—once that clicks, everything else makes sense. I tell new engineers: “Kafka is a log file. Everything else is optimization.” Takes most teams 2-4 weeks to become productive, 3-6 months to master operations.

What’s the cost of running Kafka?

Surprisingly low for small deployments. A 3-node cluster on cloud VMs (4 vCPUs, 16GB RAM, 500GB SSD) costs about $500-800/month in 2024. Enterprise-scale clusters (20+ brokers, 100+ TB storage) can run $10,000-$50,000/month. The real cost is operational—you need at least one part-time Kafka admin for production clusters. Managed services (Confluent Cloud, AWS MSK) are 2-3x more expensive but save on ops.

When should I consider alternatives to Kafka?

Pulsar: Multi-datacenter replication, geo-replication, separate storage/compute layers. We used it for a client with 5 data centers across continents.
Redpanda: Simpler (no JVM, no Zookeeper), lower latency, better resource utilization. Good for teams wanting Kafka API with less ops.
Kinesis: If you’re 100% AWS and don’t want to manage infrastructure. But you lose portability and control.
NATS: Lightweight messaging. Great for IoT or edge deployments with limited resources. Not for long-term storage or replay.

Conclusion

So why is Apache Kafka so popular?

Because it solved a real problem at the right time. Because its design is fundamentally better for stream processing than any queue-based system. Because the ecosystem made it easy to integrate with everything. Because companies like Uber, Netflix, and LinkedIn proved it works at massive scale.

But don’t adopt it because it’s popular.

Adopt it because you have data that needs to move between systems reliably. Adopt it because you need temporal decoupling between producers and consumers. Adopt it because you want to reprocess data without rebuilding pipelines.

That’s why I still recommend Kafka in 2025. Not because it’s perfect. Because it’s proven.

Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.