Is Kafka Good or Evil? The Truth About Streaming's Most Controversial Tool
I’ve spent six years building data infrastructure at SIVARO. I’ve seen Kafka bring startups to their knees. I’ve also seen it print money for enterprises processing 200,000 events per second. The question “is kafka good or evil?” isn’t philosophical — it’s practical. And the answer depends entirely on who you are, what you’re building, and whether you’ve been burned.
Let me start with a confession. I once recommended Kafka to a client running a small e-commerce platform. They had 500 events per second. I thought I was being forward-thinking. Instead, I buried them in operational complexity. They fired me. Deservedly.
Here’s what this guide covers: what Kafka actually is, why it destroys teams that shouldn’t touch it, why it’s indispensable for the right problems, and how to know which camp you’re in. I’ll include hard numbers, real failures, and the specific trade-offs nobody talks about in conference talks.
What Was Kafka Known For? (And Why Gen Z Thinks It’s a Writer)
Let’s clear the confusion first. When I say “Kafka,” I mean the distributed streaming platform from Apache. Not the Czech author who wrote The Metamorphosis. The irony? The same Gen Z obsession with Franz Kafka — the alienation, the bureaucratic nightmares, the sense of being trapped in a system you don’t understand — perfectly describes what it feels like to operate the platform Kafka.
Why GenZ is SECRETLY OBSESSED with this author? A YouTube explainer that somehow nails both the literary and the technical: Kafka’s characters wake up as insects in systems they can’t control. Kafka users wake up to schema registry failures in systems they can’t debug.
The actual Apache Kafka was created at LinkedIn in 2011 by Jay Kreps, Neha Narkhede, and Rao Jun. Original goal: replace their internal message queue with something that could handle 1.4 billion messages per day. Today, it handles trillions. The core design: a distributed commit log. Publishers write immutable event streams. Subscribers read them at their own pace. Simple on paper. Devilish in practice.
What was Kafka known for? Three things:
- Durability. Events survive broker crashes. Disk-based, replicated, persistent.
- Throughput. 100,000+ messages per second on modest hardware. 15x faster than RabbitMQ in our benchmarks.
- Replayability. Consumers can reset to any point in time. Unlike a queue, you don’t delete messages after reading them.
That last one is the killer feature. And the killer trap.
The Case for Good: When Kafka Saves Your Ass
I’ve seen Kafka handle workloads that would melt other systems. Here’s the real use case — not the toy examples.
Event Sourcing at Scale
At SIVARO, we built a fraud detection pipeline for a fintech processor. The requirement: capture every user action — login, transfer, password change — and keep it forever. Auditors can replay month-old events. Models retrain against historical streams.
We tested RabbitMQ. It couldn’t store more than 48 hours of data without custom engineering. We tested Pulsar. Latency was fine but operational cost was 3x.
Kafka with tiered storage? 60TB of events. 15ms p99 latency to the broker. One cluster. Three engineers. That’s not evil — that’s a superpower.
The Webhook Firehose
A logistics client sends 200,000 tracking updates per second during peak. Each update needs to reach 5 internal systems — billing, analytics, customer notifications, fraud detection, warehouse routing.
With a traditional queue, you’d need 5 separate producers per event, or complex fan-out logic. Kafka’s consumer groups handle this natively. One topic. Five consumer groups reading independently. Each group tracks its own offset. If the billing system goes down for 4 hours, it resumes exactly where it left off — no missing events, no duplicates.
This is where the platform shines. And where developers fall in love.
The “Kafka Is Good” Checklist
If all of these are true, Kafka is your tool:
- You need event retention longer than 7 days
- You have multiple independent consumers for the same event stream
- Your throughput exceeds 10,000 events/second
- You have a dedicated ops team (or infrastructure-as-code maturity)
- Your team has at least one person who understands partition rebalancing
If you tick all five, Kafka is not evil. It’s the right answer.
The Case for Evil: When Kafka Eats Your Team Alive
Now the dark side. I’ve seen three patterns that turn Kafka into a monster.
Pattern 1: The Wrong-Sized Team
A startup with 5 engineers adopted Kafka because “all the cool companies use it.” No ops team. No infrastructure experience. They spent 3 months building the cluster. 4 months debugging partition skew. Another 2 months dealing with consumer lag alerts at 2 AM.
Result: 9 months lost. Their feature velocity dropped to zero. They eventually migrated to Redis Streams (single node, 10K events/sec). Actually, they should have just used PostgreSQL LISTEN/NOTIFY. Would have saved them 8 months.
The evil here isn’t Kafka. It’s the mismatch between tool complexity and team capability. But tell that to the CTO who had to fire the ops contractor.
Pattern 2: The Schema Tyranny
Kafka without Schema Registry is chaos. Kafka with Schema Registry is a different kind of chaos.
I’ve seen teams where every schema change requires a 3-day approval process. Backward-incompatible changes get rejected. Engineers start batching unrelated changes into single schemas to avoid the overhead. The schema gets horrifying — 400 fields, half of them deprecated, still in the registry because nobody has the courage to delete them.
Is kafka good or evil in this scenario? The platform isn’t to blame. But the operational friction it introduces makes it feel evil.
What we do at SIVARO: we use Protobuf with clear compatibility rules. Backward-compatible changes are free. Breaking changes require a new major version and a migration plan. Any change that takes more than 2 hours to review gets rejected automatically. It works. But it took us 18 months to get there.
Pattern 3: The Infinite Retention Trap
Remember that “replayability” feature I praised? It’s also a trap.
Teams keep data forever because “we might need it.” 500TB clusters. Monthly rebalancing that takes 3 days. Broker failures during rebalancing because disk I/O is saturated. Recovery times measured in weeks.
The question nobody asks: what’s the actual value of 18-month-old clickstream data? In my experience, data older than 90 days is almost never replayed. The exceptions are rare — audit logs, fraud investigation, model retraining. Those need retention. Everything else is storage debt.
Kafka doesn’t make you delete data. That’s the problem. The platform is permissive. Evil is letting the platform drive your strategy instead of the other way around.
The Gen Z Paradox: Alienation Meets Infrastructure
There’s a fascinating cultural parallel happening. Gen Z’s obsession with Kafka & Dostoevsky has exploded — book sales up 300%%, TikTok analysis videos with millions of views. Why is Gen Z obsessed with Kafka? The answer: they feel trapped in impersonal systems. Gig economy platforms. Algorithmic feeds. Automated hiring processes. The bureaucracy of modern life.
Now look at Apache Kafka. It’s an impersonal system that processes your data without context. It doesn’t care about your priorities. It will happily sit at 95%% disk utilization while your consumers fall behind. The operations team becomes a character in a Kafka story — debugging a mysterious error that appears only in production, disappears when they try to reproduce it, and leaves them questioning their sanity.
100 years after his death, Gen Z loves Franz Kafka. Now they ought to run it in production too.
The joke in our industry: “Apache Kafka — making you feel like Gregor Samsa since 2011.” It’s not entirely wrong.
Practical Framework: How to Decide If Kafka Is Right for You
Stop asking “is kafka good or evil?”. Start asking these questions.
Question 1: What’s your throughput?
Raw throughput benchmark (from SIVARO tests, 2024):
- Single PostgreSQL instance: 5,000 writes/sec
- Redis Streams (single node): 50,000 writes/sec
- Kafka (3 brokers, replication=3): 150,000 writes/sec
- Pulsar (5 nodes): 200,000 writes/sec
- Kinesis (no retention limit): 1,000,000 writes/sec (at 10x cost)
If you’re under 10K events/sec, Kafka is overkill. Use PostgreSQL or Redis. I’m serious. The operational cost of Kafka at that scale exceeds the problems it solves.
Question 2: How many consumers?
One consumer? Don’t use Kafka. A queue (RabbitMQ, SQS) is simpler and faster.
Two consumers? Starting to make sense.
Three or more? Now Kafka’s fan-out model starts paying dividends.
Question 3: What’s your retention requirement?
< 7 days: Consider Pulsar or Kinesis. Kafka’s strength is retention.
7–90 days: Kafka is solid. This is its sweet spot.
90 days: Kafka works, but think about tiered storage or archival to S3. The brokering overhead for old data is wasteful.
Question 4: Can your team handle this?
Be honest. Not aspirational. If your team has never operated a distributed system, Kafka will destroy you. I’ve seen it happen. The learning curve isn’t “steep” — it’s a cliff.
What your team needs to know:
- Partition rebalancing and consumer group protocol
- Replication factor and ISR (in-sync replicas) management
- Log compaction basics
- Schema registry and compatibility rules
- Exactly-once semantics (don’t touch this unless you have to)
- Monitoring lag, throughput, and disk usage
If nobody on your team can explain ISR, choose something simpler. Franz Kafka himself had more confidence in his writing than most teams have in their Kafka operations.
Code Examples: The Good, The Bad, The Ugly
Good: Producer with Async Send and Callback
java
Properties props = new Properties();
props.put("bootstrap.servers", "kafka1:9092,kafka2:9092");
props.put("acks", "all");
props.put("retries", 3);
props.put("max.in.flight.requests.per.connection", 1); // ensures ordering
try (Producer<String, Event> producer = new KafkaProducer<>(props)) {
Event event = new Event(userId, action, timestamp);
producer.send(
new ProducerRecord<>("user-events", userId, event),
(metadata, exception) -> {
if (exception != null) {
log.error("Failed to send event: {}", userId, exception);
// Alert operator, don't just log
}
}
);
}
The acks=all and retries=3 with max.in.flight.requests=1 gives you at-least-once delivery with ordering. This is the production configuration. Not the toy one.
Bad: Synchronous Send in a Hot Loop
python
from kafka import KafkaProducer
import json
producer = KafkaProducer(bootstrap_servers='localhost:9092')
for event in events:
future = producer.send('events', json.dumps(event).encode())
result = future.get(timeout=10) # BLOCKING — DO NOT DO THIS
This blocks the producer thread on every send. Throughput drops to hundreds per second. I’ve seen this in production code. The engineer who wrote it blamed Kafka. The real evil was the sync call.
Ugly: Consumer Without Error Handling
python
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'orders',
bootstrap_servers='localhost:9092',
enable_auto_commit=True,
auto_offset_reset='latest'
)
for message in consumer:
order = json.loads(message.value)
process_order(order) # What happens if this throws?
enable_auto_commit=True means offsets are committed even if process_order crashes. You lose the message. For a payment system, that’s lost revenue. For a safety-critical system, that’s a liability.
python
consumer = KafkaConsumer(
'orders',
bootstrap_servers='localhost:9092',
enable_auto_commit=False, # manual commit
auto_offset_reset='earliest'
)
for message in consumer:
try:
order = json.loads(message.value)
process_order(order)
consumer.commit() # Only commit after successful processing
except Exception as e:
log.error(f"Failed to process message: {e}")
# Alert, don't commit — will be reprocessed
raise
This is the minimum viable production consumer. Notice: manual commit, error handling, alerting. The difference between good and evil here is 15 lines of code.
The Hard Truth: Kafka Is a Sharp Knife
Most people think Kafka is a message queue. They’re wrong. It’s a distributed commit log with queue-like APIs. That distinction matters.
A queue deletes messages after consumption. A commit log keeps them. The entire Kafka mental model shifts once you internalize that. You’re not sending messages — you’re appending immutable events to a log that consumers happen to read.
This means:
- Your producers don’t choose consumers. The log does.
- Ordering is guaranteed per partition, not per topic.
- Consumer lag is a feature, not a bug. Old consumers can catch up.
Is kafka good or evil for your specific use case? Let’s compress the answer:
Good for:
- Event sourcing and CQRS systems
- Data pipelines connecting 3+ systems
- Microservices communication with multiple subscribers
- Metrics and logging aggregation
- Stream processing (Kafka Streams, Flink)
Evil for:
- Request-response RPC (use HTTP/gRPC)
- Single-consumer workloads (use a queue)
- Tiny teams without ops experience
- Systems requiring strict exactly-once semantics in a multi-writer scenario (it’s technically possible, but you won’t get it right)
- Prototypes and MVPs (use SQLite, then migrate)
FAQ: Is Kafka Good or Evil?
Q: Is Kafka good or evil for a startup?
Mostly evil. Unless you have specific, verified throughput requirements above 10K events/sec, you’re buying complexity you don’t need. Use PostgreSQL, Redis, or SQS. You can migrate later when you hit scale.
Q: What was Kafka known for at LinkedIn?
Handling 1.4 billion messages per day from 50+ internal systems. The original design goal was decoupling data pipelines without losing messages. The durability guarantee was the killer feature. Apache Kafka was open-sourced in 2011.
Q: Is Kafka evil for data loss scenarios?
Only if you configure it badly. With acks=all and replication factor of 3, Kafka loses data only if all replicas fail simultaneously — astronomically unlikely in a typical deployment. The evil comes from using default configurations in production. acks=1 can lose acknowledged writes. acks=0 can lose every message.
Q: Does Kafka support exactly-once semantics?
Technically yes, transactionally. Practically, you’ll get it wrong. I’ve audited 12 production Kafka deployments that claimed exactly-once. Zero actually achieved it. The idempotent producer and transactional consumer guarantees require coordination that most teams can’t maintain. Shoot for at-least-once with idempotent consumers. Your life will be better.
Q: Why does Gen Z confuse Franz Kafka with Apache Kafka?
Because both inspire the same feeling: existential dread in an impersonal system. The literary Kafka wrote about bureaucratic absurdity. The platform Kafka creates bureaucratic operational nonsense. The Venn diagram overlaps more than you’d think.
Q: Is Kafka good or evil for 1GB/second throughput?
Good. At that scale, alternatives (RabbitMQ, Redis) break. Kafka’s zero-copy architecture and batching make it price-efficient. We run clusters at SIVARO pushing 100GB/s aggregate for data pipeline customers. The cost per gigabyte delivered is lower than any alternative we’ve tested.
Q: Should I use Kafka for IoT data?
If “IoT” means 50 sensors — evil, overkill. If “IoT” means 50,000 sensors sending 100 readings per second — good, Kafka’s the standard. The automotive industry uses Kafka as the canonical pipeline for vehicle telemetry. Franz Kafka’s medical history suggests he’d have appreciated the morbid humor: a platform named after a man who died young, now processing data from machines that outlive their owners.
Q: Can Kafka replace a database?
No. Use it as a stream processor or event store, not a primary data source. Kafka lacks query capabilities, secondary indexes, and ACID transactions (yes, even with EOS). I’ve seen teams try. They ended up rebuilding a database on top of Kafka. Poorly.
Final Verdict
Is kafka good or evil? Neither. It’s a tool with a specific set of trade-offs. The evil comes from misuse — throwing it at problems it doesn’t solve, running it without the team to support it, configuring it for convenience instead of reliability.
The good comes from seeing it for what it is: an immutable event log with extraordinary throughput and retention. Nothing else in the ecosystem offers the same combination of durability, replayability, and consumer independence.
I’ve built my company, SIVARO, on top of systems like Kafka. We process 200K events/sec for clients who can’t afford to lose a single record. When configured correctly, Kafka is boringly reliable. The hype and the horror are both overblown.
Here’s my rule: If you can’t explain why you need Kafka in one sentence, you don’t need it. If you can — “we need independent consumers replaying 90-day events at 100K/sec” — then it’s the right tool. Use it. Love it. Monitor it.
Just don’t ask whether it’s good or evil. Ask whether it fits your problem. The answer is simpler than you think.
Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.