When to Bring in Kafka
Requirements that force the move — How we identified the triggers that made SQS insufficient and migrated to managed Kafka.
0
Data Loss Incidents
187K
Msg/sec Peak
48ms
P95 Latency
3hr
Add New Consumer
A US-based SaaS company (Series B, 4 years) built a microservices architecture handling 1.2 billion events per month across 15 microservices. Initially used SQS + HTTP calls. By Series B, scaling cracks appeared.
Lost Events
SQS at-least-once meant messages disappeared after visibility timeout. 2.3 million events lost during analytics crash.
Coupling Nightmare
6 consumers required 6 separate publishers. Adding a 7th touched 5 upstream services.
Ordering Failures
SQS FIFO capped at 300 msg/sec. Black Friday hit 12,000 events/sec, causing wrong order processing.
Compliance
SOC2: no data loss
Downtime
Zero during migration
Team
5 engineers, no Kafka exp.
Cost
Reduce cost/event
Scale
5B events/mo by 2026
Kafka Maturity Curve
Level 1
Experiments
Level 2
Multiple teams
Level 3
Tactical use (client was here)
Level 4
Platform
Level 5
Critical infra
Queues (SQS)
- • Message consumed once, then gone
- • Ideal for background jobs
- • No replayability
Streams (Kafka)
- • Messages persist, can be replayed
- • Multiple consumer groups
- • Enables event sourcing
Rejected: SQS (optimized)
FIFO capped at 300 msg/sec. No replay. Can't scale beyond simple queues.
Rejected: Self-managed Kafka
Team lacked production experience. "Two of five brokers down and they didn't know."
Selected: Managed Kafka
Confluent Cloud or AWS MSK. Reduces operational burden, governance built-in.
Strangler Pattern Migration
Dual-write for two weeks, SQS queues intact during transition
12 Weeks
Total Timeline
Foundation
Managed Kafka, 3-zone, RF3. Topic naming convention. Retention 7-30 days.
Publisher Migration
Shared library, idempotent producers, deployed to 4 upstream services.
Consumer Migration
Rewrote 3 consumers to Kafka. Dead letter topics for failed events.
Cutover
Dual-write 2 weeks. Verified <1s lag. Decommissioned SQS.
Final Architecture
| Metric | Before (SQS) | After (Kafka) |
|---|---|---|
| Data loss incidents (6 mo) | 3 | 0 |
| Event replay | None | 7-day retention |
| Add new consumer | 3 days | 3 hours |
| Ordering violations | 12 | 0 |
| Peak throughput | 8,000/sec | 187,000/sec |
| Cost per million events | $0.42 | $0.18 |
| P95 event latency | 2.3 sec | 48 ms |
Black Friday
4.2B events
3 days, zero ordering failures
SOC2 Ready
Audit trail
Append-only log accepted
Engineering Time
35 hrs/mo
Saved on firefighting
Kafka becomes necessary when you need event replay, multiple independent consumers, or ordering at scale.
The decision isn't triggered by volume alone—SQS handles high throughput for simple queues. The real triggers are architectural requirements: data loss unacceptable (SOC2), coupling costs explode (>3 consumers), ordering required beyond 300 msg/sec.
| Trigger | Why Kafka | When It Appears |
|---|---|---|
| Event replay | Log-based retention | After first data loss |
| Multiple consumers | Consumer groups | When consumer count > 3 |
| Ordering at scale | Partition-level ordering | Beyond 300 msg/sec |
| Service decoupling | Independent subscription | Coordination > 20% eng. time |
When Kafka is Premature
If you only need background job processing, retry logic, or simple pub/sub with one consumer, SQS or RabbitMQ are simpler. "Adding a complex system like Kafka for simple job queues was a common regret."
Related Case Studies
ClickHouse vs PostgreSQL: Write-Heavy Analytics
310K inserts/sec, 64% cost reduction
INFRASTRUCTUREEKS vs Serverless: When to Leave Lambda Behind
42% cost reduction, 12x deploy frequency
PROJECTKafka Event Streaming Architecture for High-Throughput SaaS
1.2B events/month with zero data loss
Evaluating data streaming for your architecture?
We specialize in data infrastructure engineering. Let's discuss when Kafka makes sense for you.