Case Study

When to Bring in Kafka

Requirements that force the move — How we identified the triggers that made SQS insufficient and migrated to managed Kafka.

0

Data Loss Incidents

187K

Msg/sec Peak

48ms

P95 Latency

3hr

Add New Consumer

01 / Context

A US-based SaaS company (Series B, 4 years) built a microservices architecture handling 1.2 billion events per month across 15 microservices. Initially used SQS + HTTP calls. By Series B, scaling cracks appeared.

02 / Problem

Lost Events

SQS at-least-once meant messages disappeared after visibility timeout. 2.3 million events lost during analytics crash.

2.3M

Coupling Nightmare

6 consumers required 6 separate publishers. Adding a 7th touched 5 upstream services.

6 publishers

Ordering Failures

SQS FIFO capped at 300 msg/sec. Black Friday hit 12,000 events/sec, causing wrong order processing.

12K/sec
03 / Constraints

Compliance

SOC2: no data loss

Downtime

Zero during migration

Team

5 engineers, no Kafka exp.

Cost

Reduce cost/event

Scale

5B events/mo by 2026

04 / Approach

Kafka Maturity Curve

Level 1

Experiments

Level 2

Multiple teams

Level 3

Tactical use (client was here)

Level 4

Platform

Level 5

Critical infra

Queues (SQS)

  • • Message consumed once, then gone
  • • Ideal for background jobs
  • • No replayability

Streams (Kafka)

  • • Messages persist, can be replayed
  • • Multiple consumer groups
  • • Enables event sourcing

Rejected: SQS (optimized)

FIFO capped at 300 msg/sec. No replay. Can't scale beyond simple queues.

Rejected: Self-managed Kafka

Team lacked production experience. "Two of five brokers down and they didn't know."

Selected: Managed Kafka

Confluent Cloud or AWS MSK. Reduces operational burden, governance built-in.

05 / Implementation

Strangler Pattern Migration

Dual-write for two weeks, SQS queues intact during transition

12 Weeks

Total Timeline

Week 1-3

Foundation

Managed Kafka, 3-zone, RF3. Topic naming convention. Retention 7-30 days.

Week 4-6

Publisher Migration

Shared library, idempotent producers, deployed to 4 upstream services.

Week 7-9

Consumer Migration

Rewrote 3 consumers to Kafka. Dead letter topics for failed events.

Week 10-12

Cutover

Dual-write 2 weeks. Verified <1s lag. Decommissioned SQS.

Final Architecture

Microservices
Kafka
Consumer Groups
06 / Results
Metric Before (SQS) After (Kafka)
Data loss incidents (6 mo) 3 0
Event replay None 7-day retention
Add new consumer 3 days 3 hours
Ordering violations 12 0
Peak throughput 8,000/sec 187,000/sec
Cost per million events $0.42 $0.18
P95 event latency 2.3 sec 48 ms

Black Friday

4.2B events

3 days, zero ordering failures

SOC2 Ready

Audit trail

Append-only log accepted

Engineering Time

35 hrs/mo

Saved on firefighting

07 / Key Insight

Kafka becomes necessary when you need event replay, multiple independent consumers, or ordering at scale.

The decision isn't triggered by volume alone—SQS handles high throughput for simple queues. The real triggers are architectural requirements: data loss unacceptable (SOC2), coupling costs explode (>3 consumers), ordering required beyond 300 msg/sec.

Trigger Why Kafka When It Appears
Event replay Log-based retention After first data loss
Multiple consumers Consumer groups When consumer count > 3
Ordering at scale Partition-level ordering Beyond 300 msg/sec
Service decoupling Independent subscription Coordination > 20% eng. time

When Kafka is Premature

If you only need background job processing, retry logic, or simple pub/sub with one consumer, SQS or RabbitMQ are simpler. "Adding a complex system like Kafka for simple job queues was a common regret."

Related Case Studies

Evaluating data streaming for your architecture?

We specialize in data infrastructure engineering. Let's discuss when Kafka makes sense for you.