INFRASTRUCTURE CASE STUDY

Kafka Event Streaming Architecture for High-Throughput SaaS

SQS couldn't support multiple independent consumers, causing message loss and tight coupling between services.

Deployed managed Kafka with topic partitioning, consumer groups, and schema registry. Implemented dual-write migration from SQS with zero data loss.

Data Loss

Msg/sec

187K

New Consumer

3hr

Context

Floqer, a Canada-based CRM data enrichment platform processing 1.2 billion events per month, needed to migrate from SQS to Kafka for scalable event streaming architecture.

Problem

SQS's FIFO queues couldn't support multiple consumers needing independent processing of the same events. Each consumer group required its own queue, leading to duplication costs and operational overhead. Services were tightly coupled through direct SQS integrations, preventing independent deployments and creating single points of failure. When downstream consumers fell behind, message retention limits triggered data loss—unacceptable for a system powering CRM enrichment pipelines.

Constraints

Migration required zero data loss. The system processed 1.2B events monthly with no tolerance for gaps. New consumers had to integrate within hours, not weeks. The team had no prior Kafka experience, requiring comprehensive knowledge transfer. Budget mandated cost reduction, not increase.

Approach

SQS-to-Kafka migration isn't just a technology swap—it requires rethinking event design and consumer patterns. We chose managed Kafka for operational simplicity but implemented custom partitioning and consumer group strategies. The key was designing events for multiple independent consumers from day one, rather than retrofitting SQS patterns onto Kafka.

Implementation

Topics were partitioned by customer_id to ensure event ordering per-customer while enabling parallel processing across partitions. Consumer groups were configured for each downstream service—analytics, enrichment, notifications—allowing independent consumption without coordination. A schema registry enforced event contracts, preventing breaking changes from propagating to consumers. The dual-write migration sent events to both SQS and Kafka for 30 days, with traffic gradually shifted to Kafka once all consumers validated data consistency. We implemented exactly-once semantics for critical enrichment pipelines using idempotent producers.

Results

Eliminated data loss entirely—100% message delivery guaranteed. Throughput scaled to 187,000 messages per second, handling burst traffic without degradation. New consumers onboard in 3 hours instead of 3 weeks, because Kafka's replay capability lets them consume historical data independently. Cost per million messages dropped 60% compared to SQS at this scale.

Key Insight

The bottleneck wasn't Kafka—it was event design. SQS forces a simple fire-and-forget model. Kafka demands intentional schema evolution and consumer independence. Teams that treat Kafka as a drop-in replacement for queues miss 80% of its value.

Related Projects

INFRASTRUCTURE