EKS vs. Serverless
When to leave Lambda behind — How migrating from Lambda + Step Functions to EKS reduced costs by 42% and improved reliability.
3.8min
P95 Job Time
42%
Cost Reduction
93%
Reliability Gain
12x
Deploy Frequency
A US-based B2B SaaS startup (Series A, 18 months) built a data-enrichment platform. Customers uploaded CSV files with hundreds of thousands of records; the platform enriched each record by calling external APIs, geocoding, and applying ML models. The service processed 5 million enrichment requests per day, with peak concurrency of 200 simultaneous upload jobs.
Cold Start Latency
Enrichment function had significant dependencies (Pandas, scikit-learn, geocoding). Cold starts averaged 2.8 seconds.
Execution Time Limits
15-minute Lambda timeout forced complex chunking. Step Functions grew to 200+ states.
Cost Inefficiency
Lambda costs reached $11,000/month + $2,500/month for Step Functions.
Error Recovery
Failed invocations required re-executing entire batches from scratch.
Zero Downtime
24/7 customer uploads
Team
3 backend engineers, no K8s experience
Budget
$25K/mo and growing
Compliance
Region-specific data requirements
Rejected: Optimize Serverless
SnapStart wasn't available for Python. Wouldn't solve timeout or cost issues.
Rejected: ECS on EC2
Team wanted to avoid managing scaling, patching, capacity planning.
Selected: EKS + Fargate
No node management, containers for long-running jobs, kubectl for orchestration.
Why EKS + Fargate?
Long-Running Jobs
Containers have no execution time limit — entire file processed in one pod
Cost Model
Fargate bills per vCPU-second. 30-40% cheaper than Lambda for sustained workloads
Orchestration
Kubernetes Jobs replaced 200-state Step Functions. GitOps for deployments.
Operational Overhead
Fargate eliminates node management. 4-week learning investment.
Strangler Pattern Migration
8-week parallel run, both architectures serving traffic
8 Weeks
Total Timeline
Containerization
Docker image with FastAPI, EKS cluster with Fargate, internal Service
Orchestration Layer
Lambda creates Kubernetes Jobs, OpenTelemetry sidecar for observability
Traffic Shift
Feature flag: 10% → 50% → 100%. Lambda path as fallback.
Optimization
Resource tuning, HPA based on queue length, graceful termination
Final Architecture
| Metric | Before (Lambda) | After (EKS) |
|---|---|---|
| Job completion (p95) | 14.2 min | 3.8 min |
| Failed jobs per 10k | 47 | 3 |
| Monthly compute cost | $13,500 | $7,800 |
| Cold starts | 2.8s | None |
| Deploys per week | 3 | 12 |
| Eng time on orchestration bugs | 40% | 5% |
Cost Savings
$5,700/mo
42% reduction
Reliability
93%
Fewer failed jobs
Velocity
4x
More deployments
Serverless is ideal for MVP. Containers are ideal for scale.
Once workload becomes predictable, high-volume, or long-running, Lambda's per-invocation pricing and 15-minute limit become cost and reliability burdens. EKS + Fargate offers a middle ground: operational simplicity with container flexibility.
Stay with Serverless
- • Workload is spiky or unpredictable
- • Execution under 15 minutes
- • Cold starts acceptable
- • No container experience
Consider EKS
- • Workloads are sustained, predictable
- • Lambda costs exceed container equivalent
- • Need fine-grained runtime control
- • Team can invest 2-4 weeks in K8s
Related Case Studies
When to Choose Kafka: Requirements That Force the Move
From SQS to managed Kafka, 187K msg/sec
DATABASESClickHouse vs PostgreSQL: Write-Heavy Analytics
310K inserts/sec, 64% cost reduction
PROJECTClickHouse Migration for Real-Time Analytics at Scale
200M events/day with 250ms P95 query latency
Facing similar infrastructure decisions?
We specialize in data infrastructure engineering and production AI systems. Let's discuss your architecture.