Case Study

EKS vs. Serverless

When to leave Lambda behind — How migrating from Lambda + Step Functions to EKS reduced costs by 42% and improved reliability.

3.8min

P95 Job Time

42%

Cost Reduction

93%

Reliability Gain

12x

Deploy Frequency

01 / Context

A US-based B2B SaaS startup (Series A, 18 months) built a data-enrichment platform. Customers uploaded CSV files with hundreds of thousands of records; the platform enriched each record by calling external APIs, geocoding, and applying ML models. The service processed 5 million enrichment requests per day, with peak concurrency of 200 simultaneous upload jobs.

02 / Problem

Cold Start Latency

Enrichment function had significant dependencies (Pandas, scikit-learn, geocoding). Cold starts averaged 2.8 seconds.

2.8s

Execution Time Limits

15-minute Lambda timeout forced complex chunking. Step Functions grew to 200+ states.

15min limit

Cost Inefficiency

Lambda costs reached $11,000/month + $2,500/month for Step Functions.

$13.5K/mo

Error Recovery

Failed invocations required re-executing entire batches from scratch.

No granular retry

40%

of engineering time spent fire-fighting Step Functions and Lambda concurrency limits

03 / Constraints

Zero Downtime

24/7 customer uploads

Team

3 backend engineers, no K8s experience

Budget

$25K/mo and growing

Compliance

Region-specific data requirements

04 / Approach

Rejected: Optimize Serverless

SnapStart wasn't available for Python. Wouldn't solve timeout or cost issues.

Rejected: ECS on EC2

Team wanted to avoid managing scaling, patching, capacity planning.

Selected: EKS + Fargate

No node management, containers for long-running jobs, kubectl for orchestration.

Why EKS + Fargate?

Long-Running Jobs

Containers have no execution time limit — entire file processed in one pod

Cost Model

Fargate bills per vCPU-second. 30-40% cheaper than Lambda for sustained workloads

Orchestration

Kubernetes Jobs replaced 200-state Step Functions. GitOps for deployments.

Operational Overhead

Fargate eliminates node management. 4-week learning investment.

05 / Implementation

Strangler Pattern Migration

8-week parallel run, both architectures serving traffic

8 Weeks

Total Timeline

Week 1-2

Containerization

Docker image with FastAPI, EKS cluster with Fargate, internal Service

Week 3-4

Orchestration Layer

Lambda creates Kubernetes Jobs, OpenTelemetry sidecar for observability

Week 5-6

Traffic Shift

Feature flag: 10% → 50% → 100%. Lambda path as fallback.

Week 7-8

Optimization

Resource tuning, HPA based on queue length, graceful termination

Final Architecture

Upload

API Gateway

Lambda (Trigger)

K8s Job

S3 + DynamoDB

06 / Results

Metric	Before (Lambda)	After (EKS)
Job completion (p95)	14.2 min	3.8 min
Failed jobs per 10k	47	3
Monthly compute cost	$13,500	$7,800
Cold starts	2.8s	None
Deploys per week	3	12
Eng time on orchestration bugs	40%	5%

Cost Savings

$5,700/mo

42% reduction

Reliability

93%

Fewer failed jobs

Velocity

More deployments

07 / Key Insight

Serverless is ideal for MVP. Containers are ideal for scale.

Once workload becomes predictable, high-volume, or long-running, Lambda's per-invocation pricing and 15-minute limit become cost and reliability burdens. EKS + Fargate offers a middle ground: operational simplicity with container flexibility.

Stay with Serverless

• Workload is spiky or unpredictable
• Execution under 15 minutes
• Cold starts acceptable
• No container experience

Consider EKS

• Workloads are sustained, predictable
• Lambda costs exceed container equivalent
• Need fine-grained runtime control
• Team can invest 2-4 weeks in K8s

Related Case Studies

INFRASTRUCTURE

Facing similar infrastructure decisions?

We specialize in data infrastructure engineering and production AI systems. Let's discuss your architecture.

Start a Conversation More Case Studies