Case Study

EKS vs. Serverless

When to leave Lambda behind — How migrating from Lambda + Step Functions to EKS reduced costs by 42% and improved reliability.

3.8min

P95 Job Time

42%

Cost Reduction

93%

Reliability Gain

12x

Deploy Frequency

01 / Context

A US-based B2B SaaS startup (Series A, 18 months) built a data-enrichment platform. Customers uploaded CSV files with hundreds of thousands of records; the platform enriched each record by calling external APIs, geocoding, and applying ML models. The service processed 5 million enrichment requests per day, with peak concurrency of 200 simultaneous upload jobs.

02 / Problem

Cold Start Latency

Enrichment function had significant dependencies (Pandas, scikit-learn, geocoding). Cold starts averaged 2.8 seconds.

2.8s

Execution Time Limits

15-minute Lambda timeout forced complex chunking. Step Functions grew to 200+ states.

15min limit

Cost Inefficiency

Lambda costs reached $11,000/month + $2,500/month for Step Functions.

$13.5K/mo

Error Recovery

Failed invocations required re-executing entire batches from scratch.

No granular retry
40%
of engineering time spent fire-fighting Step Functions and Lambda concurrency limits
03 / Constraints

Zero Downtime

24/7 customer uploads

Team

3 backend engineers, no K8s experience

Budget

$25K/mo and growing

Compliance

Region-specific data requirements

04 / Approach

Rejected: Optimize Serverless

SnapStart wasn't available for Python. Wouldn't solve timeout or cost issues.

Rejected: ECS on EC2

Team wanted to avoid managing scaling, patching, capacity planning.

Selected: EKS + Fargate

No node management, containers for long-running jobs, kubectl for orchestration.

Why EKS + Fargate?

Long-Running Jobs

Containers have no execution time limit — entire file processed in one pod

Cost Model

Fargate bills per vCPU-second. 30-40% cheaper than Lambda for sustained workloads

Orchestration

Kubernetes Jobs replaced 200-state Step Functions. GitOps for deployments.

Operational Overhead

Fargate eliminates node management. 4-week learning investment.

05 / Implementation

Strangler Pattern Migration

8-week parallel run, both architectures serving traffic

8 Weeks

Total Timeline

Week 1-2

Containerization

Docker image with FastAPI, EKS cluster with Fargate, internal Service

Week 3-4

Orchestration Layer

Lambda creates Kubernetes Jobs, OpenTelemetry sidecar for observability

Week 5-6

Traffic Shift

Feature flag: 10% → 50% → 100%. Lambda path as fallback.

Week 7-8

Optimization

Resource tuning, HPA based on queue length, graceful termination

Final Architecture

Upload
API Gateway
Lambda (Trigger)
K8s Job
S3 + DynamoDB
06 / Results
Metric Before (Lambda) After (EKS)
Job completion (p95) 14.2 min 3.8 min
Failed jobs per 10k 47 3
Monthly compute cost $13,500 $7,800
Cold starts 2.8s None
Deploys per week 3 12
Eng time on orchestration bugs 40% 5%

Cost Savings

$5,700/mo

42% reduction

Reliability

93%

Fewer failed jobs

Velocity

4x

More deployments

07 / Key Insight

Serverless is ideal for MVP. Containers are ideal for scale.

Once workload becomes predictable, high-volume, or long-running, Lambda's per-invocation pricing and 15-minute limit become cost and reliability burdens. EKS + Fargate offers a middle ground: operational simplicity with container flexibility.

Stay with Serverless

  • • Workload is spiky or unpredictable
  • • Execution under 15 minutes
  • • Cold starts acceptable
  • • No container experience

Consider EKS

  • • Workloads are sustained, predictable
  • • Lambda costs exceed container equivalent
  • • Need fine-grained runtime control
  • • Team can invest 2-4 weeks in K8s

Related Case Studies

Facing similar infrastructure decisions?

We specialize in data infrastructure engineering and production AI systems. Let's discuss your architecture.