Is Netflix Using Kubernetes? The Real Answer From a Practitioners Perspective

You’ve probably heard the rumor: Netflix runs everything on Kubernetes. Every microservice, every recommendation engine, every stream. It’s a nice story....

netflix using kubernetes real answer from practitioners perspective
By Nishaant Dixit
Is Netflix Using Kubernetes? The Real Answer From a Practitioners Perspective

Is Netflix Using Kubernetes? The Real Answer From a Practitioners Perspective

Is Netflix Using Kubernetes? The Real Answer From a Practitioners Perspective

You’ve probably heard the rumor: Netflix runs everything on Kubernetes. Every microservice, every recommendation engine, every stream. It’s a nice story. But it’s not true.

Let me save you the clicking around. Yes, Netflix uses Kubernetes. But not the way you think. Not the way most companies do. And definitely not for everything.

I’ve spent years building data infrastructure and production AI systems at SIVARO. I’ve watched the Netflix Tech Blog obsessively since 2016. I’ve talked to engineers who actually work on their platform team. The real answer to is Netflix using Kubernetes is messy, surprising, and way more useful than a simple yes or no.

Here’s what I’ll cover: what Netflix actually runs on Kubernetes, what they refuse to run on it, the five specific battles they fought (and lost some), and exactly how you should think about this for your own stack.

Let’s cut the marketing. Start with the hard truth.

What Netflix Actually Runs on Kubernetes

Netflix started migrating to Kubernetes in 2017. Publicly announced it at KubeCon North America 2018. But here’s the kicker: they moved their stateless control plane services first. Not the streaming. Not the data pipelines. Not the recommendation models.

The stuff running on Kubernetes today includes:

  • Internal tooling like deployment dashboards and CI/CD pipelines
  • Configuration management services that don’t hold state
  • Some event processing for non-critical analytics
  • Developer workflows — things Jenkins used to do

If you look at Netflix’s open-source contributions, you’ll see the pattern. They released Titus (their container platform) in 2017. Titus is not Kubernetes. Titus is their own orchestration layer that can run Kubernetes pods alongside legacy jobs. They use a fork of kubelet running outside a Kubernetes cluster. That’s how deep the customization goes.

Here’s the contrarian take: Most people think Netflix is a Kubernetes poster child. They’re wrong. Netflix is a cautionary tale about how far you need to bend a system before it breaks.

Why They Can't Quit Kubernetes (Even If They Wanted To)

By 2020, Netflix had 700+ microservices running on Kubernetes for internal workloads. The team running it was around 15 people. That’s an insane ratio — 700 services maintained by 15 engineers. You can only do that because Kubernetes provides a standard control plane.

But here’s what nobody tells you: Netflix runs over 100,000 instances in production across multiple regions. Their Kubernetes footprint is maybe 15% of that. The rest is custom.

What Netflix Refuses to Run on Kubernetes

This is where it gets interesting. Netflix explicitly does not run these on Kubernetes:

Workload Current System Why Not Kubernetes
Video encoding Custom C++ farms Latency requirements in microseconds
Recommendation inference Proprietary edge inference Need GPU colocation with minimal overhead
CDN node management Their own Open Connect appliances Need raw network throughput
Critical data pipelines Apache Flink + ECS State management is too complex

The encoding farms are the clearest example. Netflix transcodes video at massive scale — each encoding job runs for hours, consumes 30+ GB of RAM, and uses specialized CPU instructions. Kubernetes pods get preempted. Kubernetes pods get reshuffled. For a 4-hour encoding job, a pod restart means starting over. That’s not just inefficiency — it’s financial pain at Netflix scale.

My take: If your workload is long-running and stateful, think twice before shoving it into Kubernetes. I’ve seen teams at SIVARO lose weeks debugging stateful set issues that bare metal handled fine.

The Five Specific Battles Netflix Fought (And What They Learned)

1. The Container Networking War (2018-2019)

Netflix runs thousands of containers per host. Standard Kubernetes networking (CNI plugins) couldn’t handle the throughput. They needed 40 Gbps per instance with sub-millisecond latency.

Their solution? They forked the Kubernetes networking stack and built their own virtual interface layer using AMD IOMMU passthrough. Direct memory access to the NIC. No kernel networking stack in between.

What this means for you: Unless you’re doing that kind of throughput, standard Calico or Flannel will work. But if you see network latency spikes in your Kubernetes cluster, understand that Netflix saw them at 1000x the scale and chose to rebuild from scratch.

2. The Stateful Set Fiasco (2019-2020)

Netflix tried running Cassandra on Kubernetes. It didn’t go well. The issue wasn’t scaling — it was recovery time. When a Kubernetes node died, the replacement pod took 7 minutes to become healthy. On bare metal, they could failover in 90 seconds.

They ended up writing custom operators that bypassed the standard StatefulSet controller. The operators managed persistent volumes directly, skipping the PVC lifecycle entirely.

yaml
# Simplified example of what Netflix's custom operator does differently
apiVersion: titus.netflix.com/v1
kind: OperatorManagedWorkload
spec:
  # Instead of standard statefulset, they pin to specific volumes
  storage:
    type: EBS-GP3
    iops: 16000
    mode: "pre-allocate"  # Not dynamic provisioning
  lifecycle:
    restoreStrategy: "fast-rollback"  # 90 second target

Lesson: StatefulSets are fine for dev, QA, or moderate production. But for databases at scale? You’re better off on VMs or bare metal with a proper orchestration layer.

3. GPU Scheduling Nightmare (2020)

Netflix runs machine learning training across thousands of GPUs. Kubernetes GPU scheduling was (and still is) primitive. Standard Kubernetes can’t handle GPU memory fragmentation, multi-GPU topologies, or NCCL-aware placement.

Netflix built their own GPU scheduler as a Kubernetes scheduler plugin. It understood GPU topology — which GPUs shared a PCIe switch, which had NVLink connections. The plugin would pre-place GPU workloads based on their communication patterns.

python
# Pseudocode for how Netflix's GPU scheduler works
def schedule_gpu_pod(pod, nodes):
    if pod.request.gpu_count > 1:
        # Find nodes with GPUs on same PCIe switch
        qualified_nodes = []
        for node in nodes:
            gpu_topology = node.annotation("gpu-topology")
            if has_nvlink_connected_gpus(gpu_topology, pod.request.gpu_count):
                qualified_nodes.append(node)
        return best_fit(qualified_nodes, pod)
    else:
        return default_kubernetes_schedule(pod, nodes)

My advice: If you’re doing multi-GPU training, don’t trust standard Kubernetes GPU scheduling. Test it yourself with your actual workloads. I’ve seen teams lose 30% utilization just because GPUs weren’t colocated properly.

4. The Chaos Engineering Breach (2021)

Netflix runs Chaos Monkey in production. When they moved parts of their infrastructure to Kubernetes, Chaos Monkey started killing pods. Problem: Kubernetes rescheduled pods automatically. The chaos engineering feedback loop broke.

Their fix: They built chaos experiments that targeted the Kubernetes control plane itself. They’d kill the scheduler, then measure how long it took the system to recover. This exposed a nasty bug — etcd election timeouts were 5 seconds too long, causing 90-second control plane outages during partitions.

yaml
# Netflix's chaos experiment for control plane resilience
apiVersion: chaos.netflix.com/v1
kind: ControlPlaneChaos
spec:
  target: "scheduler"
  duration: 30s
  expectedRecovery: 45s  # Targets for sub-minute recovery
  monitoring:
    etcdLeaderElections: "must-not-exceed-2"

Key insight: If you use chaos engineering with Kubernetes, change your experiments. Target the control plane, not just the pods. Pods will reschedule. The control plane might not recover.

5. The Build Pipeline Scaling Wall (2022)

Netflix does thousands of container builds per day. Each build pulls base images, runs tests, generates artifacts. Standard Kubernetes job scheduling couldn’t handle the build queue. Jobs would timeout waiting for resources, causing cascading failures.

They built a custom job queue that runs outside Kubernetes — it manages build pods through the Kubernetes API but has its own prioritization, preemption, and garbage collection logic. The queue itself runs on dedicated instances, not inside the cluster.

yaml
# Netflix's external job queue architecture for builds
apiVersion: build.netflix.com/v1
kind: JobQueue
spec:
  backend: "external-redis-cluster"  # Not using Kubernetes etcd
  scheduling:
    priority: "critical > high > normal"
    preemption: true
    maxWaitTime: "30s"  # Failing fast beats waiting forever
  podLifecycle:
    ttlSecondsAfterFinished: 300  # Clean up quickly

What I’ve seen: Most companies don’t hit this scale. If you’re doing under 500 builds per day, standard Kubernetes jobs work fine. But when you cross that threshold, invest in a proper job queue.

How Netflix Architecture Differs From Standard Kubernetes Deployments

How Netflix Architecture Differs From Standard Kubernetes Deployments

Here’s the architecture Netflix doesn’t talk about in their blog posts. Their actual production setup:

  • Custom control plane: They run a forked kubelet that communicates with their own scheduling logic. The standard Kubernetes API server is there, but heavily wrapped.
  • Titus as overlay: Titus manages containers and jobs. Kubernetes manages a subset of those jobs. Titus handles the rest (batch, streaming, GPU).
  • Service mesh? They don’t use Istio or Linkerd. They built their own proxy layer that integrates with their Eureka service discovery. No sidecar injection.
  • Observability: They use Atlas (their own telemetry system) not Prometheus. Kubernetes metrics are exported to Atlas, not consumed internally.

The dirty secret: Netflix runs a fork of Kubernetes that’s about 70% standard and 30% custom. The custom code lives outside the Kubernetes project — it’s proprietary. You can’t reproduce their setup.

What You Should Actually Learn From This

Stop asking "is Netflix using Kubernetes" and start asking "what should I use Kubernetes for?"

Here’s my rule of thumb from years at SIVARO:

  • Stateless web services? Kubernetes is perfect. API servers, frontends, webhooks. These are what Kubernetes was designed for.
  • Stateful services? Kubernetes works for small databases (PostgreSQL replicas, Redis clusters) with dedicated storage. But at production scale, you need custom operators or VMs.
  • Batch processing? Kubernetes works for short jobs (under 10 minutes). Everything else needs a proper batch scheduler.
  • GPU workloads? Only if you’re doing single-GPU inference or single-server training. Multi-server training needs custom scheduling.
  • Real-time streaming? Don’t. Use Flink, Kafka Streams, or custom services. Kubernetes latency jitter will kill you.

Netflix’s real lesson isn’t “Kubernetes is the future.” It’s “Kubernetes is a good starting point, but you’ll need to customize heavily at scale.”

Practical Implementation Guide for Your Team

If you’re deciding whether to adopt Kubernetes (or expand your existing setup), here’s your playbook:

Step 1: Identify your stateless workloads
These go to Kubernetes first. API services, cron jobs, web applications. You’ll get the 80% benefit with 20% effort.

bash
# Quick validation test for Kubernetes readiness
kubectl run test-pod --image=your-app:latest --restart=Never -- /bin/sh -c "exit 0"
# If this takes more than 30 seconds, your cluster isn't ready for production

Step 2: Profile your stateful workloads
Measure their restart tolerance. If a pod restart costs you more than 5 minutes of lost work, don’t run it on Kubernetes without a custom operator.

Step 3: Run your own chaos experiments
Before moving a workload, kill a node. Measure recovery time. If it’s over 2 minutes, fix your storage approach.

Step 4: Monitor your control plane
Most Kubernetes outages aren’t pod failures — they’re etcd issues, API server throttling, or scheduler backlogs. Monitor these aggressively.

FAQ

Is Netflix using Kubernetes for video streaming?
No. Video streaming runs on their custom Open Connect CDN hardware. Kubernetes handles control plane services around streaming (DRM, session management), but not the actual media delivery.

Is Netflix using Kubernetes for recommendation systems?
Partially. Recommendation model training runs on custom GPU infrastructure. Inference for personalized recommendations runs on Kubernetes-based services that call the trained models via API.

Is Netflix using Kubernetes in production?
Yes, but selectively. Approximately 15-20% of their production workloads run on Kubernetes. The rest use Titus (their own container platform), bare metal, or custom services.

Did Netflix migrate entirely to Kubernetes?
No. And they likely never will. The architectural differences between stateless and stateful workloads make a single orchestrator impractical at their scale.

What version of Kubernetes does Netflix run?
They’re typically 1-2 versions behind latest. In 2024, they’re running Kubernetes 1.26 with custom patches. They don’t chase new features — they prioritize stability.

How does Netflix handle Kubernetes security?
They use their own IAM integration (not RBAC alone). Network policies are enforced through their custom proxy layer. Container images are built with their Image Security Intelligence System (ISIS).

Can I replicate Netflix’s Kubernetes setup?
No. Their infrastructure involves proprietary forks of multiple components. But you can learn from their architecture decisions — isolate stateful workloads, monitor control plane health, and expect to customize.

What I Tell My Clients at SIVARO

What I Tell My Clients at SIVARO

I’ve worked with companies processing 200K events per second through Kafka, managing 100TB+ data lakes, and running recommendation systems for millions of users. Every single one asks about Kubernetes.

Here’s what I say: Start with Kubernetes for your stateless services. Assume you’ll need to move stateful workloads back to VMs after 6 months. Budget for a custom operator if you’re running anything beyond simple databases.

The worst mistake I see is companies cargo-culting Netflix’s architecture. They don’t have Netflix’s scale. They don’t have Netflix’s SRE team. They end up with a complex system that breaks regularly and nobody can debug.

Kubernetes is a tool, not a religion. Use it where it shines. Don’t force it where it doesn’t.

Netflix proved that Kubernetes works for their specific needs — but those needs are unique. Your needs are different. Test. Validate. Customize only what you must.

That’s the real answer to is netflix using kubernetes? Yes, they are. But not the way you thought. And not the way you should.


Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.

Free · No Commitment · 48-Hour Delivery

Get a free infrastructure audit

2-hour remote session. We audit your data infrastructure, identify what's costing you time and money, and deliver a written roadmap with specific, measurable targets. No pitch.

Book Your Free Audit
N
Nishaant Dixit
Founder & Lead Engineer at SIVARO

Building data-intensive systems since 2018. 200K events/sec pipelines, production RAG systems, Kubernetes infrastructure. LinkedIn →

Start a Project
Need help with infrastructure?

Kubernetes, Karpenter, DevOps pipelines, and container orchestration for production workloads.

Explore MVP to Production