What Is Reliability in Kubernetes? It’s Not What You Think

I’ve spent the last six years building data infrastructure and production AI systems at SIVARO. We process 200K events per second. We run stateful workload...

what reliability kubernetes it’s what think
By Nishaant Dixit
What Is Reliability in Kubernetes? It’s Not What You Think

What Is Reliability in Kubernetes? It’s Not What You Think

What Is Reliability in Kubernetes? It’s Not What You Think

I’ve spent the last six years building data infrastructure and production AI systems at SIVARO. We process 200K events per second. We run stateful workloads on Kubernetes that most people told me couldn’t work. And I’ve learned one hard truth about what is reliability in kubernetes? — it’s not about keeping pods running. It’s about making the system predictable when things break.

Reliability in Kubernetes means the cluster can absorb failures, maintain data integrity, and return to steady state without you touching a terminal at 3 AM. It means your users don’t notice when a node dies. It means your database doesn’t silently corrupt data when a network partition happens.

Most people think reliability is about uptime. They’re wrong because uptime is a lagging indicator. Real reliability is about failure behavior. What happens when etcd leader election takes five seconds instead of one? What happens when a node’s kernel panics in the middle of an etcd write? What happens when your CNI plugin silently drops 0.1% of packets for three minutes?

These are the questions that separate a “reliable” cluster from one that’s just ticking time bombs wrapped in YAML.

I’ll show you what reliability actually means — with specific numbers, patterns we’ve tested, and things I had to learn by breaking production clusters on purpose. This isn’t theory. This is what we do at SIVARO every day.


Defining Reliability in Kubernetes Terms

Let’s get precise. Kubernetes reliability breaks down into three layers:

Layer 1: Infrastructure reliability. The nodes, network, storage, and etcd cluster that Kubernetes sits on top of.

Layer 2: Control plane reliability. The API server, scheduler, controller manager, and cloud-controller-manager working correctly.

Layer 3: Workload reliability. Your applications handling pod restarts, node failures, and traffic spikes.

Here’s the thing most guides don’t tell you — layer 1 and layer 2 are prerequisites. You can’t have workload reliability without a stable control plane. And you can’t have a stable control plane without reliable infrastructure. But here’s the contrarian take: even with perfect layers 1 and 2, your application can still be unreliable.

I’ve seen teams run Kubernetes on bare metal with redundant everything. Their pods still failed because they didn’t handle SIGTERM properly. Their databases still corrupted because they used default volume provisioning. Reliability isn’t an infrastructure checkbox — it’s a system property you have to design for at every layer.


What Reliability Isn’t in Kubernetes

Most people conflate reliability with availability. They’re different.

  • Availability is “is the API server responding?”
  • Reliability is “does the API server respond with correct state under load, even after three controller manager restarts?”

Availability is easy. Run multiple API server replicas behind a load balancer. Done. Reliability is hard. It means the scheduler makes consistent decisions even when etcd is under write pressure. It means the admission webhooks don’t time out when the cluster has 10,000 pods. It means your ConfigMap updates propagate within a bounded time.

I worked with a client in 2022 who had 99.99% API server uptime. Their cluster was unreliable because the API server would respond with confusing errors during rolling updates. 200 OK with wrong data. That’s worse than a 503. At least a 503 tells you something’s wrong.

Reliability is about semantic correctness under stress. Not just “is it up?”


The True Reliability Bottleneck: etcd

If I had to pick one thing that kills Kubernetes reliability, it’s etcd. Not because etcd is bad software — it’s exceptional. But because people treat it as an afterthought.

etcd is your cluster’s brain. Every pod creation, every node status update, every ConfigMap change goes through etcd. If etcd has latency spikes, your entire cluster has latency spikes. If etcd has a leader election storm, your cluster goes temporarily blind.

At SIVARO, we run etcd on dedicated nodes. Not shared with workloads. Not on the same instance type as your general compute nodes. Dedicated nodes with local NVMe SSDs, no bursting, and a separate network path.

Why? Because we tested what happens when etcd shares a node with a noisy neighbor pulling a large container image. The answer: etcd commit time jumped from 500 microseconds to 50 milliseconds. That’s a 100x degradation. You don’t notice until the scheduler stops scheduling because it can’t read pod state.

Here’s our etcd configuration that actually works for production reliability:

yaml
# etcd member configuration for reliability
--quota-backend-bytes=8589934592  # 8GB limit, prevents unbounded growth
--auto-compaction-mode=revision
--auto-compaction-retention=1000  # Keep 1000 revisions for history
--max-request-bytes=1572864  # 1.5MB max request size
--snapshot-count=100000  # More frequent snapshots for faster recovery

The snapshot-count of 100,000 is critical. Default is 100,000 anyway, but I’ve seen people bump it to 1,000,000 to reduce disk writes. Don’t. When etcd recovers from a snapshot, it has to replay all the changes since the last snapshot. A larger snapshot count means longer recovery time. You want recovery under 30 seconds, not 5 minutes.


Control Plane Reliability: What Actually Breaks

The Kubernetes control plane is surprisingly robust. API servers are stateless — you can run 5 of them without coordination overhead (beyond etcd). But the controller manager and scheduler are mostly stateless. They read from etcd and write to it, but they don’t hold critical state themselves.

The real failure modes aren’t in the code — they’re in the timing.

Token Expiration Chaos

We saw this at SIVARO in 2023. Our cloud provider rotated the service account token for the controller manager every 24 hours. The controller manager doesn’t reload tokens automatically if you’re using in-cluster config. It keeps the old token until it’s rejected. Then it crashes. Then it restarts. Then it re-reads the token file.

This causes a 30-second window where:

  • No endpoints are reconciled
  • No deployments get scaled
  • No ConfigMaps get propagated

The fix was simple: use a projected service account token with a longer lifetime, and mount it as a volume so Kubernetes automatically refreshes it.

yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: controller-manager-sa
  namespace: kube-system
automountServiceAccountToken: true
---
# In the controller manager pod spec
volumeMounts:
  - name: token
    mountPath: /var/run/secrets/kubernetes.io/serviceaccount
    readOnly: true
volumes:
  - name: token
    projected:
      sources:
        - serviceAccountToken:
            audience: kube-system
            expirationSeconds: 3600  # 1 hour, but auto-refreshed

Scheduler Starvation

The scheduler is a greedy algorithm. It picks a node, assigns a pod, moves on. It never “un-schedules” a pod. Under normal load, this is fine. Under heavy load (500+ pod creations per second), the scheduler’s internal queue can grow unbounded.

We hit this during a cluster migration in 2022. We tried to move 2,000 pods to a new node pool simultaneously. The scheduler’s internal queue grew to 15,000 entries. New pod creation took 90 seconds instead of 500 milliseconds.

The fix: rate-limit pod creation on the application side. Kubernetes doesn’t have a built-in throttle for pod creation rate. You have to implement it yourself. We used a simple queue worker that creates pods at a max rate of 50 per second.

python
# Python-like pseudocode for rate-limited pod creation
import asyncio
import time

class PodRateLimiter:
    def __init__(self, max_rate=50):
        self.max_rate = max_rate
        self.last_time = time.time()
        self.tokens = max_rate
    
    async def wait_for_token(self):
        while self.tokens <= 0:
            await asyncio.sleep(0.1)
            elapsed = time.time() - self.last_time
            self.tokens = min(self.max_rate, 
                             self.tokens + elapsed * self.max_rate)
            self.last_time = time.time()
        self.tokens -= 1

This isn’t elegant. But it works. The Kubernetes scheduler can handle 50 pods/sec reliably. Push it to 500, and you’ll see dropped watch notifications and scheduling failures.


Pod Reliability: The Hardest Layer

Pod Reliability: The Hardest Layer

Pods are ephemeral. Everyone knows that. But how they die matters more than you think.

The SIGTERM Problem

Kubernetes sends SIGTERM to the main process (PID 1) when it wants to stop a pod. If your application doesn’t handle SIGTERM, Kubernetes waits for the terminationGracePeriodSeconds (default 30 seconds) and then sends SIGKILL. That hard kill can corrupt data if your application is in the middle of writing to disk or sending a response.

We wrote a distributed queue system that ran on Kubernetes. During a rolling update, pods were killed while holding unacknowledged messages. The queue lost data. Our fault, not Kubernetes’.

The fix: proper grace period and preStop hook.

yaml
apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Flush all pending writes
            kill -USR1 1  # Signal application to drain connections
            sleep 5       # Wait for drain
            # Health check that everything is flushed
            while [ -f /tmp/writing_to_disk ]; do
              sleep 1
            done
    ports:
    - containerPort: 8080

The key insight: your application needs to know when Kubernetes is about to kill it, and it needs time to finish critical work. A 30-second default grace period is not enough for stateful applications. We use 60 seconds minimum. Some databases need 120 seconds.

Liveness Probes That Kill Your Reliability

Most people think liveness probes are for reliability. They’re often wrong.

A liveness probe that restarts a pod is a last resort. If you restart a pod because a single request timed out, you’ve degraded reliability, not improved it. The restart takes time. The pod has to start up again, re-read state, re-establish connections. During that time, your service is degraded.

We follow one rule at SIVARO: liveness probes must only fail when the pod cannot recover without a restart. Memory corruption? Yes. Deadlock? Yes. Transient HTTP timeout? No.

Here’s how we actually use probes:

yaml
livenessProbe:
  httpGet:
    path: /healthz/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 15
  failureThreshold: 3
  # Discovers deadlocks, not transient slowness
readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2
  # Removes from service when not ready to receive traffic
startupProbe:
  # Only for slow-starting containers
  httpGet:
    path: /healthz/startup
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10
  failureThreshold: 30  # 5 minutes total startup time

The differentiation matters. Liveness for unrecoverable failures. Readiness for traffic management. Startup for slow inits. Mix them up and you’ll restart healthy pods for no reason.


Stateful Workload Reliability: The Hard Case

Kubernetes was built for stateless apps. Everyone knows this. But we run stateful workloads — databases, queues, event stores — on Kubernetes. And they’re reliable. Here’s how.

StatefulSets Are Better Than Deployments for State

Obvious, I know. But I’ve seen people run PostgreSQL on a Deployment with a shared PersistentVolumeClaim. That’s a reliability disaster. If the pod moves to a different node, the volume might not attach in time. Or it might attach but with stale data.

StatefulSets guarantee:

  • Ordered pod creation (pod-0 before pod-1)
  • Stable network identities (app-0.svc instead of random-hash)
  • Ordered graceful shutdown (pod-N before pod-0)

These guarantees are essential for databases. PostgreSQL primary-promotion logic assumes pod-0 is the primary. If pod-1 becomes the primary, your ordering assumptions break.

Volume Snapshot Reliability

PersistentVolumeClaims can fail. The attachment can time out. The detach can hang. We’ve seen AWS EBS volumes take 30 seconds to attach during node failures. In that 30 seconds, your pod can’t start.

The solution: use volume expansion to pre-allocate storage, and use volume binding modes that suit your workload.

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-pvc
  labels:
    app: myapp
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 100Gi
  # Important: WaitForFirstConsumer prevents volume creation 
  # until a pod is scheduled, reducing orphan volumes
  volumeBindingMode: WaitForFirstConsumer

We use WaitForFirstConsumer for all stateful workloads. It means the volume doesn’t get created until the pod is scheduled to a node. This sounds risky — “what if the pod moves to a different node?” — but it actually reduces volume churn. The volume stays attached to the same node as long as possible.

Backup Reliability Is Part of Reliability

If your volume gets corrupted, can you recover? If you say “snapshots”, you haven’t tested them.

We test volume snapshots every week. We restore them into a separate namespace and verify data integrity. No snapshot is “live” until it’s been restored and checked. We use Velero for this, but the tool doesn’t matter. The process does:

yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup-schedule
spec:
  schedule: "0 3 * * *"
  template:
    includeNamespaces:
    - production
    excludedResources:
    - events
    - nodes
    ttl: 720h

But schedules aren’t enough. We run a weekly restore test that spins up a temporary MySQL instance on the restored volume, runs a checksum query, and deletes everything. If the checksum doesn’t match, we get paged.


Network Reliability: The Silent Killer

Networking is the most abstract reliability concern in Kubernetes. You can’t see it. You can’t easily measure it. But it breaks all the time.

DNS Timeouts That Take Down Everything

Kubernetes DNS (CoreDNS) is critical. Every pod depends on DNS for service discovery. If CoreDNS gets overloaded, every pod’s DNS queries time out. And DNS timeouts can cascade — pods try to connect to services, DNS fails, pods back off, DNS gets overwhelmed, more timeouts.

The fix: cache DNS locally in each pod, and use a dedicated CoreDNS deployment.

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns-dedicated
  namespace: kube-system
spec:
  replicas: 3
  selector:
    matchLabels:
      k8s-app: coredns-dedicated
  template:
    metadata:
      labels:
        k8s-app: coredns-dedicated
    spec:
      # Dedicated nodes for DNS
      nodeSelector:
        node-role.kubernetes.io/dns: "true"
      containers:
      - name: coredns
        image: coredns/coredns:1.11.0
        args:
          - -conf
          - /etc/coredns/Corefile
        ports:
        - containerPort: 53
        memory: 256Mi
        cpu: 500m

The dedicated node selector means DNS pods don’t compete with application pods for resources. And three replicas mean one can fail without impact.

But the biggest change: we set ndots: 1 in /etc/resolv.conf for all pods. Default is ndots: 5, which means any name with fewer than 5 dots gets checked against search domains first. That’s 3-4 extra DNS queries per lookup. With ndots: 1, names like postgres are resolved directly. This cut our DNS failure rate by 60%.

CNI Plugin Failures

We use Cilium for CNI. It’s fast. It supports network policies. But it’s not immune to failure. In 2024, we had a Cilium agent crash-loop because of a kernel module conflict. All nodes in the cluster lost connectivity for 10 seconds.

The lesson: you need network disconnection detection in your applications. If your application can’t detect a network failure and reconnect gracefully, Kubernetes isn’t going to save you.

python
# Python example: exponential backoff with jitter for reconnection
import random
import time

def connect_with_backoff(host, port, max_attempts=5):
    attempt = 0
    while attempt < max_attempts:
        try:
            conn = connect(host, port)
            return conn
        except ConnectionError:
            attempt += 1
            if attempt == max_attempts:
                raise
            delay = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)
    return None

Measuring Reliability: The Only Metric That Matters

Forget “uptime percentage”. That’s a vanity metric. The real reliability metric for Kubernetes is time to steady state after failure.

Measure this:

  • Node failure: how long until pods are rescheduled and serving traffic?
  • API server restart: how long until all controllers converge?
  • Volume attachment failure: how long until the application detects the failure and retries?

We measure all three with Prometheus and alert on anything over 60 seconds. If a node failure takes more than 60 seconds to recover from, we have a reliability problem.

Here’s a Prometheus rule we use:

yaml
groups:
- name: kubernetes-reliability
  rules:
  - alert: SlowPodRescheduling
    expr: |
      time() - max by(pod) (kube_pod_start_time{namespace="production"}) 
      > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} is taking too long to start"

This alerts when any pod in production takes more than 90 seconds to become ready after creation. 90 seconds is our threshold. Yours might be different. But you need a threshold.


Frequently Asked Questions

What is reliability in kubernetes vs availability?

Availability asks “is the cluster up?”. Reliability asks “does the cluster behave correctly under failure?”. A cluster can be available (API server responds) but unreliable (returns stale data after an etcd leader change). Reliability is about correct behavior under stress.

How do I make my Kubernetes cluster more reliable without rewriting applications?

Start with etcd. Dedicated etcd nodes with fast disks. Then fix your probes — liveness probes should only restart on unrecoverable failures. Then add pod disruption budgets to prevent too many replicas from being unavailable simultaneously. Rewriting applications is the last step, not the first.

What’s the biggest reliability mistake teams make?

Using default settings. Default terminationGracePeriodSeconds is 30 seconds. That’s not enough for stateful apps. Default liveness probe settings cause unnecessary restarts. Default resource limits are too low for etcd. Every default is designed for development, not production.

Can you achieve 99.99% reliability on Kubernetes?

Yes. But it’s not about uptime. It’s about predictable failure recovery. If every node failure is handled correctly within 60 seconds, that’s better than “99.99% uptime” with a 5-minute outage every year. We achieve this at SIVARO with the patterns in this article. It’s not magic — it’s testing, monitoring, and relentless configuration hardening.

Why does my DNS keep failing in Kubernetes?

Probably ndots. Check your pod’s resolv.conf. Default ndots=5 means every hostname lookup generates 3-4 extra DNS queries for search domains. This overloads CoreDNS under load. Set ndots=1 by patching the Pod spec or using a dnsConfig in your pod template.

Is running databases on Kubernetes reliable?

It can be, but you have to treat the database as a first-class citizen. StatefulSets for ordering, dedicated nodes, volume backups tested weekly, proper preStop hooks, and a complete disaster recovery plan. We run PostgreSQL, Redis, and a custom event store on Kubernetes. It works. But it takes more effort than running them on VMs.

What role does observability play in reliability?

Everything. You can’t fix what you can’t see. We monitor etcd latency, API server request duration, scheduler queue depth, and pod start times. Without this data, you’re flying blind. Prometheus and Grafana are non-negotiable. Add Loki for logs and Tempo for traces. The cost is worth it.


Conclusion: What Is Reliability in Kubernetes?

Conclusion: What Is Reliability in Kubernetes?

After years building production systems at SIVARO, I can answer what is reliability in kubernetes? cleanly: reliability is the system’s ability to maintain correct behavior during and after failures. It’s not about keeping everything running. It’s about making the system predictable when things break.

You can’t prevent every failure. Nodes die. Networks partition. Software bugs exist. But you can design your cluster so that failures don’t cascade. You can configure etcd for speed and recovery. You can set probes that don’t cry wolf. You can handle SIGTERM with dignity.

The teams that succeed at Kubernetes reliability aren’t the ones with the most powerful clusters. They’re the ones who have tested failure scenarios, configured every knob deliberately, and accepted that reliability is a continuous process of measurement and improvement.

Start with one thing today: check your etcd configuration. Change those defaults. It’s the highest-leverage change you can make.


Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.

Free · No Commitment · 48-Hour Delivery

Get a free infrastructure audit

2-hour remote session. We audit your data infrastructure, identify what's costing you time and money, and deliver a written roadmap with specific, measurable targets. No pitch.

Book Your Free Audit
N
Nishaant Dixit
Founder & Lead Engineer at SIVARO

Building data-intensive systems since 2018. 200K events/sec pipelines, production RAG systems, Kubernetes infrastructure. LinkedIn →

Start a Project
Need help with infrastructure?

Kubernetes, Karpenter, DevOps pipelines, and container orchestration for production workloads.

Explore MVP to Production