What Is Reliability in Kubernetes? It’s Not What You Think
I’ve spent the last six years building data infrastructure and production AI systems at SIVARO. We process 200K events per second. We run stateful workloads on Kubernetes that most people told me couldn’t work. And I’ve learned one hard truth about what is reliability in kubernetes? — it’s not about keeping pods running. It’s about making the system predictable when things break.
Reliability in Kubernetes means the cluster can absorb failures, maintain data integrity, and return to steady state without you touching a terminal at 3 AM. It means your users don’t notice when a node dies. It means your database doesn’t silently corrupt data when a network partition happens.
Most people think reliability is about uptime. They’re wrong because uptime is a lagging indicator. Real reliability is about failure behavior. What happens when etcd leader election takes five seconds instead of one? What happens when a node’s kernel panics in the middle of an etcd write? What happens when your CNI plugin silently drops 0.1% of packets for three minutes?
These are the questions that separate a “reliable” cluster from one that’s just ticking time bombs wrapped in YAML.
I’ll show you what reliability actually means — with specific numbers, patterns we’ve tested, and things I had to learn by breaking production clusters on purpose. This isn’t theory. This is what we do at SIVARO every day.
Defining Reliability in Kubernetes Terms
Let’s get precise. Kubernetes reliability breaks down into three layers:
Layer 1: Infrastructure reliability. The nodes, network, storage, and etcd cluster that Kubernetes sits on top of.
Layer 2: Control plane reliability. The API server, scheduler, controller manager, and cloud-controller-manager working correctly.
Layer 3: Workload reliability. Your applications handling pod restarts, node failures, and traffic spikes.
Here’s the thing most guides don’t tell you — layer 1 and layer 2 are prerequisites. You can’t have workload reliability without a stable control plane. And you can’t have a stable control plane without reliable infrastructure. But here’s the contrarian take: even with perfect layers 1 and 2, your application can still be unreliable.
I’ve seen teams run Kubernetes on bare metal with redundant everything. Their pods still failed because they didn’t handle SIGTERM properly. Their databases still corrupted because they used default volume provisioning. Reliability isn’t an infrastructure checkbox — it’s a system property you have to design for at every layer.
What Reliability Isn’t in Kubernetes
Most people conflate reliability with availability. They’re different.
- Availability is “is the API server responding?”
- Reliability is “does the API server respond with correct state under load, even after three controller manager restarts?”
Availability is easy. Run multiple API server replicas behind a load balancer. Done. Reliability is hard. It means the scheduler makes consistent decisions even when etcd is under write pressure. It means the admission webhooks don’t time out when the cluster has 10,000 pods. It means your ConfigMap updates propagate within a bounded time.
I worked with a client in 2022 who had 99.99% API server uptime. Their cluster was unreliable because the API server would respond with confusing errors during rolling updates. 200 OK with wrong data. That’s worse than a 503. At least a 503 tells you something’s wrong.
Reliability is about semantic correctness under stress. Not just “is it up?”
The True Reliability Bottleneck: etcd
If I had to pick one thing that kills Kubernetes reliability, it’s etcd. Not because etcd is bad software — it’s exceptional. But because people treat it as an afterthought.
etcd is your cluster’s brain. Every pod creation, every node status update, every ConfigMap change goes through etcd. If etcd has latency spikes, your entire cluster has latency spikes. If etcd has a leader election storm, your cluster goes temporarily blind.
At SIVARO, we run etcd on dedicated nodes. Not shared with workloads. Not on the same instance type as your general compute nodes. Dedicated nodes with local NVMe SSDs, no bursting, and a separate network path.
Why? Because we tested what happens when etcd shares a node with a noisy neighbor pulling a large container image. The answer: etcd commit time jumped from 500 microseconds to 50 milliseconds. That’s a 100x degradation. You don’t notice until the scheduler stops scheduling because it can’t read pod state.
Here’s our etcd configuration that actually works for production reliability:
yaml
# etcd member configuration for reliability
--quota-backend-bytes=8589934592 # 8GB limit, prevents unbounded growth
--auto-compaction-mode=revision
--auto-compaction-retention=1000 # Keep 1000 revisions for history
--max-request-bytes=1572864 # 1.5MB max request size
--snapshot-count=100000 # More frequent snapshots for faster recovery
The snapshot-count of 100,000 is critical. Default is 100,000 anyway, but I’ve seen people bump it to 1,000,000 to reduce disk writes. Don’t. When etcd recovers from a snapshot, it has to replay all the changes since the last snapshot. A larger snapshot count means longer recovery time. You want recovery under 30 seconds, not 5 minutes.
Control Plane Reliability: What Actually Breaks
The Kubernetes control plane is surprisingly robust. API servers are stateless — you can run 5 of them without coordination overhead (beyond etcd). But the controller manager and scheduler are mostly stateless. They read from etcd and write to it, but they don’t hold critical state themselves.
The real failure modes aren’t in the code — they’re in the timing.
Token Expiration Chaos
We saw this at SIVARO in 2023. Our cloud provider rotated the service account token for the controller manager every 24 hours. The controller manager doesn’t reload tokens automatically if you’re using in-cluster config. It keeps the old token until it’s rejected. Then it crashes. Then it restarts. Then it re-reads the token file.
This causes a 30-second window where:
- No endpoints are reconciled
- No deployments get scaled
- No ConfigMaps get propagated
The fix was simple: use a projected service account token with a longer lifetime, and mount it as a volume so Kubernetes automatically refreshes it.
yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: controller-manager-sa
namespace: kube-system
automountServiceAccountToken: true
---
# In the controller manager pod spec
volumeMounts:
- name: token
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
readOnly: true
volumes:
- name: token
projected:
sources:
- serviceAccountToken:
audience: kube-system
expirationSeconds: 3600 # 1 hour, but auto-refreshed
Scheduler Starvation
The scheduler is a greedy algorithm. It picks a node, assigns a pod, moves on. It never “un-schedules” a pod. Under normal load, this is fine. Under heavy load (500+ pod creations per second), the scheduler’s internal queue can grow unbounded.
We hit this during a cluster migration in 2022. We tried to move 2,000 pods to a new node pool simultaneously. The scheduler’s internal queue grew to 15,000 entries. New pod creation took 90 seconds instead of 500 milliseconds.
The fix: rate-limit pod creation on the application side. Kubernetes doesn’t have a built-in throttle for pod creation rate. You have to implement it yourself. We used a simple queue worker that creates pods at a max rate of 50 per second.
python
# Python-like pseudocode for rate-limited pod creation
import asyncio
import time
class PodRateLimiter:
def __init__(self, max_rate=50):
self.max_rate = max_rate
self.last_time = time.time()
self.tokens = max_rate
async def wait_for_token(self):
while self.tokens <= 0:
await asyncio.sleep(0.1)
elapsed = time.time() - self.last_time
self.tokens = min(self.max_rate,
self.tokens + elapsed * self.max_rate)
self.last_time = time.time()
self.tokens -= 1
This isn’t elegant. But it works. The Kubernetes scheduler can handle 50 pods/sec reliably. Push it to 500, and you’ll see dropped watch notifications and scheduling failures.
Pod Reliability: The Hardest Layer
Pods are ephemeral. Everyone knows that. But how they die matters more than you think.
The SIGTERM Problem
Kubernetes sends SIGTERM to the main process (PID 1) when it wants to stop a pod. If your application doesn’t handle SIGTERM, Kubernetes waits for the terminationGracePeriodSeconds (default 30 seconds) and then sends SIGKILL. That hard kill can corrupt data if your application is in the middle of writing to disk or sending a response.
We wrote a distributed queue system that ran on Kubernetes. During a rolling update, pods were killed while holding unacknowledged messages. The queue lost data. Our fault, not Kubernetes’.
The fix: proper grace period and preStop hook.
yaml
apiVersion: v1
kind: Pod
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Flush all pending writes
kill -USR1 1 # Signal application to drain connections
sleep 5 # Wait for drain
# Health check that everything is flushed
while [ -f /tmp/writing_to_disk ]; do
sleep 1
done
ports:
- containerPort: 8080
The key insight: your application needs to know when Kubernetes is about to kill it, and it needs time to finish critical work. A 30-second default grace period is not enough for stateful applications. We use 60 seconds minimum. Some databases need 120 seconds.
Liveness Probes That Kill Your Reliability
Most people think liveness probes are for reliability. They’re often wrong.
A liveness probe that restarts a pod is a last resort. If you restart a pod because a single request timed out, you’ve degraded reliability, not improved it. The restart takes time. The pod has to start up again, re-read state, re-establish connections. During that time, your service is degraded.
We follow one rule at SIVARO: liveness probes must only fail when the pod cannot recover without a restart. Memory corruption? Yes. Deadlock? Yes. Transient HTTP timeout? No.
Here’s how we actually use probes:
yaml
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 3
# Discovers deadlocks, not transient slowness
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
# Removes from service when not ready to receive traffic
startupProbe:
# Only for slow-starting containers
httpGet:
path: /healthz/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30 # 5 minutes total startup time
The differentiation matters. Liveness for unrecoverable failures. Readiness for traffic management. Startup for slow inits. Mix them up and you’ll restart healthy pods for no reason.
Stateful Workload Reliability: The Hard Case
Kubernetes was built for stateless apps. Everyone knows this. But we run stateful workloads — databases, queues, event stores — on Kubernetes. And they’re reliable. Here’s how.
StatefulSets Are Better Than Deployments for State
Obvious, I know. But I’ve seen people run PostgreSQL on a Deployment with a shared PersistentVolumeClaim. That’s a reliability disaster. If the pod moves to a different node, the volume might not attach in time. Or it might attach but with stale data.
StatefulSets guarantee:
- Ordered pod creation (pod-0 before pod-1)
- Stable network identities (app-0.svc instead of random-hash)
- Ordered graceful shutdown (pod-N before pod-0)
These guarantees are essential for databases. PostgreSQL primary-promotion logic assumes pod-0 is the primary. If pod-1 becomes the primary, your ordering assumptions break.
Volume Snapshot Reliability
PersistentVolumeClaims can fail. The attachment can time out. The detach can hang. We’ve seen AWS EBS volumes take 30 seconds to attach during node failures. In that 30 seconds, your pod can’t start.
The solution: use volume expansion to pre-allocate storage, and use volume binding modes that suit your workload.
yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-pvc
labels:
app: myapp
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 100Gi
# Important: WaitForFirstConsumer prevents volume creation
# until a pod is scheduled, reducing orphan volumes
volumeBindingMode: WaitForFirstConsumer
We use WaitForFirstConsumer for all stateful workloads. It means the volume doesn’t get created until the pod is scheduled to a node. This sounds risky — “what if the pod moves to a different node?” — but it actually reduces volume churn. The volume stays attached to the same node as long as possible.
Backup Reliability Is Part of Reliability
If your volume gets corrupted, can you recover? If you say “snapshots”, you haven’t tested them.
We test volume snapshots every week. We restore them into a separate namespace and verify data integrity. No snapshot is “live” until it’s been restored and checked. We use Velero for this, but the tool doesn’t matter. The process does:
yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup-schedule
spec:
schedule: "0 3 * * *"
template:
includeNamespaces:
- production
excludedResources:
- events
- nodes
ttl: 720h
But schedules aren’t enough. We run a weekly restore test that spins up a temporary MySQL instance on the restored volume, runs a checksum query, and deletes everything. If the checksum doesn’t match, we get paged.
Network Reliability: The Silent Killer
Networking is the most abstract reliability concern in Kubernetes. You can’t see it. You can’t easily measure it. But it breaks all the time.
DNS Timeouts That Take Down Everything
Kubernetes DNS (CoreDNS) is critical. Every pod depends on DNS for service discovery. If CoreDNS gets overloaded, every pod’s DNS queries time out. And DNS timeouts can cascade — pods try to connect to services, DNS fails, pods back off, DNS gets overwhelmed, more timeouts.
The fix: cache DNS locally in each pod, and use a dedicated CoreDNS deployment.
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns-dedicated
namespace: kube-system
spec:
replicas: 3
selector:
matchLabels:
k8s-app: coredns-dedicated
template:
metadata:
labels:
k8s-app: coredns-dedicated
spec:
# Dedicated nodes for DNS
nodeSelector:
node-role.kubernetes.io/dns: "true"
containers:
- name: coredns
image: coredns/coredns:1.11.0
args:
- -conf
- /etc/coredns/Corefile
ports:
- containerPort: 53
memory: 256Mi
cpu: 500m
The dedicated node selector means DNS pods don’t compete with application pods for resources. And three replicas mean one can fail without impact.
But the biggest change: we set ndots: 1 in /etc/resolv.conf for all pods. Default is ndots: 5, which means any name with fewer than 5 dots gets checked against search domains first. That’s 3-4 extra DNS queries per lookup. With ndots: 1, names like postgres are resolved directly. This cut our DNS failure rate by 60%.
CNI Plugin Failures
We use Cilium for CNI. It’s fast. It supports network policies. But it’s not immune to failure. In 2024, we had a Cilium agent crash-loop because of a kernel module conflict. All nodes in the cluster lost connectivity for 10 seconds.
The lesson: you need network disconnection detection in your applications. If your application can’t detect a network failure and reconnect gracefully, Kubernetes isn’t going to save you.
python
# Python example: exponential backoff with jitter for reconnection
import random
import time
def connect_with_backoff(host, port, max_attempts=5):
attempt = 0
while attempt < max_attempts:
try:
conn = connect(host, port)
return conn
except ConnectionError:
attempt += 1
if attempt == max_attempts:
raise
delay = (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
return None
Measuring Reliability: The Only Metric That Matters
Forget “uptime percentage”. That’s a vanity metric. The real reliability metric for Kubernetes is time to steady state after failure.
Measure this:
- Node failure: how long until pods are rescheduled and serving traffic?
- API server restart: how long until all controllers converge?
- Volume attachment failure: how long until the application detects the failure and retries?
We measure all three with Prometheus and alert on anything over 60 seconds. If a node failure takes more than 60 seconds to recover from, we have a reliability problem.
Here’s a Prometheus rule we use:
yaml
groups:
- name: kubernetes-reliability
rules:
- alert: SlowPodRescheduling
expr: |
time() - max by(pod) (kube_pod_start_time{namespace="production"})
> 90
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is taking too long to start"
This alerts when any pod in production takes more than 90 seconds to become ready after creation. 90 seconds is our threshold. Yours might be different. But you need a threshold.
Frequently Asked Questions
What is reliability in kubernetes vs availability?
Availability asks “is the cluster up?”. Reliability asks “does the cluster behave correctly under failure?”. A cluster can be available (API server responds) but unreliable (returns stale data after an etcd leader change). Reliability is about correct behavior under stress.
How do I make my Kubernetes cluster more reliable without rewriting applications?
Start with etcd. Dedicated etcd nodes with fast disks. Then fix your probes — liveness probes should only restart on unrecoverable failures. Then add pod disruption budgets to prevent too many replicas from being unavailable simultaneously. Rewriting applications is the last step, not the first.
What’s the biggest reliability mistake teams make?
Using default settings. Default terminationGracePeriodSeconds is 30 seconds. That’s not enough for stateful apps. Default liveness probe settings cause unnecessary restarts. Default resource limits are too low for etcd. Every default is designed for development, not production.
Can you achieve 99.99% reliability on Kubernetes?
Yes. But it’s not about uptime. It’s about predictable failure recovery. If every node failure is handled correctly within 60 seconds, that’s better than “99.99% uptime” with a 5-minute outage every year. We achieve this at SIVARO with the patterns in this article. It’s not magic — it’s testing, monitoring, and relentless configuration hardening.
Why does my DNS keep failing in Kubernetes?
Probably ndots. Check your pod’s resolv.conf. Default ndots=5 means every hostname lookup generates 3-4 extra DNS queries for search domains. This overloads CoreDNS under load. Set ndots=1 by patching the Pod spec or using a dnsConfig in your pod template.
Is running databases on Kubernetes reliable?
It can be, but you have to treat the database as a first-class citizen. StatefulSets for ordering, dedicated nodes, volume backups tested weekly, proper preStop hooks, and a complete disaster recovery plan. We run PostgreSQL, Redis, and a custom event store on Kubernetes. It works. But it takes more effort than running them on VMs.
What role does observability play in reliability?
Everything. You can’t fix what you can’t see. We monitor etcd latency, API server request duration, scheduler queue depth, and pod start times. Without this data, you’re flying blind. Prometheus and Grafana are non-negotiable. Add Loki for logs and Tempo for traces. The cost is worth it.
Conclusion: What Is Reliability in Kubernetes?
After years building production systems at SIVARO, I can answer what is reliability in kubernetes? cleanly: reliability is the system’s ability to maintain correct behavior during and after failures. It’s not about keeping everything running. It’s about making the system predictable when things break.
You can’t prevent every failure. Nodes die. Networks partition. Software bugs exist. But you can design your cluster so that failures don’t cascade. You can configure etcd for speed and recovery. You can set probes that don’t cry wolf. You can handle SIGTERM with dignity.
The teams that succeed at Kubernetes reliability aren’t the ones with the most powerful clusters. They’re the ones who have tested failure scenarios, configured every knob deliberately, and accepted that reliability is a continuous process of measurement and improvement.
Start with one thing today: check your etcd configuration. Change those defaults. It’s the highest-leverage change you can make.
Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.