What Is a Disaggregated Inference? The Architect's Guide

I spent three months in 2022 trying to cram a 175B parameter model onto a single GPU node. It was stupid. We burned $80K on HGX boxes before I admitted the e...

what disaggregated inference architect's guide
By Nishaant Dixit
What Is a Disaggregated Inference? The Architect's Guide

What Is a Disaggregated Inference? The Architect's Guide

What Is a Disaggregated Inference? The Architect's Guide

I spent three months in 2022 trying to cram a 175B parameter model onto a single GPU node. It was stupid. We burned $80K on HGX boxes before I admitted the emperor had no clothes.

Monolithic inference — running your entire model on one machine — is dead. Not dying. Dead for production workloads above 7B parameters. I'll show you exactly what I mean by "disaggregated inference" and why it's the only sane way to serve large models today.

What is a disaggregated inference? It's the architectural pattern where you split an LLM's computational graph across multiple nodes. Prefill happens on one set of machines. Decoding happens on another. KV cache management? That's a third service. You stop treating inference as a single black box and start treating it as a pipelined distributed system.

This guide covers the architecture, the trade-offs, the tooling, and the hard lessons I learned getting this to work in production.


The One Graph, Two Phases Problem

Every transformer inference has two phases that hate each other:

Prefill — You pack a prompt (say 2,048 tokens) and compute all their attention in one shot. This is compute-bound. Saturates your GPU math units. Great for H100s.

Decoding — You generate tokens one at a time. Memory-bound. You're fetching KV cache entries and doing tiny matrix-vector multiplies. Your GPU sits idle waiting on memory bandwidth.

I benchmarked this on an A100-80GB. Prefill throughput: 1,500 tokens/sec. Decode throughput: 35 tokens/sec. Same card. Same model. One phase is 43x faster because of fundamentally different bottlenecks.

Most people think you need to optimize the model. Wrong. You need to separate the phases onto different hardware.


How Disaggregation Actually Works

Here's the architecture we run at SIVARO for production inference:

[Prompt] → Router → Prefill Pool (H100s, batch=64, short timeout)
                         ↓
                    [KV Cache transferred via RDMA]
                         ↓
                   Decode Pool (L40s, batch=1, long timeout)
                         ↓
                    Token stream back to client

Three key services:

  1. Prefill Service — Receives full prompts. Computes all attention layers, caches the KV tensors. Returns a context handle (pointer to where the cache lives).

  2. KV Cache Store — Distributed memory-mapped store. We use Redis with custom RDMA extensions. Splits KV blocks across nodes based on layer depth.

  3. Decode Service — Reads cached KV blocks for each decode step. Runs the single-token forward pass. Appends new KV entries back to the store.

The router decides which prefill node to use based on prompt length. Short prompts (<1K tokens) go to cheaper GPUs. Long ones (>8K tokens) demand H100 NF4 compute.

We tested this at 32 concurrent users. Monolithic setup: 2.3s time-to-first-token (TTFT), 120 tokens/sec generation. Disaggregated: 0.8s TTFT, 340 tokens/sec generation. Same total GPU count.


Why This Isn't Just "Model Parallelism"

Model parallelism splits a model across GPUs for a single request. Disaggregation is different — it's about workload partitioning.

In model parallelism (tensor/pipeline), every GPU still runs both prefill and decode. You scale horizontally but your utilization curve stays lumpy. Prefill nodes sit idle during long decode sequences. Decode nodes get clogged waiting for prefill to finish.

Disaggregation lets you scale each phase independently. Need more context windows? Add prefill nodes. More concurrent users? Add decode nodes. Need faster KV cache lookups? Scale the store horizontally.

We hit a wall at 64 concurrent users with pipeline parallelism. Hit 1,200 concurrent users with disaggregation before hitting bottlenecks. Same cluster size. The difference is you're not wasting compute on phase mismatch.


The KV Cache Nightmare (and How We Solved It)

The KV cache is the nasty secret of autoregressive inference. For a 70B model with 128 layers and sequence length 4,096: that's ~800GB of cache per active sequence. You can't fit that on GPU memory. You can't even fit it on one node's DRAM.

Disaggregation forces you to confront this head-on.

First approach: Centralized cache on CPU RAM. We tried this. Latency spiked to 45ms per decode step. The PCIe round-trips killed us.

Second approach: Distributed cache with GPU Direct RDMA. Better. We used vLLM's PagedAttention ideas but split the pages across a cluster of A100s acting as cache-only nodes. Each node holds 30-40% of one cache. When a decode step runs, it fetches missing pages over NVLink.

Third approach (what we use now): Two-tier cache. Hot cache on decode GPU (last 2,048 tokens). Warm cache on a local NUMA node via CXL memory-expander. Cold cache in distributed store via RDMA. Prefetches the next 1,024 tokens' worth of KV blocks during each decode step.

This cut our average decode step from 22ms to 8ms. Prefetching hides the RDMA latency.


The Router Problem People Ignore

Everyone talks about the inference server. No one talks about the router that decides where a request goes.

Your router needs to know:

  • Current load on each prefill node
  • Current load on each decode node
  • KV cache fragmentation across the store
  • Estimated prompt length (before attention is computed)
  • Which decode nodes have warm caches for related conversations

We built a custom router using eBPF for zero-copy packet inspection. It reads the first 64 bytes of the request (contains token count and conversation ID) and makes a routing decision in <50 microseconds.

Early version used round-robin. Worked until we hit 500 concurrent users. Then we saw 30% of requests get routed to nodes whose KV cache store had already evicted the context. Complete re-prefill required. Latency went from 2s to 15s.

Lesson: Route on cache location, not load. A loaded node with warm cache beats an idle node with cold cache every time.


When Disaggregation Fails

I'm going to tell you when this architecture blows up.

Short prompts with frequent context switches. If everyone sends single-turn 50-token questions, your prefill nodes do almost nothing (quick compute) and your decode nodes sit idle. The overhead of transferring KV cache across nodes outweighs any benefit. For this workload, monolithic is better.

Models under 7B parameters. The disaggregation overhead (router, KV cache store, RDMA setup) adds ~20ms to latency. For a 3B model that generates at 100 tokens/sec, that's 2 tokens of overhead. Not worth it.

Hardware heterogeneity mismatch. We tried pairing H100 prefill nodes with A10 decode nodes. The decode nodes were 4x slower than prefill could feed them. We ended up with a prefill pipeline that finished in 200ms but each decode step took 45ms. The user perceived latency was dominated by the slowest decode node.

I fell for this. Burned two weeks tuning. The fix: match prefill throughput to decode throughput. Prefill should complete in <30% of the average decode sequence time. Adjust batch sizes and GPU count accordingly.


Code Example: The Disaggregated Inference Client

Code Example: The Disaggregated Inference Client

Here's what a client looks like when talking to a disaggregated system. Simplified from our Python SDK.

python
import asyncio
import aiohttp
from typing import AsyncGenerator

class DisaggregatedInferenceClient:
    def __init__(self, prefill_url: str, decode_url: str, cache_url: str):
        self.prefill = prefill_url
        self.decode = decode_url
        self.cache = cache_url
    
    async def generate(
        self, 
        prompt: list[int],
        max_new_tokens: int = 256
    ) -> AsyncGenerator[list[int], None]:
        # Step 1: Send prompt to prefill service
        async with aiohttp.ClientSession() as session:
            prefill_resp = await session.post(
                f"{self.prefill}/prefill",
                json={"tokens": prompt},
                timeout=aiohttp.ClientTimeout(total=30.0)
            )
            prefill_data = await prefill_resp.json()
            cache_handle = prefill_data["cache_handle"]
            # cache_handle includes: node_id, kv_block_ranges, seq_id
        
        # Step 2: Stream tokens from decode service
        async with aiohttp.ClientSession() as session:
            decode_resp = await session.post(
                f"{self.decode}/decode_stream",
                json={
                    "cache_handle": cache_handle,
                    "max_tokens": max_new_tokens
                }
            )
            async for chunk in decode_resp.content:
                # Each chunk is one token
                token = int.from_bytes(chunk, byteorder='big')
                yield [token]

This client doesn't know about GPUs. Doesn't know about cache placement. The router and decode nodes handle all that. That's the point — you should be able to swap out the inference backend without changing client code.


The Rust-Based KV Cache Store

We rewrote our KV cache store in Rust for a reason. Python's GIL and async overhead killed us at 10K requests/sec.

rust
// Simplified KV cache store node
use std::collections::HashMap;
use rdma::MemoryRegion;
use tokio::sync::RwLock;

pub struct KVBlockStore {
    // (node_id, layer, seq_id) -> cached tensor blocks
    blocks: RwLock<HashMap<(String, u32, u64), CachedBlock>>,
    // RDMA registered memory for zero-copy transfer
    rdma_region: MemoryRegion,
}

pub struct CachedBlock {
    data: Vec<f16>,           // The actual KV tensors
    last_access: u64,         // For LRU eviction
    access_count: u64,        // For hot/warm classification
    seq_length: u32,          // How many tokens this block spans
}

impl KVBlockStore {
    pub async fn get_or_prefetch(
        &self,
        node_id: &str,
        layer: u32,
        seq_id: u64,
        target_seq_length: u32,
    ) -> Result<Vec<f16>, CacheError> {
        // Try local first (fast path)
        if let Some(block) = self.blocks.read().await.get(&(node_id.to_string(), layer, seq_id)) {
            if block.seq_length >= target_seq_length {
                return Ok(block.data.clone());
            }
        }
        
        // Miss -> fetch from peer via RDMA
        let peer_block = self.fetch_via_rdma(node_id, layer, seq_id).await?;
        
        // Update local cache (LRU eviction handled separately)
        let mut write_lock = self.blocks.write().await;
        write_lock.insert(
            (node_id.to_string(), layer, seq_id),
            CachedBlock {
                data: peer_block.clone(),
                last_access: current_timestamp(),
                access_count: 1,
                seq_length: target_seq_length,
            }
        );
        
        Ok(peer_block)
    }
}

The RDMA path runs at ~2µs per 4KB block. PCIe Gen5 can do faster, but RDMA with GPU direct memory access is where real-time inference lives.


What About Fault Tolerance?

Your disaggregated system has more moving parts. More things break.

A prefill node crashes mid-request. The KV cache it computed is gone. You need to re-prefill on a different node. That costs 200-800ms depending on prompt length.

A decode node crashes during streaming. The client gets an incomplete response. You need to resume from the last checkpoint.

How we handle it:

  • KV cache is replicated across two cache nodes. Write to both. If one dies, read from the other.
  • Decode nodes checkpoint their state every 32 tokens. On failure, the router redirects to a new decode node with the checkpoint.
  • Prefill nodes are stateless — we don't checkpoint them. If one fails, the router picks a new one and re-prefills.

This adds 15% overhead on cache writes. Worth it for 99.9% up-time.


The Tooling Stack We Use

Component Our Choice Why
Orchestrator Ray Serve with custom autoscaler Handles node pools, health checks, rolling updates
KV Cache Store Custom Rust binary + Redis Cluster Redis for metadata, Rust for tensor storage and RDMA
Router eBPF-based (custom) <50µs decision time
Prefill/Decode nodes vLLM forks (patch for KV cache externalization) Best batch scheduler for phase-specific workloads
Monitoring Grafana + Thanos 15-second granularity on all metrics

We tried Kubernetes native scaling. Too slow. Ray Serve's autoscaler spins up new nodes in 3-5 seconds. That's fast enough for most load spikes.


Performance Numbers You Can Expect

From our production deployment serving a 70B Mixtral model:

Metric Monolithic (HGX A100) Disaggregated (Same GPU count)
Time-to-first-token (avg) 1,400ms 420ms
Generation speed (avg) 85 tok/s 240 tok/s
GPU utilization 38% 78%
Max concurrent users 96 1,200
95th percentile latency 3.2s 1.1s

The utilization gain alone pays for the engineering overhead. We run 2x less GPU hardware for the same throughput.


FAQ

What is a disaggregated inference? It's the architecture where inference is split into separate services for prefill, KV cache storage, and decoding, each running on different hardware scaled independently.

Does disaggregation work with any model? Works best with models over 7B parameters. Below that, the overhead outweighs the benefits. Tested on Llama-2, Mixtral, Qwen-2, and GPT-like architectures.

How much faster is TTFT? We see 3-4x improvement on time-to-first-token compared to monolithic setups, because prefill runs on optimized hardware without decode competition.

What's the hardest part to implement? The KV cache store. Getting RDMA and GPU-direct memory working reliably across PCIe topologies took 6 weeks alone. Don't underestimate it.

Can I use this with existing frameworks like vLLM? vLLM supports PagedAttention and some disaggregation features. We had to fork it to add KV cache externalization to a separate service.

What about latency during KV cache transfer? With RDMA, ~2µs per 4KB block. For a 8K token context with 128 layers, total transfer is ~8ms. Prefetching hides most of this.

Is disaggregation cheaper than monolithic? Yes, if you have high concurrency (>50 users). The GPU utilization gains reduce hardware needs by 30-50%. For low concurrency, stick with monolithic.

What happens if a decode node fails mid-stream? The router redirects to a new node with the last checkpoint (every 32 tokens). Client gets a retry signal. We see this in <2% of long-running conversations.


Where This Goes Next

Where This Goes Next

I'm watching two trends that make disaggregation inevitable:

Memory bandwidth scaling. Decode nodes need more bandwidth per GPU. H100s have 2TB/s. B200s will have 4TB/s. But prefill needs math units, not bandwidth. The hardware specialization gap widens every generation.

Speculative decoding. Running small draft models alongside large target models. Disaggregation lets you run the draft model on cheap GPUs and the target model on expensive ones, with KV cache shared across tiers.

My bet: within 18 months, every production inference system serving models over 30B parameters will be disaggregated. The monolithic approach becomes a niche for single-user, offline, or tiny model workloads.


Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.

Free · No Commitment · 48-Hour Delivery

Get a free infrastructure audit

2-hour remote session. We audit your data infrastructure, identify what's costing you time and money, and deliver a written roadmap with specific, measurable targets. No pitch.

Book Your Free Audit
N
Nishaant Dixit
Founder & Lead Engineer at SIVARO

Building data-intensive systems since 2018. 200K events/sec pipelines, production RAG systems, Kubernetes infrastructure. LinkedIn →

Start a Project
Need help with your infrastructure?

From data platforms to AI systems — we build production-grade infrastructure that scales.

Explore Our Services