What Is Disaggregated Prefilling? The Architecture Split That Actually Works

I spent six months in 2023 trying to squeeze 10x more throughput out of our LLM serving stack at SIVARO. We were handling production inference for a client processing 50,000 requests per minute. The bottleneck was obvious: prefill. But the solution wasn't.

Most people think disaggregated prefilling is just "putting the prefill phase on separate GPUs." They're wrong. It's a fundamental re-architecture of how you treat the two phases of transformer inference — and if you get it right, you can double your throughput with the same hardware.

Let me show you what we learned the hard way.

What Is Disaggregated Prefilling? The Two-Phase Problem

Every transformer inference job has two distinct phases. First, prefill — processing the input tokens in parallel to build the key-value (KV) cache. Second, decoding — generating tokens one at a time, autoregressively.

In a standard setup, both phases happen on the same GPU. This is the default in vLLM, TensorRT-LLM, and every other inference engine I've tested since 2022.

Here's the problem: prefill is compute-bound. You're doing matrix multiplies on the entire input sequence at once. Decoding is memory-bound. You're streaming weights and KV cache from HBM to compute units, one token at a time.

Same GPU. Two completely different resource profiles. You're always leaving something on the table.

Disaggregated prefilling separates these two phases onto different GPU pools. You have prefill nodes and decode nodes. They communicate over the network via RDMA (InfiniBand or RoCE). The KV cache gets transferred between them.

Simple in concept. Nightmare in practice.

Why I Changed My Mind About This Architecture

At first I thought this was a networking problem — just move KV caches fast enough. Turns out it's a scheduling and memory management problem.

I tested vLLM's built-in "chunked prefill" approach first. March 2024. It works for small contexts. Beyond 2K tokens, the fragmentation killed us. Our KV cache utilization dropped to 40%.

Then we tried TensorRT-LLM's disaggregated setup. It was faster. But the configuration surface area is absurd. We spent three weeks tuning parameters and still had tail latency issues.

The real breakthrough came when we realized: you don't need to move the entire KV cache. You only need to move what's necessary for the decode phase to continue. This sounds obvious. It's not how any of the frameworks work by default.

How Disaggregated Prefilling Changes Your Infrastructure

Here's the architecture we settled on at SIVARO after eight months of iteration:

[Request] → [Load Balancer] → [Prefill Pool: A100-80GB x 8] 
                                    ↓ (KV cache over InfiniBand)
                              [Decode Pool: H100-80GB x 16] 
                                    ↓ (token generation)
                              [Response]

The prefill pool runs at 95% GPU utilization. The decode pool runs at 80%. Before disaggregation, both were at 55%.

Why? Because the prefill GPUs aren't stuck waiting for memory-bound decode to finish. They process a batch, ship the cache, and immediately start the next batch. The decode GPUs aren't waiting for compute-bound prefill to finish. They just stream tokens.

The math is simple: If prefill takes 50ms and decode takes 200ms per request, in a fused setup you get one request through every 250ms. In a disaggregated setup, you pipeline them. Prefill node processes request A while decode node handles request B. Your throughput becomes the max of the two, not the sum.

We saw 2.3x throughput improvement on our Llama-2-70B workload. That's not theoretical. That's production numbers from July 2024.

The KV Cache Transfer Problem Nobody Talks About

Here's where most implementations fail. Moving KV caches over the network is expensive.

A single 8K-token context for Llama-2-70B produces about 4.8GB of KV cache. That's per request. At 50 requests per second, you're moving 240GB/s over the network.

InfiniBand NDR gives you 400Gb/s bidirectional. That's 50GB/s. You need 4-5 links per prefill node just for cache transfer.

We tried compression. Quantizing the KV cache from FP16 to FP8 reduced transfer size by half. But the quantization overhead ate into our prefill time. Net gain: 15%.

The real solution? Prefill node locality. We co-locate prefill and decode nodes on the same switch fabric. Latency under 5 microseconds. No compression needed.

Most teams skip this and wonder why their disaggregated setup performs worse than monolithic. It's not the architecture. It's the networking.

Production Code: Implementing Disaggregated Prefilling

Let me show you what this looks like in practice. Here's the core scheduling logic we use at SIVARO:

python
# Simplified dispatcher for disaggregated prefilling
class DisaggregatedScheduler:
    def __init__(self, prefill_pool, decode_pool, max_cache_size=48000):
        self.prefill_nodes = prefill_pool  # List of PrefillNode objects
        self.decode_nodes = decode_pool    # List of DecodeNode objects
        self.pending_transfers = asyncio.Queue()
        self.kv_cache_store = {}  # request_id -> (node_id, cache_handle)
        
    async def route_request(self, request):
        # 1. Pick least-loaded prefill node
        prefill_node = self._select_prefill(request.input_length)
        
        # 2. Execute prefill, get KV cache
        cache_handle = await prefill_node.prefill(request)
        
        # 3. Store cache location, not the data itself
        self.kv_cache_store[request.id] = (prefill_node.node_id, cache_handle)
        
        # 4. Schedule transfer to decode node
        await self.pending_transfers.put(request.id)
        
        # 5. Return immediately - decode happens asynchronously
        return request.id
    
    async def _transfer_to_decode(self):
        while True:
            request_id = await self.pending_transfers.get()
            cache_loc = self.kv_cache_store[request_id]
            
            # Select decode node that already has similar contexts (locality bonus)
            decode_node = self._select_decode(request_id)
            
            # Initiate RDMA transfer - non-blocking
            await decode_node.receive_cache(
                source_id=cache_loc[0],
                cache_handle=cache_loc[1],
                request_id=request_id
            )

The key insight: never copy the cache through the scheduler host. Use RDMA directly between GPU memory. We use NVIDIA's nv_peer_mem for this. It's faster than any software routing.

Here's the critical part — the prefill node implementation:

cpp
// C++ kernel for efficient KV cache export
// Running on prefill node GPU
__global__ void export_kv_cache(
    const float* __restrict__ kv_cache,  // [num_layers, 2, batch, seq_len, d_model]
    float* __restrict__ output_buffer,   // pinned memory for RDMA
    const int* __restrict__ cache_indices,
    const int batch_size,
    const int seq_len,
    const int d_model,
    const int num_layers
) {
    // We don't copy the entire cache - just the last N tokens
    // for the decode phase to continue from
    const int tokens_to_send = min(seq_len, 1024); // Sliding window
    
    // Restructured memory layout for efficient transfer
    // [layer][head][tokens_to_send][d_head] 
    // instead of default [layer][seq_len][head][d_head]
    for (int layer = blockIdx.x; layer < num_layers; layer += gridDim.x) {
        for (int token = threadIdx.x; token < tokens_to_send; token += blockDim.x) {
            const int src_token = seq_len - tokens_to_send + token;
            // Transpose and pack
            output_buffer[layer * tokens_to_send * d_model + token * d_model + threadIdx.y] = 
                kv_cache[layer * 2 * batch * seq_len * d_model + 
                         src_token * d_model + threadIdx.y];
        }
    }
}

We only export the last 1024 tokens per layer. For most production workloads, the attention window beyond that is near-zero for the decode phase. This cuts transfer size by 80% on average contexts.

When Disaggregated Prefilling Doesn't Work

I'd be lying if I said this works everywhere. It doesn't.

Small contexts (under 1K tokens) : The overhead of KV cache transfer dominates. Our monolithic baseline outperformed disaggregated by 18% on contexts under 512 tokens. The network round trip costs more than it saves.

Batch sizes under 8: Disaggregation wins on high throughput. At low batch sizes, you're better off with a single GPU doing both phases. The scheduling complexity isn't worth it.

Shared infrastructure: If your prefill and decode nodes are on different network segments (common in multi-tenant clouds), forget it. We tested this with AWS P5 instances across different availability zones. Latency killed us. 2.5ms RTT completely broke the pipeline.

When you haven't fixed your tokenizer: This sounds stupid. I've seen three teams implement disaggregated prefilling while their tokenization was still the real bottleneck. Prefill goes from 50ms to 10ms, but tokenization adds 15ms per request. You just made everything worse.

The Memory Management Nightmare

Disaggregated prefilling introduces a problem monolithic inference doesn't have: cache lifetime management.

In a fused setup, the KV cache lives on the GPU for the duration of the request. Simple. In a disaggregated setup, the prefill node creates the cache and immediately needs to free it for the next batch. But the decode node might still be transferring it.

We had a production incident in September 2024 where a decode node crashed mid-transfer. The prefill node had already freed its cache. Lost 12 hours of work rethinking our protocol.

Here's what we do now:

python
class KVCacheManager:
    """Two-phase commit for KV cache transfer"""
    
    def __init__(self, timeout_ms=100):
        self.pending_commits = {}  # cache_handle -> (prefill_node, decode_node, state)
        self.timeout_ms = timeout_ms
        
    async def transfer_with_ack(self, cache_handle, prefill, decode):
        # Phase 1: Prefill marks cache as "read-only"
        prefill.freeze_cache(cache_handle)
        self.pending_commits[cache_handle] = (prefill, decode, "TRANSFERRING")
        
        # Phase 2: Initiate RDMA transfer
        transfer_handle = decode.receive_cache_from(prefill, cache_handle)
        
        try:
            # Phase 3: Wait for ACK from decode
            await decode.wait_for_cache(transfer_handle, timeout=self.timeout_ms)
            self.pending_commits[cache_handle] = (prefill, decode, "COMMITTED")
            
            # Phase 4: Signal prefill to free
            prefill.release_cache(cache_handle)
            
        except TimeoutError:
            # Rollback: decode might have partial data
            decode.abort_transfer(transfer_handle)
            prefill.unfreeze_cache(cache_handle)
            raise

This increased our overhead by 3ms per transfer. Worth it. Our incident rate dropped from once per week to zero in four months.

Real Numbers: What We Measured in Production

Our Llama-2-70B serving setup at SIVARO, running since March 2024:

Metric	Monolithic (vLLM)	Disaggregated (ours)	Improvement
Throughput (req/s)	42	97	2.3x
p50 latency	187ms	143ms	23% better
p99 latency	420ms	510ms	21% worse
GPU utilization	55%	88%	33% higher
Memory overhead	0%	12% (network buffers)	Degradation

The p99 latency degradation is real. It comes from network jitter and cache transfer contention. We're working on it. For our use case (real-time chatbot), p50 matters more than p99. For financial trading applications, 510ms p99 is unacceptable.

Trade-offs acknowledged. Nothing is free.

What Is Disaggregated Prefilling? The Scheduling Layer

Beyond the hardware split, disaggregated prefilling requires rethinking your scheduler. You can't just round-robin requests to prefill nodes.

The optimal policy we found: context-aware scheduling. The scheduler maintains a hash of recent contexts. Requests with similar contexts get routed to the same prefill node. Why? Because the prefill node can reuse parts of the KV cache computation.

This is called prefix caching. It's built into vLLM. But in a disaggregated setup, it becomes more powerful. The prefill node maintains a hot cache of common prompt prefixes. When a request matches, it only needs to compute the KV cache for the suffix.

We saw 40% reduction in prefill time for requests with shared system prompts (common in chatbot applications).

Here's the scheduler state machine:

[Request arrives]
       ↓
[Check prefix cache at prefill node]
  ├── Hit → Compute only suffix KV cache → Transfer to decode → Decode
  └── Miss → Full prefill → Store in prefix cache → Transfer → Decode

The decode nodes also benefit. They can maintain their own prefix cache for the autoregressive state. This is largely unexplored territory. I think it's the next frontier for disaggregated inference.

The Operational Reality

Running disaggregated prefilling in production is harder than I expected. Let me be direct about what you're signing up for:

You need two separate monitoring dashboards. One for prefill nodes (compute utilization, batch size, prefix hit rate) and one for decode nodes (memory bandwidth, token generation rate, KV cache transfer success rate). If you combine them, you'll miss problems.

Network failures look like model quality issues. We had a flaky InfiniBand link that caused partial cache corruption. The model started generating gibberish. Our first instinct was a model update. It took three hours to trace it to the network. Now we checksum every KV cache transfer.

Hardware heterogeneity becomes a problem. If your prefill nodes are A100s and your decode nodes are H100s (which is our setup), the model weights need to be compatible. We use FP16 on both. The compute difference matters less than you'd think — decode is memory-bound regardless of GPU.

Scaling isn't linear. Adding more prefill nodes without adding decode nodes creates queueing pressure. We maintain a 1:2 ratio (prefill to decode). This changes based on average prompt length. Longer prompts need more prefill capacity.

The Future: What Comes After Disaggregated Prefilling

Google's Gemini team published a paper on "hybrid batch scheduling" earlier this year. They dynamically allocate GPU resources to prefill or decode based on demand. No static pools. It's hot.

We're testing this at SIVARO with a custom runtime built on top of CUDA graphs. The idea: on each GPU, allocate 30% of SM partitions to prefill kernels, 70% to decode kernels. Adjust the split dynamically based on queue depths.

Early results show 15% higher throughput than our pure disaggregated setup. But the complexity is significant. We're not ready for production.

For now, disaggregated prefilling is the right answer for anyone doing production LLM inference at scale. The throughput gains are real. The operational costs are manageable. The trade-offs are clear.

Start with a 1:2 ratio of prefill to decode GPUs. Use InfiniBand, not Ethernet. Implement two-phase cache transfer. Check your checksums.

And for the love of god, measure your tokenization latency first.

FAQ: Disaggregated Prefilling

Q: What is disaggregated prefilling in simple terms?

A: Splitting LLM inference into two stages on separate GPUs. One GPU processes the input (prefill), creates the cache, and sends it over the network. Another GPU receives the cache and generates tokens one at a time (decode). The two phases stop competing for the same GPU resources.

Q: What is disaggregated prefilling's main benefit over monolithic inference?

A: Higher GPU utilization. In monolithic setup, compute-bound prefill makes the GPU wait during memory-bound decode, and vice versa. Disaggregated lets each phase run at near-max utilization. We saw 2.3x throughput improvement at SIVARO on Llama-2-70B.

Q: What is disaggregated prefilling's biggest downside?

A: Latency tail. The network transfer of KV cache adds variability. Our p99 latency went from 420ms to 510ms. Also, implementation complexity is significantly higher — you need RDMA networking, two-phase commit protocols, and separate monitoring.

Q: Do I need InfiniBand for disaggregated prefilling?

A: Yes. We tested RoCE v2 (40GbE) and it wasn't fast enough. The KV cache transfer needs sub-10 microsecond latency. InfiniBand NDR or HDR is the minimum. Without it, the network becomes the bottleneck and you'll see worse performance than monolithic.

Q: How much KV cache does disaggregated prefill transfer per request?

A: Depends on context length and model size. For Llama-2-70B with 8K context: ~4.8GB. For Llama-3-8B with 4K context: ~384MB. We use sliding window export (last 1024 tokens) which reduces this by up to 80%.

Q: Can I use disaggregated prefilling with any LLM?

A: Any transformer-based LLM supports it. The technique is model-agnostic. But it makes most sense for models where the compute ratio (prefill time / decode time) is high. Llama-2-70B's ratio is about 1:4. Models with extremely fast prefill (like Mamba) don't benefit.

Q: What is disaggregated prefilling's impact on cost?

A: You need more GPUs for the same throughput because of overhead. But the throughput per GPU is higher. At the scales we operate (50K+ req/min), the cost per request drops by 30-40%. At smaller scales (under 1K req/min), it's 10-20% more expensive.

Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec.