Introduction

Key-Value cache (KV cache) is one of the cruicial techniques for efficient deployment of LLMs. In a nutshell, KV caching means storing the key and value tensors produced by the model’s attention layers for past tokens, so that the model can reuse them when computing attention for new tokens avoiding redundant computations and significantly speeding up the generation of each subsequent token [^1]. In practical terms, KV caching transforms the naive $O(n^2)$ scaling of transformer attention (per token) into a linear scaling by reusing past work. This post’s focus is to explain optimizing inference performance for deployment in a way thats simple yet accurate.

The Problem: Autoregressive Inference and Redundant Computation

Transformers generate text by attending to all prior tokens at each step, repeating the calculations for the first $n$ tokens that it already did to predict the $n+1$th token. As an example, imagine that the model has generated the text “The quick brown fox jumped” and we want it to generate the next word. Without caching, the model will feed the whole sequence “The quick brown fox jumped” through all layers to produce the new word “over”. Then to generate the word after “over”, say “the”, it will again feed “The quick brown fox jumped” through the network. The “The quick brown fox jumped” computation was done twice, and was unnecessary. The model is essentially recomputing the same intermediate representations for the prefix repeatedly.

The Solution

KV caching solves this inefficiency by remembering those intermediate representations. At each transformer attention layer, the model creates key ($K$) and value ($V$) vectors for each token position. If we cache those $K$ and $V$ vectors after they are first computed, we can reuse them in later inference steps instead of recomputing them. So, in the above example, after processing “The quick brown fox jumped”, we store the keys and values for those five tokens. When generating the next word “over”, the model only needs to compute the $K$ and $V$ for the new token “over” and can retrieve the cached $K,V$ for “The”, “quick”, “brown”, “fox”, “jumped” from memory. This caching mechanism “spares the recomputation of key and value tensors of past tokens at each generation step by storing them in memory as they get computed. It’s a classic time-memory trade-off: we trade increased memory usage for a huge gain in compute efficiency [^2].

Implementing a KV Cache

Implementing KV caching is straightforward. At each self-attention layer in the transformer, we maintain a cache for keys and values. Pseudocode for a single layer’s caching logic might look like:

# Within a transformer layer (simplified pseudocode)
if use_cache:
    # Concatenate new keys/values with those from previous tokens
    if self.cache_k is not None:
        keys = torch.cat([self.cache_k, new_keys], dim=1)
        values = torch.cat([self.cache_v, new_values], dim=1)
    else:
        keys = new_keys
        values = new_values
    # Update the cache for next time
    self.cache_k = keys
    self.cache_v = values
else:
    # If not using cache, just use the freshly computed keys/values
    keys = new_keys
    values = new_values

# ... proceed to compute attention output using 'keys' and 'values' ...

Most transformer libraries already implement this pattern under the hood. For instance, Hugging Face’s transformers library has KV caching enabled by default for text generation (use_cache=True). When you call the model’s generate() method iteratively, it returns a past_key_values (or similar) object which contains the stacked keys and values for all past tokens. You then feed that back into the model on the next call, so it appends new entries. This is abstracted away, so you as a user typically don’t have to manually concatenate anything – you just ensure use_cache is on and handle the model outputs properly. But it’s valuable to understand what’s happening internally.

During training, note, KV caching is usually not used. In training we often process full sequences in parallel (using teacher forcing for next-token prediction), so we don’t generate token by token. Also, backpropagation through a long sequence benefits from treating it as one big computation rather than sequentially appending (plus, the training objective usually needs the full sequence context anyway). So, KV caching is primarily an inference-time optimization. Indeed it adds complexity and memory usage, which is acceptable for serving but not necessary during training.

Benefits: Speeding Up Inference

If a model has L layers and we generate N tokens, using a KV cache ensures that each layer only processes each token once (when that token is first generated).

  • Without KV cache: Generating a sequence of length $N$ incurs roughly $O(N^2)$ attention work per layer (because the first generation sees 1 token, the next sees 2 tokens, …, the last sees N tokens, summing to $1 + 2 + … + N \approx O(N^2)$ operations per layer). Across L layers this is $O(L \cdot N^2)$, and if you consider each token generation as separate calls, the overhead can look cubic $O(N^3)$ overall.

  • With KV cache: Each new token’s attention at each layer only does work proportional to the current sequence length (which grows from 1 to N). But crucially, the work done for each prior token is not repeated across generations. In total, each layer does on the order of $1 + 1 + … + 1$ (N times) for each new token’s key/value (plus the attention score computation which still looks at all keys, but that’s just a matrix-vector multiply that is much cheaper than re-encoding through the whole layer). Thus the total operations remain $O(L \cdot N^2)$, but significantly smaller constant factors come into play, and we avoid the blow-up of re-encoding entire sequences. Empirically, this yields large speedups. A simple experiment using a 1.7B parameter model on a GPU showed about 5.2Ă— faster text generation with KV caching compared to without. [^3]

Another way to think about this is that without caching, the incremental cost per token grows with the sequence length. Caching stabilizes the per-token latnecy and allows scaling to long sequences with much less slowdown. This becaomes crucial for maintaining snappy response time in chatbots or handling long user prompts without timing out.

Downsides: Memory Cost

KV Cache consumes memory. Each cached key or value vector is typically the size of the model’s hidden dimension (divided by number of heads, etc.) and these accumulate for every token and every layer. Memory usage grows linearly with the number of tokens generated (and with the number of layers). This can quickly exhaust available GPU memory, especially in long-context scenarios. In scenarios with very long contexts (thousands of tokens) or multiple simultaneous sequences (batch decoding), the KV cache can become a significant memory bottleneck [^4]. If memory is limited, there are a few strategies to manage it.

Methods for Balancing the Tradeoff

  • Sliding Window / Cache Truncation: One straightforward approach is to only cache up to a fixed number of recent tokens as a rolling window. It prevents the cache from growing without bound, at the cost of the model no longer “remembering” unlimited history. If older context is less relevant or your application can tolerate forgetting, this method keeps memory in check. The complexity is that you must also ensure the model’s position encodings or attention mechanisms can handle the wrap-around, but in practice frameworks handle the position indexing for you if you manage the cache size. In summary, truncating the KV cache is a brute-force but effective way to cap memory growth [^1].

  • Cache Compression/Downcasting: Storing keys and values in lower precision (e.g., FP16 or even int8) can halve or quarter the memory footprint at some loss of precision. Some inference engines support 8-bit KV caches with negligible impact on model quality [^2].

  • Layer-wise caching choices: An advanced idea is to not cache every layer’s keys and values. Recent research proposes caching only a subset of layers – for example, only every few layers or just the last few layers – to save memory while still reusing most computations. One such method, Layer-Condensed KV Cache, found that caching a small number of layers can still achieve near-full speed benefits at a fraction of the memory cost, even improving throughput by up to 26Ă— in some cases [^5].

  • Multi-Query / Grouped-Query Attention: A more elegant solution at the architecture level is to reduce how much is cached per token. In multi-head attention (MHA), each head has its own distinct key and value vectors for each token. This means if there are, say, 32 heads, each token contributes 32 key vectors and 32 value vectors to the cache (one per head). Multi-Query Attention (MQA) is a modification first proposed by Shazeer [^6] that uses a single shared key and value across all attention heads. In other words, the heads still have separate query projections (so they attend differently), but they share the same $K$ and $V$ for each token. For example, if a model had 32 heads, going to 1 key/value head would reduce cache size by 32Ă—. Many large models adopt either MQA or a hybrid called Grouped-Query Attention (GQA), where heads are partitioned into groups that share keys/values. GQA lets you choose a middle ground (4 groups means 4 shared KV sets). Notably, models like Google’s PaLM, Meta’s LLaMA-2 70B, Falcon, and Mistral use MQA/GQA to curb memory usage [^2].

  • Better Memory Allocation (PagedAttention): A subtle issue in naive KV cache implementation is memory fragmentation and waste. If we don’t allocate memory smartly, we might reserve a large buffer for the maximum possible cache size of a request, even if the request ends up much shorter. Or we might allocate new arrays every time we concatenate a new token’s worth of data, leading to a lot of small allocations. The PagedAttention algorithm addresses this by managing the cache in fixed-size pages or blocks that can be shared across requests [^7]. It’s analogous to a virtual memory system: instead of one monolithic array per request, you have a pool of pages that store tokens’ keys/values, and you link them as needed. This allows dynamic sharing – e.g., if two requests have the same prefix, they can point to the same pages for that prefix rather than storing duplicates. it alleviates internal fragmentation: you allocate more pages on demand, and free them when a request is done, rather than preallocating a huge chunk that might mostly stay unused. The result reported by the vLLM system using PagedAttention is <4% memory waste (versus up to 80% waste in traditional systems) [^2]. PagedAttention doesn’t change the fundamentals of KV caching; it’s an infrastructure improvement that makes cache memory usage flexible and efficient. It has quickly been adopted in many serving frameworks given its benefits.

  • Offloading: Imagine a chatbot where a user has a conversation, but then goes idle for an hour. The GPU might still be holding that session’s KV cache in memory, wasting space. Offloading would mean transferring those tensors to CPU memory after some idle time. If the user resumes, the cached data is moved back to GPU to continue the conversation without recomputation. This way, you can support more concurrent sessions than GPU memory alone would allow, by using system memory as an extension. There is a performance hit when moving data back and forth (PCIe or NVLink transfer is not as fast as GPU RAM), but if idle times are large, it’s worth it. NVIDIA reported that using an offloading strategy, they achieved up to 14Ă— faster time-to-first-token in scenarios with long prompts, compared to recomputing everything from scratch each time[^4].

Despite the memory costs, KV caching remains indispensable for deployment because without it, many production use-cases would be infeasible due to latency. Engineers typically treat the cache as part of the model’s memory requirement and provision hardware accordingly (e.g., using GPUs with larger memory or adding CPU memory offloading). The general guideline is that the memory trade-off is worth the speed gain, especially when serving users or applications that demand responsive outputs.

Conclusion

KV caching revolutionized the feasibility of deploying large transformers by eliminating redundant computation and turned what would be an exorbitantly slow process into something that can run interactively. As LLM applications push toward longer contexts and more users, efficiency innovations have shifted to managing the KV cache itself. Techniques like multi-query attention, KV cache quantization, smarter memory management (PagedAttention), and hierarchical cache offloading are empowering us to serve bigger models and longer texts without hitting a wall in memory or latency.

References

[^1] https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms

[^2] https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8

[^3] https://huggingface.co/blog/not-lain/kv-caching#:~:text=with%20KV%20Caching%20Standard%20Inference,5.21x%20times%20faster

[^4] https://bentoml.com/llm/inference-optimization/kv-cache-offloading#:~:text=As%20context%20windows%20increase%2C%20the,applications%20that%20require%20extended%20context

[^5] https://arxiv.org/abs/2405.10637#:~:text=,times%24%20higher%20throughput%20than%20standard

[^6] https://arxiv.org/abs/1911.02150

[^7] https://arxiv.org/abs/2309.06180