8000 Fix typo (#137) · lm-sys/lm-sys.github.io@180a410 · GitHub
[go: up one dir, main page]

Skip to content

Commit 180a410

Browse files
authored
Fix typo (#137)
1 parent e242e04 commit 180a410

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

blog/2024-01-17-sglang.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ While some systems are capable of handling KV cache reuse in certain scenarios,
3737

3838
To systematically exploit these reuse opportunities, we introduce RadixAttention, a novel technique for automatic KV cache reuse during runtime. Instead of discarding the KV cache after finishing a generation request, our approach retains the KV cache for both prompts and generation results in a radix tree. This data structure enables efficient prefix search, insertion, and eviction. We implement a Least Recently Used (LRU) eviction policy, complemented by a cache-aware scheduling policy, to enhance the cache hit rate.
3939

40-
A radix tree is a data structure that serves as a space-efficient alternative to a trie (prefix tree). Unlike typical trees, the edges of a radix tree can be labeled with not just single elements, but also with sequences of elements of varying lengths. This feature boosts the efficiency of radix trees. In our system, we utilize a radix tree to manage a mapping. This mapping is between sequences of tokens, which act as the keys, and their corresponding KV cache tensors, which serve as the values. These KV cache tensors are stored on the GPU in a paged layout, where the size of each page is equivalent to one token. Considering the limited capacity of GPU memory, we cannot retrain infinite KV cache tensors, which necessitates an eviction policy. To tackle this, we implement an LRU eviction policy that recursively evicts leaf nodes.
40+
A radix tree is a data structure that serves as a space-efficient alternative to a trie (prefix tree). Unlike typical trees, the edges of a radix tree can be labeled with not just single elements, but also with sequences of elements of varying lengths. This feature boosts the efficiency of radix trees. In our system, we utilize a radix tree to manage a mapping. This mapping is between sequences of tokens, which act as the keys, and their corresponding KV cache tensors, which serve as the values. These KV cache tensors are stored on the GPU in a paged layout, where the size of each page is equivalent to one token. Considering the limited capacity of GPU memory, we cannot retain infinite KV cache tensors, which necessitates an eviction policy. To tackle this, we implement an LRU eviction policy that recursively evicts leaf nodes.
4141
Furthermore, RadixAttention is compatible with existing techniques like continuous batching and paged attention.
4242
For multi-modal models, the RadixAttention can be easily extended to handle image tokens.
4343

0 commit comments

Comments
 (0)
0