8000 kv-cache · GitHub Topics · GitHub

#

kv-cache

Here are 51 public repositories matching this topic...

LMCache / LMCache

Supercharge Your LLM with the Fastest KV Cache Layer

fast amd cuda inference pytorch speed rocm kv-cache llm vllm

Updated Nov 24, 2025
Python

HDT3213 / godis

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

go redis golang cluster redis-server redis-cluster godis kv-cache

Updated Sep 14, 2025
Go

Zefan-Cai / KVCache-Factory

Unified KV Cache Compression Methods for Auto-Regressive Models

kv-cache llm kv-cache-compression

Updated Jan 4, 2025
Python

harleyszhang / llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming transformer-models kv-cache llm vllm llm-inference triton-kernels

Updated Sep 16, 2025
Python

NVIDIA / kvpress

LLM KV cache compression made easy

python transformers inference pytorch kv-cache large-language-models llm long-context kv-cache-compression

Updated Nov 21, 2025
Python

therealoliver / Deepdive-llama3-from-scratch

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

Updated Feb 24, 2025
Jupyter Notebook

raymin0223 / mixture_of_recursions

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)

router early-exiting adaptive-computation kv-cache llm recursive-transformers

Updated Sep 26, 2025
Python

FMInference / H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

sparsity high-throughput heavy-hitters kv-cache gpt-3 large-language-models

Updated Aug 1, 2024
Python

Zefan-Cai / Awesome-LLM-KV-Cache

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.

kv-cache llm kv-cache-quantization kv-cache-compression

Updated Mar 3, 2025

dipampaul17 / KVSplit

Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

metal optimization quantization m2 m3 m1 memory-optimization kv-cache apple-silicon llm generative-ai llama-cpp

Updated May 21, 2025
Python

NVIDIA-Merlin / HierarchicalKV

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.

gpu cuda recommender-system hashtable key-value-store kv-cache dynamic-embedding embedding-storage

Updated Nov 2, 2025
Cuda

itsnamgyu / block-transformer

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

kv-cache llm llm-inference llm-architecture kv-cache-compression

Updated Apr 13, 2025
Python

jjiantong / Awesome-KV-Cache-Optimization

[Survey] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization

machine-learning ai system computer-architecture neural-language-processing mlsys kv-cache serving-ml llm llm-serving llm-inference

Updated Nov 10, 2025
Python

FastMAS / KVCOMM

[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

multi-agent-systems kv-cache neurips-2025

Updated Nov 3, 2025
Python

kddubey / cappr

Completion After Prompt Probability. Make your LLM make a choice

text-classification probability zero-shot huggingface kv-cache prompt-engineering llamacpp llm-inference

Updated Nov 2, 2024
Python

aju22 / LLaMA2

This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.

natural-language-processing transformer attention llama gpt rope kv-cache llm llama2 rms-norm

Updated Oct 1, 2023
Python

hkproj / pytorch-llama-notes

Notes about LLaMA 2 model

study-notes rmsprop attention-is-all-you-need kv-cache rotary-position-encoding llama2

Updated Aug 30, 2023
Python

DRSY / EasyKV

Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)

cache-management kv-cache llm cache-eviction

Updated Feb 13, 2024
Python

NoakLiu / PiKV

PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]

distributed-systems parallel-computing moe mixture-model management-system mixture-of-experts mlsystem kv-cache kvcache

Updated Oct 19, 2025
Python

MaxBelitsky / cache-steering

KV Cache Steering for Inducing Reasoning in Small Language Models

reasoning kv-cache large-language-models llm representation-engineering activation-steering reasoning-language-models cache-steering

Updated Jul 24, 2025
Python

Improve this page

Add a description, image, and links to the kv-cache topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kv-cache topic, visit your repo's landing page and select "manage topics."

0