Roadmap / Project status / Manifesto / ggml
Inference of Meta's LLaMA model (and others) in pure C/C++
Important
New llama.cpp
package location: ggml-org/llama.cpp
Update your container URLs to: ghcr.io/ggml-org/llama.cpp
More info: ggml-org#11801
- How to use MTLResidencySet to keep the GPU memory active? ggml-org#11427
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Universal tool call support in
llama-server
ggml-org#9639 - Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
- Introducing GGUF-my-LoRA ggml-org#10123
- Hugging Face Inference Endpoints now support GGUF out of the box! ggml-org#9669
- Hugging Face GGUF editor: discussion | tool
The main goal of llama.cpp
is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
range of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
The llama.cpp
project is the main playground for developing new features for the ggml library.
Models
Typically finetunes of the base models below are supported as well.
Instructions for adding support for new models: HOWTO-add-model.md