llama.cpp

Roadmap / Project status / Manifesto / ggml

Inference of Meta's LLaMA model (and others) in pure C/C++

Important

New llama.cpp package location: ggml-org/llama.cpp

Update your container URLs to: ghcr.io/ggml-org/llama.cpp

More info: ggml-org#11801

Recent API changes

Hot topics

How to use MTLResidencySet to keep the GPU memory active? ggml-org#11427
VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
Universal tool call support in llama-server ggml-org#9639
Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
Introducing GGUF-my-LoRA ggml-org#10123
Hugging Face Inference Endpoints now support GGUF out of the box! ggml-org#9669
Hugging Face GGUF editor: discussion | tool

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2, AVX512 and AMX support for x86 architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
Vulkan and SYCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: HOWTO-add-model.md

Name		Name	Last commit message	Last commit date
Latest commit History 5,029 Commits
.devops		.devops
.github		.github
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
media		media
models		models
pocs		pocs
prompts		prompts
requirements		requirements
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Backend	Target devices
Metal	Apple Silicon
BLAS	All
BLIS	All
SYCL	Intel and Nvidia GPU
MUSA	Moore Threads MTT GPU
CUDA	Nvidia GPU
HIP	AMD GPU
Vulkan	GPU
CANN	Ascend NPU
OpenCL	Adreno GPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama.cpp

Recent API changes

Hot topics

Description

Text-only

Multimodal

Supported backends

Building the project

Obtaining and quantizing models

`llama-cli`

A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.

`llama-server`

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

`llama-perplexity`

A tool for measuring the perplexity ¹² (and other quality metrics) of a model over a given text.

`llama-bench`

Benchmark the performance of the inference for various parameters.

`llama-run`

A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama ³.

`llama-simple`

A minimal example for implementing apps with `llama.cpp`. Useful for developers.

Contributing

Other documentation

Development documentation

Seminal papers and background on the models

Completions

Bash Completion

References

About

Uh oh!

Releases

Packages

Languages

License

ShockUtility/llama.cpp

Folders and files

Latest commit

History

Repository files navigation

llama.cpp

Recent API changes

Hot topics

Description

Text-only

Multimodal

Supported backends

Building the project

Obtaining and quantizing models

A CLI tool for accessing and experimenting with most of llama.cpp's functionality.

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

A tool for measuring the perplexity 12 (and other quality metrics) of a model over a given text.

Benchmark the performance of the inference for various parameters.

A comprehensive example for running llama.cpp models. Useful for inferencing. Used with RamaLama 3.

A minimal example for implementing apps with llama.cpp. Useful for developers.

Contributing

Other documentation

Development documentation

Seminal papers and background on the models

Completions

Bash Completion

References

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.

A tool for measuring the perplexity ¹² (and other quality metrics) of a model over a given text.

A comprehensive example for running `llama.cpp` models. Useful for inferencing. Used with RamaLama ³.

A minimal example for implementing apps with `llama.cpp`. Useful for developers.