🦙 Python Bindings for `llama.cpp`

Simple Python bindings for @ggerganov's llama.cpp library. This package provides:

Low-level access to C API via ctypes interface.
High-level Python API for text completion
- OpenAI-like API
- LangChain compatibility
- LlamaIndex compatibility
OpenAI compatible web server

Documentation is available at https://llama-cpp-python.readthedocs.io/en/latest.

Installation

Requirements:

Python 3.8+
C compiler
- Linux: gcc or clang
- Windows: Visual Studio or MinGW
- MacOS: Xcode

To install the package, run:

pip install llama-cpp-python

This will also build llama.cpp from source and install it alongside this python package.

If this fails, add --verbose to the pip install see the full cmake build log.

Installation Configuration

llama.cpp supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the llama.cpp README for a full list.

All llama.cpp cmake build options can be set via the CMAKE_ARGS environment variable or via the --config-settings / -C cli flag during installation.

Environment Variables

# Linux and Mac
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" \
  pip install llama-cpp-python

# Windows
$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
pip install llama-cpp-python

CLI / requirements.txt

They can also be set via pip install -C / --config-settings command and saved to a requirements.txt file:

pip install --upgrade pip # ensure pip is up to date
pip install llama-cpp-python \
  -C cmake.args="-DLLAMA_BLAS=ON;-DLLAMA_BLAS_VENDOR=OpenBLAS"

# requirements.txt

llama-cpp-python -C cmake.args="-DLLAMA_BLAS=ON;-DLLAMA_BLAS_VENDOR=OpenBLAS"

Supported Backends

Below are some common backends, their build commands and any additional environment variables required.

OpenBLAS (CPU)

To install with OpenBLAS, set the LLAMA_BLAS and LLAMA_BLAS_VENDOR environment variables before installing:

CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

cuBLAS (CUDA)

To install with cuBLAS, set the LLAMA_CUBLAS=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

Metal

To install with Metal (MPS), set the LLAMA_METAL=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

CLBlast (OpenCL)

To install with CLBlast, set the LLAMA_CLBLAST=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python

hipBLAS (ROCm)

To install with hipBLAS / ROCm support for AMD cards, set the LLAMA_HIPBLAS=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

Vulkan

To install with Vulkan support, set the LLAMA_VULKAN=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python

Kompute

To install with Kompute support, set the LLAMA_KOMPUTE=on environment variable before installing:

CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python

SYCL

To install with SYCL support, set the LLAMA_SYCL=on environment variable before installing:

source /opt/intel/oneapi/setvars.sh   
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

Windows Notes

Error: Can't find 'nmake' or 'CMAKE_C_COMPILER'

If you run into issues where it complains it can't find 'nmake' '?' or CMAKE_C_COMPILER, you can extract w64devkit as mentioned in llama.cpp repo and add those manually to CMAKE_ARGS before running pip install:

$env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"

See the above instructions and set CMAKE_ARGS to the BLAS backend you want to use.

MacOS Notes

Detailed MacOS Metal GPU install documentation is available at docs/install/macos.md

M1 Mac Performance Issue

Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.

M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`

Try installing with

CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python

Upgrading and Reinstalling

To upgrade and rebuild llama-cpp-python add --upgrade --force-reinstall --no-cache-dir flags to the pip install command to ensure the package is rebuilt from source.

High-level API

API Reference

The high-level API provides a simple managed interface through the Llama class.

Below is a short example demonstrating how to use the high-level API to for basic text completion:

>>> from llama_cpp import Llama
>>> llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
>>> output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
>>> print(output)
{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}

Text completion is available through the __call__ and create_completion methods of the Llama class.

Pulling models from Hugging Face Hub

You can download Llama models in gguf format directly from Hugging Face using the from_pretrained method. You'll need to install the huggingface-hub package to use this feature (pip install huggingface-hub).

llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
    filename="*q8_0.gguf",
    verbose=False
)

Name		Name	Last commit message	Last commit date
Latest commit History 1,581 Commits
.github		.github
docker		docker
docs		docs
examples		examples
llama_cpp		llama_cpp
tests		tests
vendor		vendor
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🦙 Python Bindings for `llama.cpp`

Installation

Installation Configuration

Supported Backends

Windows Notes

MacOS Notes

Upgrading and Reinstalling

High-level API

Pulling models from Hugging Face Hub

Chat Completion

JSON and JSON Schema Mode

JSON Mode

JSON Schema Mode

Function Calling

Multi-modal Models

Speculative Decoding

Embeddings

Adjusting the Context Window

OpenAI Compatible Web Server

Web Server Features

Docker image

Low-level API

Documentation

Development

FAQ

Are there pre-built binaries / binary wheels available?

How does this compare to other Python bindings of `llama.cpp`?

License

About

Uh oh!

Releases

Packages

Languages

License

coderonion/llama-cpp-python

Folders and files

Latest commit

History

Repository files navigation

🦙 Python Bindings for llama.cpp

Installation

Installation Configuration

Supported Backends

Windows Notes

MacOS Notes

Upgrading and Reinstalling

High-level API

Pulling models from Hugging Face Hub

Chat Completion

JSON and JSON Schema Mode

JSON Mode

JSON Schema Mode

Function Calling

Multi-modal Models

Speculative Decoding

Embeddings

Adjusting the Context Window

OpenAI Compatible Web Server

Web Server Features

Docker image

Low-level API

Documentation

Development

FAQ

Are there pre-built binaries / binary wheels available?

How does this compare to other Python bindings of llama.cpp?

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🦙 Python Bindings for `llama.cpp`

How does this compare to other Python bindings of `llama.cpp`?

Packages