🦙 Python Bindings for llama.cpp
Simple Python bindings for @ggerganov's llama.cpp library.
This package provides:
- Low-level access to C API via
ctypesinterface. - High-level Python API for text completion
- OpenAI-like API
- LangChain compatibility
- LlamaIndex compatibility
- OpenAI compatible web server
Documentation is available at https://llama-cpp-python.readthedocs.io/en/latest.
Requirements:
- Python 3.8+
- C compiler
- Linux: gcc or clang
- Windows: Visual Studio or MinGW
- MacOS: Xcode
To install the package, run:
pip install llama-cpp-pythonThis will also build llama.cpp from source and install it alongside this python package.
If this fails, add --verbose to the pip install see the full cmake build log.
Pre-built Wheel (New)
It is also possible to install a pre-built wheel with basic CPU support.
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpullama.cpp supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the llama.cpp README for a full list.
All llama.cpp cmake build options can be set via the CMAKE_ARGS environment variable or via the --config-settings / -C cli flag during installation.
Environment Variables
# Linux and Mac
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" \
pip install llama-cpp-python# Windows
$env:CMAKE_ARGS = "-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS"
pip install llama-cpp-pythonCLI / requirements.txt
They can also be set via pip install -C / --config-settings command and saved to a requirements.txt file:
pip install --upgrade pip # ensure pip is up to date
pip install llama-cpp-python \
-C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"# requirements.txt
llama-cpp-python -C cmake.args="-DGGML_BLAS=ON;-DGGML_BLAS_VENDOR=OpenBLAS"Below are some common backends, their build commands and any additional environment variables required.
OpenBLAS (CPU)
To install with OpenBLAS, set the GGML_BLAS and GGML_BLAS_VENDOR environment variables before installing:
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-pythonCUDA
To install with CUDA support, set the GGML_CUDA=on environment variable before installing:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-pythonPre-built Wheel (New)
It is also possible to install a pre-built wheel with CUDA support. As long as your system meets some requirements:
- CUDA Version is 12.1, 12.2, 12.3, or 12.4
- Python Version is 3.10, 3.11 or 3.12
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>Where <cuda-version> is one of the following:
cu121: CUDA 12.1cu122: CUDA 12.2cu123: CUDA 12.3cu124: CUDA 12.4
For example, to install the CUDA 12.1 wheel:
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121Metal
To install with Metal (MPS), set the GGML_METAL=on environment variable before installing:
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-pythonPre-built Wheel (New)
It is also possible to install a pre-built wheel with Metal support. As long as your system meets some requirements:
- MacOS Version is 11.0 or later
- Python Version is 3.10, 3.11 or 3.12
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metalhipBLAS (ROCm)
To install with hipBLAS / ROCm support for AMD cards, set the GGML_HIPBLAS=on environment variable before installing:
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-pythonVulkan
To install with Vulkan support, set the GGML_VULKAN=on environment variable before installing:
CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-pythonSYCL
To install with SYCL support, set the GGML_SYCL=on environment variable before installing:
source /opt/intel/oneapi/setvars.sh
CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-pythonRPC
To install with RPC support, set the GGML_RPC=on environment variable before installing:
source /opt/intel/oneapi/setvars.sh
CMAKE_ARGS="-DGGML_RPC=on" pip install llama-cpp-pythonError: Can't find 'nmake' or 'CMAKE_C_COMPILER'
If you run into issues where it complains it can't find 'nmake' '?' or CMAKE_C_COMPILER, you can extract w64devkit as mentioned in llama.cpp repo and add those manually to CMAKE_ARGS before running pip install:
$env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DGGML_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"See the above instructions and set CMAKE_ARGS to the BLAS backend you want to use.
Detailed MacOS Metal GPU install documentation is available at docs/install/macos.md
M1 Mac Performance Issue
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.shOtherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`
Try installing with
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DGGML_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-pythonTo upgrade and rebuild llama-cpp-python add --upgrade --force-reinstall --no-cache-dir flags to the pip install command to ensure the package is rebuilt from source.
The high-level API provides a simple managed interface through the Llama class.
Below is a short example demonstrating how to use the high-level API to for basic text completion:
from llama_cpp import Llama
llm = Llama(
model_path="./models/7B/llama-model.gguf",
# n_gpu_layers=-1, # Uncomment to use GPU acceleration
# seed=1337, # Uncomment to set a specific seed
# n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)By default llama-cpp-python generates completions in an OpenAI compatible format:
{
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"object": "text_completion",
"created": 1679561337,
"model": "./models/7B/llama-model.gguf",
"choices": [
{
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
"index": 0,
"logprobs": None,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 28,
"total_tokens": 42
}
}Text completion is available through the __call__ and create_completion methods of the Llama class.
You can download Llama models in gguf format directly from Hugging Face using the from_pretrained method.
You'll need to install the huggingface-hub package to use this feature (pip install huggingface-hub).
llm = Llama.from_pretrained(
repo_id="Qwen/Qwen2-0.5B-Instruct-GGUF",
filename="*q8_0.gguf",
verbose=False
)By default from_pretrained will download the model to the huggingface cache directory, you can then manage installed model files with the huggingface-cli tool.
The high-level API also provides a simple interface for chat completion.
Chat completion requires that the model knows how to format the messages into a single prompt.
The Llama class does this using pre-registered chat formats (ie. chatml, llama-2, gemma, etc) or by providing a custom chat handler object.
The model will will format the messages into a single prompt using the following order of precedence:
- Use the
chat_handlerif provided - Use the
chat_formatif provided - Use the
tokenizer.chat_templatefrom theggufmodel's metadata (should work for most new models, older models may not have this) - else, fallback to the
llama-2chat format
Set verbose=True to see the selected chat format.
from llama_cpp import Llama
llm = Llama(
model_path="path/to/llama-2/llama-model.gguf",
chat_format="llama-2"
)
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": "Describe this image in detail please."
}
]
)Chat completion is available through the create_chat_completion method of the Llama class.
For OpenAI API v1 compatibility, you use the create_chat_completion_openai_v1 method which will return pydantic models instead of dicts.
To constrain chat responses to only valid JSON or a specific JSON Schema use the response_format argument in create_chat_completion.
The following example will constrain the response to valid JSON strings only.
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs in JSON.",
},
{"role": "user", "content": "Who won the world series in 2020"},
],
response_format={
"type": "json_object",
},
temperature=0.7,
)To constrain the response further to a specific JSON Schema add the schema to the schema property of the response_format argument.
from llama_cpp import Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs in JSON.",
},
{"role": "user", "content": "Who won the world series in 2020"},
],
response_format={
"type": "json_object",
"schema": {
"type": "object",
"properties": {"team_name": {"type": "string"}},
"required": ["team_name"],
},
},
temperature=0.7,
)The high-level API supports OpenAI compatible function and tool calling. This is possible through the functionary pre-trained models chat format or through the generic chatml-function-calling chat format.
from llama_cpp import Llama
llm = Llama(model_path="path/to/chatml/llama-model.gguf", chat_format="chatml-function-calling")
llm.create_chat_completion(
messages = [
{
"role": "system",
"content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"
},
{
"role": "user",
"content": "Extract Jason is 25 years old"
}
],
tools=[{
"type": "function",
"function": {
"name": "UserDetail",
"parameters": {
"type": "object",
"title": "UserDetail",
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"age": {
"title": "Age",
"type": "integer"
}
},
"required": [ "name", "age" ]
}
}
}],
tool_choice={
"type": "function",
"function": {
"name": "UserDetail"
}
}
)Functionary v2
The various gguf-converted files for this set of models can be found here. Functionary is able to intelligently call functions and also analyze any provided function outputs to generate coherent responses. All v2 models of functionary supports parallel function calling. You can provide either functionary-v1 or functionary-v2 for the chat_format when initializing the Llama class.
Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The LlamaHFTokenizer class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files.
from llama_cpp import Llama
from llama_cpp.llama_tokenizer import LlamaHFTokenizer
llm = Llama.from_pretrained(
repo_id="meetkai/functionary-small-v2.2-GGUF",
filename="functionary-small-v2.2.q4_0.gguf",
chat_format="functionary-v2",
tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-small-v2.2-GGUF")
)