🦙 Python Bindings for llama.cpp
Simple Python bindings for @ggerganov's llama.cpp
library.
This package provides:
- Low-level access to C API via
ctypes
interface. - High-level Python API for text completion
- OpenAI-like API
- LangChain compatibility
- LlamaIndex compatibility
- OpenAI compatible web server
Documentation is available at https://llama-cpp-python.readthedocs.io/en/latest.
Requirements:
- Python 3.8+
- C compiler
- Linux: gcc or clang
- Windows: Visual Studio or MinGW
- MacOS: Xcode
To install the package, run:
pip install llama-cpp-python
This will also build llama.cpp
from source and install it alongside this python package.
If this fails, add --verbose
to the pip install
see the full cmake build log.
llama.cpp
supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the llama.cpp README for a full list.
All llama.cpp
cmake build options can be set via the CMAKE_ARGS
environment variable or via the --config-settings / -C
cli flag during installation.
Environment Variables
# Linux and Mac
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" \
pip install llama-cpp-python
# Windows
$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
pip install llama-cpp-python
CLI / requirements.txt
They can also be set via pip install -C / --config-settings
command and saved to a requirements.txt
file:
pip install --upgrade pip # ensure pip is up to date
pip install llama-cpp-python \
-C cmake.args="-DLLAMA_BLAS=ON;-DLLAMA_BLAS_VENDOR=OpenBLAS"
# requirements.txt
llama-cpp-python -C cmake.args="-DLLAMA_BLAS=ON;-DLLAMA_BLAS_VENDOR=OpenBLAS"
Below are some common backends, their build commands and any additional environment variables required.
OpenBLAS (CPU)
To install with OpenBLAS, set the LLAMA_BLAS
and LLAMA_BLAS_VENDOR
environment variables before installing:
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
cuBLAS (CUDA)
To install with cuBLAS, set the LLAMA_CUBLAS=on
environment variable before installing:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
Metal
To install with Metal (MPS), set the LLAMA_METAL=on
environment variable before installing:
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
CLBlast (OpenCL)
To install with CLBlast, set the LLAMA_CLBLAST=on
environment variable before installing:
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
hipBLAS (ROCm)
To install with hipBLAS / ROCm support for AMD cards, set the LLAMA_HIPBLAS=on
environment variable before installing:
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
Vulkan
To install with Vulkan support, set the LLAMA_VULKAN=on
environment variable before installing:
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
Kompute
To install with Kompute support, set the LLAMA_KOMPUTE=on
environment variable before installing:
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
SYCL
To install with SYCL support, set the LLAMA_SYCL=on
environment variable before installing:
source /opt/intel/oneapi/setvars.sh
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python
Error: Can't find 'nmake' or 'CMAKE_C_COMPILER'
If you run into issues where it complains it can't find 'nmake'
'?'
or CMAKE_C_COMPILER, you can extract w64devkit as mentioned in llama.cpp repo and add those manually to CMAKE_ARGS before running pip
install:
$env:CMAKE_GENERATOR = "MinGW Makefiles"
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe"
See the above instructions and set CMAKE_ARGS
to the BLAS backend you want to use.
Detailed MacOS Metal GPU install documentation is available at docs/install/macos.md
M1 Mac Performance Issue
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
Otherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
M Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`
Try installing with
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
To upgrade and rebuild llama-cpp-python
add --upgrade --force-reinstall --no-cache-dir
flags to the pip install
command to ensure the package is rebuilt from source.
The high-level API provides a simple managed interface through the Llama
class.
Below is a short example demonstrating how to use the high-level API to for basic text completion:
>>> from llama_cpp import Llama
>>> llm = Llama(
model_path="./models/7B/llama-model.gguf",
# n_gpu_layers=-1, # Uncomment to use GPU acceleration
# seed=1337, # Uncomment to set a specific seed
# n_ctx=2048, # Uncomment to increase the context window
)
>>> output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
>>> print(output)
{
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"object": "text_completion",
"created": 1679561337,
"model": "./models/7B/llama-model.gguf",
"choices": [
{
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
"index": 0,
"logprobs": None,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 28,
"total_tokens": 42
}
}
Text completion is available through the __call__
and create_completion
methods of the Llama
class.
You can download Llama
models in gguf
format directly from Hugging Face using the from_pretrained
method.
You'll need to install the huggingface-hub
package to use this feature (pip install huggingface-hub
).
llm = Llama.from_pretrained(
repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
filename="*q8_0.gguf",
verbose=False
)