llama.cpp

Roadmap / Project status / Manifesto / ggml

Inference of Meta's LLaMA model (and others) in pure C/C++

Recent API changes

[2024 Apr 21] llama_token_to_piece can now optionally render special tokens ggml-org#6807
[2024 Apr 4] State and session file functions reorganized under llama_state_* ggml-org#6341
[2024 Mar 26] Logits and embeddings API updated for compactness ggml-org#6122
[2024 Mar 13] Add llama_synchronize() + llama_context_params.n_ubatch ggml-org#6017
[2024 Mar 8] llama_kv_cache_seq_rm() returns a bool instead of void, and new llama_n_seq_max() returns the upper limit of acceptable seq_id in batches (relevant when dealing with multiple sequences) ggml-org#5328
[2024 Mar 4] Embeddings API updated ggml-org#5796
[2024 Mar 3] struct llama_context_params ggml-org#5849

Hot topics

BPE pre-tokenization support has been added: ggml-org#6920
MoE memory layout has been updated - reconvert models for mmap support and regenerate imatrix ggml-org#6387
Model sharding instructions using gguf-split ggml-org#6404
Fix major bug in Metal batched inference ggml-org#6225
Multi-GPU pipeline parallelism support ggml-org#6017
Looking for contributions to add Deepseek support: ggml-org#5981
Quantization blind testing: ggml-org#5962
Initial Mamba support has been added: ggml-org#5328

Table of Contents

Description
Usage
Contributing
Coding guidelines
Docs

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2 and AVX512 support for x86 architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
Vulkan, SYCL, and (partial) OpenCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Since its inception, the project has improved significantly thanks to many contributions. It is the main playground for developing new features for the ggml library.

Supported platforms:

Supported models:

Typically finetunes of the base models below are supported as well.

(instructions for supporting more models: HOWTO-add-model.md)

Multimodal models:

HTTP server

llama.cpp web server is a lightweight OpenAI API compatible HTTP server that can be used to serve local models and easily connect them to existing clients.

Bindings:

Python: abetlen/llama-cpp-python
Go: go-skynet/go-llama.cpp
Node.js: withcatai/node-llama-cpp
JS/TS (llama.cpp server client): lgrammel/modelfusion
JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
Typescript/Wasm (nicer API, available on npm): ngxson/wllama
Ruby: yoshoku/llama_cpp.rb
Rust (more features): edgenai/llama_cpp-rs
Rust (nicer API): mdrokz/rust-llama.cpp
Rust (more direct bindings): utilityai/llama-cpp-rs
C#/.NET: SciSharp/LLamaSharp
Scala 3: donderom/llm4s
Clojure: phronmophobic/llama.clj
React Native: mybigday/llama.rn
Java: kherud/java-llama.cpp
Zig: deins/llama.cpp.zig
Flutter/Dart: netdur/llama_cpp_dart
PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)

UI:

Unless otherwise noted these projects are open-source with permissive licensing:

iohub/collama
janhq/jan (AGPL)
nat/openplayground
Faraday (proprietary)
LMStudio (proprietary)
LocalAI (MIT)
LostRuins/koboldcpp (AGPL)
Mozilla-Ocho/llamafile
nomic-ai/gpt4all
ollama/ollama
oobabooga/text-generation-webui (AGPL)
psugihara/FreeChat
cztomsik/ava (MIT)
ptsochantaris/emeltal
pythops/tenere (AGPL)
RecurseChat (proprietary)
semperai/amica
withcatai/catai
Mobile-Artificial-Intelligence/maid (MIT)
Msty (proprietary)
LLMFarm (MIT)
KanTV(Apachev2.0 or later)
Dot (GPL)
MindMac (proprietary)
KodiBot (GPL)
eva (MIT)
AI Sublime Text plugin (MIT)

(to have a project listed here, it should clearly state that it depends on llama.cpp)

Here is a typical run using LLaMA v2 13B on M2 Ultra:

$ make -j && ./main -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: build = 1041 (cf658ad)
main: seed  = 1692823051
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q4_0
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: mem required  = 7024.01 MB (+  400.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.41 MB

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’s an image or text file (like PDFs). In order for someone else’s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings:        load time =   576.45 ms
llama_print_timings:      sample time =   283.10 ms /   400 runs   (    0.71 ms per token,  1412.91 tokens per second)
llama_print_timings: prompt eval time =   599.83 ms /    19 tokens (   31.57 ms per token,    31.68 tokens per second)
llama_print_timings:        eval time = 24513.59 ms /   399 runs   (   61.44 ms per token,    16.28 tokens per second)
llama_print_timings:       total time = 25431.49 ms

And here is another demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook:

whisper-llama-lq.mp4

Usage

Here are the end-to-end binary build and model conversion steps for most supported models.

Get the Code

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build

In order to build llama.cpp you have three different options.

Using make:
- On Linux or MacOS:
```
make
```
  Note: for Debug builds, run make LLAMA_DEBUG=1
- On Windows:
  1. Download the latest fortran version of w64devkit.
  2. Extract w64devkit on your pc.
  3. Run w64devkit.exe.
  4. Use the cd command to reach the llama.cpp folder.
  5. From here you can run:
```
make
```
Using CMake:
```
cmake -B build
cmake --build build --config Release
```
Note: for Debug builds, there are two cases:
- Single-config generators (e.g. default = Unix Makefiles; note that they just ignore the --config flag):
```
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
```
- Multi-config generators (-G param set to Visual Studio, XCode...):
```
cmake -B build -G "Xcode"
cmake --build build --config Debug
```
Using Zig (version 0.11 or later):

Building for optimization levels and CPU features can be accomplished using standard build arguments, for example AVX2, FMA, F16C, it's also possible to cross compile for other operating systems and architectures:
```
zig build -Doptimize=ReleaseFast -Dtarget=x86_64-windows-gnu -Dcpu=x86_64+avx2+fma+f16c
```
The zig targets command will give you valid options to use.
Using gmake (FreeBSD):
1. Install and activate DRM in FreeBSD
2. Add your user to video group
3. Install compilation dependencies.
```
sudo pkg install gmake automake autoconf pkgconf llvm15 clinfo clover \
    opencl clblast openblas

gmake CC=/usr/local/bin/clang15 CXX=/usr/local/bin/clang++15 -j4
```
Notes: With this packages you can build llama.cpp with OPENBLAS and CLBLAST support for use OpenCL GPU acceleration in FreeBSD. Please read the instructions for use and activate this options in this document below.

Metal Build

On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU. To disable the Metal build at compile time use the LLAMA_NO_METAL=1 flag or the LLAMA_METAL=OFF cmake option.

When built with Metal support, you can explicitly disable GPU inference with the --n-gpu-layers|-ngl 0 command-line argument.

MPI Build

MPI lets you distribute the computation over a cluster of machines. Because of the serial nature of LLM prediction, this won't yield any end-to-end speed-ups, but it will let you run larger models than would otherwise fit into RAM on a single machine.

First you will need MPI libraries installed on your system. The two most popular (only?) options are MPICH and OpenMPI. Either can be installed with a package manager (apt, Homebrew, MacPorts, etc).

Next you will need to build the project with LLAMA_MPI set to true on all machines; if you're building with make, you will also need to specify an MPI-capable compiler (when building with CMake, this is configured automatically):

Using make:
```
make CC=mpicc CXX=mpicxx LLAMA_MPI=1
```
Using CMake:
```
cmake -S . -B build -DLLAMA_MPI=ON
```

Once the programs are built, download/convert the weights on all of the machines in your cluster. The paths to the weights and programs should be identical on all machines.

Next, ensure password-less SSH access to each machine from the primary host, and create a hostfile with a list of the hostnames and their relative "weights" (slots). If you want to use localhost for computation, use its local subnet IP address rather than the loopback address or "localhost".

Here is an example hostfile:

192.168.0.1:2
malvolio.local:1

The above will distribute the computation across 2 processes on the first host and 1 process on the second host. Each process will use roughly an equal amount of RAM. Try to keep these numbers small, as inter-process (intra-host) communication is expensive.

Finally, you're ready to run a computation using mpirun:

mpirun -hostfile hostfile -n 3 ./main -m ./models/7B/ggml-model-q4_0.gguf -n 128

BLAS Build

Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS and CLBlast. There are currently several different BLAS implementations available for build and use:

Accelerate Framework:

This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
OpenBLAS:

This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
- Using make:
  - On Linux:
```
make LLAMA_OPENBLAS=1
```
  - On Windows:
    1. Download the latest fortran version of w64devkit.
    2. Download the latest version of OpenBLAS for Windows.
    3. Extract w64devkit on your pc.
    4. From the OpenBLAS zip that you just downloaded copy libopenblas.a, located inside the lib folder, inside w64devkit\x86_64-w64-mingw32\lib.
    5. From the same OpenBLAS zip copy the content of the include folder inside w64devkit\x86_64-w64-mingw32\include.
    6. Run w64devkit.exe.
    7. Use the cd command to reach the llama.cpp folder.
    8. From here you can run:
      make LLAMA_OPENBLAS=1
- Using CMake on Linux:
```
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release
```
BLIS

Check BLIS.md for more information.
SYCL

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.

llama.cpp based on SYCL is used to support Intel GPU (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).

For detailed info, please refer to llama.cpp for SYCL.
Intel oneMKL

Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config does not support Intel GPU. For Intel GPU support, please refer to llama.cpp for SYCL.
- Using manual oneAPI installation: By default, LLAMA_BLAS_VENDOR is set to Generic, so if you already sourced intel environment script and assign -DLLAMA_BLAS=ON in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
```
source /opt/intel/oneapi/setvars.sh # You can skip this step if  in oneapi-basekit docker image, only required for manual installation
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_NATIVE=ON
cmake --build build --config Release
```
- Using oneAPI docker image: If you do not want to source the environment vars and install oneAPI manually, you can also build the code using intel docker container: oneAPI-basekit. Then, you can use the commands given above.
Check Optimizing and Running LLaMA2 on Intel® CPU for more information.

CUDA

This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. apt install nvidia-cuda-toolkit) or from here: CUDA Toolkit.

For Jetson user, if you have Jetson Orin, you can try this: Offical Support. If you are using an old model(nano/TX2), need some additional operations before compiling.

Using make:
```
make LLAMA_CUDA=1
```

Using CMake:

cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release

The environment variable CUDA_VISIBLE_DEVICES can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:

Option	Legal values	Default	Description
LLAMA_CUDA_FORCE_DMMV	Boolean	false	Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants.
LLAMA_CUDA_DMMV_X	Positive integer >= 32	32	Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants.
LLAMA_CUDA_MMV_Y	Positive integer	1	Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended.
LLAMA_CUDA_F16	Boolean	false	If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs.
LLAMA_CUDA_KQUANTS_ITER	1 or 2	2	Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs.
LLAMA_CUDA_PEER_MAX_BATCH_SIZE	Positive integer	128	Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial.

hipBLAS

This provides BLAS acceleration on HIP-supported AMD GPUs. Make sure to have ROCm installed. You can download it from your Linux distro's package manager or from here: ROCm Quick Start (Linux).

Using make:
```
make LLAMA_HIPBLAS=1
```
Using CMake for Linux (assuming a gfx1030-compatible AMD GPU):
```
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ \
    cmake -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16
```
On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting -DLLAMA_HIP_UMA=ON". However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
Using make (example for target gfx1030, build with 16 CPU threads):
```
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030
```
Using CMake for Windows (using x64 Native Tools Command Prompt for VS, and assuming a gfx1100-compatible AMD GPU):
```
set PATH=%HIP_PATH%\bin;%PATH%
mkdir build
cd build
cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
```
Make sure that AMDGPU_TARGETS is set to the GPU arch you want to compile for. The above example uses gfx1100 that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets here Find your gpu version string by matching the most significant version information from rocminfo | grep gfx | head -1 | awk '{print $2}' with the list of processors, e.g. gfx1035 maps to gfx1030.

The environment variable HIP_VISIBLE_DEVICES can be used to specify which GPU(s) will be used. If your GPU is not officially supported you can use the environment variable [HSA_OVERRIDE_GFX_VERSION] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3. The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):

Option	Legal values	Default	Description
LLAMA_CUDA_DMMV_X	Positive integer >= 32	32	Number of values in x direction processed by the HIP dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants.
LLAMA_CUDA_MMV_Y	Positive integer	1	Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants.
LLAMA_CUDA_KQUANTS_ITER	1 or 2	2	Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs.

CLBlast

OpenCL acceleration is provided by the matrix multiplication kernels from the CLBlast project and custom kernels for ggml that can generate tokens on the GPU.

You will need the OpenCL SDK.
- For Ubuntu, Debian, and Fedora the packages opencl-headers, ocl-icd may be needed.
- For Windows, a pre-built SDK is available on the OpenCL Releases page.
- Installing the OpenCL SDK from source
```
git clone --recurse-submodules https://github.com/KhronosGroup/OpenCL-SDK.git
cd OpenCL-SDK
cmake -B build -DBUILD_DOCS=OFF \
  -DBUILD_EXAMPLES=OFF \
  -DBUILD_TESTING=OFF \
  -DOPENCL_SDK_BUILD_SAMPLES=OFF \
  -DOPENCL_SDK_TEST_SAMPLES=OFF
cmake --build build
cmake --install build --prefix /some/path
```
Installing CLBlast

Pre-built CLBlast binaries may be found on the CLBlast Releases page. For Unix variants, it may also be found in your operating system's packages.

Linux packaging: Fedora Linux:
```
sudo dnf install clblast
```
Alternatively, they may be built from source.
- Windows:
```
set OPENCL_SDK_ROOT="C:/OpenCL-SDK-v2023.04.17-Win-x64"
git clone https://github.com/CNugteren/CLBlast.git
cd CLBlast
cmake -B build -DBUILD_SHARED_LIBS=OFF -DOVERRIDE_MSVC_FLAGS_TO_MT=OFF -DTUNERS=OFF -DOPENCL_ROOT=%OPENCL_SDK_ROOT% -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
cmake --install build --prefix C:/CLBlast
```
  (note: --config Release at build time is the default and only relevant for Visual Studio builds - or multi-config Ninja builds)
- Unix:
```
git clone https://github.com/CNugteren/CLBlast.git
cd CLBlast
cmake -B build -DBUILD_SHARED_LIBS=OFF -DTUNERS=OFF
cmake --build build --config Release
cmake --install build --prefix /some/path
```
  Where /some/path is where the built library will be installed (default is /usr/local).
Building Llama with CLBlast
- Build with make: < 1CF5 div class="highlight highlight-source-shell notranslate position-relative overflow-auto" dir="auto" data-snippet-clipboard-copy-content="make LLAMA_CLBLAST=1">
```
make LLAMA_CLBLAST=1
```

Name		Name
Latest commit History 2,768 Commits
.devops		.devops
.github		.github
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml-cuda		ggml-cuda
gguf-py		gguf-py
grammars		grammars
kompute @ 4565194		kompute @ 4565194
kompute-shaders		kompute-shaders
media		media
models		models
pocs		pocs
prompts		prompts
requirements		requirements
scripts		scripts
spm-headers		spm-headers
tests		tests
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc	.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
Package.swift		Package.swift
README-sycl.md		README-sycl.md
README.md		README.md
SECURITY.md		SECURITY.md
build.zig		build.zig
codecov.yml		codecov.yml
convert-hf-to-gguf-update.py		convert-hf-to-gguf-update.py
convert-hf-to-gguf.py		convert-hf-to-gguf.py
convert-llama-ggml-to-gguf.py		convert-llama-ggml-to-gguf.py
convert-lora-to-ggml.py		convert-lora-to-ggml.py
convert-persimmon-to-gguf.py		convert-persimmon-to-gguf.py
convert.py		convert.py
flake.lock		flake.lock
flake.nix		flake.nix
ggml-alloc.c		ggml-alloc.c
ggml-alloc.h		ggml-alloc.h
ggml-backend-impl.h		ggml-backend-impl.h
ggml-backend.c		ggml-backend.c
ggml-backend.h		ggml-backend.h
ggml-common.h		ggml-common.h
ggml-cuda.cu		ggml-cuda.cu
ggml-cuda.h		ggml-cuda.h
ggml-impl.h		ggml-impl.h
ggml-kompute.cpp		ggml-kompute.cpp
ggml-kompute.h		ggml-kompute.h
ggml-metal.h		ggml-metal.h
ggml-metal.m		ggml-metal.m
ggml-metal.metal		ggml-metal.metal
ggml-mpi.c		ggml-mpi.c
ggml-mpi.h		ggml-mpi.h
ggml-opencl.cpp		ggml-opencl.cpp
ggml-opencl.h		ggml-opencl.h
ggml-quants.c		ggml-quants.c
ggml-quants.h		ggml-quants.h
ggml-sycl.cpp		ggml-sycl.cpp
ggml-sycl.h		ggml-sycl.h
ggml-vulkan-shaders.hpp		ggml-vulkan-shaders.hpp
ggml-vulkan.cpp		ggml-vulkan.cpp
ggml-vulkan.h		ggml-vulkan.h
ggml.c		ggml.c
ggml.h		ggml.h
ggml_vk_generate_shaders.py		ggml_vk_generate_shaders.py
llama.cpp		llama.cpp
llama.h		llama.h
mypy.ini		mypy.ini
requirements.txt		requirements.txt
sgemm.cpp		sgemm.cpp
sgemm.h		sgemm.h
unicode-data.cpp		unicode-data.cpp
unicode-data.h		unicode-data.h
unicode.cpp		unicode.cpp
unicode.h		unicode.h

Model	Original size	Quantized size (Q4_0)
7B	13 GB	3.9 GB
13B	24 GB	7.8 GB
30B	60 GB	19.5 GB
65B	120 GB	38.5 GB

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9066	6.1565	6.0912	5.9862	5.9481	5.9070
7B	file size	13.0G	3.5G	3.9G	4.3G	4.7G	6.7G
7B	ms/tok @ 4th	127	55	54	76	83	72
7B	ms/tok @ 8th	122	43	45	52	56	67
7B	bits/weight	16.0	4.5	5.0	5.5	6.0	8.5
13B	perplexity	5.2543	5.3860	5.3608	5.2856	5.2706	5.2548
13B	file size	25.0G	6.8G	7.6G	8.3G	9.1G	13G
13B	ms/tok @ 4th	-	103	105	148	160	131
13B	ms/tok @ 8th	-	73	82	98	105	128
13B	bits/weight	16.0	4.5	5.0	5.5	6.0	8.5

License

tanaydin/llama.cpp

Folders and files

Latest commit

History

Repository files navigation

llama.cpp

Recent API changes

Hot topics

Description

Usage

Get the Code

Build

Metal Build

MPI Build

BLAS Build

Accelerate Framework:

OpenBLAS:

BLIS

SYCL

Intel oneMKL

CUDA

hipBLAS

CLBlast

Installing CLBlast

Building Llama with CLBlast

Running Llama with CLBlast

Vulkan

Prepare and Quantize

Run the quantized model

Running on Windows with prebuilt binaries

Memory/Disk Requirements

Quantization

Perplexity (measuring model quality)

How to run

Interactive mode

Persistent Interaction

Constrained output with grammars

Instruct mode

Obtaining and using the Facebook LLaMA 2 model

Seminal papers and background on the models

Android

Building the Project using Android NDK

Building the Project using Termux (F-Droid)

Docker

Prerequisites

Images

Usage

Docker With CUDA

Building Locally

Usage

Contributing

Coding guidelines

Docs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages