Closed
Description
Multi-GPU inference is essential for small VRAM GPU. 13B llama model cannot fit in a single 3090 unless using quantization.
llama.cpp yesterday merge multi gpu branch, which help us using small VRAM GPUS to deploy LLM.
ggml-org/llama.cpp#1703
Hope llama-cpp-python can support multi GPU inference in the future.
Many thanks!!!