-
Notifications
You must be signed in to change notification settings - Fork 11.9k
llama : try loading tensors with pre-computed hashes #13106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Steps to try this out:
$ bin/llama-gguf-hash --fnv gemma-3-4b-it-q4_0.gguf This will create
|
I don't think this function needs to be part of the base ggml interface, it is just never going to be used outside of the RPC backend. It would be better to use a custom function that is obtained via I also do not think that a 64-bit hash is enough to uniquely identify a tensor. The chance of collisions is too high for this to be acceptable. |
I think this is the best way manage large models using RPC. Performance gains will start at x2 for 2 RPC and go up from there (3x for 3 RPC, 4x for 4, etc.), predominantly when loading models over USB3 HDDs which will be a very typical use case (I don't store models on ssds to avoid degrading their lifetime, particulary big models). Since RPC is a core feature and arguably one of the best features llama.cpp offers for running big models on limited VRAM commodity GPUs I think it makes sense to alter APIs or add hooks as necessary to let it be as efficient as possible. Once the model is cached into system RAM gains will be less, but for iterating versions on big models (lately I have been testing hybrid layer quants on Llama Scout for example) it will remove a huge amount of time to load and test new versions. |
Ah, right, I forgot about this feature, thanks. Will rework the patch with
I guess we can switch to the 128bit version of the FNV hash and bump the version of the RPC protocol. I don't think we need anything bigger than 128bit here. As for the hash function, I think that a non-cryptographic hash like FNV is fine as we use it for caching purposes. If we go with 128 bit hashes then I guess we should store them as hex strings in the GGUF metadata, right? |
7792798
to
a3f2520
Compare
I have updated the patch with |
a3f2520
to
dc89d36
Compare
I was thinking that it would be nice if this is formulated in a way that other backends could also benefit from, for example to implement DirectStorage or cuFile support for faster loading times. But for that the function signature would need to be changed significantly to instead take a file name, and an offset within the file from which to load the tensor data. This wouldn't really fit in the current design for the RPC backend since it assumes one hash per tensor, and instead it would require one hash per file. |
The current proposal is using the following function prototype: bool load_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, size_t offset, uint64_t hash); The implementation is supposed to load the data from Why this is not going to work for CUDA with DirectStorage/cuFile? I understand this approach requires more disk space (2x) but I think it's a fair trade-off. Moreover, we can do some tricks with packing/unpacking GGUF files to local cache to save space if needed. |
The idea would be to add a more generic function to load data from a file into a tensor, something like I am not saying that it is necessary to do it this way, but if it is viable it could be a more generic way to support this that would simplify the model loading code. |
The whole point of this PR is to avoid reading tensor data on the main host and use pre-computed hashes because this brings massive reduction of load time for the RPC backend. Going with something like Once we have support for loading tensor data with pre-computed hash, we can start working on making it asynchronous which will reduce load time even more when multiple RPC servers are being used. I think that would be a game changer for distributed LLM inference. |
The only difference is that you would need to store a hash per-file instead of per-tensor. It would require more changes to the existing code, but it could be done. |
I agree with the approach you are taking. When working with huge ggufs over multiple RPCs its often necessary to change tensor split multiple times to optimize VRAM distribution and the per tensor hash approach minimizes the amount of new data that needs to be sent to the RPCs on reloads when adjusting tensor splits. Its easy to automatically launch all remote rpcs over ssh and set the cache directory to a shortname for the model while doing so which keeps the cache directory nicely organized and easy to manage and loading the model is just one command on the master machine independent of the number of RPCs. Async would be amazing too, another speedup proportional to the number of RPCs by allowing each one to load the tensor data in parallel at the same time. |
This feature would be incredibly useful given the trend of huge MoE models coming out now.
Would it be better to store only the hashes in the gguf.rpc file, and have it sit along side the original .gguf? |
No. The hashes need to be in the specific gguf they were created for. |
I have reworked the patch to use a single hash for the entire GGUF and load tensors with the function prototype suggested by @slaren. The RPC backend reads the model hash in With this implementation I can load There are still some rough edges but I'd like to get some feedback on the general approach. |
This approach is going to be much less efficient that the original scheme when working with big ggufs over multiple RPCs because you are now forcing the big gguf to be sent to all the RPCs prior to loading the model, which completely defeats the efficiency (in fact makes it much worse) when prototyping new ggufs. In particular my recent workflow on hybrid quants with Lllama scout I create tons of variations of the gguf and test, that whole process would be pushed to a crawl if I had to first send the entire gguf to each machine before running the model. Maybe there would be a way to just fall back to the old scheme if there is no gguf passed to the rpc? Passing a gguf to the rpc and avoid cache is clean but incredibly inefficient when prototyping new gguf quants. |
The local cache of the RPC server is not going away. If the model file doesn't have a pre-computed hash, everything will work as before. |
Do you know why that is? I would expect the performance to be the same. |
I suspect that seeking into a large file doesn't come for free but I am not 100% sure. The old implementation is here |
The same seek should be necessary in both cases. From what I can tell the previous implementation did the same thing, using C++ iostream instead of fopen, but I don't think that should make a significant difference either. It may be just measurement error. Currently this is not used when mmap is disabled, but that may be the most efficient way to use this feature, because with mmap the entire file is prefetched first. Without mmap, only the tensors that are used in the local device would be loaded from disk, which should be faster if the file is not already in the system cache. @gapeleon suggestion to store the hashes separately in a |
Yes I understand that, and wasn't suggesting you could use them for different ggufs. Just that they don't need to be embedded in the file with the weights. For example, if you have model-Q4_K_M.gguf And then This way we can keep the pristine originals (for distribution, use with other tools, etc) without having to double the storage requirements. It would also decrease the storage requirements for producing the .rpc file (particularly important with all these >200GB models coming out now. |
I don't agree with this approach. There is nothing that locks the rpc file to the gguf with the tensors. The hashes should be stored in the metadata of the gguf they were created for. Its much cleaner without another clutter file to worry about sending around and keeping synced. If the hashes are in the gguf they are always synced with that created gguf by design. |
When iterating ggufs it would just be necessary to not use the hashes at all and have it work as master works now without the has mechansim. This will result in the minimum necessary amount of data being sent to the RPCs for new ggufs. However, when the tensor split inevitably crashes with out of memory and reload is necessary, the reload will be just as slow as it is now. The existing design only requires the necessary tensors to be shipped over the network to each RPC and after a tensor split overload crash the next load they will be much faster since they are already on the machine (this is not a hypothetical. I have been running this flow extensively over the last couple weeks and RPC hash is saving a huge amount of time on tensor overload crashes which always happen). There is no way to get that operational efficiency if the entire gguf needs to be transferred to each RPC before the hash mechansim can work. |
Turns out it depends on whether the model is in the page cache, i.e. if this is the first run of the server with a particular model. I have used
I think the current approach with one hash per model is good enough in terms of performance. |
I think we can support both methods for storing hashes. We can try to load the hash from I have been using
You can still use the local cache of the RPC server during the experimentation phase and once you have the final GGUF then you can embed a hash into it and transfer it to your RPC nodes. |
This would be perfect. Your first design approach with per tensor hashes was optimal for many reasons. The RPC machines only need to store the tensors they will be using. So for instance if I create a monster 200G Maverick file and want to ship that around to five or six RPC boxes I do not want to have to replicate that monster file on every RPC box, I only want the specific tensors the RPC box will be running. I can also just offlload attention tensors to a specific machine to expand KV cache memory range without having to download the huge FFN tensors which are not needed for that use cases. The per tensor hash approach is flexible enough to handle any use case optimally.
I honestly don't see any use case for transferring the whole GGUF to the remote RPC node ever. The per tensor hash already covers every use case I can see optimally. All that is needed to improve the existing design you have is to pre compute the hashes, store in GGUF so master machine is not forced to read the whole file, just the tensors it will be using, and ship pre computed tensor hashes to the RPC nodes. |
Here are the results of fully offloading
|
src/llama-model-loader.cpp
Outdated
bool llama_model_loader::load_tensor(ggml_tensor * cur, const char * path, size_t file_offset, size_t tensor_offset, size_t size) { | ||
if (!rpc_load_tensor_fn) { | ||
return false; | ||
} | ||
ggml_backend_buffer_t buf = cur->view_src ? cur->view_src->buffer : cur->buffer; | ||
const char * buf_name = ggml_backend_buffer_name(buf); | ||
if (strncmp(buf_name, "RPC", 3) != 0) { | ||
return false; | ||
} | ||
return rpc_load_tensor_fn(buf, cur, path, file_offset, tensor_offset, size); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren Now that the signature of the function has been updated as per #13106 (comment), does it make sense for the function to become part of the backend interface? I think your idea was to make it like this so that the changes in llama_model_loader
would become very minimal and there won't be special RPC-only logic and in the future to be able to implement it in other backends.
84c8cef
to
e672602
Compare
I have also implemented When GDS is not available, cuFile APIs can still be used in compatibility mode. While this is great for testing, it is slightly slower than what we already do in Unfortunately I don't have GDS environment for testing and I am not able to provide any performance results. I may try to setup one on AWS but it looks really complicated. |
@JohannesGaessler I am trying to implement a new backend API |
@@ -3410,6 +3418,68 @@ static ggml_backend_feature * ggml_backend_cuda_get_features(ggml_backend_reg_t | |||
GGML_UNUSED(reg); | |||
} | |||
|
|||
static bool ggml_backend_cuda_buffer_load_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, const char * path, size_t file_offset, size_t tensor_offset, size_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this function correctly it loads a single tensor from a file. The driver is being re-used but the file handle is not - I don't know how much overhead that causes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, file handles are re-created every time and I am not sure how to address this with the current design. Adding a cache is going to leak both handles and descriptors which is not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are from what I can tell never explicitly registering the buffers that you load data to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can register the buffer in ggml_backend_cuda_buffer_type_alloc_buffer()
but it'd be nice to have some performance data with the internal buffers first. The doc also suggests that registering the buffer is needed for small IO sizes which is generally not the case here.
Sorry, I unfortunately also don't have a suitable machine either. My machines are either using consumer GPUs or SATA SSDs. Using an RTX 4090, a Lexar SSD NM790 4TB, and the following command: export model_name=llama_3-8b && export quantization=f16
time ./cli --model models/opt/${model_name}-${quantization}.gguf --n-gpu-layers 99 -p "" -n 1 --no-warmup -fa --no-mmap
|
This patch is proposing a new backend API for loading tensor's data with a precomputed hash stored in the model KV.
The main use case is to allow faster model load time with the RPC backend when multiple hosts are used for distributed LLM inference.
When the model is being loaded with
mmap
and there is a precomputed hash available for the current tensor, we can skip reading the actual data and ask the backend to load the data with the specified hash. The RPC backend already supports this with theRPC_CMD_SET_TENSOR_HASH
command (which may be renamed toRPC_CMD_LOAD_TENSOR
).In this PoC I have modified
llama-gguf-hash
to generate a new GGUF file with.rpc
suffix which have FNV-1a hashes for all of the tensors.With the
rpc-server
running on localhost I am seeing huge improvements of model load time, e.g:gemma-3-4b-it-q4_0.gguf
- 5052,50 msgemma-3-4b-it-q4_0.gguf.rpc
- 2148,51 msI will do more testing with local LAN network soon.
I understand that making API changes for one particular use case is generally not a good idea but there is a significant peformance gain here.
Let me know what you think.