8000 llama : try loading tensors with pre-computed hashes by rgerganov · Pull Request #13106 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

llama : try loading tensors with pre-computed hashes #13106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rgerganov
Copy link
Collaborator

This patch is proposing a new backend API for loading tensor's data with a precomputed hash stored in the model KV.
The main use case is to allow faster model load time with the RPC backend when multiple hosts are used for distributed LLM inference.
When the model is being loaded with mmap and there is a precomputed hash available for the current tensor, we can skip reading the actual data and ask the backend to load the data with the specified hash. The RPC backend already supports this with the RPC_CMD_SET_TENSOR_HASH command (which may be renamed to RPC_CMD_LOAD_TENSOR).

In this PoC I have modified llama-gguf-hash to generate a new GGUF file with .rpc suffix which have FNV-1a hashes for all of the tensors.

With the rpc-server running on localhost I am seeing huge improvements of model load time, e.g:

  • gemma-3-4b-it-q4_0.gguf - 5052,50 ms
  • gemma-3-4b-it-q4_0.gguf.rpc - 2148,51 ms

I will do more testing with local LAN network soon.

I understand that making API changes for one particular use case is generally not a good idea but there is a significant peformance gain here.
Let me know what you think.

@rgerganov rgerganov requested a review from slaren April 25, 2025 11:00
@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Apr 25, 2025
@rgerganov
Copy link
Collaborator Author

Steps to try this out:

  1. Generate GGUF with precomputed hashes with llama-gguf-hash:
$ bin/llama-gguf-hash --fnv gemma-3-4b-it-q4_0.gguf

This will create gemma-3-4b-it-q4_0.gguf.rpc which is the same as gemma-3-4b-it-q4_0.gguf + precomputed hashes for all tensors
2. Start rpc-server with local cache:

$ bin/rpc-server -c
  1. Run bin/llama-cli -m gemma-3-4b-it-q4_0.gguf.rpc --rpc localhost:50052 -ngl 99 .... The first run will populate the RPC cache and the load time for the second run should be much smaller

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Kompute https://github.com/KomputeProject/kompute/ labels Apr 25, 2025
@slaren
Copy link
Member
slaren commented Apr 25, 2025

I don't think this function needs to be part of the base ggml interface, it is just never going to be used outside of the RPC backend. It would be better to use a custom function that is obtained via ggml_backend_reg_get_proc_address.

I also do not think that a 64-bit hash is enough to uniquely identify a tensor. The chance of collisions is too high for this to be acceptable.

@steampunque
Copy link

I understand that making API changes for one particular use case is generally not a good idea but there is a significant peformance gain here. Let me know what you think.

I think this is the best way manage large models using RPC. Performance gains will start at x2 for 2 RPC and go up from there (3x for 3 RPC, 4x for 4, etc.), predominantly when loading models over USB3 HDDs which will be a very typical use case (I don't store models on ssds to avoid degrading their lifetime, particulary big models). Since RPC is a core feature and arguably one of the best features llama.cpp offers for running big models on limited VRAM commodity GPUs I think it makes sense to alter APIs or add hooks as necessary to let it be as efficient as possible. Once the model is cached into system RAM gains will be less, but for iterating versions on big models (lately I have been testing hybrid layer quants on Llama Scout for example) it will remove a huge amount of time to load and test new versions.

@rgerganov
Copy link
Collaborator Author

I don't think this function needs to be part of the base ggml interface, it is just never going to be used outside of the RPC backend. It would be better to use a custom function that is obtained via ggml_backend_reg_get_proc_address.

Ah, right, I forgot about this feature, thanks. Will rework the patch with ggml_backend_reg_get_proc_address

I also do not think that a 64-bit hash is enough to uniquely identify a tensor. The chance of collisions is too high for this to be acceptable.

I guess we can switch to the 128bit version of the FNV hash and bump the version of the RPC protocol. I don't think we need anything bigger than 128bit here. As for the hash function, I think that a non-cryptographic hash like FNV is fine as we use it for caching purposes.

If we go with 128 bit hashes then I guess we should store them as hex strings in the GGUF metadata, right?

@rgerganov
Copy link
Collaborator Author

I have updated the patch with ggml_backend_reg_get_proc_address()

@rgerganov rgerganov changed the title ggml-backend : add load_tensor() to backend API llama : try loading tensors with pre-computed hashes Apr 29, 2025
@slaren
Copy link
Member
slaren commented Apr 29, 2025

I was thinking that it would be nice if this is formulated in a way that other backends could also benefit from, for example to implement DirectStorage or cuFile support for faster loading times. But for that the function signature would need to be changed significantly to instead take a file name, and an offset within the file from which to load the tensor data. This wouldn't really fit in the current design for the RPC backend since it assumes one hash per tensor, and instead it would require one hash per file.

@rgerganov
Copy link
Collaborator Author

The current proposal is using the following function prototype:

bool load_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, size_t offset, uint64_t hash);

The implementation is supposed to load the data from $LLAMA_CACHE/hash and it's already implemented for the RPC backend.

Why this is not going to work for CUDA with DirectStorage/cuFile?

I understand this approach requires more disk space (2x) but I think it's a fair trade-off. Moreover, we can do some tricks with packing/unpacking GGUF files to local cache to save space if needed.

@slaren
Copy link
Member
slaren commented Apr 30, 2025

The idea would be to add a more generic function to load data from a file into a tensor, something like bool load_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, const char * path, size_t file_offset, size_t tensor_offset, size_t size). This would allow other backends to implement accelerated file loading functions, such as DirectStorage to read from SSD to VRAM directly. The RPC backend would need to implement this function probably by obtaining or calculating a hash of the file, and sending that hash to the RPC server to see if it has that file.

I am not saying that it is necessary to do it this way, but if it is viable it could be a more generic way to support this that would simplify the model loading code.

@rgerganov
Copy link
Collaborator Author

The whole point of this PR is to avoid reading tensor data on the main host and use pre-computed hashes because this brings massive reduction of load time for the RPC backend. Going with something like bool load_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, const char * path, size_t file_offset, size_t tensor_offset, size_t size) brings no practical benefits for RPC because we already do this in ggml_backend_rpc_buffer_set_tensor (i.e. hashing the data and sending hash to the server).

Once we have support for loading tensor data with pre-computed hash, we can start working on making it asynchronous which will reduce load time even more when multiple RPC servers are being used. I think that would be a game changer for distributed LLM inference.

@slaren
Copy link
Member
slaren commented May 1, 2025

The only difference is that you would need to store a hash per-file instead of per-tensor. It would require more changes to the existing code, but it could be done.

@steampunque
Copy link

The whole point of this PR is to avoid reading tensor data on the main host and use pre-computed hashes because this brings massive reduction of load time for the RPC backend. Going with something like bool load_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, const char * path, size_t file_offset, size_t tensor_offset, size_t size) brings no practical benefits for RPC because we already do this in ggml_backend_rpc_buffer_set_tensor (i.e. hashing the data and sending hash to the server).

Once we have support for loading tensor data with pre-computed hash, we can start working on making it asynchronous which will reduce load time even more when multiple RPC servers are being used. I think that would be a game changer for distributed LLM inference.

I agree with the approach you are taking. When working with huge ggufs over multiple RPCs its often necessary to change tensor split multiple times to optimize VRAM distribution and the per tensor hash approach minimizes the amount of new data that needs to be sent to the RPCs on reloads when adjusting tensor splits. Its easy to automatically launch all remote rpcs over ssh and set the cache directory to a shortname for the model while doing so which keeps the cache directory nicely organized and easy to manage and loading the model is just one command on the master machine independent of the number of RPCs.

Async would be amazing too, another speedup proportional to the number of RPCs by allowing each one to load the tensor data in parallel at the same time.

@gapeleon
Copy link
gapeleon commented May 3, 2025

This feature would be incredibly useful given the trend of huge MoE models coming out now.

This will create gemma-3-4b-it-q4_0.gguf.rpc which is the same as gemma-3-4b-it-q4_0.gguf + precomputed hashes for all tensors

Would it be better to store only the hashes in the gguf.rpc file, and have it sit along side the original .gguf?
This would make distribution and storage a lot easier.

@steampunque
Copy link
steampunque commented May 5, 2025

Would it be better to store only the hashes in the gguf.rpc file, and have it sit along side the original .gguf? This would make distribution and storage a lot easier.

No. The hashes need to be in the specific gguf they were created for.

@rgerganov
Copy link
Collaborator Author

I have reworked the patch to use a single hash for the entire GGUF and load tensors with the function prototype suggested by @slaren. The RPC backend reads the model hash in ggml_backend_rpc_buffer_load_tensor() and keeps a local cache of model path -> hash. The RPC server now accepts a model file and reads tensor data from it when RPC_CMD_LOAD_TENSOR comes in.

With this implementation I can load gemma-3-4b-it-q4_0.gguf.rpc for 3018,22 ms with RPC on localhost which is a bit slower than my original implementation. However, it is still much faster than master.

There are still some rough edges but I'd like to get some feedback on the general approach.

@steampunque
Copy link

I have reworked the patch to use a single hash for the entire GGUF and load tensors with the function prototype suggested by @slaren. The RPC backend reads the model hash in ggml_backend_rpc_buffer_load_tensor() and keeps a local cache of model path -> hash. The RPC server now accepts a model file and reads tensor data from it when RPC_CMD_LOAD_TENSOR comes in.

With this implementation I can load gemma-3-4b-it-q4_0.gguf.rpc for 3018,22 ms with RPC on localhost which is a bit slower than my original implementation. However, it is still much faster than master.

There are still some rough edges but I'd like to get some feedback on the general approach.

This approach is going to be much less efficient that the original scheme when working with big ggufs over multiple RPCs because you are now forcing the big gguf to be sent to all the RPCs prior to loading the model, which completely defeats the efficiency (in fact makes it much worse) when prototyping new ggufs. In particular my recent workflow on hybrid quants with Lllama scout I create tons of variations of the gguf and test, that whole process would be pushed to a crawl if I had to first send the entire gguf to each machine before running the model. Maybe there would be a way to just fall back to the old scheme if there is no gguf passed to the rpc? Passing a gguf to the rpc and avoid cache is clean but incredibly inefficient when prototyping new gguf quants.

@rgerganov
Copy link
Collaborator Author

Maybe there would be a way to just fall back to the old scheme if there is no gguf passed to the rpc? Passing a gguf to the rpc and avoid cache is clean but incredibly inefficient when prototyping new gguf quants.

The local cache of the RPC server is not going away. If the model file doesn't have a pre-computed hash, everything will work as before.

@slaren
Copy link
Member
slaren commented May 7, 2025

With this implementation I can load gemma-3-4b-it-q4_0.gguf.rpc for 3018,22 ms with RPC on localhost which is a bit slower than my original implementation. However, it is still much faster than master.

Do you know why that is? I would expect the performance to be the same.

@rgerganov
Copy link
Collaborator Author

Do you know why that is? I would expect the performance to be the same.

I suspect that seeking into a large file doesn't come for free but I am not 100% sure. The old implementation is here

@slaren
Copy link
Member
slaren commented May 7, 2025

The same seek should be necessary in both cases. From what I can tell the previous implementation did the same thing, using C++ iostream instead of fopen, but I don't think that should make a significant difference either. It may be just measurement error.

Currently this is not used when mmap is disabled, but that may be the most efficient way to use this feature, because with mmap the entire file is prefetched first. Without mmap, only the tensors that are used in the local device would be loaded from disk, which should be faster if the file is not already in the system cache.

@gapeleon suggestion to store the hashes separately in a .rpc IMO would result in a better experience for most users. You could just generate the hashes automatically the first time a model file is used in the RPC server. Unless people start distributing GGUFs with the hashes, having to create a copy of the model with the hash metadata is going to be very annoying quickly.

@gapeleon
Copy link
gapeleon commented May 7, 2025

@steampunque

No. The hashes need to be in the specific gguf they were created for.

Yes I understand that, and wasn't suggesting you could use them for different ggufs. Just that they don't need to be embedded in the file with the weights. For example, if you have

model-Q4_K_M.gguf
model-Q4_K_M.gguf.rpc (containing hashes for Q4_K_M version)

And then
model-Q8_0.gguf
model-Q8_0.gguf.rpc (containing hashes for Q8_0 version)

This way we can keep the pristine originals (for distribution, use with other tools, etc) without having to double the storage requirements.

It would also decrease the storage requirements for producing the .rpc file (particularly important with all these >200GB models coming out now.

@steampunque
Copy link

@steampunque

No. The hashes need to be in the specific gguf they were created for.

Yes I understand that, and wasn't suggesting you could use them for different ggufs. Just that they don't need to be embedded in the file with the weights. For example, if you have

model-Q4_K_M.gguf model-Q4_K_M.gguf.rpc (containing hashes for Q4_K_M version)

And then model-Q8_0.gguf model-Q8_0.gguf.rpc (containing hashes for Q8_0 version)

This way we can keep the pristine originals (for distribution, use with other tools, etc) without having to double the storage requirements.

It would also decrease the storage requirements for producing the .rpc file (particularly important with all these >200GB models coming out now.

I don't agree with this approach. There is nothing that locks the rpc file to the gguf with the tensors. The hashes should be stored in the metadata of the gguf they were created for. Its much cleaner without another clutter file to worry about sending around and keeping synced. If the hashes are in the gguf they are always synced with that created gguf by design.

@steampunque
Copy link
steampunque commented May 7, 2025

Maybe there would be a way to just fall back to the old scheme if there is no gguf passed to the rpc? Passing a gguf to the rpc and avoid cache is clean but incredibly inefficient when prototyping new gguf quants.

The local cache of the RPC server is not going away. If the model file doesn't have a pre-computed hash, everything will work as before.

When iterating ggufs it would just be necessary to not use the hashes at all and have it work as master works now without the has mechansim. This will result in the minimum necessary amount of data being sent to the RPCs for new ggufs. However, when the tensor split inevitably crashes with out of memory and reload is necessary, the reload will be just as slow as it is now. The existing design only requires the necessary tensors to be shipped over the network to each RPC and after a tensor split overload crash the next load they will be much faster since they are already on the machine (this is not a hypothetical. I have been running this flow extensively over the last couple weeks and RPC hash is saving a huge amount of time on tensor overload crashes which always happen). There is no way to get that operational efficiency if the entire gguf needs to be transferred to each RPC before the hash mechansim can work.

@rgerganov
Copy link
Collaborator Author

Do you know why that is? I would expect the performance to be the same.

Turns out it depends on whether the model is in the page cache, i.e. if this is the first run of the server with a particular model. I have used echo 3 > /proc/sys/vm/drop_caches to simulate cold runs and these are the average results for gemma-3-4b-it-q4_0.gguf on my machine with RPC server running on localhost:

impl load time (cold) ms load time (warm) ms
master 11500 4700
hash-per-tensor 5200 2200
hash-per-model 3300 3100

I think the current approach with one hash per model is good enough in terms of performance.

@rgerganov
Copy link
Collaborator Author

I don't agree with this approach. There is nothing that locks the rpc file to the gguf with the tensors. The hashes should be stored in the metadata of the gguf they were created for. Its much cleaner without another clutter file to worry about sending around and keeping synced. If the hashes are in the gguf they are always synced with that created gguf by design.

I think we can support both methods for storing hashes. We can try to load the hash from model_path + .rpc first and if that fails we can try to load it from KV metadata.

I have been using llama-gguf-hash as quick and dirty way for embedding hashes but I am not sure if this is the best approach. Now I am thinking to implement @slaren's proposal of generating .rpc file in rpc-server if the model doesn't have embedded hash and in addition to that have a standalone python script for embedding hashes in GGUF.

I have been running this flow extensively over the last couple weeks and RPC hash is saving a huge amount of time on tensor overload crashes which always happen). There is no way to get that operational efficiency if the entire gguf needs to be transferred to each RPC before the hash mechansim can work.

You can still use the local cache of the RPC server during the experimentation phase and once you have the final GGUF then you can embed a hash into it and transfer it to your RPC nodes.

@steampunque
Copy link
steampunque commented May 9, 2025

I don't agree with this approach. There is nothing that locks the rpc file to the gguf with the tensors. The hashes should be stored in the metadata of the gguf they were created for. Its much cleaner without another clutter file to worry about sending around and keeping synced. If the hashes are in the gguf they are always synced with that created gguf by design.

I think we can support both methods for storing hashes. We can try to load the hash from model_path + .rpc first and if that fails we can try to load it from KV metadata.

This would be perfect. Your first design approach with per tensor hashes was optimal for many reasons. The RPC machines only need to store the tensors they will be using. So for instance if I create a monster 200G Maverick file and want to ship that around to five or six RPC boxes I do not want to have to replicate that monster file on every RPC box, I only want the specific tensors the RPC box will be running. I can also just offlload attention tensors to a specific machine to expand KV cache memory range without having to download the huge FFN tensors which are not needed for that use cases. The per tensor hash approach is flexible enough to handle any use case optimally.

I have been using llama-gguf-hash as quick and dirty way for embedding hashes but I am not sure if this is the best approach. Now I am thinking to implement @slaren's proposal of generating .rpc file in rpc-server if the model doesn't have embedded hash and in addition to that have a standalone python script for embedding hashes in GGUF.

I have been running this flow extensively over the last couple weeks and RPC hash is saving a huge amount of time on tensor overload crashes which always happen). There is no way to get that operational efficiency if the entire gguf needs to be transferred to each RPC before the hash mechansim can work.

You can still use the local cache of the RPC server during the experimentation phase and once you have the final GGUF then you can embed a hash into it and transfer it to your RPC nodes.

I honestly don't see any use case for transferring the whole GGUF to the remote RPC node ever. The per tensor hash already covers every use case I can see optimally. All that is needed to improve the existing design you have is to pre compute the hashes, store in GGUF so master machine is not forced to read the whole file, just the tensors it will be using, and ship pre computed tensor hashes to the RPC nodes.

@rgerganov rgerganov marked this pull request as ready for review May 9, 2025 12:21
@rgerganov
Copy link
Collaborator Author

Here are the results of fully offloading gemma-3-4b-it-q4_0.gguf (3GB) over 1Gbps network connection:

cmd load time on main host (ms)
rpc-server 31200
rpc-server -c 8600
rpc-server -c -m gemma-3-4b-it-q4_0.gguf 3300

Comment on lines 888 to 918
bool llama_model_loader::load_tensor(ggml_tensor * cur, const char * path, size_t file_offset, size_t tensor_offset, size_t size) {
if (!rpc_load_tensor_fn) {
return false;
}
ggml_backend_buffer_t buf = cur->view_src ? cur->view_src->buffer : cur->buffer;
const char * buf_name = ggml_backend_buffer_name(buf);
if (strncmp(buf_name, "RPC", 3) != 0) {
return false;
}
return rpc_load_tensor_fn(buf, cur, path, file_offset, tensor_offset, size);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren Now that the signature of the function has been updated as per #13106 (comment), does it make sense for the function to become part of the backend interface? I think your idea was to make it like this so that the changes in llama_model_loader would become very minimal and there won't be special RPC-only logic and in the future to be able to implement it in other backends.

@rgerganov
Copy link
Collaborator Author

I have also implemented load_tensor with cuFile APIs in the CUDA backend. This will leverage GPU Direct Storage(GDS) when available and transfer tensor data directly from disk to GPU memory.

When GDS is not available, cuFile APIs can still be used in compatibility mode. While this is great for testing, it is slightly slower than what we already do in llama-model-loader.cpp (mmap + tensor_set). So I put explicit checks to disable load_tensor when the driver is running in compat mode.

Unfortunately I don't have GDS environment for testing and I am not able to provide any performance results. I may try to setup one on AWS but it looks really complicated.
If anyone can try this out and report perf results, it'd be great!

@rgerganov
Copy link
Collaborator Author

@JohannesGaessler I am trying to implement a new backend API load_tensor which is using pre-computed hashes with the RPC backend and cuFile with the CUDA backend. Could you please review the CUDA part? If you happen to have a GDS testbed with nvidia_fs.ko driver loaded, perf results are very much appreciated.

@@ -3410,6 +3418,68 @@ static ggml_backend_feature * ggml_backend_cuda_get_features(ggml_backend_reg_t
GGML_UNUSED(reg);
}

static bool ggml_backend_cuda_buffer_load_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, const char * path, size_t file_offset, size_t tensor_offset, size_t size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this function correctly it loads a single tensor from a file. The driver is being re-used but the file handle is not - I don't know how much overhead that causes though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, file handles are re-created every time and I am not sure how to address this with the current design. Adding a cache is going to leak both handles and descriptors which is not great.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are from what I can tell never explicitly registering the buffers that you load data to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can register the buffer in ggml_backend_cuda_buffer_type_alloc_buffer() but it'd be nice to have some performance data with the internal buffers first. The doc also suggests that registering the buffer is needed for small IO sizes which is generally not the case here.

@JohannesGaessler
Copy link
Collaborator

If you happen to have a GDS testbed with nvidia_fs.ko driver loaded, perf results are very much appreciated.

Sorry, I unfortunately also don't have a suitable machine either. My machines are either using consumer GPUs or SATA SSDs. Using an RTX 4090, a Lexar SSD NM790 4TB, and the following command:

export model_name=llama_3-8b && export quantization=f16
time ./cli --model models/opt/${model_name}-${quantization}.gguf --n-gpu-layers 99 -p "" -n 1 --no-warmup -fa --no-mmap
Software version mmap? Runtime cold [s] Runtime hot [s]
master Yes 9.990 3.985
master No 9.097 3.541
PR Yes 14.796 8.800
PR No 9.240 3.574

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0