10000 convert_hf : faster lazy safetensors by compilade · Pull Request #8482 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_hf : faster lazy safetensors #8482

Merged
merged 3 commits into from
Jul 16, 2024

Conversation

compilade
Copy link
Collaborator
@compilade compilade commented Jul 14, 2024

Currently, with Lazy conversion, a relatively big portion of the model files is read before even beginning to write the output file, and then if the disk cache it smaller than the model, it will be read from disk again when actually converting.

Most of the time in the initial read is spent on model_part.get_tensor(name) (at least when using safetensors).

Turns out safetensors has the much faster .get_slice(name) which doesn't read the tensor data before it's needed, while still giving access to the shape and dtype of each tensor.

As a nice result, this makes convert_hf_to_gguf.py --dry-run much, much faster than before for slow disks and/or big models (seconds instead of minutes). Normal lazy conversion is also faster, since the initial metadata reading step doesn't unnecessarily read all the data anymore.

Note that I've also removed some unused code in gguf-py/gguf/tensor_mapping.py related to the number of experts. xid does not exist in the mappings since stacked experts were implemented, so .format(xid = xid) does not do anything.

Testing

After fixing the problem found in #8482 (comment), I've ran some more tests.

-no-slices- means master at commit 97bdd26, while -slices-recurse- means after the memory leak was fixed in b971122.

  • TinyMistral MoE (https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moe)
    $ sha256sum tinymistral-moe-no-slices-q8_0.gguf tinymistral-moe-slices-recurse-q8_0.gguf 
    19f86b44d7bc10a2053d0afb3f7815d7901450755b715fa4c04e323c5e0ea036  tinymistral-moe-no-slices-q8_0.gguf
    19f86b44d7bc10a2053d0afb3f7815d7901450755b715fa4c04e323c5e0ea036  tinymistral-moe-slices-recurse-q8_0.gguf
  • CosMoE 1x4b (https://huggingface.co/Lambent/cosmoem-4x1b)
    $ sha256sum cosmoem-4x1b-no-slices-q8_0.gguf cosmoem-4x1b-slices-recurse-q8_0.gguf 
    98bff7846527e103468045d498e1a856dc70e555ecabdc4fedc9ae6087c35926  cosmoem-4x1b-no-slices-q8_0.gguf
    98bff7846527e103468045d498e1a856dc70e555ecabdc4fedc9ae6087c35926  cosmoem-4x1b-slices-recurse-q8_0.gguf
  • InternLM2 (https://huggingface.co/internlm/internlm2-chat-1_8b)
    $ sha256sum internlm2-1.8B-no-slices-q8_0.gguf internlm2-1.8B-slices-recurse-q8_0.gguf 
    ddb5d5ecb7c12c61f197a513f1de407809b48d13ffb18ed78a336cb12173ef28  internlm2-1.8B-no-slices-q8_0.gguf
    ddb5d5ecb7c12c61f197a513f1de407809b48d13ffb18ed78a336cb12173ef28  internlm2-1.8B-slices-recurse-q8_0.gguf
  • Mamba (https://huggingface.co/state-spaces/mamba-130m-hf)
    $ sha256sum mamba-130m-no-slices-q8_0.gguf mamba-130m-slices-recurse-q8_0.gguf 
    ababd062531d9f961db3681b170dd31937a5a98d774baf6a6eb9a8c8caf5edcd  mamba-130m-no-slices-q8_0.gguf
    ababd062531d9f961db3681b170dd31937a5a98d774baf6a6eb9a8c8caf5edcd  mamba-130m-slices-recurse-q8_0.gguf
  • Bloom LoRA (https://huggingface.co/player1537/Bloom-560m-LoRA-trained-on-Dolphin with base https://huggingface.co/bigscience/bloom-560m)
    $ sha256sum lora-bloom-dolphin-no-slices-f16.gguf lora-bloom-dolphin-slices-recurse-f16.gguf 
    2a176f1df5892af05475cce9659dfbf801bcdae1ffc2b0007ca1d88970c8f252  lora-bloom-dolphin-no-slices-f16.gguf
    2a176f1df5892af05475cce9659dfbf801bcdae1ffc2b0007ca1d88970c8f252  lora-bloom-dolphin-slices-recurse-f16.gguf

@github-actions github-actions bot added the python python script changes label Jul 14, 2024
@compilade compilade added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 15, 2024
@compilade compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024
@compilade
Copy link
Collaborator Author
compilade commented Jul 15, 2024

This seems to cause a memory usage regression when lazily converting MoE models. Looks a lot like a memory leak. I'll try to fix this before merging.

EDIT: So in my tests it's only happening with MoE models, so this means it's likely not a safetensors issue, although this is weird if it only affects torch.stack. I'll try to understand and report back.

EDIT2: Seems like it might be related to a reference cycle in LazyBase, I assume in the _lazy deque. Manually running gc.collect() "fixes" the problem. I wonder why this problem didn't manifest itself before? I think the real solution will be to remove that queue and use recursion for evaluation instead. That queue was used to avoid deep recursion, but it has unintuitive behaviors (with correct results) in some cases like when concatenating tensors, and I don't think there's deep enough conversion chains in any of the model subclasses for the recursion limit (1000 by default) to be a problem, especially since complex operations like quantization are abstracted away with meta_noop.

EDIT3: implemented the recursive solution in b971122.

@compilade compilade removed the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024
The '_lazy' queue was sometimes self-referential,
which caused reference cycles of objects old enough
to avoid garbage collection until potential memory exhaustion.
@compilade compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 16, 2024
@compilade compilade merged commit 7acfd4e into master Jul 16, 2024
12 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* convert_hf : faster lazy safetensors

This makes '--dry-run' much, much faster.

* convert_hf : fix memory leak in lazy MoE conversion

The '_lazy' queue was sometimes self-referential,
which caused reference cycles of objects old enough
to avoid garbage collection until potential memory exhaustion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0