convert_hf : faster lazy safetensors #8482

compilade · 2024-07-14T23:45:32Z

Currently, with Lazy conversion, a relatively big portion of the model files is read before even beginning to write the output file, and then if the disk cache it smaller than the model, it will be read from disk again when actually converting.

Most of the time in the initial read is spent on model_part.get_tensor(name) (at least when using safetensors).

Turns out safetensors has the much faster .get_slice(name) which doesn't read the tensor data before it's needed, while still giving access to the shape and dtype of each tensor.

As a nice result, this makes convert_hf_to_gguf.py --dry-run much, much faster than before for slow disks and/or big models (seconds instead of minutes). Normal lazy conversion is also faster, since the initial metadata reading step doesn't unnecessarily read all the data anymore.

Note that I've also removed some unused code in gguf-py/gguf/tensor_mapping.py related to the number of experts. xid does not exist in the mappings since stacked experts were implemented, so .format(xid = xid) does not do anything.

Testing

After fixing the problem found in #8482 (comment), I've ran some more tests.

-no-slices- means master at commit 97bdd26, while -slices-recurse- means after the memory leak was fixed in b971122.

TinyMistral MoE (https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moe)

$ sha256sum tinymistral-moe-no-slices-q8_0.gguf tinymistral-moe-slices-recurse-q8_0.gguf 
19f86b44d7bc10a2053d0afb3f7815d7901450755b715fa4c04e323c5e0ea036  tinymistral-moe-no-slices-q8_0.gguf
19f86b44d7bc10a2053d0afb3f7815d7901450755b715fa4c04e323c5e0ea036  tinymistral-moe-slices-recurse-q8_0.gguf

CosMoE 1x4b (https://huggingface.co/Lambent/cosmoem-4x1b)

$ sha256sum cosmoem-4x1b-no-slices-q8_0.gguf cosmoem-4x1b-slices-recurse-q8_0.gguf 
98bff7846527e103468045d498e1a856dc70e555ecabdc4fedc9ae6087c35926  cosmoem-4x1b-no-slices-q8_0.gguf
98bff7846527e103468045d498e1a856dc70e555ecabdc4fedc9ae6087c35926  cosmoem-4x1b-slices-recurse-q8_0.gguf

InternLM2 (https://huggingface.co/internlm/internlm2-chat-1_8b)

$ sha256sum internlm2-1.8B-no-slices-q8_0.gguf internlm2-1.8B-slices-recurse-q8_0.gguf 
ddb5d5ecb7c12c61f197a513f1de407809b48d13ffb18ed78a336cb12173ef28  internlm2-1.8B-no-slices-q8_0.gguf
ddb5d5ecb7c12c61f197a513f1de407809b48d13ffb18ed78a336cb12173ef28  internlm2-1.8B-slices-recurse-q8_0.gguf

Mamba (https://huggingface.co/state-spaces/mamba-130m-hf)

$ sha256sum mamba-130m-no-slices-q8_0.gguf mamba-130m-slices-recurse-q8_0.gguf 
ababd062531d9f961db3681b170dd31937a5a98d774baf6a6eb9a8c8caf5edcd  mamba-130m-no-slices-q8_0.gguf
ababd062531d9f961db3681b170dd31937a5a98d774baf6a6eb9a8c8caf5edcd  mamba-130m-slices-recurse-q8_0.gguf

Bloom LoRA (https://huggingface.co/player1537/Bloom-560m-LoRA-trained-on-Dolphin with base https://huggingface.co/bigscience/bloom-560m)

$ sha256sum lora-bloom-dolphin-no-slices-f16.gguf lora-bloom-dolphin-slices-recurse-f16.gguf 
2a176f1df5892af05475cce9659dfbf801bcdae1ffc2b0007ca1d88970c8f252  lora-bloom-dolphin-no-slices-f16.gguf
2a176f1df5892af05475cce9659dfbf801bcdae1ffc2b0007ca1d88970c8f252  lora-bloom-dolphin-slices-recurse-f16.gguf

I have read the contributing guidelines
Self-reported review complexity:
- Low

compilade · 2024-07-15T19:37:35Z

This seems to cause a memory usage regression when lazily converting MoE models. Looks a lot like a memory leak. I'll try to fix this before merging.

EDIT: So in my tests it's only happening with MoE models, so this means it's likely not a safetensors issue, although this is weird if it only affects torch.stack. I'll try to understand and report back.

EDIT2: Seems like it might be related to a reference cycle in LazyBase, I assume in the _lazy deque. Manually running gc.collect() "fixes" the problem. I wonder why this problem didn't manifest itself before? I think the real solution will be to remove that queue and use recursion for evaluation instead. That queue was used to avoid deep recursion, but it has unintuitive behaviors (with correct results) in some cases like when concatenating tensors, and I don't think there's deep enough conversion chains in any of the model subclasses for the recursion limit (1000 by default) to be a problem, especially since complex operations like quantization are abstracted away with meta_noop.

EDIT3: implemented the recursive solution in b971122.

The '_lazy' queue was sometimes self-referential, which caused reference cycles of objects old enough to avoid garbage collection until potential memory exhaustion.

* convert_hf : faster lazy safetensors This makes '--dry-run' much, much faster. * convert_hf : fix memory leak in lazy MoE conversion The '_lazy' queue was sometimes self-referential, which caused reference cycles of objects old enough to avoid garbage collection until potential memory exhaustion.

convert_hf : faster lazy safetensors

7cda4dd

github-actions bot added the python python script changes label Jul 14, 2024

compilade added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 15, 2024

compilade mentioned this pull request Jul 15, 2024

convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor #7499

Merged

ggerganov approved these changes Jul 15, 2024

View reviewed changes

compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024

Merge branch 'master' into compilade/faster-lazy-safetensors

8000

2a49a68

compilade removed the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024

convert_hf : fix memory leak in lazy MoE conversion

b971122

The '_lazy' queue was sometimes self-referential, which caused reference cycles of objects old enough to avoid garbage collection until potential memory exhaustion.

compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 16, 2024

compilade merged commit 7acfd4e into master Jul 16, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert_hf : faster lazy safetensors #8482

convert_hf : faster lazy safetensors #8482

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

convert_hf : faster lazy safetensors #8482

convert_hf : faster lazy safetensors #8482

Uh oh!

Conversation

Uh oh!

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!