convert : allow partial update to the chkhsh pre-tokenizer list #13847

ngxson · 2025-05-28T09:23:36Z

This is a QoL change which makes the convert_hf_to_gguf_update script easier to used by contributors.

Traditionally, each time it's run, we update all models, which may not be ideal if user doesn't have access to it. With this change, we only update added models.

The old behavior can still be triggered using --full flag

CISC · 2025-05-28T09:37:20Z

Nice, but can we merge #11600 first (since you are adding some missing inp/out files anyway)? :)

ngxson · 2025-05-28T13:45:34Z

@CISC yes sounds ok for me

CISC · 2025-05-28T13:50:39Z

@ngxson Done, now you have to update all the .out files. :)

CISC · 2025-05-28T14:14:18Z

@ngxson Uhm, it shouldn't be necessary to add inp/out files for those we don't have GGUFs for.

CISC · 2025-05-28T14:15:39Z

convert_hf_to_gguf.py

-        if chkhsh == "1431a23e583c97432bc230bff598d103ddb5a1f89960c8f1d1051aaa944d0b35":
+        if chkhsh == "68fa7e0a33050885cc10a2acfa4df354042188f0afa03b809f7a71c4cde6e373":
            # ref: https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0
            res = "minerva-7b"


What happened here, was the model updated?

They updated tokenizer.json, and removed

{ "type": "Digits", "individual_digits": true }

Might warrant an updated regex?

Yep, we need to preserve the old hash as minerva-7b, using this regex:

llama.cpp/src/llama-vocab.cpp

Lines 337 to 341 in c3a2624

case LLAMA_VOCAB_PRE_TYPE_MINERVA:

regex_exprs = {

"\\p{N}",

"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",

};

And add a new name for the new hash, using this regex:

llama.cpp/src/llama-vocab.cpp

Lines 347 to 351 in c3a2624

case LLAMA_VOCAB_PRE_TYPE_TRILLION:

regex_exprs = {

"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",

};

break;

I brought back the old hash, so nothing change for this model. Tbh I think no one is actually using it, so not worth the time to fix it.

convert_hf_to_gguf.py

ngxson · 2025-05-28T14:17:35Z

@ngxson Uhm, it shouldn't be necessary to add inp/out files for those we don't have GGUFs for.

I have no idea, I thought the update script should handle that case. I can add a condition in the python script to skip such models

ngxson · 2025-05-28T14:21:59Z

They should be removed now

ngxson · 2025-05-28T14:29:12Z

Hmm not sure why some files are deleted inadvertently. Seems like some are missing from the list in python code

Edit: I think it's fine though, if we don't have the original tokenizer in python code, it's impossible to generate the .out file

CISC · 2025-05-28T14:44:09Z

convert_hf_to_gguf_update.py

+if hf_token is None:
+    logger.error("HF token is required. Please provide it as an argument or set it in ~/.cache/huggingface/token")
+    sys.exit(1)


Since this will now be used for mostly public models I don't think we should require a token.

All of them are public, but some are gated, so token is still needed

For example: gemma, llama, dbrx, command-r, etc

Yeah, but isn't the point that regular people don't have to download them now?

CISC · 2025-05-28T14:54:33Z

Hmm not sure why some files are deleted inadvertently. Seems like some are missing from the list in python code

Edit: I think it's fine though, if we don't have the original tokenizer in python code, it's impossible to generate the .out file

Not quite, at least nomic-bert-moe is used in tests.

ngxson · 2025-05-28T15:06:38Z

Not quite, at least nomic-bert-moe is used in tests.

Not sure how I can fix this. As .inp file change, we need to find a way to re-generate .out file. Any ideas?

CISC · 2025-05-28T15:14:21Z

Not quite, at least nomic-bert-moe is used in tests.

Not sure how I can fix this. As .inp file change, we need to find a way to re-generate .out file. Any ideas?

Well, we can remove it from the test then and add the files under ggml-org/vocabs/UGM.

ngxson · 2025-05-28T15:20:09Z

Ok I comment the test out from cmakelists

CISC · 2025-05-28T17:10:30Z

Ok I comment the test out from cmakelists

You can remove the comment, I just recently added it.

CISC · 2025-05-28T17:17:20Z

There's no point in keeping the orphaned GGUFs (even though they will haunt git forever :) ) I guess?

ggml-vocab-aquila.gguf
ggml-vocab-baichuan.gguf
ggml-vocab-gpt-neox.gguf
ggml-vocab-nomic-bert-moe.gguf

ngxson · 2025-05-28T17:19:41Z

Keeping them, maybe someone will add back the inp/out file in the future

convert_hf_to_gguf_update.py

CISC

Fix minerva, otherwise LGTM.

convert : allow partial update to the chkhsh pre-tokenizer list

7697161

ngxson requested a review from ggerganov May 28, 2025 09:23

ggerganov approved these changes May 28, 2025

View reviewed changes

github-actions bot added the python python script changes label May 28, 2025

ngxson added 3 commits May 28, 2025 15:53

Merge branch 'master' into xsn/convert_update_qol

8bab97a

code style

16247c4

update tokenizer out

ac5449b

ngxson requested review from ggerganov and CISC May 28, 2025 14:08

ngxson mentioned this pull request May 28, 2025

convert: handle when model's tokenization method relies on Mecab #13830

Closed

rm inp/out files for models not having gguf

eb23a95

fixed hash for glm

787c36d

CISC reviewed May 28, 2025

View reviewed changes

skip nomic-bert-moe test

85e3350

github-actions bot added the testing Everything test related label May 28, 2025

CISC reviewed May 29, 2025

View reviewed changes

convert_hf_to_gguf_update.py Outdated Show resolved Hide resolved

Update convert_hf_to_gguf_update.py

5774158

ngxson requested a review from CISC May 29, 2025 12:19

CISC approved these changes May 29, 2025

View reviewed changes

ngxson added 2 commits May 30, 2025 11:42

fix minerva-7b hash

defa0df

rm redundant import

714d403

ngxson merged commit 07e4351 into ggml-org:master May 30, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : allow partial update to the chkhsh pre-tokenizer list #13847

convert : allow partial update to the chkhsh pre-tokenizer list #13847

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	case LLAMA_VOCAB_PRE_TYPE_MINERVA:
	regex_exprs = {
	"\\p{N}",
	"'s\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?\\p{L}+\| ?\\p{N}+\| ?[^\\s\\p{L}\\p{N}]+\|\\s+(?!\\S)",
	};

	case LLAMA_VOCAB_PRE_TYPE_TRILLION:
	regex_exprs = {
	"'s\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?\\p{L}+\| ?\\p{N}+\| ?[^\\s\\p{L}\\p{N}]+\|\\s+(?!\\S)",
	};
	break;

convert : allow partial update to the chkhsh pre-tokenizer list #13847

convert : allow partial update to the chkhsh pre-tokenizer list #13847

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!