8000 convert : allow partial update to the chkhsh pre-tokenizer list by ngxson · Pull Request #13847 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

convert : allow partial update to the chkhsh pre-tokenizer list #13847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 30, 2025

Conversation

ngxson
Copy link
Collaborator
@ngxson ngxson commented May 28, 2025

This is a QoL change which makes the convert_hf_to_gguf_update script easier to used by contributors.

Traditionally, each time it's run, we update all models, which may not be ideal if user doesn't have access to it. With this change, we only update added models.

The old behavior can still be triggered using --full flag

@ngxson ngxson requested a review from ggerganov May 28, 2025 09:23
@CISC
Copy link
Collaborator
CISC commented May 28, 2025

Nice, but can we merge #11600 first (since you are adding some missing inp/out files anyway)? :)

@github-actions github-actions bot added the python python script changes label May 28, 2025
@ngxson
Copy link
Collaborator Author
ngxson commented May 28, 2025

@CISC yes sounds ok for me

@CISC
Copy link
Collaborator
CISC commented May 28, 2025

@ngxson Done, now you have to update all the .out files. :)

@ngxson ngxson requested review from ggerganov and CISC May 28, 2025 14:08
@CISC
Copy link
Collaborator
CISC commented May 28, 2025

@ngxson Uhm, it shouldn't be necessary to add inp/out files for those we don't have GGUFs for.

CISC
Comment on lines 767 to 766
if chkhsh == "1431a23e583c97432bc230bff598d103ddb5a1f89960c8f1d1051aaa944d0b35":
if chkhsh == "68fa7e0a33050885cc10a2acfa4df354042188f0afa03b809f7a71c4cde6e373":
# ref: https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0
res = "minerva-7b"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here, was the model updated?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They updated tokenizer.json, and removed

{
    "type": "Digits",
    "individual_digits": true
}

Might warrant an updated regex?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we need to preserve the old hash as minerva-7b, using this regex:

case LLAMA_VOCAB_PRE_TYPE_MINERVA:
regex_exprs = {
"\\p{N}",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
};

And add a new name for the new hash, using this regex:

case LLAMA_VOCAB_PRE_TYPE_TRILLION:
regex_exprs = {
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
};
break;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought back the old hash, so nothing change for this model. Tbh I think no one is actually using it, so not worth the time to fix it.

@ngxson
Copy link
Collaborator Author
ngxson commented May 28, 2025

@ngxson Uhm, it shouldn't be necessary to add inp/out files for those we don't have GGUFs for.

I have no idea, I thought the update script should handle that case. I can add a condition in the python script to skip such models

@ngxson
Copy link
Collaborator Author
ngxson commented May 28, 2025

They should be removed now

@ngxson
Copy link
Collaborator Author
ngxson commented May 28, 2025

Hmm not sure why some files are deleted inadvertently. Seems like some are missing from the list in python code

Edit: I think it's fine though, if we don't have the original tokenizer in python code, it's impossible to generate the .out file

Comment on lines +72 to +74
if hf_token is None:
logger.error("HF token is required. Please provide it as an argument or set it in ~/.cache/huggingface/token")
sys.exit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will now be used for mostly public models I don't think we should require a token.

Copy link
Collaborator Author
@ngxson ngxson May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of them are public, but some are gated, so token is still needed

For example: gemma, llama, dbrx, command-r, etc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but isn't the point that regular people don't have to download them now?

@CISC
Copy link
Collaborator
CISC commented May 28, 2025

Hmm not sure why some files are deleted inadvertently. Seems like some are missing from the list in python code

Edit: I think it's fine though, if we don't have the original tokenizer in python code, it's impossible to generate the .out file

Not quite, at least nomic-bert-moe is used in tests.

@ngxson
Copy link
Collaborator Author
ngxson commented May 28, 2025

Not quite, at least nomic-bert-moe is used in tests.

Not sure how I can fix this. As .inp file change, we need to find a way to re-generate .out file. Any ideas?

@CISC
Copy link
Collaborator
CISC commented May 28, 2025

Not quite, at least nomic-bert-moe is used in tests.

Not sure how I can fix this. As .inp file change, we need to find a way to re-generate .out file. Any ideas?

Well, we can remove it from the test then and add the files under ggml-org/vocabs/UGM.

@ngxson
Copy link
Collaborator Author
ngxson commented May 28, 2025

Ok I comment the test out from cmakelists

@github-actions github-actions bot added the testing Everything test related label May 28, 2025
@CISC
Copy link
Collaborator
CISC commented May 28, 2025

Ok I comment the test out from cmakelists

You can remove the comment, I just recently added it.

@CISC
Copy link
Collaborator
CISC commented May 28, 2025

There's no point in keeping the orphaned GGUFs (even though they will haunt git forever :) ) I guess?

ggml-vocab-aquila.gguf
ggml-vocab-baichuan.gguf
ggml-vocab-gpt-neox.gguf
ggml-vocab-nomic-bert-moe.gguf

@ngxson
Copy link
Collaborator Author
ngxson commented May 28, 2025

Keeping them, maybe someone will add back the inp/out file in the future

@ngxson ngxson requested a review from CISC May 29, 2025 12:19
Copy link
Collaborator
@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix minerva, otherwise LGTM.

@ngxson ngxson merged commit 07e4351 into ggml-org:master May 30, 2025
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0