-
Notifications
You must be signed in to change notification settings - Fork 12k
convert : allow partial update to the chkhsh pre-tokenizer list #13847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice, but can we merge #11600 first (since you are adding some missing inp/out files anyway)? :) |
@CISC yes sounds ok for me |
@ngxson Done, now you have to update all the .out files. :) |
@ngxson Uhm, it shouldn't be necessary to add inp/out files for those we don't have GGUFs for. |
convert_hf_to_gguf.py
Outdated
if chkhsh == "1431a23e583c97432bc230bff598d103ddb5a1f89960c8f1d1051aaa944d0b35": | ||
if chkhsh == "68fa7e0a33050885cc10a2acfa4df354042188f0afa03b809f7a71c4cde6e373": | ||
# ref: https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0 | ||
res = "minerva-7b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here, was the model updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They updated tokenizer.json, and removed
{
"type": "Digits",
"individual_digits": true
}
Might warrant an updated regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we need to preserve the old hash as minerva-7b
, using this regex:
Lines 337 to 341 in c3a2624
case LLAMA_VOCAB_PRE_TYPE_MINERVA: | |
regex_exprs = { | |
"\\p{N}", | |
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", | |
}; |
And add a new name for the new hash, using this regex:
Lines 347 to 351 in c3a2624
case LLAMA_VOCAB_PRE_TYPE_TRILLION: | |
regex_exprs = { | |
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", | |
}; | |
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought back the old hash, so nothing change for this model. Tbh I think no one is actually using it, so not worth the time to fix it.
I have no idea, I thought the update script should handle that case. I can add a condition in the python script to skip such models |
They should be removed now |
Hmm not sure why some files are deleted inadvertently. Seems like some are missing from the list in python code Edit: I think it's fine though, if we don't have the original tokenizer in python code, it's impossible to generate the |
if hf_token is None: | ||
logger.error("HF token is required. Please provide it as an argument or set it in ~/.cache/huggingface/token") | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this will now be used for mostly public models I don't think we should require a token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of them are public, but some are gated, so token is still needed
For example: gemma, llama, dbrx, command-r, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but isn't the point that regular people don't have to download them now?
Not quite, at least nomic-bert-moe is used in tests. |
Not sure how I can fix this. As |
Well, we can remove it from the test then and add the files under |
Ok I comment the test out from cmakelists |
You can remove the comment, I just recently added it. |
There's no point in keeping the orphaned GGUFs (even though they will haunt git forever :) ) I guess?
|
Keeping them, maybe someone will add back the inp/out file in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix minerva, otherwise LGTM.
This is a QoL change which makes the
convert_hf_to_gguf_update
script easier to used by contributors.Traditionally, each time it's run, we update all models, which may not be ideal if user doesn't have access to it. With this change, we only update added models.
The old behavior can still be triggered using
--full
flag