-
Notifications
You must be signed in to change notification settings - Fork 1.2k
TGI 2.0.3 fails to serve CodeLlama models that 2.0.1 supports #1969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you try with the latest available public container |
I still get a CUDA error with
|
Just on a hunch, but I've had some stability issues that I believe (I haven't had time to dig into it enough to rule out something on our end or consistently reproduce it) have to do with inter-GPU communication in these newest versions. The issues went away completely when I disabled cuda-graphs (--cuda-graphs 0), so maybe that's worth a try? It's really just a shot in the dark though at this point. |
Hi @stefanobranco , thanks for the suggestion. The model can be loaded with |
Hi @KCFindstr, I’ve just tried to reproduce the issue on a device with 8 NVIDIA A10G GPUs, which have the same VRAM (8x24.1GB = 192.8GB). I can confirm that the I tested the
In version I don't think this is a regression but is a limitation to the hardware + model + configuration (using cuda graphs). However I think we need better messaging/smart defaults to avoid this case. Regarding reducing GPU usage; we're always trying to get the most out of the limited GPU space and am going to look more into ways that we can reduce overhead in general to allow for more optimizations command used MODEL=codellama/CodeLlama-70b-Python-hf
IMAGE=ghcr.io/huggingface/text-generation-inference:2.0.0
docker run \
-m 320G \
--shm-size=40G \
-v /nvme0n1/Models/:/data \
-e HUGGINGFACE_HUB_CACHE=/data \
-e NVIDIA_VISIBLE_DEVICES=all \
-e MODEL_ID=$MODEL \
-e NUM_SHARD=8 \
-e MAX_INPUT_TOKENS=1024 \
-e MAX_TOTAL_TOKENS=2048 \
-e MAX_BATCH_PREFILL_TOKENS=2048 \
-e TRUST_REMOTE_CODE=true \
-e JSON_OUTPUT=true \
-e
-e PORT=8080 \
-p 7080:8080 \
--runtime=nvidia \
$IMAGE |
All reactions
-
❤️ 1 reaction
@drbh Thanks for the investigation! However, I retried with TGI 2.0.1 and the model can still be loaded successfully, which is different from your observation.
From the logs, it looks like CUDA graph is enabled. ( This is the VRAM usage when the model is loaded and being served:
This is the TGI 2.0.1 image I used:
Do you know if there are any other changes from 2.0.1 -> 2.0.3 that might lead to VRAM usage increase? |
@KCFindstr thats very interesting! would you be able to share the all of the logs until the server is ready to receive requests? Also does the model run on 2.0.2? I'm not aware of anything major at the moment but will take another look soon for anything that would impact the VRAM |
@drbh I tested the containers hosted on ghcr.io:
The logs from |
Thank you for sharing, I've started to look through the changes between the two versions and the issue is possibly related to a change in how we are masking the frequency penalty. However I cannot confirm this yet and am going to investigate further tomorrow. Will post an update soon. |
Hi @KCFindstr, I've continued to debug the issue and cannot find a specific change within TGI that would use more memory in 2.0.2, although I have some findings/recommendations/ideas below. Where the OOM happens and how to avoid it: During warmup we attempt to allocate as many blocks of kv_cache memory. This step checks the available memory and estimates a number of blocks. The percent of the total hardware memory can be configured with Continuing the warmup process, after the blocks are allocated, cuda graphs are initialized, at this point if there is not enough memory for the graphs (because of the block allocation) TGI will OOM. Depending on the model, the cuda graphs step will allocate more memory when initializing. In the case of The solution is to either decrease the optimistic kv_cache allocation with Personally, I'd recommend decreasing the text-generation-launcher \
--model-id codellama/CodeLlama-70b-Python-hf \
--num-shard 8 \
--cuda-memory-fraction .93 In terms of the origin of the change, in 2.0.2, Would you kindly try reducing the memory fraction to fit the model on your hardware with cuda graphs? Im going continue to explore the memory allocation and follow up with better error messages in the future, however I believe reducing this value should resolve this loading issue |
Thanks @drbh ! Setting |
It may also be worth mentioning that torch 2.3.1 contains a few fixes for various memory leaks, e.g. pytorch/pytorch#124238. From my understanding this mainly affects torch.compile, so it might not be relevant, but I'm also not knowledgeable enough to really estimate the impact of these fixes. Might be worth checking out if it has any impact though? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
System Info
Running a TGI 2.0.3 docker on a 8 NVIDIA_L4 VM.
Command:
$IMAGE
is a TGI 2.0.3 docker image.Information
Tasks
Reproduction
Run the provided command.
meta-llama/Meta-Llama-3-70B
also failed with similar error. Not sure about other models.Get the following error:
Is this a CUDA OOM error?
Expected behavior
The model is loaded successfully on 8 L4s with the same command and a TGI 2.0.1 container. Are there any changes to default settings that might have caused increased GPU RAM usage?
The text was updated successfully, but these errors were encountered: