-
Notifications
You must be signed in to change notification settings - Fork 24.8k
Closed
Labels
dynamo-triage-jan2025module: dynamooncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate modulevllm-compile
Description
Profiling vLLM leads to a profile that looks like the above. There are a lot of small "dict getitem" calls in the middle.
This seems to be not representative. For example, in the profile above it looks like it takes 40% of the overall time, but in reality, it likely takes a lot less and there is some per-call profiler overhead.
We should try to figure out what is emitting these (this is likely a more generic torch.compile x profiler problem, because this is using the PyTorch profiler) and see if we can group them all together in one single "dynamo bytecode" region.
To repro:
- use https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/simple_profiling.py#L24
- change "model" to "meta-llama/Llama-3.1-8B-Instruct"
- run the script
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames
jataylo and Oldpan
Metadata
Metadata
Assignees
Labels
dynamo-triage-jan2025module: dynamooncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate modulevllm-compile