8000 [release] CPU perf benchmark latency increase for 2.6->2.7 on c5.24xlarge and A100 instances · Issue #151037 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[release] CPU perf benchmark latency increase for 2.6->2.7 on c5.24xlarge and A100 instances #151037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
atalman opened this issue Apr 10, 2025 · 8 comments
Assignees
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: intel Specific to x86 architecture module: performance Issues related to performance, either of kernel code or framework glue oncall: releng In support of CI and Release Engineering topic: performance topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@atalman
Copy link
Contributor
atalman commented Apr 10, 2025

Running torchbench userbenchmarks for CPU I see following results from different runs:

On C5.24xlarge CPU latency increase 10-30% . Please note we have to up to ~8% for noise. However looks like the signal we are getting is clear.

Running workflow:
https://github.com/pytorch/benchmark/blob/perf-release-2.7/.github/workflows/userbenchmark-c5-24xlarge.yml

C5.24xlarge

Run 1: 
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist-cpu_memory;847.316;866.195
mnist-gpu_memory;0.0;0.0
mnist-latency;74.46;93.13                                 ->  25% increase

Run 2:
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist_hogwild-cpu_memory;616.598;634.773
mnist_hogwild-gpu_memory;0.0;0.0
mnist_hogwild-latency;46.64;48.20
wlm_cpu_lstm-cpu_memory;946.34;960.879
wlm_cpu_lstm-gpu_memory;0.0;0.0
wlm_cpu_lstm-latency;821.30;934.65              -> 14% increase
wlm_cpu_trans-cpu_memory;976.066;975.133
wlm_cpu_trans-gpu_memory;0.0;0.0
wlm_cpu_trans-latency;818.44;910.90               -> 11% increase

Run 3:
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist-cpu_memory;1034.32;1030.47
mnist-gpu_memory;0.0;0.0
mnist-latency;66.29;92.53                                ->39% increase
mnist_hogwild-cpu_memory;615.805;629.711
mnist_hogwild-gpu_memory;0.0;0.0
mnist_hogwild-latency;45.94;45.15
wlm_cpu_lstm-cpu_memory;959.113;951.457
wlm_cpu_lstm-gpu_memory;0.0;0.0
wlm_cpu_lstm-latency;832.84;977.05                -> 17% increase
wlm_cpu_trans-cpu_memory;953.859;980.09
wlm_cpu_trans-gpu_memory;0.0;0.0
wlm_cpu_trans-latency;822.26;918.15


Run 4:
Benchmark,pytorch-2.6.0-cuda-12.6,pytorch-2.7.0-cuda-12.6
mnist-cpu_memory,993.281,1113.65
mnist-gpu_memory,0.0,0.0
mnist-latency,70.27,90.28                                 -> 28% increase
mnist_hogwild-cpu_memory,614.816,629.562
mnist_hogwild-gpu_memory,0.0,0.0
mnist_hogwild-latency,44.73,46.65
wlm_cpu_lstm-cpu_memory,946.188,964.023
wlm_cpu_lstm-gpu_memory,0.0,0.0
wlm_cpu_lstm-latency,811.83,954.04.            -> 17% increase
wlm_cpu_trans-cpu_memory,973.684,954.387
wlm_cpu_trans-gpu_memory,0.0,0.0
wlm_cpu_trans-latency,801.86,918.64.           -> 14% increase
wlm_gpu_lstm-cpu_memory,482.219,488.016
wlm_gpu_lstm-gpu_memory,0.0,0.0
wlm_gpu_lstm-latency,3.18,3.30
wlm_gpu_trans-cpu_memory,482.23,488.312
wlm_gpu_trans-gpu_memory,0.0,0.0
wlm_gpu_trans-latency,3.23,3.24

Run 5: 2.5vs2.6 - No increase in latency
Benchmark;pytorch-2.5.1-cuda-12.4;pytorch-2.6.0-cuda-12.4
mnist-cpu_memory;1016.57;726.523
mnist-gpu_memory;0.0;0.0
mnist-latency;73.15;72.89

Run 6: 2.5vs2.6 - No increase in latency
Benchmark;pytorch-2.5.1-cuda-12.4;pytorch-2.6.0-cuda-12.4
mnist_hogwild-cpu_memory;596.617;591.684
mnist_hogwild-gpu_memory;0.0;0.0
mnist_hogwild-latency;47.48;46.37
wlm_cpu_lstm-cpu_memory;935.559;926.969
wlm_cpu_lstm-gpu_memory;0.0;0.0
wlm_cpu_lstm-latency;881.19;831.11
wlm_cpu_trans-cpu_memory;946.93;956.215
wlm_cpu_trans-gpu_memory;0.0;0.0
wlm_cpu_trans-latency;815.29;838.54

A100 (please note A100 cpu results are not reliable):

Run 1:
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist-cpu_memory;1201.49;1262.79
mnist-gpu_memory;1093.0;1093.0
mnist-latency;38.46;37.13
mnist_hogwild-cpu_memory;613.566;631.934
mnist_hogwild-gpu_memory;4.0;4.0
mnist_hogwild-latency;600.82;567.12
wlm_cpu_lstm-cpu_memory;920.457;880.457
wlm_cpu_lstm-gpu_memory;4.0;4.0
wlm_cpu_lstm-latency;888.66;1007.18
wlm_cpu_trans-cpu_memory;927.922;886.551
wlm_cpu_trans-gpu_memory;4.0;4.0
wlm_cpu_trans-latency;938.17;1078.43    -> 16% Increase cpu latency
wlm_gpu_lstm-cpu_memory;1016.54;1044.67
wlm_gpu_lstm-gpu_memory;903.0;903.0
wlm_gpu_lstm-latency;52.99;52.87
wlm_gpu_trans-cpu_memory;1029.73;1100.92
wlm_gpu_trans-gpu_memory;911.0;911.0
wlm_gpu_trans-latency;55.06;55.15

Run 2:
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist-cpu_memory;1201.49;1262.79
mnist-gpu_memory;1093.0;1093.0
mnist-latency;38.46;37.13
mnist_hogwild-cpu_memory;613.566;631.934
mnist_hogwild-gpu_memory;4.0;4.0
mnist_hogwild-latency;600.82;567.12
wlm_cpu_lstm-cpu_memory;920.457;880.457
wlm_cpu_lstm-gpu_memory;4.0;4.0
wlm_cpu_lstm-latency;888.66;1007.18
wlm_cpu_trans-cpu_memory;927.922;886.551
wlm_cpu_trans-gpu_memory;4.0;4.0
wlm_cpu_trans-latency;938.17;1078.43.    ->14% Increase cpu latency
wlm_gpu_lstm-cpu_memory;1016.54;1044.67
wlm_gpu_lstm-gpu_memory;903.0;903.0
wlm_gpu_lstm-latency;52.99;52.87
wlm_gpu_trans-cpu_memory;1029.73;1100.92
wlm_gpu_trans-gpu_memory;911.0;911.0
wlm_gpu_trans-latency;55.06;55.15

cc @msaroufim @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @frank-wei

@atalman
Copy link
Contributor Author
atalman commented Apr 10, 2025

@atalman atalman transferred this issue from pytorch/benchmark Apr 10, 2025
@atalman atalman added module: performance Issues related to performance, either of kernel code or framework glue module: cpu CPU specific problem (e.g., perf, algorithm) topic: performance topic category labels Apr 10, 2025
@malfet malfet added oncall: releng In support of CI and Release Engineering triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 10, 2025
@malfet malfet added the module: intel Specific to x86 architecture label Apr 10, 2025
@malfet
Copy link
Contributor
malfet commented Apr 10, 2025

Do you know how we measure latency? And wouldn't something like that be observable in dashboard ?

@atalman atalman added this to the 2.7.1 milestone Apr 10, 2025
@atalman
Copy link
Contributor Author
atalman commented Apr 10, 2025

@mingfeima
Copy link
Collaborator

@CaoE please follow up on this one!

@LifengWang
Copy link
Contributor

Hi, @CaoE. The guilty commit of this issue is 5ed5793.

@CaoE
Copy link
Collaborator
CaoE commented Apr 15, 2025

@malfet @atalman This regression is caused by disabling MKL generator. #151218 will hopefully fix this issue.

@malfet
Copy link
Contributor
malfet commented Apr 15, 2025

This regression is caused by disabling MKL generator. #151218 will hopefully fix this issue.

Can you check if re-enabling it will resolve the latency problem?

@CaoE
Copy link
Collaborator
CaoE commented Apr 16, 2025

Can you check if re-enabling it will resolve the latency problem?

cc @LifengWang has tested this PR and it can resolve this problem.

@malfet malfet removed their assignment Apr 17, 2025
@atalman atalman removed this from the 2.7.1 milestone May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: intel Specific to x86 architecture module: performance Issues related to performance, either of kernel code or framework glue oncall: releng In support of CI and Release Engineering topic: performance topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants
0