[release] CPU perf benchmark latency increase for 2.6->2.7 on c5.24xlarge and A100 instances #151037

atalman · 2025-04-10T15:21:24Z

Running torchbench userbenchmarks for CPU I see following results from different runs:

On C5.24xlarge CPU latency increase 10-30% . Please note we have to up to ~8% for noise. However looks like the signal we are getting is clear.

Running workflow:
https://github.com/pytorch/benchmark/blob/perf-release-2.7/.github/workflows/userbenchmark-c5-24xlarge.yml

C5.24xlarge

Run 1: 
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist-cpu_memory;847.316;866.195
mnist-gpu_memory;0.0;0.0
mnist-latency;74.46;93.13                                 ->  25% increase

Run 2:
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist_hogwild-cpu_memory;616.598;634.773
mnist_hogwild-gpu_memory;0.0;0.0
mnist_hogwild-latency;46.64;48.20
wlm_cpu_lstm-cpu_memory;946.34;960.879
wlm_cpu_lstm-gpu_memory;0.0;0.0
wlm_cpu_lstm-latency;821.30;934.65              -> 14% increase
wlm_cpu_trans-cpu_memory;976.066;975.133
wlm_cpu_trans-gpu_memory;0.0;0.0
wlm_cpu_trans-latency;818.44;910.90               -> 11% increase

Run 3:
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist-cpu_memory;1034.32;1030.47
mnist-gpu_memory;0.0;0.0
mnist-latency;66.29;92.53                                ->39% increase
mnist_hogwild-cpu_memory;615.805;629.711
mnist_hogwild-gpu_memory;0.0;0.0
mnist_hogwild-latency;45.94;45.15
wlm_cpu_lstm-cpu_memory;959.113;951.457
wlm_cpu_lstm-gpu_memory;0.0;0.0
wlm_cpu_lstm-latency;832.84;977.05                -> 17% increase
wlm_cpu_trans-cpu_memory;953.859;980.09
wlm_cpu_trans-gpu_memory;0.0;0.0
wlm_cpu_trans-latency;822.26;918.15


Run 4:
Benchmark,pytorch-2.6.0-cuda-12.6,pytorch-2.7.0-cuda-12.6
mnist-cpu_memory,993.281,1113.65
mnist-gpu_memory,0.0,0.0
mnist-latency,70.27,90.28                                 -> 28% increase
mnist_hogwild-cpu_memory,614.816,629.562
mnist_hogwild-gpu_memory,0.0,0.0
mnist_hogwild-latency,44.73,46.65
wlm_cpu_lstm-cpu_memory,946.188,964.023
wlm_cpu_lstm-gpu_memory,0.0,0.0
wlm_cpu_lstm-latency,811.83,954.04.            -> 17% increase
wlm_cpu_trans-cpu_memory,973.684,954.387
wlm_cpu_trans-gpu_memory,0.0,0.0
wlm_cpu_trans-latency,801.86,918.64.           -> 14% increase
wlm_gpu_lstm-cpu_memory,482.219,488.016
wlm_gpu_lstm-gpu_memory,0.0,0.0
wlm_gpu_lstm-latency,3.18,3.30
wlm_gpu_trans-cpu_memory,482.23,488.312
wlm_gpu_trans-gpu_memory,0.0,0.0
wlm_gpu_trans-latency,3.23,3.24

Run 5: 2.5vs2.6 - No increase in latency
Benchmark;pytorch-2.5.1-cuda-12.4;pytorch-2.6.0-cuda-12.4
mnist-cpu_memory;1016.57;726.523
mnist-gpu_memory;0.0;0.0
mnist-latency;73.15;72.89

Run 6: 2.5vs2.6 - No increase in latency
Benchmark;pytorch-2.5.1-cuda-12.4;pytorch-2.6.0-cuda-12.4
mnist_hogwild-cpu_memory;596.617;591.684
mnist_hogwild-gpu_memory;0.0;0.0
mnist_hogwild-latency;47.48;46.37
wlm_cpu_lstm-cpu_memory;935.559;926.969
wlm_cpu_lstm-gpu_memory;0.0;0.0
wlm_cpu_lstm-latency;881.19;831.11
wlm_cpu_trans-cpu_memory;946.93;956.215
wlm_cpu_trans-gpu_memory;0.0;0.0
wlm_cpu_trans-latency;815.29;838.54

A100 (please note A100 cpu results are not reliable):

Run 1:
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist-cpu_memory;1201.49;1262.79
mnist-gpu_memory;1093.0;1093.0
mnist-latency;38.46;37.13
mnist_hogwild-cpu_memory;613.566;631.934
mnist_hogwild-gpu_memory;4.0;4.0
mnist_hogwild-latency;600.82;567.12
wlm_cpu_lstm-cpu_memory;920.457;880.457
wlm_cpu_lstm-gpu_memory;4.0;4.0
wlm_cpu_lstm-latency;888.66;1007.18
wlm_cpu_trans-cpu_memory;927.922;886.551
wlm_cpu_trans-gpu_memory;4.0;4.0
wlm_cpu_trans-latency;938.17;1078.43    -> 16% Increase cpu latency
wlm_gpu_lstm-cpu_memory;1016.54;1044.67
wlm_gpu_lstm-gpu_memory;903.0;903.0
wlm_gpu_lstm-latency;52.99;52.87
wlm_gpu_trans-cpu_memory;1029.73;1100.92
wlm_gpu_trans-gpu_memory;911.0;911.0
wlm_gpu_trans-latency;55.06;55.15

Run 2:
Benchmark;pytorch-2.6.0-cuda-12.6;pytorch-2.7.0-cuda-12.6
mnist-cpu_memory;1201.49;1262.79
mnist-gpu_memory;1093.0;1093.0
mnist-latency;38.46;37.13
mnist_hogwild-cpu_memory;613.566;631.934
mnist_hogwild-gpu_memory;4.0;4.0
mnist_hogwild-latency;600.82;567.12
wlm_cpu_lstm-cpu_memory;920.457;880.457
wlm_cpu_lstm-gpu_memory;4.0;4.0
wlm_cpu_lstm-latency;888.66;1007.18
wlm_cpu_trans-cpu_memory;927.922;886.551
wlm_cpu_trans-gpu_memory;4.0;4.0
wlm_cpu_trans-latency;938.17;1078.43.    ->14% Increase cpu latency
wlm_gpu_lstm-cpu_memory;1016.54;1044.67
wlm_gpu_lstm-gpu_memory;903.0;903.0
wlm_gpu_lstm-latency;52.99;52.87
wlm_gpu_trans-cpu_memory;1029.73;1100.92
wlm_gpu_trans-gpu_memory;911.0;911.0
wlm_gpu_trans-latency;55.06;55.15

cc @msaroufim @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @frank-wei

The text was updated successfully, but these errors were encountered:

atalman · 2025-04-10T15:26:10Z

cc @xuzhao9 @malfet @seemethere @ZainRizvi

malfet · 2025-04-10T20:35:18Z

Do you know how we measure latency? And wouldn't something like that be observable in dashboard ?

atalman · 2025-04-10T20:38:16Z

Runs can be accessed here: https://github.com/pytorch/benchmark/actions/workflows/userbenchmark-c5-24xlarge.yml

mingfeima · 2025-04-11T08:52:25Z

@CaoE please follow up on this one!

LifengWang · 2025-04-15T05:16:27Z

Hi, @CaoE. The guilty commit of this issue is 5ed5793.

CaoE · 2025-04-15T05:19:54Z

@malfet @atalman This regression is caused by disabling MKL generator. #151218 will hopefully fix this issue.

malfet · 2025-04-15T21:05:07Z

This regression is caused by disabling MKL generator. #151218 will hopefully fix this issue.

Can you check if re-enabling it will resolve the latency problem?

CaoE · 2025-04-16T07:32:25Z

Can you check if re-enabling it will resolve the latency problem?

cc @LifengWang has tested this PR and it can resolve this problem.

atalman transferred this issue from pytorch/benchmark Apr 10, 2025

atalman added module: performance Issues related to performance, either of kernel code or framework glue module: cpu CPU specific problem (e.g., perf, algorithm) topic: performance topic category labels Apr 10, 2025

malfet added oncall: releng In support of CI and Release Engineering triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 10, 2025

github-project-automation bot added this to PyTorch OSS Dev Infra Apr 10, 2025

malfet added the module: intel Specific to x86 architecture label Apr 10, 2025

atalman added this to the 2.7.1 milestone Apr 10, 2025

mingfeima assigned CaoE Apr 11, 2025

ZainRizvi removed this from PyTorch OSS Dev Infra Apr 15, 2025

ZainRizvi assigned malfet Apr 15, 2025

malfet removed their assignment Apr 17, 2025

atalman removed this from the 2.7.1 milestone May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release] CPU perf benchmark latency increase for 2.6->2.7 on c5.24xlarge and A100 instances #151037

[release] CPU perf benchmark latency increase for 2.6->2.7 on c5.24xlarge and A100 instances #151037

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[release] CPU perf benchmark latency increase for 2.6->2.7 on c5.24xlarge and A100 instances #151037

[release] CPU perf benchmark latency increase for 2.6->2.7 on c5.24xlarge and A100 instances #151037

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!