8000 [inductor][cpu]performance regression in 2025-03-10 nightly release · Issue #149116 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[inductor][cpu]performance regression in 2025-03-10 nightly release #149116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zxd1997066 opened this issue Mar 13, 2025 · 2 comments
Closed

[inductor][cpu]performance regression in 2025-03-10 nightly release #149116

zxd1997066 opened this issue Mar 13, 2025 · 2 comments
Assignees
Labels
oncall: cpu inductor CPU Inductor issues for Intel team to triage oncall: pt2

Comments

@zxd1997066
Copy link
Contributor
zxd1997066 commented Mar 13, 2025

🐛 Describe the bug

fp32 static shape cpp wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench lennard_jones multiple 1000 0.899138 0.000328722 0.000295566441636 4.907769 1000 1.311658 0.000220639 0.00028940290946200003 4.916 0.69 0.98 0.67 1.0

fp32 dynamic shape cpp wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench lennard_jones multiple 1000 0.7305 0.000399273 0.0002916689265 4.941489 1000 1.041987 0.000281327 0.000293139076749 4.938169 0.7 1.01 0.7 1.0

amp static shape default wrapper max autotune

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench hf_Albert single 1 1.308119 0.431678501 0.564686849049619 95.22961 1 1.597903 0.35871472 0.57319132723216 95.290552 0.82 1.02 0.83 1.0
torchbench hf_GPT2_large single 1 1.433638 6.326373821 9.069729911990798 141.727643 1 1.631708 5.627456199 9.182365299557892 129.7057 0.88 1.01 0.89 0.92

amp static shape default wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench hf_Albert single 1 1.388928 0.412458401 0.572875021984128 54.402956 1 1.544128 0.36844618100000004 0.5689280645751681 53.764985 0.9 0.99 0.89 0.99

the bad commit: 165e335

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Testing with inductor.
single-thread testing....
loading model: 0it [00:01, ?it/s]
cpu  eval  hf_Albert
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:47<00:00,  1.04it/s]
1.361x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,hf_Albert,1,1.361182,405.056800,41.416172,0.852280,111.418163,130.729574,438,1,0,0,0,0,1

the last good commit: 118a165

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Testing with inductor.
single-thread testing....
loading model: 0it [00:01, ?it/s]
cpu  eval  hf_Albert
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:45<00:00,  1.09it/s]
1.513x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,hf_Albert,1,1.512885,364.084261,45.160686,0.905931,111.499264,123.077018,438,1,0,0,0,0,1

Versions

SW info

name target_branch target_commit refer_branch refer_commit
torchbench main 373ffb19 main 373ffb19
torch main 5245304 main ce2f680
torchvision main 0.19.0a0+d23a6e1 main 0.19.0a0+d23a6e1
torchtext main 0.16.0a0+b0ebddc main 0.16.0a0+b0ebddc
torchaudio main 2.6.0a0+c670ad8 main 2.6.0a0+c670ad8
torchdata main 0.7.0a0+11bb5b8 main 0.7.0a0+11bb5b8
dynamo_benchmarks main nightly main nightly

Repro:
inductor_single_run.sh
bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Suspected guilty commit: 165e335
torchbench-hf_Albert-inference-amp-static-default-single-performance-drop_guilty_commit.log
cc @chauhang @penguinwu @chuanqi129 @leslie-fang-intel

@leslie-fang-intel leslie-fang-intel self-assigned this Mar 13, 2025
@leslie-fang-intel leslie-fang-intel added the oncall: cpu inductor CPU Inductor issues for Intel team to triage label Mar 13, 2025
@leslie-fang-intel
Copy link
Collaborator

Thanks for reporting this issue and comparing the generated code before and after the regression.
The performance drop is caused by a switch in the tanh implementation in the vectorized kernel. Previously, we used an approximate calculation, 2/(1+exp(-2*x)) - 1, which had issue of numerical accuracy, as reported in #148241. To improve numerical accuracy, we switched to the SLEEF implementation. I think it's a trade-off between numerical accuracy and performance.

@leslie-fang-intel
Copy link
Collaborator

Close this issue as I think a better numerical is more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: cpu inductor CPU Inductor issues for Intel team to triage oncall: pt2
Projects
None yet
Development

No branches or pull requests

3 participants
0