[inductor][cpu]performance regression in 2025-03-10 nightly release #149116

zxd1997066 · 2025-03-13T09:45:57Z

🐛 Describe the bug

fp32 static shape cpp wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	lennard_jones	multiple	1000	0.899138	0.000328722	0.000295566441636	4.907769	1000	1.311658	0.000220639	0.00028940290946200003	4.916	0.69	0.98	0.67	1.0

fp32 dynamic shape cpp wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	lennard_jones	multiple	1000	0.7305	0.000399273	0.0002916689265	4.941489	1000	1.041987	0.000281327	0.000293139076749	4.938169	0.7	1.01	0.7	1.0

amp static shape default wrapper max autotune

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	hf_Albert	single	1	1.308119	0.431678501	0.564686849049619	95.22961	1	1.597903	0.35871472	0.57319132723216	95.290552	0.82	1.02	0.83	1.0
torchbench	hf_GPT2_large	single	1	1.433638	6.326373821	9.069729911990798	141.727643	1	1.631708	5.627456199	9.182365299557892	129.7057	0.88	1.01	0.89	0.92

amp static shape default wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	hf_Albert	single	1	1.388928	0.412458401	0.572875021984128	54.402956	1	1.544128	0.36844618100000004	0.5689280645751681	53.764985	0.9	0.99	0.89	0.99

the bad commit: 165e335

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Testing with inductor.
single-thread testing....
loading model: 0it [00:01, ?it/s]
cpu  eval  hf_Albert
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:47<00:00,  1.04it/s]
1.361x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,hf_Albert,1,1.361182,405.056800,41.416172,0.852280,111.418163,130.729574,438,1,0,0,0,0,1

the last good commit: 118a165

/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Testing with inductor.
single-thread testing....
loading model: 0it [00:01, ?it/s]
cpu  eval  hf_Albert
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:45<00:00,  1.09it/s]
1.513x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,hf_Albert,1,1.512885,364.084261,45.160686,0.905931,111.499264,123.077018,438,1,0,0,0,0,1

Versions

SW info

name	target_branch	target_commit	refer_branch	refer_commit
torchbench	main	373ffb19	main	373ffb19
torch	main	`5245304`	main	`ce2f680`
torchvision	main	0.19.0a0+d23a6e1	main	0.19.0a0+d23a6e1
torchtext	main	0.16.0a0+b0ebddc	main	0.16.0a0+b0ebddc
torchaudio	main	2.6.0a0+c670ad8	main	2.6.0a0+c670ad8
torchdata	main	0.7.0a0+11bb5b8	main	0.7.0a0+11bb5b8
dynamo_benchmarks	main	nightly	main	nightly

Repro:
inductor_single_run.sh
bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Suspected guilty commit: 165e335
torchbench-hf_Albert-inference-amp-static-default-single-performance-drop_guilty_commit.log
cc @chauhang @penguinwu @chuanqi129 @leslie-fang-intel

The text was updated successfully, but these errors were encountered:

leslie-fang-intel · 2025-03-14T00:43:41Z

Thanks for reporting this issue and comparing the generated code before and after the regression.
The performance drop is caused by a switch in the tanh implementation in the vectorized kernel. Previously, we used an approximate calculation, 2/(1+exp(-2*x)) - 1, which had issue of numerical accuracy, as reported in #148241. To improve numerical accuracy, we switched to the SLEEF implementation. I think it's a trade-off between numerical accuracy and performance.

leslie-fang-intel · 2025-05-15T07:32:45Z

Close this issue as I think a better numerical is more important.

leslie-fang-intel self-assigned this Mar 13, 2025

leslie-fang-intel added the oncall: cpu inductor CPU Inductor issues for Intel team to triage label Mar 13, 2025

soulitzer added the oncall: pt2 label Mar 26, 2025

leslie-fang-intel closed this as completed May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor][cpu]performance regression in 2025-03-10 nightly release #149116

[inductor][cpu]performance regression in 2025-03-10 nightly release #149116

[inductor][cpu]performance regression in 2025-03-10 nightly release #149116

[inductor][cpu]performance regression in 2025-03-10 nightly release #149116

Comments

🐛 Describe the bug

Versions