You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Testing with inductor.
single-thread testing....
loading model: 0it [00:01, ?it/s]
cpu eval hf_Albert
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:47<00:00, 1.04it/s]
1.361x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,hf_Albert,1,1.361182,405.056800,41.416172,0.852280,111.418163,130.729574,438,1,0,0,0,0,1
/workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Testing with inductor.
single-thread testing....
loading model: 0it [00:01, ?it/s]
cpu eval hf_Albert
running benchmark: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:45<00:00, 1.09it/s]
1.513x
WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu]
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips
cpu,hf_Albert,1,1.512885,364.084261,45.160686,0.905931,111.499264,123.077018,438,1,0,0,0,0,1
Thanks for reporting this issue and comparing the generated code before and after the regression.
The performance drop is caused by a switch in the tanh implementation in the vectorized kernel. Previously, we used an approximate calculation, 2/(1+exp(-2*x)) - 1, which had issue of numerical accuracy, as reported in #148241. To improve numerical accuracy, we switched to the SLEEF implementation. I think it's a trade-off between numerical accuracy and performance.
🐛 Describe the bug
fp32 static shape cpp wrapper
fp32 dynamic shape cpp wrapper
amp static shape default wrapper max autotune
amp static shape default wrapper
the bad commit: 165e335
the last good commit: 118a165
Versions
SW info
Repro:
inductor_single_run.sh
bash inductor_single_run.sh single inference performance torchbench hf_Albert amp
Suspected guilty commit: 165e335
torchbench-hf_Albert-inference-amp-static-default-single-performance-drop_guilty_commit.log
cc @chauhang @penguinwu @chuanqi129 @leslie-fang-intel
The text was updated successfully, but these errors were encountered: