8000 ENH: Add SIMD sin/cos implementation with numpy-simd-routines by seiko2plus · Pull Request #29699 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@seiko2plus
Copy link
Member
@seiko2plus seiko2plus commented Sep 6, 2025

numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.

numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.

Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.

Benchmark

X86 Platform

The following benchmark was tested against GCC 14.3.0 on an x86 CPU (Ryzen 7 7700X).

Environment
> uname -a
Linux seiko-pc 6.12.60 #1-NixOS SMP PREEMPT_DYNAMIC Mon Dec  1 10:43:41 UTC 2025 x86_64 GNU/Linux
> gcc --version
gcc (GCC) 14.3.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> python --version
Python 3.12.12
> lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      16
  On-line CPU(s) list:       0-15
Vendor ID:                   AuthenticAMD
  Model name:                AMD Ryzen 7 7700X 8-Core Processor
    CPU family:              25
    Model:                   97
    Thread(s) per core:      2
    Core(s) per socket:      8
    Socket(s):               1
    Stepping:                2
    Frequency boost:         enabled
    CPU(s) scaling MHz:      72%
    CPU max MHz:             5573.0000
    CPU min MHz:             545.0000
    BogoMIPS:                8982.62
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl 
                             xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_leg
                             acy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp
                              ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec
                              xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cl
                             ean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpo
                             pcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
x86-64-v4
export NPY_DISABLE_CPU_FEATURES=""
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"
Change Before [b70fc77] <brings_npsr~2> After [c5fc842] <brings_npsr> Ratio Benchmark (Parameter)
- 689±6μs 640±30μs 0.93 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
- 683±0.9μs 602±2μs 0.88 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
- 682±1μs 603±1μs 0.88 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
- 695±0.7μs 604±1μs 0.87 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
- 354±0.7μs 231±0.4μs 0.65 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
- 363±0.2μs 234±4μs 0.64 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
- 1.04±0.02ms 632±20μs 0.61 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
- 1.02±0ms 612±10μs 0.6 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
- 1.32±0.01ms 795±2μs 0.6 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
- 1.72±0.03ms 806±3μs 0.47 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
- 1.44±0.02ms 603±0.3μs 0.42 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
- 5.05±0.1ms 2.07±0.01ms 0.41 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
- 4.98±0.01ms 1.88±0.03ms 0.38 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
- 5.95±0.02ms 2.26±0.07ms 0.38 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
- 2.18±0.05ms 792±1μs 0.36 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
- 1.78±0.03ms 608±4μs 0.34 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
- 5.65±0.02ms 1.87±0.2ms 0.33 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
- 5.65±0.01ms 1.87±0.01ms 0.33 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
- 5.77±0.01ms 1.90±0.01ms 0.33 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
- 6.00±0.04ms 1.98±0.02ms 0.33 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
- 1.36±0.01ms 443±2μs 0.32 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
- 5.75±0ms 1.68±0.04ms 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
- 4.97±0ms 1.46±0.04ms 0.29 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
- 2.42±0.2ms 609±2μs 0.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
- 5.62±0.01ms 1.25±0.02ms 0.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
- 5.72±0ms 1.25±0.01ms 0.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
- 4.97±0ms 1.08±0ms 0.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
- 7.08±0.01ms 1.58±0.02ms 0.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
- 7.07±0ms 1.20±0.01ms 0.17 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
- 1.49±0ms 233±10μs 0.16 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
- 5.61±0ms 828±0.4μs 0.15 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
- 5.73±0ms 859±8μs 0.15 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
x86-64-v3
export NPY_DISABLE_CPU_FEATURES="X86_V4"
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"
Change Before [b70fc77] <brings_npsr~2> After [c5fc842] <brings_npsr> Ratio Benchmark (Parameter)
+ 844±1μs 939±50μs 1.11 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
- 4.98±0ms 4.57±0.1ms 0.92 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
- 4.98±0ms 4.20±0.01ms 0.84 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
- 6.03±0.06ms 5.00±0.06ms 0.83 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
- 6.06±0.06ms 4.78±0.04ms 0.79 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
- 1.37±0.01ms 913±40μs 0.67 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
- 1.37±0.01ms 901±10μs 0.66 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
- 1.45±0ms 941±2μs 0.65 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
- 830±0.3μs 511±2μs 0.62 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
- 7.07±0.01ms 4.35±0.1ms 0.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
- 844±2μs 496±5μs 0.59 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
- 1.46±0.01ms 864±9μs 0.59 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
- 3.21±0.03ms 1.83±0ms 0.57 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
- 7.08±0.01ms 3.98±0.06ms 0.56 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
- 3.63±0.09ms 1.84±0.01ms 0.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
- 5.65±0.01ms 2.82±0.1ms 0.5 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
- 5.64±0ms 2.74±0.3ms 0.49 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
- 5.75±0ms 2.76±0.1ms 0.48 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
- F440 3.00±0.09ms 1.44±0ms 0.48 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
- 3.78±0.1ms 1.82±0.02ms 0.48 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
- 5.77±0.01ms 2.68±0.02ms 0.46 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
- 5.73±0ms 2.11±0.03ms 0.37 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
- 5.62±0ms 2.03±0.01ms 0.36 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
- 5.62±0.02ms 1.67±0ms 0.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
- 5.72±0ms 1.73±0ms 0.3 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
- 3.23±0.02ms 866±10μs 0.27 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
- 3.48±0.1ms 856±3μs 0.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
- 3.68±0.06ms 851±2μs 0.23 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
- 3.03±0.08ms 497±4μs 0.16 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
x86-64-v2
export NPY_DISABLE_CPU_FEATURES="X86_V4 X86_V3"
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"
Change Before [b70fc77] <brings_npsr~2> After [c5fc842] <brings_npsr> Ratio Benchmark (Parameter)
+ 4.98±0ms 7.04±0.1ms 1.41 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
+ 2.12±0ms 2.88±0.05ms 1.36 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
+ 4.97±0.01ms 6.68±0.02ms 1.34 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
+ 2.12±0ms 2.84±0.01ms 1.34 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
+ 2.13±0ms 2.82±0ms 1.32 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
+ 4.98±0.01ms 6.19±0.05ms 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
+ 4.97±0ms 5.77±0.05ms 1.16 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
+ 5.96±0.03ms 6.90±0.2ms 1.16 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
+ 2.13±0ms 2.45±0ms 1.15 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
+ 5.95±0.06ms 6.75±0.03ms 1.13 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
- 7.07±0.01ms 6.31±0.02ms 0.89 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
- 7.08±0ms 5.86±0.03ms 0.83 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
- 5.64±0.01ms 4.47±0.3ms 0.79 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
- 5.76±0.01ms 4.15±0.1ms 0.72 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
- 5.64±0.02ms 3.94±0.06ms 0.7 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
- 5.76±0.01ms 4.01±0.1ms 0.7 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
- 2.01±0.01ms 1.36±0.02ms 0.67 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
- 2.02±0.02ms 1.35±0.03ms 0.67 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
- 2.02±0ms 1.30±0ms 0.64 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
- 5.62±0.01ms 3.54±0.08ms 0.63 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
- 5.73±0.01ms 3.58±0.06ms 0.62 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
- 5.62±0.01ms 3.11±0.07ms 0.55 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
- 5.72±0.01ms 3.12±0.05ms 0.54 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
- 2.01±0ms 926±3μs 0.46 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
- 3.29±0ms 1.45±0.1ms 0.44 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
- 3.31±0ms 1.41±0.04ms 0.42 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
- 3.31±0ms 1.36±0.01ms 0.41 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
- 3.28±0ms 1.30±0ms 0.4 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
- 3.29±0ms 1.32±0.03ms 0.4 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
- 3.30±0ms 1.32±0.03ms 0.4 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
- 3.30±0ms 947±4μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
- 3.28±0ms 925±2μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')

Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so

stripped size (bytes)
before 9,895,936
after 10,014,720
diff +118,784

@seiko2plus seiko2plus force-pushed the brings_npsr branch 4 times, most recently from 09414c8 to af5d98a Compare September 7, 2025 22:33
@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 10, 2025
@seiko2plus seiko2plus added this to the 2.4.0 release milestone Sep 13, 2025
@seiko2plus seiko2plus force-pushed the brings_npsr branch 2 times, most recently from 2f4c0a6 to 9b9a438 Compare October 8, 2025 02:30
@charris charris modified the milestones: 2.4.0 release, 2.5.0 Release Nov 20, 2025
@seiko2plus seiko2plus marked this pull request as ready for review December 8, 2025 15:46
  numpy-simd-routines added as subrepo in meson subprojects
  directory and the current FP configuration is static, ~1ulp used for double-precision
  ~4ulp for single-precision with handling floating-point errors,
  special-cases extended precision for large arguments,
  subnormals are enabled by default too.

  numpy-simd-routines supports all SIMD extensions that are supported
  by Google Highway including non-FMA extensions and is fully independent
  from libm to guarantee unified results across all compilers and
  platforms.

  Full benchmarks will be provided within the pull-request, the following
  benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X)
  with AVX512 enabled.

  Note: that there was no SIMD optimization enabled for sin/cos
  for double-precision, only single-precision.

  | Before        | After       |  Ratio | Benchmark (Parameter)                    |
  |---------------|-------------|--------|------------------------------------------|
  | 713±6μs       | 633±6μs     |   0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f')        |
  | 717±9μs       | 637±6μs     |   0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f')        |
  | 705±3μs       | 607±10μs    |   0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f')        |
  | 714±10μs      | 595±0.5μs   |   0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f')        |
  | 370±0.3μs     | 277±4μs     |   0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f')        |
  | 373±2μs       | 236±0.6μs   |   0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f')        |
  | 1.06±0.01ms   | 648±3μs     |   0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f')        |
  | 1.06±0.01ms   | 617±30μs    |   0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f')        |
  | 5.06±0.06ms   | 2.61±0.3ms  |   0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') |
  | 1.48±0ms      | 715±5μs     |   0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') |
  | 1.50±0.01ms   | 639±6μs     |   0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') |
  | 5.15±0.1ms    | 1.96±0.01ms |   0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') |
  | 5.72±0.02ms   | 2.09±0.1ms  |   0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd')        |
  | 5.76±0.01ms   | 2.03±0.08ms |   0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd')        |
  | 5.07±0.08ms   | 1.76±0.2ms  |   0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') |
  | 6.04±0.04ms   | 2.05±0.09ms |   0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') |
  | 5.79±0.03ms   | 1.90±0.2ms  |   0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd')        |
  | 2.29±0.1ms    | 762±40μs    |   0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') |
  | 5.72±0.1ms    | 1.75±0.07ms |   0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd')        |
  | 6.04±0.03ms   | 1.82±0.2ms  |   0.3  | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') |
  | 2.49±0.1ms    | 748±30μs    |   0.3  | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') |
  | 2.23±0.1ms    | 634±6μs     |   0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') |
  | 1.31±0.03ms   | 367±5μs     |   0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') |
  | 2.55±0.09ms   | 654±30μs    |   0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') |
  | 4.97±0.03ms   | 1.14±0ms    |   0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') |
  | 5.67±0.01ms   | 1.22±0.03ms |   0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd')        |
  | 5.76±0.03ms   | 1.28±0.06ms |   0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd')        |
  | 1.26±0.01ms   | 272±2μs     |   0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') |
  | 7.03±0.02ms   | 1.31±0.01ms |   0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') |
  | 5.67±0.01ms   | 810±9μs     |   0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd')        |
  | 5.71±0.01ms   | 817±40μs    |   0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd')        |
  | 7.05±0.03ms   | 915±4μs     |   0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |
Allow up to 3 ULP error for float32 sin/cos when native
FMA is not available.
@seiko2plus
Copy link
Member Author
seiko2plus commented Dec 9, 2025

I have updated the benchmark to use GCC tests instead of Clang, as GCC is commonly the default compiler for wheels on Linux. At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures. The ufunc inner implementation is optimized for size as much as possible and only increases +118,784B (stripped) on x86/gcc.

cc: @charris, @rgommers, @mattip, @seberg, @r-devulap, @Mousius

@seiko2plus
Copy link
Member Author
seiko2plus commented Dec 9, 2025

Ah, regarding the reverted #23399, as discussed in the thread/C6EYZZSR4EWGVKHAZXLE7IBILRMNVK7L/. The default precision for f64 is set up to 1ULP, including for large arguments, with the ability to add a build-time option to change this behavior later:

using PreciseHigh = decltype(npsr::Precise{});
using PreciseLow = decltype(npsr::Precise{npsr::kLowAccuracy});
struct PresiceDummy {};
template <typename T>
struct PreciseByType {};
template <>
struct PreciseByType<float> { using Type = PreciseLow; };
template <>
struct PreciseByType<double> {
#if NPY_HWY_F64
using Type = PreciseHigh;
#else
// If float64 SIMD isn’t available, use a dummy type.
// The scalar path will run, but `Type` must still be defined.
// The dummy is never passed; it only satisfies interfaces.
// This also avoids spurious FP exceptions during RAII.
using Type = PresiceDummy;
#endif
};
} // namespace detail
template <typename T>
using Precise = typename detail::PreciseByType<T>::Type;

cc: @mhvk

@rgommers
Copy link
Member
rgommers commented Dec 9, 2025

At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures.

No more GCC compile farm access? AWS seems okay of course. If you want me to run this on macOS arm64 (M1), I can.

The ufunc inner implementation is optimized for size as much as possible and only increases +118,784B (stripped) on x86/gcc.

That's impressively small, glad to see that.

@seberg
Copy link
Member
seberg commented Dec 9, 2025

@seiko2plus thanks a lot for this hard work! I don't want to discuss it on the PR here, but maybe you can send a very brief mail summarizing precision changes for the float32 version? (I am slightly worried might get into the same dance as with the 64bit version and SVML before, so should at least increase awareness.)

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seiko2plus - looks very impressive! A few small comments inline, with that about the iterator perhaps the most important (if it is possible; can be for follow-up). I also made a small comment over at your new npsr routines.

// Note: the intrinsics of NumPy SIMD routines are inlined by default.
if (NPY_UNLIKELY(is_mem_overlap(args[0], sin, args[1], sout, len) ||
sin != sizeof(T) || sout != sizeof(T))) {
// this for non-contiguous or overlapping case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always have trouble knowing exactly what one can ask the iterator, but is it not possible to set some flag that ensures that for non-contiguous or overlapping data, a copy is made to a contiguous buffer? I ask since if so we do not have to do this inside the loop (with similar duplicated code presumably elsewhere).

Copy link
Member
@seberg seberg Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the case I believe. But these paths get used by .accumulate unfortunately. Although, it may be that we can simplify the check based on that knowledge?

EDIt: Sorry to be clear, I think that is the case, I am not 100% sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be used by .accumulate since this ufunc has nin=1... If a flag is indeed passed, maybe it is worth replacing this with an assert on the stride and see if anything breaks? Logically, if we are passing the iterator a flag that we require contiguous input and output, then we should never get anything else...

Copy link
Member
@seberg seberg Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there is no flag to enforce a contiguous loop, but we could probably add one, if we need this more I think that makes sense. The buffer machinery will have a lot of overhead unfortunately, but overall it is still better likely.

For overlap, I believe the rule should be (or am I missing something?!):

  • If memory overlap exists it must be exactly the same memory (i.e. args[0] == args[1] and steps identical).
  • Except for .accumulate which you are right, doesn't apply here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, if there is no "needs contiguous" flag, it is out of scope here. But I think it does make sense to add the option to request buffering to a contiguous and aligned chunk to the iterator. I would think the overhead cannot be that much worse than it is here, since one would already be inside the iterator anyway typically for non-contiguous data. And it would certainly be nice to keep the underlying loops simple...

Copy link
Member
@seberg seberg Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - you mentioned there is no flag, but is one needed? If sin/cos are defined as an array method, would things not work automatically if one filled only the contiguous_loop slot? Or is that at the moment only treated as something that will be used if present if data are contiguous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

The other thing is that if we centralize this, there will be more of an incentive to speed it up. Isn't the buffer pre-allocated by the compiler here? It would make more sense to have just one scratch buffer in the iterator, perhaps using your mechanism that allocates if it is too small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk no not yet, the contiguous loop is an optimization only, only with that flag or a "non-contig to contig wrapper" could it be more. Dunno what is better, centralizing or not, I mostly like it because it removes unnecessary repitition both in code and binary, whether it is slower or faster.
(but yeah, we shouldn't worry about that here.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, basically it would be some way to tell the ufunc (in this case) that the NPY_ITER_CONTIG flag should be passed on to the iterator... Anyway, let's take the discussion out of this PR -- see #30413.

/// npsr is tag free by design so we only include it within main namespace (np::simd)
namespace sr = npsr::HWY_NAMESPACE;

/// Default precision configrations for NumPy SIMD Routines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good here to tell what low and high imply (or at least where to look it up)

assert_array_max_ulp(np.cos(tx_f32, out=tx_f32), np.float32(np.cos(x_f64)), maxulp=2)
assert_array_max_ulp(np.sin(x_f32, out=x_f32), np.float32(np.sin(x_f64)), maxulp=maxulp_f32)
assert_array_max_ulp(np.cos(tx_f32, out=tx_f32), np.float32(np.cos(x_f64)), maxulp=maxulp_f32)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maybe good to add an explicit test that cos(0)==1 for all float, as it is the thing that threw our tests off in a previous iteration (and I think it is something you now explicitly capture).

- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type.
- Add test cases for sine/cosine facts—for example, `cos(0) == 1`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

0