ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

seiko2plus · 2025-09-06T15:16:21Z

numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.

numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.

Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.

Benchmark

X86 Platform

The following benchmark was tested against GCC 14.3.0 on an x86 CPU (Ryzen 7 7700X).

Environment

> uname -a
Linux seiko-pc 6.12.60 #1-NixOS SMP PREEMPT_DYNAMIC Mon Dec  1 10:43:41 UTC 2025 x86_64 GNU/Linux
> gcc --version
gcc (GCC) 14.3.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> python --version
Python 3.12.12
> lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      16
  On-line CPU(s) list:       0-15
Vendor ID:                   AuthenticAMD
  Model name:                AMD Ryzen 7 7700X 8-Core Processor
    CPU family:              25
    Model:                   97
    Thread(s) per core:      2
    Core(s) per socket:      8
    Socket(s):               1
    Stepping:                2
    Frequency boost:         enabled
    CPU(s) scaling MHz:      72%
    CPU max MHz:             5573.0000
    CPU min MHz:             545.0000
    BogoMIPS:                8982.62
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl 
                             xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_leg
                             acy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp
                              ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec
                              xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cl
                             ean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpo
                             pcntdq rdpid overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze

x86-64-v4

export NPY_DISABLE_CPU_FEATURES=""
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [`c5fc842`] <brings_npsr>	Ratio	Benchmark (Parameter)
-	689±6μs	640±30μs	0.93	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	683±0.9μs	602±2μs	0.88	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	682±1μs	603±1μs	0.88	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	695±0.7μs	604±1μs	0.87	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	354±0.7μs	231±0.4μs	0.65	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	363±0.2μs	234±4μs	0.64	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	1.04±0.02ms	632±20μs	0.61	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	1.02±0ms	612±10μs	0.6	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	1.32±0.01ms	795±2μs	0.6	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	1.72±0.03ms	806±3μs	0.47	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	1.44±0.02ms	603±0.3μs	0.42	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	5.05±0.1ms	2.07±0.01ms	0.41	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	4.98±0.01ms	1.88±0.03ms	0.38	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.95±0.02ms	2.26±0.07ms	0.38	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	2.18±0.05ms	792±1μs	0.36	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	1.78±0.03ms	608±4μs	0.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	5.65±0.02ms	1.87±0.2ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.65±0.01ms	1.87±0.01ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.77±0.01ms	1.90±0.01ms	0.33	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	6.00±0.04ms	1.98±0.02ms	0.33	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	1.36±0.01ms	443±2μs	0.32	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	5.75±0ms	1.68±0.04ms	0.29	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	4.97±0ms	1.46±0.04ms	0.29	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	2.42±0.2ms	609±2μs	0.25	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	5.62±0.01ms	1.25±0.02ms	0.22	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.72±0ms	1.25±0.01ms	0.22	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	4.97±0ms	1.08±0ms	0.22	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	7.08±0.01ms	1.58±0.02ms	0.22	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	7.07±0ms	1.20±0.01ms	0.17	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	1.49±0ms	233±10μs	0.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	5.61±0ms	828±0.4μs	0.15	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.73±0ms	859±8μs	0.15	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')

x86-64-v3

export NPY_DISABLE_CPU_FEATURES="X86_V4"
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [`c5fc842`] <brings_npsr>	Ratio	Benchmark (Parameter)
+	844±1μs	939±50μs	1.11	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	4.98±0ms	4.57±0.1ms	0.92	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	4.98±0ms	4.20±0.01ms	0.84	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	6.03±0.06ms	5.00±0.06ms	0.83	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	6.06±0.06ms	4.78±0.04ms	0.79	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	1.37±0.01ms	913±40μs	0.67	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	1.37±0.01ms	901±10μs	0.66	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	1.45±0ms	941±2μs	0.65	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	830±0.3μs	511±2μs	0.62	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	7.07±0.01ms	4.35±0.1ms	0.61	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	844±2μs	496±5μs	0.59	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	1.46±0.01ms	864±9μs	0.59	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.21±0.03ms	1.83±0ms	0.57	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	7.08±0.01ms	3.98±0.06ms	0.56	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	3.63±0.09ms	1.84±0.01ms	0.51	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	5.65±0.01ms	2.82±0.1ms	0.5	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.64±0ms	2.74±0.3ms	0.49	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.75±0ms	2.76±0.1ms	0.48	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
- F440	3.00±0.09ms	1.44±0ms	0.48	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	3.78±0.1ms	1.82±0.02ms	0.48	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	5.77±0.01ms	2.68±0.02ms	0.46	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	5.73±0ms	2.11±0.03ms	0.37	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	5.62±0ms	2.03±0.01ms	0.36	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.62±0.02ms	1.67±0ms	0.3	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.72±0ms	1.73±0ms	0.3	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	3.23±0.02ms	866±10μs	0.27	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	3.48±0.1ms	856±3μs	0.25	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.68±0.06ms	851±2μs	0.23	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	3.03±0.08ms	497±4μs	0.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')

x86-64-v2

export NPY_DISABLE_CPU_FEATURES="X86_V4 X86_V3"
spin bench --compare parent/main --cpu-affinity 7 -t "\(<ufunc '(cos|sin)'>"

Change	Before [`b70fc77`] <brings_npsr~2>	After [`c5fc842`] <brings_npsr>	Ratio	Benchmark (Parameter)
+	4.98±0ms	7.04±0.1ms	1.41	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
+	2.12±0ms	2.88±0.05ms	1.36	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f')
+	4.97±0.01ms	6.68±0.02ms	1.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
+	2.12±0ms	2.84±0.01ms	1.34	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
+	2.13±0ms	2.82±0ms	1.32	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f')
+	4.98±0.01ms	6.19±0.05ms	1.24	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
+	4.97±0ms	5.77±0.05ms	1.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
+	5.96±0.03ms	6.90±0.2ms	1.16	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
+	2.13±0ms	2.45±0ms	1.15	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f')
+	5.95±0.06ms	6.75±0.03ms	1.13	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	7.07±0.01ms	6.31±0.02ms	0.89	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	7.08±0ms	5.86±0.03ms	0.83	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	5.64±0.01ms	4.47±0.3ms	0.79	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-	5.76±0.01ms	4.15±0.1ms	0.72	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-	5.64±0.02ms	3.94±0.06ms	0.7	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-	5.76±0.01ms	4.01±0.1ms	0.7	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-	2.01±0.01ms	1.36±0.02ms	0.67	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	2.02±0.02ms	1.35±0.03ms	0.67	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	2.02±0ms	1.30±0ms	0.64	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	5.62±0.01ms	3.54±0.08ms	0.63	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-	5.73±0.01ms	3.58±0.06ms	0.62	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-	5.62±0.01ms	3.11±0.07ms	0.55	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-	5.72±0.01ms	3.12±0.05ms	0.54	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-	2.01±0ms	926±3μs	0.46	bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f')
-	3.29±0ms	1.45±0.1ms	0.44	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
-	3.31±0ms	1.41±0.04ms	0.42	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
-	3.31±0ms	1.36±0.01ms	0.41	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')
-	3.28±0ms	1.30±0ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-	3.29±0ms	1.32±0.03ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')
-	3.30±0ms	1.32±0.03ms	0.4	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')
-	3.30±0ms	947±4μs	0.29	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')
-	3.28±0ms	925±2μs	0.28	bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')

Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so

stripped	size (bytes)
before	9,895,936
after	10,014,720
diff	+118,784

numpy-simd-routines added as subrepo in meson subprojects directory and the current FP configuration is static, ~1ulp used for double-precision ~4ulp for single-precision with handling floating-point errors, special-cases extended precision for large arguments, subnormals are enabled by default too. numpy-simd-routines supports all SIMD extensions that are supported by Google Highway including non-FMA extensions and is fully independent from libm to guarantee unified results across all compilers and platforms. Full benchmarks will be provided within the pull-request, the following benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X) with AVX512 enabled. Note: that there was no SIMD optimization enabled for sin/cos for double-precision, only single-precision. | Before | After | Ratio | Benchmark (Parameter) | |---------------|-------------|--------|------------------------------------------| | 713±6μs | 633±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f') | | 717±9μs | 637±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f') | | 705±3μs | 607±10μs | 0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f') | | 714±10μs | 595±0.5μs | 0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f') | | 370±0.3μs | 277±4μs | 0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f') | | 373±2μs | 236±0.6μs | 0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f') | | 1.06±0.01ms | 648±3μs | 0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f') | | 1.06±0.01ms | 617±30μs | 0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f') | | 5.06±0.06ms | 2.61±0.3ms | 0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') | | 1.48±0ms | 715±5μs | 0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') | | 1.50±0.01ms | 639±6μs | 0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') | | 5.15±0.1ms | 1.96±0.01ms | 0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') | | 5.72±0.02ms | 2.09±0.1ms | 0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd') | | 5.76±0.01ms | 2.03±0.08ms | 0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd') | | 5.07±0.08ms | 1.76±0.2ms | 0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') | | 6.04±0.04ms | 2.05±0.09ms | 0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') | | 5.79±0.03ms | 1.90±0.2ms | 0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd') | | 2.29±0.1ms | 762±40μs | 0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') | | 5.72±0.1ms | 1.75±0.07ms | 0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd') | | 6.04±0.03ms | 1.82±0.2ms | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') | | 2.49±0.1ms | 748±30μs | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') | | 2.23±0.1ms | 634±6μs | 0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') | | 1.31±0.03ms | 367±5μs | 0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') | | 2.55±0.09ms | 654±30μs | 0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') | | 4.97±0.03ms | 1.14±0ms | 0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') | | 5.67±0.01ms | 1.22±0.03ms | 0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd') | | 5.76±0.03ms | 1.28±0.06ms | 0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd') | | 1.26±0.01ms | 272±2μs | 0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') | | 7.03±0.02ms | 1.31±0.01ms | 0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') | | 5.67±0.01ms | 810±9μs | 0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd') | | 5.71±0.01ms | 817±40μs | 0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd') | | 7.05±0.03ms | 915±4μs | 0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |

Allow up to 3 ULP error for float32 sin/cos when native FMA is not available.

seiko2plus · 2025-12-09T05:54:43Z

I have updated the benchmark to use GCC tests instead of Clang, as GCC is commonly the default compiler for wheels on Linux. At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures. The ufunc inner implementation is optimized for size as much as possible and only increases +118,784B (stripped) on x86/gcc.

cc: @charris, @rgommers, @mattip, @seberg, @r-devulap, @Mousius

seiko2plus · 2025-12-09T06:08:05Z

Ah, regarding the reverted #23399, as discussed in the thread/C6EYZZSR4EWGVKHAZXLE7IBILRMNVK7L/. The default precision for f64 is set up to 1ULP, including for large arguments, with the ability to add a build-time option to change this behavior later:

numpy/numpy/_core/src/common/simd/simd.hpp

Lines 79 to 100 in 67d0274

    
           using PreciseHigh = decltype(npsr::Precise{}); 
        
           using PreciseLow = decltype(npsr::Precise{npsr::kLowAccuracy}); 
        
           struct PresiceDummy {}; 
        
           template <typename T> 
        
           struct PreciseByType {}; 
        
           template <> 
        
           struct PreciseByType<float> { using Type = PreciseLow; }; 
        
           template <> 
        
           struct PreciseByType<double> { 
        
           #if NPY_HWY_F64 
        
               using Type = PreciseHigh; 
        
           #else 
        
               // If float64 SIMD isn’t available, use a dummy type. 
        
               // The scalar path will run, but `Type` must still be defined. 
        
               // The dummy is never passed; it only satisfies interfaces. 
        
               // This also avoids spurious FP exceptions during RAII. 
        
               using Type = PresiceDummy; 
        
           #endif 
        
           }; 
        
           } // namespace detail 
        
           template <typename T> 
        
           using Precise = typename detail::PreciseByType<T>::Type;

cc: @mhvk

rgommers · 2025-12-09T07:31:49Z

At the moment, I only have access to bare-metal x86 hardware; Perhaps I should try cloud services—e.g., AWS—for testing on other architectures.

No more GCC compile farm access? AWS seems okay of course. If you want me to run this on macOS arm64 (M1), I can.

The ufunc inner implementation is optimized for size as much as possible and only increases +118,784B (stripped) on x86/gcc.

That's impressively small, glad to see that.

seberg · 2025-12-09T08:43:46Z

@seiko2plus thanks a lot for this hard work! I don't want to discuss it on the PR here, but maybe you can send a very brief mail summarizing precision changes for the float32 version? (I am slightly worried might get into the same dance as with the 64bit version and SVML before, so should at least increase awareness.)

mhvk

@seiko2plus - looks very impressive! A few small comments inline, with that about the iterator perhaps the most important (if it is possible; can be for follow-up). I also made a small comment over at your new npsr routines.

mhvk · 2025-12-09T16:25:02Z

numpy/_core/src/umath/loops_trigonometric.dispatch.cpp

+    // Note: the intrinsics of NumPy SIMD routines are inlined by default.
+    if (NPY_UNLIKELY(is_mem_overlap(args[0], sin, args[1], sout, len) ||
+                     sin != sizeof(T) || sout != sizeof(T))) {
+        // this for non-contiguous or overlapping case


I always have trouble knowing exactly what one can ask the iterator, but is it not possible to set some flag that ensures that for non-contiguous or overlapping data, a copy is made to a contiguous buffer? I ask since if so we do not have to do this inside the loop (with similar duplicated code presumably elsewhere).

This is already the case I believe. But these paths get used by .accumulate unfortunately. Although, it may be that we can simplify the check based on that knowledge?

EDIt: Sorry to be clear, I think that is the case, I am not 100% sure.

I don't think it can be used by .accumulate since this ufunc has nin=1... If a flag is indeed passed, maybe it is worth replacing this with an assert on the stride and see if anything breaks? Logically, if we are passing the iterator a flag that we require contiguous input and output, then we should never get anything else...

Ah, there is no flag to enforce a contiguous loop, but we could probably add one, if we need this more I think that makes sense. The buffer machinery will have a lot of overhead unfortunately, but overall it is still better likely.

For overlap, I believe the rule should be (or am I missing something?!):

If memory overlap exists it must be exactly the same memory (i.e. args[0] == args[1] and steps identical).

Except for .accumulate which you are right, doesn't apply here.

Ah, if there is no "needs contiguous" flag, it is out of scope here. But I think it does make sense to add the option to request buffering to a contiguous and aligned chunk to the iterator. I would think the overhead cannot be that much worse than it is here, since one would already be inside the iterator anyway typically for non-contiguous data. And it would certainly be nice to keep the underlying loops simple...

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

@seberg - you mentioned there is no flag, but is one needed? If sin/cos are defined as an array method, would things not work automatically if one filled only the contiguous_loop slot? Or is that at the moment only treated as something that will be used if present if data are contiguous.

I would think the overhead cannot be that much worse than it is here

Well, the buffering has a "huge" amount of one time overhead. For very large arrays, yes the overhead may well be lower in practice (or at least nicer, as it's repeated fewer times).

The other thing is that if we centralize this, there will be more of an incentive to speed it up. Isn't the buffer pre-allocated by the compiler here? It would make more sense to have just one scratch buffer in the iterator, perhaps using your mechanism that allocates if it is too small.

@mhvk no not yet, the contiguous loop is an optimization only, only with that flag or a "non-contig to contig wrapper" could it be more. Dunno what is better, centralizing or not, I mostly like it because it removes unnecessary repitition both in code and binary, whether it is slower or faster.
(but yeah, we shouldn't worry about that here.)

So, basically it would be some way to tell the ufunc (in this case) that the NPY_ITER_CONTIG flag should be passed on to the iterator... Anyway, let's take the discussion out of this PR -- see #30413.

mhvk · 2025-12-09T16:37:21Z

numpy/_core/src/common/simd/simd.hpp

+/// npsr is tag free by design so we only include it within main namespace (np::simd)
+namespace sr = npsr::HWY_NAMESPACE;
+
+/// Default precision configrations for NumPy SIMD Routines


Might be good here to tell what low and high imply (or at least where to look it up)

mhvk · 2025-12-09T16:38:28Z

numpy/_core/tests/test_umath.py

-        assert_array_max_ulp(np.cos(tx_f32, out=tx_f32), np.float32(np.cos(x_f64)), maxulp=2)
+        assert_array_max_ulp(np.sin(x_f32, out=x_f32), np.float32(np.sin(x_f64)), maxulp=maxulp_f32)
+        assert_array_max_ulp(np.cos(tx_f32, out=tx_f32), np.float32(np.cos(x_f64)), maxulp=maxulp_f32)



It maybe good to add an explicit test that cos(0)==1 for all float, as it is the thing that threw our tests off in a previous iteration (and I think it is something you now explicitly capture).

- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type. - Add test cases for sine/cosine facts—for example, `cos(0) == 1`.

github-actions bot added the 01 - Enhancement label Sep 6, 2025

seiko2plus force-pushed the brings_npsr branch 4 times, most recently from 09414c8 to af5d98a Compare September 7, 2025 22:33

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 10, 2025

seiko2plus added this to the 2.4.0 release milestone Sep 13, 2025

seiko2plus mentioned this pull request Oct 7, 2025

Generate trigonometric constants and lookup tables using Sollya numpy/numpy-simd-routines#4

Merged

seiko2plus force-pushed the brings_npsr branch 2 times, most recently from 2f4c0a6 to 9b9a438 Compare October 8, 2025 02:30

charris modified the milestones: 2.4.0 release, 2.5.0 Release Nov 20, 2025

seiko2plus force-pushed the brings_npsr branch from 9b9a438 to 67d0274 Compare December 8, 2025 15:46

seiko2plus marked this pull request as ready for review December 8, 2025 15:46

seiko2plus added 2 commits December 9, 2025 06:46

Relax sin/cos ULP test for float32 on non-FMA

c5fc842

Allow up to 3 ULP error for float32 sin/cos when native FMA is not available.

mhvk mentioned this pull request Dec 9, 2025

Maybe small improvement for accuracy for cos for regular angles numpy/numpy-simd-routines#6

Open

mhvk reviewed Dec 9, 2025

View reviewed changes

mhvk mentioned this pull request Dec 10, 2025

ENH: Allow ufuncs to request the iterator to provide contiguous arrays #30413

Open

SIMD, TST: Apply Marten’s suggestions

e16efa5

- Extend the C++ doc scope to better explain precision control, which is chosen based on the data type. - Add test cases for sine/cosine facts—for example, `cos(0) == 1`.

seiko2plus force-pushed the brings_npsr branch from 67d0274 to e16efa5 Compare December 19, 2025 06:58

SIMD: Enable SVE/RVV dynamic dispatch for sin/cos kernels

2f99039

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

Are you sure you want to change the base?

ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

Conversation

Uh oh!

Benchmark

X86 Platform

Binary size change: _multiarray_umath.cpython-312-x86_64-linux-gnu.so

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants