8000 add avx512 reduce sum comments · numpy/numpy@2e713b0 · GitHub
[go: up one dir, main page]

Skip to content 8000

Commit 2e713b0

Browse files
committed
add avx512 reduce sum comments
1 parent ae53e35 commit 2e713b0

File tree

1 file changed

+12
-0
lines changed

1 file changed

+12
-0
lines changed

numpy/core/src/common/simd/avx512/arithmetic.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,18 @@ NPY_FINLINE __m512i npyv_mul_u8(__m512i a, __m512i b)
115115
// TODO: emulate integer division
116116
#define npyv_div_f32 _mm512_div_ps
117117
#define npyv_div_f64 _mm512_div_pd
118+
119+
/***************************
120+
* Reduce Sum
121+
* there are three ways to implement reduce sum for AVX512:
122+
* 1- split(256) /add /split(128) /add /hadd /hadd /extract
123+
* 2- shuff(cross) /add /shuff(cross) /add /shuff /add /shuff /add /extract
124+
* 3- _mm512_reduce_add_ps/pd
125+
* The first one is been widely used by many projects while the second one is used by Intel Compiler and here
126+
* the reason why the second preferred by intel compiler maybe because the latency of hadd increased by (2-3)
127+
* starting from Skylake-X which makes two extra shuffles(non-cross) cheaper. check https://godbolt.org/z/s3G9Er for more clarification.
128+
* The third one is almost the same as the second one but only works for intel compiler/GCC 7.1/Clang 4.
129+
***************************/
118130
NPY_FINLINE float npyv_sum_f32(npyv_f32 a)
119131
{
120132
__m512 h64 = _mm512_shuffle_f32x4(a, a, _MM_SHUFFLE(3, 2, 3, 2));

0 commit comments

Comments
 (0)
0