8000 BUG: Resolve Divide by Zero on Apple silicon + test failures (#19926) · howjmay/numpy@6f36599 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6f36599

Browse files
BUG: Resolve Divide by Zero on Apple silicon + test failures (numpy#19926)
* Resolve divide by zero in reciprocal numpy#18555 clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's `reciprocal` function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's. The changes here insert a dummy `volatile` variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's. `volatile` can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply `volatile`. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the `reciprocal` function. Moreover, `volatile` is only needed when a vector is partially loaded. Testing: Beyond fixing the cases mentioned in the GitHub issue, the changes here also resolve several failures in numpy's test suite. Before: ``` FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal 8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09) ``` After: ``` FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan 1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10) ``` * Enhancement on top of workaround for clang bug in reciprocal Enhancement on top of workaround for clang bug in reciprocal (numpy#18555) Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector. The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop. * Address Azure CI failures with older versions of clang - -ftrapping-math is default enabled for Numpy, but support in clang is mainly for x86_64 - Apple Clang and Clang have different, but overlapping versions - Non-Apple Clang versions come from looking at when they started supporting -ftrapping-math for x86_64 Testing was done against Apple Clang versions - v11 / x86_64 - failed previously, now passes (azure failure) - v12+ / x86_64 - passes before and after - v13 / arm64 - failed before initial patch, passes after
1 parent 52f894c commit 6f36599

File tree

1 file changed

+83
-1
lines changed

1 file changed

+83
-1
lines changed

numpy/core/src/umath/loops_unary_fp.dispatch.c.src

Lines changed: 83 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,56 @@ NPY_FINLINE double c_square_f64(double a)
7777
*/
7878
#define CONTIG 0
7979
#define NCONTIG 1
80+
81+
/*
82+
* clang has a bug that's present at -O1 or greater. When partially loading a
83+
* vector register for a reciprocal operation, the remaining elements are set
84+
* to 1 to avoid divide-by-zero. The partial load is paired with a partial
85+
* store after the reciprocal operation. clang notices that the entire register
86+
* is not needed for the store and optimizes out the fill of 1 to the remaining
87+
* elements. This causes either a divide-by-zero or 0/0 with invalid exception
88+
* that we were trying to avoid by filling.
89+
*
90+
* Using a dummy variable marked 'volatile' convinces clang not to ignore
91+
* the explicit fill of remaining elements. If `-ftrapping-math` is
92+
* supported, then it'll also avoid the bug. `-ftrapping-math` is supported
93+
* on Apple clang v12+ for x86_64. It is not currently supported for arm64.
94+
* `-ftrapping-math` is set by default of Numpy builds in
95+
* numpy/distutils/ccompiler.py.
96+
*
97+
* Note: Apple clang and clang upstream have different versions that overlap
98+
*/
99+
#if defined(__clang__)
100+
#if defined(__apple_build_version__)
101+
// Apple Clang
102+
#if __apple_build_version__ < 12000000
103+
// Apple Clang before v12
104+
#define WORKAROUND_CLANG_RECIPROCAL_BUG 1
105+
#elif defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64)
106+
// Apple Clang after v12, targeting i386 or x86_64
107+
#define WORKAROUND_CLANG_RECIPROCAL_BUG 0
108+
#else
109+
// Apple Clang after v12, not targeting i386 or x86_64
110+
#define WORKAROUND_CLANG_RECIPROCAL_BUG 1
111+
#endif
112+
#else
113+
// Clang, not Apple Clang
114+
#if __clang_major__ < 10
115+
// Clang before v10
116+
#define WORKAROUND_CLANG_RECIPROCAL_BUG 1
117+
#elif defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64)
118+
// Clang v10+, targeting i386 or x86_64
119+
#define WORKAROUND_CLANG_RECIPROCAL_BUG 0
120+
#else
121+
// Clang v10+, not targeting i386 or x86_64
122+
#define WORKAROUND_CLANG_RECIPROCAL_BUG 1
123+
#endif
124+
#endif
125+
#else
126+
// Not a Clang compiler
127+
#define WORKAROUND_CLANG_RECIPROCAL_BUG 0
128+
#endif
129+
80130
/**begin repeat
81131
* #TYPE = FLOAT, DOUBLE#
82132
* #sfx = f32, f64#
@@ -87,6 +137,7 @@ NPY_FINLINE double c_square_f64(double a)
87137
* #kind = sqrt, absolute, square, reciprocal#
88138
* #intr = sqrt, abs, square, recip#
89139
* #repl_0w1 = 0, 0, 0, 1#
140+
* #RECIP_WORKAROUND = 0, 0, 0, WORKAROUND_CLANG_RECIPROCAL_BUG#
90141
*/
91142
/**begin repeat2
92143
* #STYPE = CONTIG, NCONTIG, CONTIG, NCONTIG#
@@ -101,6 +152,8 @@ static void simd_@TYPE@_@kind@_@STYPE@_@DTYPE@
101152

102153
const int vstep = npyv_nlanes_@sfx@;
103154
const int wstep = vstep * @unroll@;
155+
156+
// unrolled iterations
104157
for (; len >= wstep; len -= wstep, src += ssrc*wstep, dst += sdst*wstep) {
105158
/**begin repeat3
106159
* #N = 0, 1, 2, 3#
@@ -126,7 +179,24 @@ static void simd_@TYPE@_@kind@_@STYPE@_@DTYPE@
126179
#endif
127180
/**end repeat3**/
128181
}
129-
for (; len > 0; len -= vstep, src += ssrc*vstep, dst += sdst*vstep) {
182+
183+
// vector-sized iterations
184+
for (; len >= vstep; len -= vstep, src += ssrc*vstep, dst += sdst*vstep) {
185+
#if @STYPE@ == CONTIG
186+
npyv_@sfx@ v_src0 = npyv_load_@sfx@(src);
187+
#else
188+
npyv_@sfx@ v_src0 = npyv_loadn_@sfx@(src, ssrc);
189+
#endif
190+
npyv_@sfx@ v_unary0 = npyv_@intr@_@sfx@(v_src0);
191+
#if @DTYPE@ == CONTIG
192+
npyv_store_@sfx@(dst, v_unary0);
193+
#else
194+
npyv_storen_@sfx@(dst, sdst, v_unary0);
195+
#endif
196+
}
197+
198+
// last partial iteration, if needed
199+
if(len > 0){
130200
#if @STYPE@ == CONTIG
131201
#if @repl_0w1@
132202
npyv_@sfx@ v_src0 = npyv_load_till_@sfx@(src, len, 1);
@@ -140,20 +210,32 @@ static void simd_@TYPE@_@kind@_@STYPE@_@DTYPE@
140210
npyv_@sfx@ v_src0 = npyv_loadn_tillz_@sfx@(src, ssrc, len);
141211
#endif
142212
#endif
213+
#if @RECIP_WORKAROUND@
214+
/*
215+
* Workaround clang bug. We use a dummy variable marked 'volatile'
216+
* to convince clang that the entire vector is needed. We only
217+
* want to do this for the last iteration / partial load-store of
218+
* the loop since 'volatile' forces a refresh of the contents.
219+
*/
220+
volatile npyv_@sfx@ unused_but_workaround_bug = v_src0;
221+
#endif // @RECIP_WORKAROUND@
143222
npyv_@sfx@ v_unary0 = npyv_@intr@_@sfx@(v_src0);
144223
#if @DTYPE@ == CONTIG
145224
npyv_store_till_@sfx@(dst, len, v_unary0);
146225
#else
147226
npyv_storen_till_@sfx@(dst, sdst, len, v_unary0);
148227
#endif
149228
}
229+
150230
npyv_cleanup();
151231
}
152232
/**end repeat2**/
153233
/**end repeat1**/
154234
#endif // @VCHK@
155235
/**end repeat**/
156236

237+
#undef WORKAROUND_CLANG_RECIPROCAL_BUG
238+
157239
/********************************************************************************
158240
** Defining ufunc inner functions
159241
********************************************************************************/

0 commit comments

Comments
 (0)
0