8000 BUG: Invalid value in inplace log · Issue #17761 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: Invalid value in inplace log #17761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
larsoner opened this issue Nov 12, 2020 · 23 comments · Fixed by #17763
Closed

BUG: Invalid value in inplace log #17761

larsoner opened this issue Nov 12, 2020 · 23 comments · Fixed by #17763

Comments

@larsoner
Copy link
Contributor

Reproducing code example:

I cannot reproduce on my system (yet).

Error message:

Travis Linux hits a RuntimeWarning it did not previously (with no relevant code changes at our end):

mne/filter.py:387: in _construct_fir_filter
    h = minimum_phase(h)
../../../virtualenv/python3.8.1/lib/python3.8/site-packages/scipy/signal/fir_filter_design.py:1246: in minimum_phase
    np.log(h_temp, out=h_temp)
E   RuntimeWarning: invalid value encountered in log

Above the error is in a SciPy call. But it appears to happen anywhere we do a np.log(x, out=x), here is another:

mne/preprocessing/nirs/_optical_density.py:43: in optical_density
    np.log(raw._data[ii], out=raw._data[ii])
E   RuntimeWarning: invalid value encountered in log

If it's not clear from looking at the past few days of NumPy merges what the culprit might be, I can try to get more information out of Travis about what's actually in the offending arrays.

NumPy/Python version information:

This is on numpy-1.20.0.dev0+0ffaaf8 / 0ffaaf8 from Thursday (today) on pypi.anaconda.org/scipy-wheels-nightly. Previous builds on numpy-1.20.0.dev0+5f071c6 / 5f071c6 from Monday didn't have this problem.

@mattip
Copy link
Member
mattip commented Nov 12, 2020

Do you happen to know the dtype of the input? Is it always the same dtype? We did change many of the SIMD-enhanced ufuncs over the past couple of months, but if I recall correctly float_log is not one of them.

@larsoner
Copy link
Contributor Author

It should be float64 in both cases.

@larsoner
Copy link
Contributor Author

... and in the SciPy code it even protects against anything close to zero:

        h_temp = np.abs(fft(h, n_fft))
        # take 0.25*log(|H|**2) = 0.5*log(|H|)
        h_temp += 1e-7 * h_temp[h_temp > 0].min()  # don't let log blow up
        np.log(h_temp, out=h_temp)  # <-- the problematic call

@mattip
Copy link
Member
mattip commented Nov 12, 2020

it may be that the error state is getting set somewhere else and only getting checked after the call to np.log

@larsoner
Copy link
Contributor Author

In the SciPy and MNE code there are x += and x /= operations, respectively, immediately preceding the np.log(x, out=x) call. Locally with some toy arrays I cannot replicate, but it could be that the SIMD extensions on my machine don't match those of Travis, which I guess could matter here.

@larsoner
Copy link
Contributor Author

And actually it looks like this is causing failures in SciPy directly as well:

https://travis-ci.org/github/scipy/scipy/jobs/743059506#L19563-L19579

That is a build for a PR that did not change the minimum_phase function or anything related to it.

@mattip
Copy link
Member
mattip commented Nov 12, 2020

Between 5f071c6 and 0ffaaf8 it seems that PR gh-16247 might be the most relevant. It touched sqrt, abs, recip, square.

@larsoner
Copy link
Contributor Author

One option would be to open a SciPy PR that, in one of the NumPy-nightly Travis runs, reverts that PR to see if it fixes the minimum_phase test. Do you want to give it a shot? If not I can try to do it

@larsoner
Copy link
8000 Contributor Author

(This would at least tell us if it's the culprit)

@charris charris added this to the 1.20.0 release milestone Nov 12, 2020
@seberg
Copy link
Member
seberg commented Nov 12, 2020

@mattip we clear the floating point flags before each ufunc call, so I do not think the flags can be set anywhere else (unless there is casting involved, which would require strange input like integers, if it is possible at all).

@mattip
Copy link
Member
mattip commented Nov 12, 2020

Sure, if it is not to much bother go ahead. I think we will need a compact reproducer anyway, so it would be best to reproduces lcoally and cut down the test to a minimum. If it is that PR, we need a avx512 machine.

@larsoner
Copy link
Contributor Author

I think we will need a compact reproducer anyway, so it would be best to reproduces lcoally and cut down the test to a minimum.

Agreed this makes sense. But perhaps it makes sense to revert gh-16247 if it does indeed fix the problem so that downstream libraries don't have to live with the bug. Then once a compact reproducer is made, gh-16247 can be revived and fixed?

If it is that PR, we need a avx512 machine.

FWIW my machine has AVX512 I think:

NumPy version 1.20.0.dev0+44d66b8
NumPy relaxed strides checking option: True
NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CLX? AVX512_CNL? AVX512_ICL?

And I cannot reproduce. I compile with extra_compile_args = -march=native if it matters.

@larsoner
Copy link
Contributor Author

... and I cannot reproduce when installing that pip version either.

@mattip
Copy link
Member
mattip commented Nov 12, 2020

can you see what NumPy CPU features: are reported for the failing builds? You might need to add

from numpy._pytesttester import _show_numpy_info
 _show_numpy_info()

@mattip
Copy link
Member
mattip commented Nov 12, 2020

(this needs to be improved for the release, it is part of PR gh-17737)

@seberg
Copy link
Member
seberg commented Nov 12, 2020

Sorry for the noise.

Could you try to save the h_temp array as npz (maybe also before the guard line)? One silly thing I notice is that: h_temp += 1e-7 * h_temp[h_temp > 0].min() # don't let log blow up might not work if you have subnormal numbers, which could flush to zero when multiplied by 1e-7? So at least in theory that line could do nothing.

EDIT: Well, I guess probably more noise, even if there was a 0 in there, the warning should be "divide by zero" not invalid value.

@seberg
Copy link
Member
seberg commented Nov 12, 2020

One last bit of noise, but I think we should ping @r-devulap.

I am a bit dubious by this code:

divide_by_zero_mask = divide_by_zero_mask | (zero_mask & load_mask);
invalid_mask = invalid_mask | negx_mask;

since the negx_mask here does not have the & load_mask (I just can't explain why it would be different from the earlier version). And x at least in principle (e.g. if the array is short) can be uninitialized. So my best guess is if we change that to:

 divide_by_zero_mask = divide_by_zero_mask | (zero_mask & load_mask); 
 invalid_mask = invalid_mask | (negx_mask & load_mask); 

Of course I wasn't able to trigger that (e.g. with a short array or similar).

@larsoner
Copy link
Contributor Author

Thank goodniss I could reproduce on CircleCI, where I can run an interactive SSH session to interact with it afterward. In theory others can, too. But in any case, the output there is:

NumPy version 1.20.0.dev0+0ffaaf8
NumPy relaxed strides checking option: True
NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F* AVX512CD* AVX512_KNL? AVX512_KNM? AVX512_SKX* AVX512_CLX? AVX512_CNL? AVX512_ICL?

I then shortened one of our tests, added before = raw._data[ii].copy(), and got this in interactive debug:

mne/preprocessing/tests/test_optical_density.py:29: in test_optical_density
    raw = optical_density(raw)
mne/preprocessing/nirs/_optical_density.py:47: in optical_density
    np.log(raw._data[ii], out=raw._data[ii])
E   RuntimeWarning: invalid value encountered in log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem (IO-capturing turned off) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /home/circleci/project/mne/preprocessing/nirs/_optical_density.py(47)optical_density()
-> np.log(raw._data[ii], out=raw._data[ii])
(Pdb) p before
array([0.98274385, 0.97102981, 1.00724412, 0.89445025, 1.04150874,
       1.11441121, 0.85187457, 0.97669253, 0.95033699, 1.08958501,
       0.9713925 , 1.17180138, 0.97692903])
(Pdb) p raw._data[ii]
array([        nan,         nan, -4.93117598, -0.11154599, -3.2022553 ,
        0.10832621, -0.16031598,         nan,         nan,  0.0857969 ,
               nan,  0.15854221,         nan])

After some more whittling, even this does it:

$ python -c "import numpy as np; x = np.array([0.98274385, 0.97102981]); np.log(x, out=x)"
<string>:1: RuntimeWarning: invalid value encountered in log

If there is one element there is no error. If the out=x is omitted, there is no error.

I then went a step further and did sudo apt-get install libopenblas-dev, cloned numpy, pip install cython then pip install --no-use-pep517 -ve .ed it, and replicated the error. Then git bisect iterating over my pip install and python -c lines and found 34d2672 to be the culprit. Hopefully with this and knowing the conditions (>= 2 elements, in-place operation of log) is enough to fix the problem. Or if it's not, enough to revert 34d2672 until it can be figured out.

@larsoner
Copy link
Contributor Author

... and I can confirm that a git revert 34d2672 on top of latest master makes things work again.

I am out of my depth on this, but in that commit message, this line in particular seems suspicious, due to the SIMD and in-place requirements for the failure:

- fix SIMD memory overlap check for aliasing(same ptr & stride)

@seberg
Copy link
Member
seberg commented Nov 12, 2020

OK, I am 99% sure I found the issue. The gcc loop uses reuses the previous values if they are close to 1 (where the fast algorithm misbehaves).

The values that go to invalid values, are those. Now the m1 & m2 mask will use the already updated values (basically call the whole thing a second time), because of the new commit which fixed that we want to call the fast loops for in-place operations.

So what is needed here is a load_mask = load_mask & glibc_mask to ensure those values are not overwritten.

To be honest, there are more bugs here, because also glibc_mask must be masked with load_mask or it may read beyond the end of the array. And the above invalid_mask that I mentioned may also be wrong...

@mattip
Copy link
Member
mattip commented Nov 12, 2020

ping @seiko2plus

@seiko2plus
< 8000 /svg> Copy link
Member
seiko2plus commented Nov 12, 2020

@larsoner, my bad, this bug is unfairly caused by #16247 due to allowing vectorize aliasing(same pointer and stride),
which leads to what @seberg mentioned, "reuses the previous values" during the fallback to libc.

I will release a fix for it, as fast workaround you can disable AVX512 in runtime via environment variable NPY_DISABLE_CPU_FEATURES.

export NPY_DISABLE_CPU_FEATURES="AVX512F AVX512_SKX"

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Nov 12, 2020
@charris charris removed this from the 1.20.0 release milestone Nov 12, 2020
@charris charris added this to the 1.19.5 release milestone Nov 12, 2020
@r-devulap
Copy link
Member

@larsoner thanks for narrowing down the commit id!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0