8000 BUG: `np.nonzero` outputs too large indices for boolean matrices · Issue #23196 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: np.nonzero outputs too large indices for boolean matrices #23196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rank-and-files opened this issue Feb 10, 2023 · 11 comments
Closed

BUG: np.nonzero outputs too large indices for boolean matrices #23196

rank-and-files opened this issue Feb 10, 2023 · 11 comments
Labels

Comments

@rank-and-files
Copy link
rank-and-files commented Feb 10, 2023

Describe the issue:

When calling np.nonzero on some boolean matrices, sometimes the returned indices are larger than it should be possible.
(for example 1152921504606852819 , when they should be smaller than 12000).

The code example below reproduces this bug.
In the below script, note that one could set FORK to False and the bug is still thrown but it takes longer (for me it was maximally 17000 around iterations for a similar script.)
I tried this on two different machines, both running 20.04.1-Ubuntu with around 32 GB of RAM.
It happens with different numpy versions, including the latest one (1.24.2) and different ways of installing numpy (conda, pip and poetry).

When using FORK = True in the script below, the error is shown for me after 20 iterations already (output 20 in stdout). If not, it has proven effective to Ctrl + C and try again.

Reproduce the code example:

import os

import numpy as np

FORK = True


def main():

    np.random.seed(4321)

    if FORK:
        pid = os.fork()
        np.random.seed(4321)
        if pid > 0:
            np.random.seed(1234)
            pid = os.fork()
            if pid > 0:
                np.random.seed(123)
                pid = os.fork()
                if pid > 0:
                    np.random.seed(321)
                    if pid > 0:
                        np.random.seed(12)
                        pid = os.fork()
                        if pid > 0:
                            np.random.seed(21)
                            pid = os.fork()
                            if pid > 0:
                                np.random.seed(1)

    count = 0
    while True:
        count += 1
        if count % 10 == 0:
            print(count)
        random_num_one = np.random.randint(6000, 8000)
        random_num_two = np.random.randint(10000, 12000)

        self_offsets = np.zeros((random_num_one, random_num_two, 2))

        random_arr = np.random.random((random_num_one, random_num_two))
        mask = random_arr >= 0.5
        ys_rel, xs_rel = np.nonzero(mask)

        if np.max(xs_rel) > random_num_two:
            raise Exception(f"This should not happen: {np.max(xs_rel)} > {random_num_two}")

        if np.max(ys_rel) > random_num_one:
            raise Exception(f"This should not happen: {np.max(ys_rel)} > {random_num_one}")


if __name__ == "__main__":
    main()

Error message:

Traceback (most recent call last):
  File "scripts/minimal_new.py", line 52, in <module>
    main()
  File "scripts/minimal_new.py", line 45, in main
    raise Exception(f"numpy does not work: {np.max(xs)} > {random_num_two}")
Exception: This should not happen: 1152921504606852819 > 10326

Runtime information:

Ouput of import sys, numpy; print(numpy.__version__); print(sys.version)

1.24.2
3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0]

Output of print(numpy.show_runtime())

[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2'],
                      'not_found': ['AVX512F',
                                    'AVX512CD',
                                    'AVX512_KNL',
                                    'AVX512_KNM',
                                    'AVX512_SKX',
                                    'AVX512_CLX',
                                    'AVX512_CNL',
                                    'AVX512_ICL']}},
 {'architecture': 'Haswell',
  'filepath': '/home/benr/.cache/pypoetry/virtualenvs/projectname-6Jmumlav-py3.8/lib/python3.8/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so',
  'internal_api': 'openblas',
  'num_threads': 16,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.21'}]

Operating system

Linux 5.15.0-58-generic #64~20.04.1-Ubuntu

Context for the issue:

The usage is in the context of data loading for the analysis of images (semantic segmentation).
I cannot work without getting the indices where this kind of matrices are nonzero, I'm bound to use workarounds.

@rank-and-files rank-and-files changed the title BUG: <np.nonzero outputs too large indices for boolean matrices> BUG: np.nonzero outputs too large indices for boolean matrices Feb 10, 2023
@rkern
Copy link
Member
rkern commented Feb 10, 2023

I cannot reproduce this. Try printing out the seed in each process each time you print out the count so we can figure out which seed is producing the problematic array within 20 iterations so we don't have to do all this forking.

    if FORK:
        pid = os.fork()
        seed = 4321
        np.random.seed(seed)
        if pid > 0:
            ...

    count = 0
    while True:
        count += 1
        if count % 10 == 0:
            print(f"{seed=} {count=}")
        ...

You can also try saving the data at the end so we can try to reproduce in the case that it's a problem with the actual values in the array.

        if np.max(xs_rel) > random_num_two:
            np.savez('error.npz', random_arr=random_arr, mask=mask, xs_rel=xs_rel, ys_rel=ys_rel)
            raise Exception(f"This should not happen: {np.max(xs_rel)} > {random_num_two}")

        if np.max(ys_rel) > random_num_one:
            np.savez('error.npz', random_arr=random_arr, mask=mask, xs_rel=xs_rel, ys_rel=ys_rel)
            raise Exception(f"This should not happen: {np.max(ys_rel)} > {random_num_one}")

You mention that it sometimes doesn't happen ("it has proven effective to Ctrl + C and try again."), so I do wonder how reproducible this is going to be even if we do narrow down the values in the array.

@rank-and-files
Copy link
Author
rank-and-files commented Feb 13, 2023

I tested some more and found out that the randomness is not necessary. And also it's not necessary that the array is boolean.
Consider the following script.

import numpy as np


def main():

    count = 0
    while True:
        count += 1
        print(count)

        num_one = 8000
        num_two = 6000

        arr = np.ones((num_one, num_two))
        ys, xs = np.nonzero(arr)

        if np.max(xs) > num_two:
            raise Exception(f"This should not happen: {np.max(xs)} > {num_two}")

        if np.max(ys) > num_one:
            raise Exception(f"This should not happen: {np.max(ys)} > {num_one}")


if __name__ == "__main__":
    main()

This throws an error almost instantly on both machines I tested on.
Output machine one:

1
Traceback (most recent call last):
  File "scripts/minimal_new.py", line 25, in <module>
    main()
  File "scripts/minimal_new.py", line 18, in main
    raise Exception(f"This should not happen: {np.max(xs)} > {num_two}")
Exception: This should not happen: 1152921504606848384 > 6000

Output machine two:

1
2
3
Traceback (most recent call last):
  File "minimal_new.py", line 25, in <module>
    main()
  File "minimal_new.py", line 18, in main
    raise Exception(f"This should not happen: {np.max(xs)} > {num_two}")
Exception: This should not happen: 549755814222 > 6000

@seberg
Copy link
Member
seberg commented Feb 13, 2023

Unfortunately, I also cannot reproduce on either of my computers, also trying Python 3.8 (and within valgrind). The hardware on one looks rather equivalent to yours (same instruction sets in show_runtime()).

Is there anything else noteworthy about your setup, for example use of a virtual machine?

Since you are on an ubuntu, maybe you can actually try running PYTHONMALLOC=malloc valgrind python3 test.py and see if it gives any output beyond Warning: set address range perms? (If it does, we will have something to work with.)
EDIT: You will have to install valgrind probably, but that should be very straight forward hopefully.

Additionally/alternatively can you try to also check the result of np.count_nonzero to hopefully narrow down whether the result calculation is wrong, or whether counting the number of nonzero elements already fails.

@rank-and-files
Copy link
Author
rank-and-files commented Feb 13, 2023

valgrind
Thank you for your message, using valgrind, the only other output is:

==326118== Conditional jump or move depends on uninitialised value(s)         
==326118==    at 0x58B40F: PyUnicode_Decode (in /usr/bin/python3.8)                             
==326118==    by 0x58B764: PyUnicode_FromEncodedObject (in /usr/bin/python3.8)                                                                                                                  
==326118==    by 0x577A7A: ??? (in /usr/bin/python3.8)                                                                                                                                          
==326118==    by 0x5F65F2: _PyObject_MakeTpCall (in /usr/bin/python3.8)                         
==326118==    by 0x570D33: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)                     
==326118==    by 0x5F5EE5: _PyFunction_Vectorcall (in /usr/bin/python3.8)                      
==326118==    by 0x59C093: ??? (in /usr/bin/python3.8)                                          
==326118==    by 0x5F666E: _PyObject_MakeTpCall (in /usr/bin/python3.8)                         
==326118==    by 0x570D33: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)                     
==326118==    by 0x569D89: _PyEval_EvalCodeWithName (in /usr/bin/python3.8)                     
==326118==    by 0x5F60C2: _PyFunction_Vectorcall (in /usr/bin/python3.8)                       
==326118==    by 0x570B81: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)

This is output even before the first printed number, so it should be before the loop.

same problem in pytorch

I also note that I get the same problem using similar code in pytorch. I'm not sure if pytorch calls numpy functions under the hood, but if it doesn't, I think the issue is likely not related to numpy and we can close it.

setup

As far as I know there are no noteworthy things. I don't think I'm running on a VM, but I will check with my co-workers.

We will reinstall the OS and try again.

np.count_nonzero

np.count_nonzero returns the right number, and also the shapes of the arrays xs and ys are correct.

@seberg
Copy link
Member
seberg commented Feb 13, 2023

Yeah, that warning should be something harmless happening at import time probably (a bit surprised that I don't see it right now, but these are pretty familiar at startup).

A similar error in pytorch seems surprising, but I also do not know if pytorch might use NumPy (I would be surprised, but no idea).

One other thing to try is disabling the used simd extentions starting with the lowest one you got there by setting the environment variable NPY_DISABLE_CPU_FEATURES=AVX2 and adding everything until you are only left with "base-line" (compare the np.show_runtime() output. See here: https://numpy.org/devdocs/user/troubleshooting-importerror.html#segfaults-or-crashes

Are the two machines you tried identical machines? Could you give the full CPU details (not sure it helps, but until someone can reproduce the issue any information seems good)?
We had one weird case where someone updated the CPU microcode and a computer crash went away... In any case, I hope that the NPY_DISABLE_CPU_FEATURES might give us an idea, although I am not 100% sure it is relevant for this exact code.

@WarrenWeckesser
Copy link
Member

The bit patterns of the bad values are interesting (but maybe irrelevant):

Exception: This should not happen: 1152921504606852819 > 10326

In [58]: bin(1152921504606852819)
Out[58]: '0b1000000000000000000000000000000000000000000000001011011010011'

Exception: This should not happen: 1152921504606848384 > 6000

In [60]: bin(1152921504606848384)
Out[60]: '0b1000000000000000000000000000000000000000000000000010110000000'

Exception: This should not happen: 549755814222 > 6000

In [64]: bin(549755814222)
Out[64]: '0b1000000000000000000000000000000101001110'

@rank-and-files
Copy link
Author
rank-and-files commented Feb 13, 2023

Thank you all very much for your help, we reinstalled the OS and the error is not observable any more.
So I think the issue is probably not connected to numpy and I will close it.
Of course if you feel there needs to be further investigation, reach out and I will provide more details.

@seberg
Copy link
Member
seberg commented Feb 13, 2023

Thank you all very much for your help, we reinstalled the OS and the error is not observable any more.

If you have any information about the bad setup, that would be great, but I don't know what exactly we would look at, maybe CPU, but since a software updated apparently helped... Just in case we can narrow it down for the next person stumbling into a hard to find issue.

Running into issues that were fixed by platform updates (in one case CPU microcode) have been reported in the recent past and is a bit scary...

@rank-and-files
Copy link
Author
rank-and-files commented Feb 15, 2023

Using the machine for some time, we saw that the problem still is there, but for higher parameters num_one and num_two, e.g. 18000 and 16000 respectively.
I still don't think it is strictly related to numpy and will not reopen the issue (it occured for similar methods in OpenCV and torch as well).
The CPU on both machines is Intel(R) Core(TM) i7-10700K .
We tested on other machines with different CPUs and could not reproduce the error.
We also tested on another machine with the same CPU and the error also occured there.

@seberg
Copy link
Member
seberg commented Feb 15, 2023

@bsen the only idea I would have is the old one of disabling the use of some advanced SIMD instructions, e.g. starting with NPY_DISABLE_CPU_FEATURES=AVX2.
Updating the micro-code might also be an option in principle, in gh-23082 it supposedly helped, but I admit this is fishing.

Anyway, thanks for following up!

@WarrenWeckesser
Copy link
Member

Shot in the dark: try running a rigorous RAM memory check on the machines where the problem is occurring. It seems unlikely that two or three machines would start having bad memory at the same time, but at least you could rule out that potential source of problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants
0