8000 BUG: Fix incorrect/missing reference cleanups found using valgrind by seberg · Pull Request #12624 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: Fix incorrect/missing reference cleanups found using valgrind #12624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 2, 2019

Conversation

seberg
Copy link
Member
@seberg seberg commented Dec 27, 2018

I may add more fixes, but right now the output is pretty spammed by some errors for not cleaned up dtypes and I am not sure where that comes from. (compare also gh-12615)


Edit, track things that still leak. I do not think we should wait for all of these to be tracked down, some of them are probably very difficult to find.

  • Array objects seem are being leaked in
    • test_nditer.py
    • test_arrayprint.py (or maybe it were only dtypes to begin with)
  • Rare python float being leaked in masked array tests, created during dtype with fields copy or similar.
  • (at least mostly done, rerun of tests necessary) dtypes leak in some places, it is plausible that many are for the same reasons:
    • test_arrayprint.py
    • test_defchararray.py
    • test_ufuncs.py
    • test_memoverlap.py.
  • Tuples, for example in test_recfunctions.py (maybe some tuple stored but not cleaned up by dtypes?) and test_ufuncs.py.
  • test_ufunc.pyleaks https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L2666
    (probably just the free missing).
  • Won't fix now: Most /many (not test_array_from_pyobj.py maybe more) f2py tests leak PyFortranObjects
  • A few leaks datetime related, e.g. test_datetime.py
  • ... Probably some I forgot
  • Won't investigate I see a python leak always, may check python-dbg build to see where it comes from.

Some tests are needed:

  • [ ] Object dtype subarray reference count test.
  • … please add more if you want.

Hack I am using right now is to run the following script (after cleaning up and running pytest to cache the build with debugging):

#!/usr/bin/env python

import sys
import glob
from subprocess import call

if len(sys.argv) > 1:
    tests = sys.argv[1:]
else:
    tests = glob.glob("numpy/**/tests/test*.py")

for test in tests:
    print("Running test:", test)
    call("PYTHONMALLOC=malloc valgrind --show-leak-kinds=definite --leak-check=full "
         "python runtests.py -t {} -- --continue-on-collection-errors &> results/{}".format(
              test, test.replace("/", "_")), shell=True)

to get at least test file specific output of the leaks (of course you can bisect to find a test that triggers manually then, but it is pretty time consuming), would be need to have a better way, but probably not important enough.

/*
* Note that the buffer info is cached as pyints making them appear like
* unreachable lost memory to valgrind.
*/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, right now I do not understand the logic of the cache, but I did not try to either and I guess they are very small objects.

@seberg seberg force-pushed the valgrind-testing-leaks branch from c2873d9 to 0e753dc Compare December 28, 2018 00:42
@seberg
Copy link
Member Author
seberg commented Dec 28, 2018

I guess some of these could get tests...

@@ -3255,6 +3255,7 @@ array_datetime_data(PyObject *NPY_UNUSED(dummy), PyObject *args)
}

meta = get_datetime_metadata_from_dtype(dtype);
Py_DECREF(dtype);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably one of the few ones that could hurt in practice, if someone prints a lot of datetime objects for some reason.

@seberg
Copy link
Member Author
seberg commented Dec 28, 2018

I wonder if there is a nice way to hook pytest to run some special code for each test or simply more specific tests (after maybe just listing them all). To be able to do these runs without a manual bisect of the tests in the file.

@seberg
Copy link
Member Author
seberg commented Dec 28, 2018

Hmmm, the following code leaks a dtype:

def test_it():
    # dtype=np.string_ makes no difference
    np.array(np.array('abc1', dtype='c').view(np.chararray), copy=False)

Where the np.array call causes the leak and the view is required to cause the leak. EDIT: subok has to be False, so it will not use the fast path. I guess the bug may be in the fast path.

EDIT: OK found the issue.

@seberg seberg force-pushed the valgrind-testing-leaks branch from 1d82476 to eae11d4 Compare December 28, 2018 16:34
@charris
Copy link
Member
charris commented Dec 28, 2018

@seberg Would you recommend backporting this at some point?

@seberg
Copy link
Member Author
seberg commented Dec 28, 2018

Yeah, I would say so. Most of them are small or very small fish though, so if we want to keep things minimal we could pick out the bigger ones maybe.

@seberg
Copy link
Member Author
seberg commented Dec 28, 2018

Frankly, a lot of these are even irrelevant except to keep the output clean, such as errors in the tests.

@seberg
Copy link
Member Author
seberg commented Dec 28, 2018

The bug in the memory overlap tests worries me the most right now. Because it seems to me like we may be losing whole arrays due to memory overlap checks sometimes.

EDIT: We were, for reductions with overlap.

@seberg
Copy link
Member Author
seberg commented Dec 28, 2018

Ah, a few of the odder/bigger errors are due to circular references that are not manually resolved (and numpy's lack of support).

@seberg
Copy link
Member 8000 Author
seberg commented Dec 29, 2018

So, a few of the leaks could in principle be non-harmless, though I guess rarely. Since I do not remember all fixes that were found while searching for other things, we should use the code coverage to judge if new tests are needed anywhere or maybe if not to backport something.

Py_DECREF(args);
if (item == NULL) {
return 0;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code above does not seem super pretty, but I am not quite sure that subclasses are not supposed to get a chance here, so PyObject_GetItem is no behaviour change I think. Oddly the datetime code looks a bit different than the timedelta one above.

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - I only looked in detail at umath, but those certainly are good (apologies for the one I introduced...).

@seberg
Copy link
Member Author
seberg commented Dec 29, 2018

Things are still running, but the current status is probably:

  • There are still leaks of Python float objects in ma/tests/test_core.py which sounds a lot like a bug.
  • The buffer_info is leaked, hopefully only because of the cache (so not a true leak)
  • f2py tests leak a lot, some of which is PyFortranObject_NewAsAttr some of it may be unrelated to f2py, but that seems unlikely. And the non PyFotranObject stuff may be worse.
  • All files show a leak which I think is due to python. Some show a leak during some module/so loading, I hope this is python related as well, such as a local import statement or so, but did not attempt to narrow it down (not sure I will)

EDIT: add last point.

@seberg
Copy link
Member Author
seberg commented Dec 29, 2018

That last one (not counting f2py and probably python related ones) is seriously strange, because I can't seem to bisect the test that fails. It is as if there is some crosstalk between multiple tests. Hmmm, lets poke it a bit more, but I may just give up.

@seberg
Copy link
Member Author
seberg commented Dec 30, 2018

I have cooked down ma.test_core to this:

import numpy as np
from numpy.ma.core import masked_array

def test_fancy_printoptions():
    t_2d0 = masked_array(data = (0, [[0.0, 0.0, 0.0],
                                    [0.0, 0.0, 0.0]],
                                0.0),
                         mask = (False, [[True, False, True],
                                         [False, False, True]],
                                 False),
                         dtype = "int, (2,3)float, float")
    str(t_2d0)  # repr also works, str(t_2d0.data) does *not*.

    # Removing a single item from the below list fixes it!
    # Changing the types between float/int does not change anything
    np.array([1., 0., 1., 2., 1., 0., 1., 3.])

and yes the np.array call is in a way a leftover from a different test (although somewhat changed). Changing the masked array to a normal one does not help, trying to simplify the dtype does not really seem to help either. The str call of course is important.

EDIT: "does not help" meaning "does not work to get a simpler example reproducing the lost behaviour". What is lost here are actually float objects, some of which seem created by a numpy cast from object to float. But I am not sure that is guaranteed.

So, anyone genius enough to squirt at this and see something that might lead anywhere? It seems completely absurd to me right now, especially why the last array call and the list length can be important.

@seberg
Copy link
Member Author
seberg commented Dec 30, 2018

Sqirting at the cpython code, pyfloats have a list of float objects flying around, probably there are some shenanigans going on there.

@seberg
Copy link
Member Author
seberg commented Dec 30, 2018

OK, can't say I fully understand it, but my current guess is that this is because of pythons list of discarded float objects to avoid memory allocations when creating new ones. Since this abuses the slots of the object for linking, I am not sure valgrind would find those as "reachable".

On the other hand.... if that is (which I assume), why do we not see a similar thing constantly?

@ahaldane
Copy link
Member

I took a look since I was touching the float/ma code. Here's a shorter test where I see the leak:

import numpy as np

res = np.array((0, [[0., 0., 0.], [0., 0., 0.]], 0),
               dtype=[('f0', 'O'), ('f1', 'O', (2, 3)), ('f2', 'O')])
del res
np.array([1., 0., 1., 2., 1., 0., 1., 3.])

I'd bet is has to do with the deallocation of the subarray, which has to decref a bunch of floats.

@ahaldane
Copy link
Member

Yup, PyArray_Item_XDECREF is definitely leaking because it doesn't account for subarrays at all. Not sure why allocating another array right after silences the leak.

@ahaldane
Copy link
Member

Rough possible code, adapted from dtype_transfer.c. The INCREF function above also fails to incref subarray elements so needs similar updates.

    if (PyDataType_HASSUBARRAY(dtype)) {
        PyArray_Dims shape = {NULL, -1};
        npy_intp size = 1;

        if (!(PyArray_IntpConverter(dtype->subarray->shape, &shape))) {
            PyErr_SetString(PyExc_ValueError,
            "invalid subarray shape");
            return;
        }
        size = PyArray_MultiplyList(shape.ptr, shape.len);

        for (i = 0; i < size; i++){
            NPY_COPY_PYOBJECT_PTR(&temp, data + i*sizeof(PyObject*));
            Py_XDECREF(temp);
        }
    }

That's as far as I'll get today.

@seberg
Copy link
Member Author
seberg commented Dec 30, 2018

Ah, very nice cooking down and tracking, sounds like it should be spot-on. Thanks @ahaldane, I guess the print only caused the object cast, makes sense. Possibly the randomness is because of the small array cache needing to be cleared, so we have to create a new array of the right size.

@seberg
Copy link
Member Author
seberg commented Dec 30, 2018

So knowing this, it is much easier to skip the valgrind part, and just put in a known float 6 times to see its refcount never decrease on array deletion.

@seberg
Copy link
Member Author
seberg commented Dec 30, 2018

@ahaldane thanks for that code snippet. I simplified it a bit and copied it in, and it seems to work. We really need some tests though. It is also a shame (and praise to masked arrays) that this is only found by the masked array test suit and not the rest of numpy.

So, with this, I think this effort is finalized, with the exception of new tests, especially for the last issue and possibly some others based on code coverage or review request (that would be nice anyway).

The f2py leaks remain, and may be substantial, but I am not quite willing to dig into it. This hunt has already been time consuming.

@seberg seberg force-pushed the valgrind-testing-leaks branch from a00d04a to 370044b Compare December 30, 2018 13:09
@mhvk
Copy link
Contributor
mhvk commented Dec 30, 2018

@seberg, @ahaldane - 🎆 from the sideline.

@seberg
Copy link
Member Author
seberg commented Jan 2, 2019

OK, I moved the subarray reference count fix to gh-12650. That should make this PR pretty manageable review wise.

@mattip
Copy link
Member
mattip commented Jan 2, 2019

LGTM, I will squash-merge in a bit (in case someone else wants to take a look).

The XDECREF is not required, the type cannot be NULL.
@seberg seberg force-pushed the valgrind-testing-leaks branch from 496cab8 to 4732c1b Compare January 2, 2019 19:17
@seberg
Copy link
Member Author
seberg commented Jan 2, 2019

@mattip added a fixup to change those lines to DECREF (and not init to NULL), but can remove that again. The DescrConverter cannot return NULL I think, that is what DescrConvert2 does.

@mattip
Copy link
Member
mattip commented Jan 2, 2019

All tests passed, including the flaky one.

@mattip mattip merged commit 0ea890a into numpy:master Jan 2, 2019
@mattip
Copy link
Member
mattip commented Jan 2, 2019

Thanks @seberg

@seberg seberg deleted the valgrind-testing-leaks branch January 2, 2019 22:23
@seberg seberg restored the valgrind-testing-leaks branch January 2, 2019 22:24
@charris
Copy link
Member
charris commented Jan 2, 2019

Nice work.

seberg added a commit to seberg/numpy that referenced this pull request Jan 9, 2019
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Jan 17, 2019
@charris charris removed this from the 1.16.1 release milestone Jan 17, 2019
charris pushed a commit to charris/numpy that referenced this pull request Jan 17, 2019
charris pushed a commit to charris/numpy that referenced this pull request Jan 17, 2019
@seberg seberg deleted the valgrind-testing-leaks branch April 13, 2019 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0