8000 Creating and argsorting an array with a structured dtype leaks memory · Issue #7860 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Creating and argsorting an array with a structured dtype leaks memory #7860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dpitch40 opened this issue Jul 21, 2016 · 10 comments
Closed

Creating and argsorting an array with a structured dtype leaks memory #7860

dpitch40 opened this issue Jul 21, 2016 · 10 comments

Comments

@dpitch40
Copy link
Contributor
dpitch40 commented Jul 21, 2016

The following snippet of code consumes steadily increasing amounts of memory when run (even after the calls to gc.collect every 5000th iteration), as observed from the task manager. I have confirmed that changing the sort kind does not affect the issue.

import gc
import numpy as np

n = 0
while True:
    names = ["POH"] + \
        ["Head %d Really Long Data Field Name With Many Words" % i for i in xrange(200)]
    formats = [">f"] + [">I"] * 200
    offsets = range(0, 804, 4)

    dt = np.dtype({"names": names, "formats": formats, "offsets": offsets})
    array = np.zeros((1000,), dt)

    sortedIndices = np.argsort(array, order=["POH"])

    n += 1
    if n % 1000 == 0:
        print n
        if n % 5000 == 0:
            gc.collect()

I am running numpy 1.11.0 with Python 2.7.11 on 64-bit Windows 7.

@jaimefrio
Copy link
Member

Can you debug the cause of the leak? I.e. I'm betting on either the dtype creation or the argsort itself, but it could be the array creation. So if you could move the dtype and array creation in and out of the loop and report in which of the four possible cases the leak persists it would be very helpful. Thanks!

@dpitch40
Copy link
Contributor Author

Removing the dtype creation from the loop appears to stop the leak (as does removing the array creation as well). I don't think removing the array creation from the loop but not the dtype creation is possible. So the only case when it leaks is when they are both in the loop, and when I argsort the array.

@seberg
Copy link
Member
seberg commented Jul 22, 2016

That makes sense, argsort should have a reference leak on the array dtype, in the cases you described, it would only increase the reference count of the dtype by one and thus not leak actual memory, since it is only one dtype that survives.

@dpitch40
Copy link
Contributor Author

So do you think it could be that argsort is leaking a reference to the copy of the dtype each time? It also doesn't appear to happen if I use a simple dtype (i.e. one without named fields to argsort on).

@seberg
Copy link
Member
seberg commented Jul 22, 2016

Yes, a simple dtype will just find an existing dtype and return it from a table I believe np.dtype('i') is np.dtype('i'). It all adds up, just need to hunt down in the code where the reference is leaked. Almost certainly a missing DECREF somewhere.

@njsmith
Copy link
Member
njsmith commented Jul 22, 2016

To confirm that it's the dtype being leaked, you could put the dtype outside the loop and then call sys.getrefcount on it each time through - if that number grows without bound, then we're definitely leaking references to it.

@dpitch40
Copy link
Contributor Author

No, it doesn't appear to; it stays constant at 3. I tried moving the creation of names, formats, and offsets outside the loop while leaving the creation of dt inside. This also stops the leak, and the reference count of names stays constant at 2.

@seberg
Copy link
Member
seberg commented Jul 25, 2016

Yes, right. What is leaking is not the dtype itself but the "POH" string.

@dpitch40
Copy link
Contributor Author

Okay, now I see what you mean. Even declaring names outside the loop, the sys.getrefcount of (apparently) each entry of names increases by 1 with each iteration of the loop. But this stops if I declare dt outside the loop. So it seems like argsort is leaking each name the first time it is called with a given dtype. I'm guessing it happens faster if names is declared in the loop because then the actual objects are being leaked and not just redundant references to the same object.

emfree added a commit to emfree/numpy that referenced this issue Aug 16, 2016
@seberg
Copy link
Member
seberg commented Jan 2, 2019

This bug has been fixed in python master as part of gh-12624

EDIT: Actually, maybe for argsort it was fixed even earlier, would have to check

@seberg seberg closed this as completed Jan 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0