10000 mae DecisionTree does not free used memory after refit · Issue #7811 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

mae DecisionTree does not free used memory after refit #7811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mbochk opened this issue Nov 2, 2016 · 24 comments
Closed

mae DecisionTree does not free used memory after refit #7811

mbochk opened this issue Nov 2, 2016 · 24 comments
Labels
Milestone

Comments

@mbochk
Copy link
mbochk commented Nov 2, 2016

Description

DecisionTree with mae criterion does not free used memory.
As a result that kind of trees is unusable in cycle, forests and cross-validation.
(originally i encoutered this problem while trying use grid search with mae forest)

Steps/Code to Reproduce

import sklearn

params = dict(max_depth=range(1, 10))
params = sklearn.model_selection.ParameterGrid(params)
reg = sklearn.tree.DecisionTreeRegressor(criterion='mae')

# repeated several times in jupyter cell
for p in params:
    print p
    for i in range(100):
        reg.set_params(**p)
        reg.fit(xxx, yyy)

Expected Results

No additional memory usage after first completion of cycle.
Cell with 'for' cycle is runnable unlimited times.

Replacing 'mae' with 'mse' does give expected result.

Actual Results

Additional memory is used after each cell execution.
Deleteing tree with 'del' and enforcing 'gc.collect()' does not help.
When there is no memory left OS (windows) shuts down python process.

My guess is that there is memory leak in cython.

Versions

Windows-7-6.1.7601-SP1
('Python', '2.7.12 | packaged by conda-forge | (default, Aug 12 2016, 18:53:38) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.2')
('SciPy', '0.18.0')
('Scikit-Learn', '0.18')

@lesteve
Copy link
Member
lesteve commented Nov 2, 2016

What kind of data are you fitting on, i.e. what is xxx and yyy (shape, dtype mostly)? If that's not practical to share the data, that would be great if you could create a stand-alone reproducible example with some randomly generated data.

@mbochk
Copy link
Author
mbochk commented Nov 2, 2016

Good point, forgot about data. Originally, i found it for (dense) dataframes,

xxx = np.random.randn(1000, 11)
yyy = np.random.randn(1000)

works as well.

I may test it on sparse data, if you wish.

@lesteve
Copy link
Member
lesteve commented Nov 2, 2016

I seem to be able to reproduce on my Ubuntu laptop, although I haven't looked into it in detail. Here is a snippet:

import numpy as np

from sklearn.model_selection import ParameterGrid
from sklearn.tree import DecisionTreeRegressor

import psutil

rng = np.random.RandomState(0)
xxx = rng.randn(1000, 11)
yyy = rng.randn(1000)

params = dict(max_depth=range(1, 10))
params = ParameterGrid(params)
reg = DecisionTreeRegressor(criterion='mae')

this_process = psutil.Process()

for p in params:
    print(p)
    print(this_process.memory_info().rss / 1e6)
    for i in range(100):
        reg.set_params(**p)
        reg.fit(xxx, yyy)

Output:

{'max_depth': 1}
81.936384
{'max_depth': 2}
91.181056
{'max_depth': 3}
113.348608
{'max_depth': 4}
159.8464
{'max_depth': 5}
243.65056
{'max_depth': 6}
383.414272
{'max_depth': 7}
615.903232
{'max_depth': 8}
972.746752
{'max_depth': 9}
1471.787008

You don't see a memory usage increase if you use the default criterion.

@glouppe
Copy link
Contributor
glouppe commented Nov 2, 2016

Just a hunch, but that certainly comes from something that isnt freed in _criterion.pyx:MAE or _utils.pyx:WeightedMedianSomething.

@nelson-liu
Copy link
Contributor
nelson-liu commented Nov 2, 2016

I don't have the time at the moment to test and see if the error is reproducible, but I'd imagine that the issue stems from here. I use two numpy arrays to store objects to efficiently calculate the median for all outputs. I'd imagine cython/python would automatically free them, seeing as they are python objects...if it doesn't, that could certainly cause problems.

@nelson-liu
Copy link
Contributor

yeah I agree with @glouppe, will look into it in a bit.

@lucidyan
Copy link

The bug has been reproduced from me and the guys from the forum
https://www.kaggle.com/c/allstate-claims-severity/forums/t/24293/sklearn-randomforestregressor-mae-criterion

@nelson-liu
Copy link
Contributor

thanks for the info @lucidyan. I think a rewrite is probably in order at the cython level; I'd imagine there's something funky going on with how the numpy arrays of cython objects are treated. It's expected that MAE would take longer than MSE and use more memory, but not to this extent (not freeing the memory is obviously not intended).

To solve the memory issues, I think a reasonable change would be to either do all the computations inline without any external objects (since this is what seems to be causing the leak right now) or create a struct to facilitate it.
To solve the speed issues, @jmschrei and I were talking about using a bit-masking approach to minimize the amount of expensive copying / shifting of data from left to right candidate branches, but I haven't had the time to write up our ideas as Cython code.

I'm prohibitively busy until at least mid-January, so I wouldn't be able to get to the above until then; if some other contributor wants to tackle this, though, I'd be happy to provide guidance.

@lesteve
Copy link
Member
lesteve commented Nov 28, 2016

@nelson-liu just curious, what kind of tools are available to investigate memory leaks within cython code ?

@nelson-liu
Copy link
Contributor
nelson-liu commented Nov 28, 2016

if by investigate you mean debug, I think it's possible to use valgrind. I'm not super familiar with debugging memory issues in Cython, though; I think it's fairly intuitive where the leak is happening here, though (the numpy array of Cython objects is improperly freed)...

If you find anything that seems helpful, please let me know!

@raghavrv
Copy link
Member

I think I stumbled upon this issue of MAE not releasing the memory in one of my PRs when I incorrectly rebased... And yes valgrind is the tool for the task... I remember there was a rule set or something to triage the output to filter out the python interpreter specific error messages...

raghavrv added a commit that referenced this issue Jan 18, 2017
…en raising; Propagate all errors to python interpreter level (#7811) (#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
sergeyf pushed a commit to sergeyf/scikit-learn that referenced this issue Feb 28, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
@akansal1
Copy link

I am facing issue mentioned above also in default 'MSE', is the fix mentioned above, will close this as well?

@raghavrv
Copy link
Member

Could you share a sample code demonstrating such a leak? The last time I checked MSE does not have any such problems as all allocated memory is freed up correctly...

@akansal1
Copy link
akansal1 commented Mar 14, 2017

So, following are my observations:

Options are used to fit dataset:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', 
                            max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, 
                            min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10000, 
                            n_jobs=3, oob_score=False, random_state=0, verbose=0, warm_start=False)

When I iterate over options, I get increase in my memory memory load:

So, I am able to train my data once without issue , but second iteration of training over data, increases memory load. For example, If first iteration ends at 5 GB of memory consumption, second iteration over same takes memory consumption to >8GB, ideally it should flush memory and maintain in at ~5 GB

PS: Training same dataset with Gradient boosting regressor works fine, not sure if this issue is related to issue mentioned above

@jnothman
Copy link
Member
jnothman commented Mar 14, 2017 via email

@akansal1
Copy link

Sorry, but I did not understand:

is your estimator deleted between calls to fit?

I did reduce model size later, but the behavior mentioned above happened during prototyping, and it was strange.

@Przemo10 Przemo10 mentioned this issue Mar 17, 2017
@jnothman
Copy link
Member
jnothman commented Mar 19, 2017 via email

@akansal1
Copy link

Yes, I was trying something similar, but, it was not an expected behavior

@jnothman
Copy link
Member
jnothman commented Mar 21, 2017 via email

@akansal1
Copy link
akansal1 commented Mar 21, 2017

I mean following changes reduce the problem:

est=RandomForestRegressor(some parameters)
est.fit(X, y)
est=RandomForestRegressor(changed parameters)
est.fit(X,y)

But, the memory freed up by doing above is not same of what is gained in first run, and it is not an expected behavior

This, is causing extreme memory overhead in CV , which I do not experience in GBRT over same dataset.

Hence, I feel, there is some memory leak somewhere, however, I am not using RandomForestRegressor currently, in my approach

8000

@lesteve
Copy link
Member
lesteve commented Mar 22, 2017

@akansal1 it would help tremendously if you posted a fully stand-alone snippet that we can just copy and paste in a IPython session to check whether we can reproduce.

@akansal1
Copy link
akansal1 commented Mar 22, 2017

@lesteve type of machine I run my analysis, it takes lot of time, but I did manage to run some code for extratreeregressor,over the number of estimates I was trying, taken from issue 8623

For, randomforestregressor, run was running too slow, and not using enough RAM for this data, I will try to put my code, reproducible structure to share it here. Sorry

p = psutil.Process(os.getpid())
X = np.random.normal(size=(20000, 50))
Y = np.random.binomial(1, 0.5, size=(20000, ))


def print_mem():
    print("{:.0f}MB".format(p.memory_info().rss / 1e6))

print_mem()

for i in range(5):
    et = ExtraTreesRegressor(n_estimators=10000,
                              n_jobs=3).fit(X, Y)
    del et
    gc.collect()
    print_mem()

Following is the screenshot for the run:
image

PS: I am using Ubuntu 14.04, with Python 2.7 & latest scikit-learn stable release

@akansal1
Copy link

@lesteve I was able to generate it for RandomForestRegressor as well:

p = psutil.Process(os.getpid())
X = np.random.normal(size=(11000, 12))
Y = np.random.binomial(1, 0.5, size=(11000, ))


def print_mem():
    print("{:.0f}MB".format(p.memory_info().rss / 1e6))

print_mem()

for i in range(5):
    et = RandomForestRegressor(n_estimators=10000,
                              n_jobs=3).fit(X, Y)
    del et
    gc.collect()
    print_mem()

image

@raghavrv
Copy link
Member

I don't see any increasing memory usage when I run the code snippet...

import os, time, gc, psutil

import numpy as np
from sklearn.ensemble import RandomForestRegressor

p = psutil.Process()
X = np.random.normal(size=(10000, 12))
y = np.random.randint(-1, 3, size=(10000, ))

def sleep_and_print_mem(title, sleep=3):
    time.sleep(sleep)
    print("\n" + title + " : " +  "%0.2f MB" % (p.memory_info().rss / 1e6)
         + "   " + "=" * 49)

sleep_and_print_mem("Initial memory")

for i in range(6):
    et = RandomForestRegressor(n_estimators=20, criterion='mse',
                               n_jobs=7).fit(X, y)
    del et
    gc.collect()

    sleep_and_print_mem("After iteration %d" % i)

The residual memory (as per psutils) after each run + del, gc is rougly the same after each iteration...
This residual memory is not a memory leak... It has something to do with OS's mem caching... Refer this SO link - http://stackoverflow.com/a/35121766

@lesteve I tried it with ExtraTreesRegressor too, I can't see a steady increasing memory usage. There is a random fluctuation (both increase and decrease) in the residual memory but no steady increase...

Initial memory : 72.15 MB   =================================================
After iteration 0 : 84.47 MB   =================================================
After iteration 1 : 158.83 MB   ================================================= 
After iteration 2 : 214.52 MB   =================================================
After iteration 3 : 204.03 MB   =================================================
After iteration 4 : 216.31 MB   =================================================
After iteration 5 : 253.84 MB   =================================================
After iteration 6 : 207.64 MB   =================================================
After iteration 7 : 226.03 MB   =================================================
After iteration 8 : 212.75 MB   ===============================================
After iteration 9 : 186.47 MB   =================================================

(In contrast with this, if you use run the above on scikit-learn 0.18.0 with criterion='mae' you'd get)

Initial memory : 67MB   ==================================================\
After iteration 0 : 318MB   ==================================================
After iteration 1 : 562MB   ==================================================
After iteration 2 : 810MB   ==================================================
After iteration 3 : 1061MB   ==================================================
After iteration 4 : 1311MB   ==================================================
After iteration 5 : 1558MB   ==================================================
After iteration 6 : 1803MB   ==================================================

I'm +1 for closing #8623

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this issue Jun 14, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
NelleV pushed a commit to NelleV/scikit-learn that referenced this issue Aug 11, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
paulha pushed a commit to paulha/scikit-learn that referenced this issue Aug 19, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this issue Nov 15, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants
0