-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
mae DecisionTree does not free used memory after refit #7811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What kind of data are you fitting on, i.e. what is |
Good point, forgot about data. Originally, i found it for (dense) dataframes,
works as well. I may test it on sparse data, if you wish. |
I seem to be able to reproduce on my Ubuntu laptop, although I haven't looked into it in detail. Here is a snippet: import numpy as np
from sklearn.model_selection import ParameterGrid
from sklearn.tree import DecisionTreeRegressor
import psutil
rng = np.random.RandomState(0)
xxx = rng.randn(1000, 11)
yyy = rng.randn(1000)
params = dict(max_depth=range(1, 10))
params = ParameterGrid(params)
reg = DecisionTreeRegressor(criterion='mae')
this_process = psutil.Process()
for p in params:
print(p)
print(this_process.memory_info().rss / 1e6)
for i in range(100):
reg.set_params(**p)
reg.fit(xxx, yyy) Output:
You don't see a memory usage increase if you use the default criterion. |
Just a hunch, but that certainly comes from something that isnt freed in |
I don't have the time at the moment to test and see if the error is reproducible, but I'd imagine that the issue stems from here. I use two numpy arrays to store objects to efficiently calculate the median for all outputs. I'd imagine cython/python would automatically free them, seeing as they are python objects...if it doesn't, that could certainly cause problems. |
yeah I agree with @glouppe, will look into it in a bit. |
The bug has been reproduced from me and the guys from the forum |
thanks for the info @lucidyan. I think a rewrite is probably in order at the cython level; I'd imagine there's something funky going on with how the numpy arrays of cython objects are treated. It's expected that MAE would take longer than MSE and use more memory, but not to this extent (not freeing the memory is obviously not intended). To solve the memory issues, I think a reasonable change would be to either do all the computations inline without any external objects (since this is what seems to be causing the leak right now) or create a struct to facilitate it. I'm prohibitively busy until at least mid-January, so I wouldn't be able to get to the above until then; if some other contributor wants to tackle this, though, I'd be happy to provide guidance. |
@nelson-liu just curious, what kind of tools are available to investigate memory leaks within cython code ? |
if by investigate you mean debug, I think it's possible to use valgrind. I'm not super familiar with debugging memory issues in Cython, though; I think it's fairly intuitive where the leak is happening here, though (the numpy array of Cython objects is improperly freed)... If you find anything that seems helpful, please let me know! |
I think I stumbled upon this issue of MAE not releasing the memory in one of my PRs when I incorrectly rebased... And yes valgrind is the tool for the task... I remember there was a rule set or something to triage the output to filter out the python interpreter specific error messages... |
…en raising; Propagate all errors to python interpreter level (#7811) (#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
I am facing issue mentioned above also in default 'MSE', is the fix mentioned above, will close this as well? |
Could you share a sample code demonstrating such a leak? The last time I checked MSE does not have any such problems as all allocated memory is freed up correctly... |
So, following are my observations: Options are used to fit dataset: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10000,
n_jobs=3, oob_score=False, random_state=0, verbose=0, warm_start=False) When I iterate over options, I get increase in my memory memory load: So, I am able to train my data once without issue , but second iteration of training over data, increases memory load. For example, If first iteration ends at 5 GB of memory consumption, second iteration over same takes memory consumption to >8GB, ideally it should flush memory and maintain in at ~5 GB PS: Training same dataset with Gradient boosting regressor works fine, not sure if this issue is related to issue mentioned above |
is your estimator deleted between calls to fit? Is there a reason you don't
want to adjust parameters to reduce model size?
…On 14 Mar 2017 9:05 pm, "akansal1" ***@***.***> wrote:
So, following are my observations:
Following options are used ti fit dataset:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10000,
n_jobs=3, oob_score=False, random_state=0, verbose=0, warm_start=False)
When I iterate over options, I get increase in my memory memory load:
So, I am able to train my data once without issue , but second iteration
of training over data, increases memory load. For example, If first
iteration ends at *5 GB* of memory consumption, second iteration over
same takes memory consumption to *>8GB*, ideally it should flush memory
and maintain in at *~5 GB*
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7811 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62QxS7cEajPw5cY0cCZIYDeSyN9Gks5rlmZmgaJpZM4KnTOy>
.
|
Sorry, but I did not understand:
I did reduce model size later, but the behavior mentioned above happened during prototyping, and it was strange. |
I mean that if you write:
est.fit(X, y)
est.fit(X, y)
it's possible that while it is building the second model, it still has the
first one stored, hence doubling the memory.
…On 18 March 2017 at 01:00, akansal1 ***@***.***> wrote:
Sorry, but I did not understand:
is your estimator deleted between calls to fit?
I did reduce model size later, but the behavior mentioned above happened
during prototyping, and it was strange.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7811 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yuDLV5t6ync_BOjUxy5RfwBwm0fks5rmpHygaJpZM4KnTOy>
.
|
Yes, I was trying something similar, but, it was not an expected behavior |
I'm not sure whether that's "yes i did not delete between fits and that
seems to solve the problem" or otherwise.
It might not be so good to keep the previous model while fitting a new one,
and we are probably inconsistent about it across estimators.
On 21 Mar 2017 4:52 pm, "akansal1" <notifications@github.com> wrote:
Yes, I was trying something similar, but, it was not an expected behavior
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7811 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wZsYxQug7iw7nWgGkXWDlKCe05rks5rn2WBgaJpZM4KnTOy>
.
|
I mean following changes reduce the problem: est=RandomForestRegressor(some parameters)
est.fit(X, y)
est=RandomForestRegressor(changed parameters)
est.fit(X,y) But, the memory freed up by doing above is not same of what is gained in first run, and it is not an expected behavior This, is causing extreme memory overhead in CV , which I do not experience in GBRT over same dataset. Hence, I feel, there is some memory leak somewhere, however, I am not using RandomForestRegressor currently, in my approach |
@akansal1 it would help tremendously if you posted a fully stand-alone snippet that we can just copy and paste in a IPython session to check whether we can reproduce. |
@lesteve type of machine I run my analysis, it takes lot of time, but I did manage to run some code for extratreeregressor,over the number of estimates I was trying, taken from issue 8623 For, randomforestregressor, run was running too slow, and not using enough RAM for this data, I will try to put my code, reproducible structure to share it here. Sorry p = psutil.Process(os.getpid())
X = np.random.normal(size=(20000, 50))
Y = np.random.binomial(1, 0.5, size=(20000, ))
def print_mem():
print("{:.0f}MB".format(p.memory_info().rss / 1e6))
print_mem()
for i in range(5):
et = ExtraTreesRegressor(n_estimators=10000,
n_jobs=3).fit(X, Y)
del et
gc.collect()
print_mem() Following is the screenshot for the run: PS: I am using Ubuntu 14.04, with Python 2.7 & latest scikit-learn stable release |
@lesteve I was able to generate it for RandomForestRegressor as well: p = psutil.Process(os.getpid())
X = np.random.normal(size=(11000, 12))
Y = np.random.binomial(1, 0.5, size=(11000, ))
def print_mem():
print("{:.0f}MB".format(p.memory_info().rss / 1e6))
print_mem()
for i in range(5):
et = RandomForestRegressor(n_estimators=10000,
n_jobs=3).fit(X, Y)
del et
gc.collect()
print_mem() |
I don't see any increasing memory usage when I run the code snippet... import os, time, gc, psutil
import numpy as np
from sklearn.ensemble import RandomForestRegressor
p = psutil.Process()
X = np.random.normal(size=(10000, 12))
y = np.random.randint(-1, 3, size=(10000, ))
def sleep_and_print_mem(title, sleep=3):
time.sleep(sleep)
print("\n" + title + " : " + "%0.2f MB" % (p.memory_info().rss / 1e6)
+ " " + "=" * 49)
sleep_and_print_mem("Initial memory")
for i in range(6):
et = RandomForestRegressor(n_estimators=20, criterion='mse',
n_jobs=7).fit(X, y)
del et
gc.collect()
sleep_and_print_mem("After iteration %d" % i) The residual memory (as per psutils) after each run + @lesteve I tried it with
(In contrast with this, if you use run the above on scikit-learn 0.18.0 with
I'm +1 for closing #8623 |
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
Description
DecisionTree with mae criterion does not free used memory.
As a result that kind of trees is unusable in cycle, forests and cross-validation.
(originally i encoutered this problem while trying use grid search with mae forest)
Steps/Code to Reproduce
Expected Results
No additional memory usage after first completion of cycle.
Cell with 'for' cycle is runnable unlimited times.
Replacing 'mae' with 'mse' does give expected result.
Actual Results
Additional memory is used after each cell execution.
Deleteing tree with 'del' and enforcing 'gc.collect()' does not help.
When there is no memory left OS (windows) shuts down python process.
My guess is that there is memory leak in cython.
Versions
Windows-7-6.1.7601-SP1
('Python', '2.7.12 | packaged by conda-forge | (default, Aug 12 2016, 18:53:38) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.2')
('SciPy', '0.18.0')
('Scikit-Learn', '0.18')
The text was updated successfully, but these errors were encountered: