-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Isolation forest - decision_function & average_path_length method are memory inefficient #12040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for the report and the detailed analysis. A pull request to improve the memory usage in Also if possible please use code formatting in Github comments -- it really helps readability (I edited your comment above) , and it possible to use some other format than .docx for sharing documents (as it's difficult to open it on Linux). Thanks! |
Thanks for a very prompt response. Wants to first confirm that it's a valid issue & possible to resolve memory consumption as I have mentioned. The attached document has images too & so .docx. Any preference as what format to use for future sharing. |
If I understand correctly, the issue is that in I agree this can be probably optimized as you propose. The other alternative could be just to chunk X row wise then concatenate the results (see related discussion in #10280). If you make a Pull Request with the proposed changes (see Contributing docs), even if it's work in progress, it will be easier to discuss other possible solutions there while looking at the code diff. Edit: PDF might be simpler to open, or just posting the results in a comment on Github if it's not too much content. You can also hide some content with the details tag. |
Hello, yes that exactly the issue with isolation forest. The dataset is indeed large 257K samples with 35 numerical features. However, that even needs to be more than that as per my needs and so looking for memory efficiency too in addition to time. I have gone through the links and they are quite useful to my specific usecases(I was even facing memory issues with sillloutte score and brute algo). Will be first working on handling the data with chunks and probably in coming weeks will be making the PR for isoforest modification as have to go through the research paper on iso forest algo too. Also looking for other packages/languages than sklearn as how they are doing isoforest. |
working on this for the sprint. So to avoid arrays of shape
We can also do both options I guess. |
Description
Isolation forest consumes too much memory due to memory ineffecient implementation of anomoly score calculation. Due to this the parallelization with n_jobs is also impacted as anomoly score cannot be calculated in parallel for each tree.
Steps/Code to Reproduce
Run a simple Isolation forest with n_estimators as 10 and as 50 respectively.
On memory profiling, it can be seen that each building of tree is not taking much memory but in the end a lot of memory is consumed as a for loop is iteration over all trees and calculating the anomoly score of all trees together and then averaging it.
-iforest.py line 267-281
Due to this, in case of more no. of estimators(1000), the memory consumed is quite high.
Expected Results
Possible Solution:
The above for loop should only do the averaging of anomoly score from each estimator instead of calculation. The logic of isoforest anomoly score calculation can be moved to base estimator class so it is done for each tree( i guess bagging.py file-similar to other method available after fitting)
Actual Results
The memory consumption is profound as we increase no. of estimators.
The fit method calls decision function & average anomoly score which are taking quite a lot memory.
the memory spike is too high in the very end, that is in finall call to
average_path_length()
method.Versions
isoForest_memoryConsumption.docx
The text was updated successfully, but these errors were encountered: