8000 Isolation forest final stage very slow and single threaded · Issue #13295 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Isolation forest final stage very slow and single threaded #13295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ibackus opened this issue Feb 27, 2019 · 8 comments
Closed

Isolation forest final stage very slow and single threaded #13295

ibackus opened this issue Feb 27, 2019 · 8 comments

Comments

@ibackus
Copy link
ibackus commented Feb 27, 2019

Description

Isolation forest final stage very slow and single threaded.

This is an issue I get quite frequently. I'll train an isolation forest on a decently large data set (say order 1M to 100M records, around 50 features), and it will run rapidly and in parallel with nearly 100% CPU utilization. I'll get the output like the following:

[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done   3 out of  30 | elapsed:   17.9s remaining:  2.7min
[Parallel(n_jobs=30)]: Done   7 out of  30 | elapsed:   18.5s remaining:  1.0min
[Parallel(n_jobs=30)]: Done  11 out of  30 | elapsed:   19.4s remaining:   33.5s
[Parallel(n_jobs=30)]: Done  15 out of  30 | elapsed:   19.7s remaining:   19.7s
[Parallel(n_jobs=30)]: Done  19 out of  30 | elapsed:   20.0s remaining:   11.6s
[Parallel(n_jobs=30)]: Done  23 out of  30 | elapsed:   20.2s remaining:    6.2s
[Parallel(n_jobs=30)]: Done  27 out of  30 | elapsed:   20.9s remaining:    2.3s
[Parallel(n_jobs=30)]: Done  30 out of  30 | elapsed:   21.5s finished

And then it will run for a very long time (10x as long? more?) on a single core, and eventually finalize. Often I'll get progress statements all printed simultaneously at the end when the task completes:

Building estimator 1 of 3 for this parallel run (total 100)...
Building estimator 2 of 3 for this parallel run (total 100)...
Building estimator 3 of 3 for this parallel run (total 100)...
...

I presume that's from parallel processes or threads printing to stdout without flushing.

I create the isolation forest with:

from sklearn.ensemble import IsolationForest
model_kwargs={
    'n_estimators': 100,
    'n_jobs': 30,
    'verbose': 10,
    'max_samples': 1000,
    'behaviour': "new"
}
clf = IsolationForest(**model_kwargs)

Versions

System:
python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0]
executable: /home/ibackus/anaconda3/bin/python
machine: Linux-4.15.0-1032-aws-x86_64-with-debian-buster-sid

BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /home/ibackus/anaconda3/lib
cblas_libs: mkl_rt, pthread

Python deps:
pip: 18.1
setuptools: 40.6.3
sklearn: 0.20.2
numpy: 1.15.4
scipy: 1.2.1
Cython: 0.29.2
pandas: 0.24.1

@jnothman
Copy link
Member

Perhaps related: #13260?

@ngoix
Copy link
Contributor
ngoix commented Feb 27, 2019

I think it is related indeed. During training, _fit is called first (fast process) then score_samples is called to get the threshold (slow process). score_sample is being fixed in #13260 and #13283

@ibackus
Copy link
Author
ibackus commented Feb 27, 2019

@ngoix is it possible to run without calling score_sample? For my purposes I actually just need the decision function, not the full predict method.

@ngoix
Copy link
Contributor
ngoix commented Feb 28, 2019

unfortunately it is not. decision_function, score_sample and predict do the same computation basically.

@ibackus
Copy link
Author
ibackus commented Feb 28, 2019

I phrased it incorrectly, I'm just wondering if it's possible during fit to not call score_sample? It's okay if later on I call the decision_function, I just want to speed up training. I can trivially parallelize the decision_function() call myself.

@ngoix
Copy link
Contributor
ngoix commented Mar 1, 2019

You can achieve this by setting contamination="auto" and behaviour="new"

@ibackus
Copy link
Author
ibackus commented Mar 1, 2019

Perfect. Thanks!

@jnothman
Copy link
Member
jnothman commented Mar 4, 2019

I'll close this, given solutions presented, work merged since and work under way... let me know if that is a mistake.

@jnothman jnothman closed this as completed Mar 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0