8000 IsolationForest degenerates with uniform training data · Issue #7141 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

IsolationForest degenerates with uniform training data #7141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bspeice opened this issue Aug 4, 2016 · 6 comments · Fixed by #14771
Closed

IsolationForest degenerates with uniform training data #7141

bspeice opened this issue Aug 4, 2016 · 6 comments · Fixed by #14771

Comments

@bspeice
Copy link
bspeice commented Aug 4, 2016

Description

First off, thank you so much for the Isolation Forest algorithm being incorporated. This has been absolutely amazing already, I've been doing a lot with it.

I found an interesting edge case with the IsolationForest where if the data being passed in is all the same for each feature (i.e. each column consists of a single value for all rows) then the model degenerates and does two interesting things:

  1. All training points, if run through prediction, are classified as anomalies
  2. All points predicted on in the future no matter what their value is will be classified as anomalies.

That is, if the model is fitted on either a single row, or an array where each column contains nothing but the same value, then everything is always considered an anomaly during prediction.

Now, there's no way to really fix this, but I wanted to make sure both that this was a known issue, and ask: should a warning be created to detect this behavior?

Steps/Code to Reproduce

from sklearn.ensemble import IsolationForest
import numpy as np

#2-d array of all 1s
X = np.ones((100, 10))
iforest = IsolationForest()
iforest.fit(X)

assert all(iforest.predict(X) == -1)
assert all(iforest.predict(np.random.randn(100, 10)) == -1)
assert all(iforest.predict(X + 1) == -1)
assert all(iforest.predict(X - 1) == -1)

#2-d array where columns contain the same value across rows
X = np.repeat(np.random.randn(1, 10), 100, 0)
iforest = IsolationForest()
iforest.fit(X)

assert all(iforest.predict(X) == -1)
assert all(iforest.predict(np.random.randn(100, 10)) == -1)
assert all(iforest.predict(np.ones((100, 10))) == -1)

# Single row
X = np.random.randn(1, 10)
iforest = IsolationForest()
iforest.fit(X)

assert all(iforest.predict(X) == -1)
assert all(iforest.predict(np.random.randn(100, 10)) == -1)
assert all(iforest.predict(np.ones((100, 10))) == -1)

Versions (though the version shouldn't actually matter)

Windows-7-6.1.7601-SP1
('Python', '2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.1')
('SciPy', '0.17.1')
('Scikit-Learn', '0.18.dev0')

Thanks, hope you all have a great day.

@raghavrv
Copy link
Member
raghavrv commented Aug 4, 2016

@ngoix This is an expected behavior correct?

@ngoix
Copy link
Contributor
ngoix commented Aug 5, 2016

Yes it is. In each of the three cases, the number of different samples is 1. The trees are expected to have one single node (no split) because we don't want to have void nodes. So the decision_function is constant. I don't know if we should raise a warning. @agramfort what do you think?

@agramfort
Copy link
Member

even raise an error as results will be all wrong in this case.

is there a cheap way to detect the pb?

@ngoix
Copy link
Contributor
ngoix commented Jun 16, 2017

I don't see anything except testing if there is at least two different observations in the training set, which will be expensive on sparse matrices?
BTW the name of this issue is not clear as iforest does not degenerate with uniform samples.
@agramfort can you add a "need contributor" tag?

@agramfort
Copy link
Member

done

@jayybhatt
Copy link
Contributor

I can take this up. NYC WiMLDS sprint

matwey added a commit to matwey/scikit-learn that referenced this issue Apr 18, 2020
…rn#14771)"

This reverts commit bcaf381.

The test in reverted commit is useless and doesn't rely on the code
implementation. The commit claims to fix scikit-learn#7141, where the isolation forest is
trained on the identical values leading to the degenerated trees.

Under described circumstances, one may check that the exact score value for
every point in the parameter space is zero (or 0.5 depending on if we are
talking about initial paper or scikit-learn implementation).
However, there is no special code in existing implementation, and the score
value is a subject of rounding erros. So, for instance, for 100 identical input
samples, we have a forest predicting everything as inliners, but for 101 input
samples, we have a forest predicting everything as outliers. The decision is
taken only based on floating point rounding error value.

One may check this by changing the number of input samples:

    X = np.ones((100, 10))

to

    X = np.ones((101, 10))

or something else.
matwey added a commit to matwey/scikit-learn that referenced this issue Apr 20, 2020
Allow small discripance for decision function as suggested in scikit-learn#16721.
@matwey matwey mentioned this issue Apr 20, 2020
matwey added a commit to matwey/scikit-learn that referenced this issue Apr 20, 2020
Allow small discripance for decision function as suggested in scikit-learn#16721.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0