IsolationForest degenerates with uniform training data #7141

bspeice · 2016-08-04T19:07:36Z

Description

First off, thank you so much for the Isolation Forest algorithm being incorporated. This has been absolutely amazing already, I've been doing a lot with it.

I found an interesting edge case with the IsolationForest where if the data being passed in is all the same for each feature (i.e. each column consists of a single value for all rows) then the model degenerates and does two interesting things:

All training points, if run through prediction, are classified as anomalies
All points predicted on in the future no matter what their value is will be classified as anomalies.

That is, if the model is fitted on either a single row, or an array where each column contains nothing but the same value, then everything is always considered an anomaly during prediction.

Now, there's no way to really fix this, but I wanted to make sure both that this was a known issue, and ask: should a warning be created to detect this behavior?

Steps/Code to Reproduce

from sklearn.ensemble import IsolationForest
import numpy as np

#2-d array of all 1s
X = np.ones((100, 10))
iforest = IsolationForest()
iforest.fit(X)

assert all(iforest.predict(X) == -1)
assert all(iforest.predict(np.random.randn(100, 10)) == -1)
assert all(iforest.predict(X + 1) == -1)
assert all(iforest.predict(X - 1) == -1)

#2-d array where columns contain the same value across rows
X = np.repeat(np.random.randn(1, 10), 100, 0)
iforest = IsolationForest()
iforest.fit(X)

assert all(iforest.predict(X) == -1)
assert all(iforest.predict(np.random.randn(100, 10)) == -1)
assert all(iforest.predict(np.ones((100, 10))) == -1)

# Single row
X = np.random.randn(1, 10)
iforest = IsolationForest()
iforest.fit(X)

assert all(iforest.predict(X) == -1)
assert all(iforest.predict(np.random.randn(100, 10)) == -1)
assert all(iforest.predict(np.ones((100, 10))) == -1)

Versions (though the version shouldn't actually matter)

Windows-7-6.1.7601-SP1
('Python', '2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.1')
('SciPy', '0.17.1')
('Scikit-Learn', '0.18.dev0')

Thanks, hope you all have a great day.

The text was updated successfully, but these errors were encountered:

raghavrv · 2016-08-04T21:00:31Z

@ngoix This is an expected behavior correct?

ngoix · 2016-08-05T01:29:01Z

Yes it is. In each of the three cases, the number of different samples is 1. The trees are expected to have one single node (no split) because we don't want to have void nodes. So the decision_function is constant. I don't know if we should raise a warning. @agramfort what do you think?

agramfort · 2016-08-05T15:11:59Z

even raise an error as results will be all wrong in this case.

is there a cheap way to detect the pb?

ngoix · 2017-06-16T15:32:45Z

I don't see anything except testing if there is at least two different observations in the training set, which will be expensive on sparse matrices?
BTW the name of this issue is not clear as iforest does not degenerate with uniform samples.
@agramfort can you add a "need contributor" tag?

agramfort · 2017-06-17T09:54:28Z

done

jayybhatt · 2019-08-24T15:53:59Z

I can take this up. NYC WiMLDS sprint

…rn#14771)" This reverts commit bcaf381. The test in reverted commit is useless and doesn't rely on the code implementation. The commit claims to fix scikit-learn#7141, where the isolation forest is trained on the identical values leading to the degenerated trees. Under described circumstances, one may check that the exact score value for every point in the parameter space is zero (or 0.5 depending on if we are talking about initial paper or scikit-learn implementation). However, there is no special code in existing implementation, and the score value is a subject of rounding erros. So, for instance, for 100 identical input samples, we have a forest predicting everything as inliners, but for 101 input samples, we have a forest predicting everything as outliers. The decision is taken only based on floating point rounding error value. One may check this by changing the number of input samples: X = np.ones((100, 10)) to X = np.ones((101, 10)) or something else.

Allow small discripance for decision function as suggested in scikit-learn#16721.

agramfort added the Need Contributor label Jun 17, 2017

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

jayybhatt mentioned this issue Aug 24, 2019

[MRG] Added validation test for iforest on uniform data #14771

Merged

NicolasHug closed this as completed in #14771 Aug 26, 2019

matwey mentioned this issue Apr 18, 2020

Improve IsolationForest average depth evaluation #16721

Open

matwey added a commit to matwey/scikit-learn that referenced this issue Apr 20, 2020

Fix scikit-learn#7141

9aba7b4

Allow small discripance for decision function as suggested in scikit-learn#16721.

matwey mentioned this issue Apr 20, 2020

Fix #7141 #16967

Closed

matwey added a commit to matwey/scikit-learn that referenced this issue Apr 20, 2020

Fix scikit-learn#7141

1a657e6

Allow small discripance for decision function as suggested in scikit-learn#16721.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

IsolationForest degenerates with uniform training data #7141

IsolationForest degenerates with uniform training data #7141

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IsolationForest degenerates with uniform training data #7141

IsolationForest degenerates with uniform training data #7141

Comments

Description

Steps/Code to Reproduce

Versions (though the version shouldn't actually matter)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!