-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
IsolationForest degenerates with uniform training data #7141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ngoix This is an expected behavior correct? |
Yes it is. In each of the three cases, the number of different samples is 1. The trees are expected to have one single node (no split) because we don't want to have void nodes. So the |
even raise an error as results will be all wrong in this case. is there a cheap way to detect the pb? |
I don't see anything except testing if there is at least two different observations in the training set, which will be expensive on sparse matrices? |
done |
I can take this up. NYC WiMLDS sprint |
…rn#14771)" This reverts commit bcaf381. The test in reverted commit is useless and doesn't rely on the code implementation. The commit claims to fix scikit-learn#7141, where the isolation forest is trained on the identical values leading to the degenerated trees. Under described circumstances, one may check that the exact score value for every point in the parameter space is zero (or 0.5 depending on if we are talking about initial paper or scikit-learn implementation). However, there is no special code in existing implementation, and the score value is a subject of rounding erros. So, for instance, for 100 identical input samples, we have a forest predicting everything as inliners, but for 101 input samples, we have a forest predicting everything as outliers. The decision is taken only based on floating point rounding error value. One may check this by changing the number of input samples: X = np.ones((100, 10)) to X = np.ones((101, 10)) or something else.
Allow small discripance for decision function as suggested in scikit-learn#16721.
Allow small discripance for decision function as suggested in scikit-learn#16721.
Description
First off, thank you so much for the Isolation Forest algorithm being incorporated. This has been absolutely amazing already, I've been doing a lot with it.
I found an interesting edge case with the IsolationForest where if the data being passed in is all the same for each feature (i.e. each column consists of a single value for all rows) then the model degenerates and does two interesting things:
That is, if the model is fitted on either a single row, or an array where each column contains nothing but the same value, then everything is always considered an anomaly during prediction.
Now, there's no way to really fix this, but I wanted to make sure both that this was a known issue, and ask: should a warning be created to detect this behavior?
Steps/Code to Reproduce
Versions (though the version shouldn't actually matter)
Windows-7-6.1.7601-SP1
('Python', '2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.1')
('SciPy', '0.17.1')
('Scikit-Learn', '0.18.dev0')
Thanks, hope you all have a great day.
The text was updated successfully, but these errors were encountered: