-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is one alternative version of the function that seems to work in case it is helpful:
|
Maybe you will create a pull request? |
Sure, I'll do that -- just didn't want to be presumptuous and didn't know the right protocol for this kind of thing. Thanks! |
joshuakennethjones
added a commit
to joshuakennethjones/scikit-learn
that referenced
this issue
Sep 14, 2018
Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
albertcthomas
pushed a commit
to albertcthomas/scikit-learn
that referenced
this issue
Feb 25, 2019
Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
When an input value to _average_path_length() is in {0,1}, the return value should be zero, not one, as in the existing implementation. Also, when the input value is 2, the return value should be 1, not 0.15... as in the current implementation. These results should be expected for two reasons: first, based on the 2012 iForest paper (the original paper indicated a zero value for terminal nodes to which < 2 training examples had been sorted in the first paragraph of section 4.2, but left this vague in the algorithm/equation specifications, and did not specify a unique value for nodes to which exactly two training examples had been sorted), where it is explicitly stated that c(n) (the value computed by _average_path_length) should take the value zero for n in {0,1} and takes the value 1 for n=2. Also, from a rational perspective, we want these values to monotonically increase with n, and in the current implementation this is not the case. This is a pretty easy fix, I think -- just alter the existing cases for inputs in {0,1} to return zero instead of 1 (already hard-coded for these cases) and add a case for an input value of 2 to return 1. Since I have not contributed in the past, I felt it best to relay the issue this way vs. making my own pull request. This issue will impact anomaly scores in a subtle but potentially meaningful way.
Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.
Steps/Code to Reproduce
Expected Results
array([0. , 0. , 1., 1.20739236])
Actual Results
array([1. , 1. , 0.15443133, 1.20739236])
Versions
The text was updated successfully, but these errors were encountered: