-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839 #12085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
Please add a non-regression test
|
Thanks for the PR @joshuakennethjones . I agree with the description of the suggested change but I have not reviewed the code yet. Will do it soon I hope. |
Changed existing test to reflect correct values now produced by _average_path_length(), and added checks to ensure non-regression on all "base case" values in {0,1,2}.
Cool, thanks @albertcthomas -- happy to be able to help. @jnothman , I've just updated test_average_path_length to expect the corrected values for all "base case" inputs to _average_path_length in {0,1,2}. |
Tests are failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except from the failing tests and the minor comments below LGTM.
Made recommended enhancements to comments, and change assert_almost_equal to assert_equal where constants should be returned.
I'm hoping to find some more time to dig into the test I can see failing in the check logs. One comment is that it looks to me like that test is demanding something that isn't a guaranteed feature of the algorithm -- specifically, that exactly 30 of 300 input instances will fall beyond the chosen threshold when a corruption level of 10% is set. Yes, there is tolerance built into the test, but it is too small to allow any deviation from exactly 30 examples being marked as outliers, as far as I can tell. In fact, it appears that 29 examples are now being marked as outliers, and this is rejected. Notice that if we sort the input examples by anomaly score, and the 30th and 31st examples have equal scores, there is no way to choose a threshold such that exactly 30 examples lie beyond it. |
Thanks @joshuakennethjones. I also had a look at the failing tests and I agree that they are not completely satisfying because as you explain we cannot expect to have the exact proportion of outliers. A fix might be to accept an error corresponding to one sample of the data set (i.e. a tolerance of 1/n_samples). maybe @jnothman has a better suggestion? Here it would be good to at least check why it is failing now and not before this change (because of a tie?) |
@albertcthomas -- thanks for taking a look! I'm looking into the test further. I'm working to verify that the current behavior is due to a tie in anomaly scores. I'm pretty sure that has to be the reason, but I agree that it makes sense to double check. I also have an idea about how to fix the test in a way that won't be subject to capturing "good luck" in the random seed values, which I believe was essentially the case when it previously succeeded. Of course, the change we are discussing here is going to change the anomaly scores produced, so no surprise there. Practically speaking, increasing the tolerance would probably make the test pass, because the odds of creating a random dataset with three examples in a critical position that have equal values should be quite low -- but it's still nonzero. If I'm going to fix the test I'd rather do something that deterministically succeeds regardless of random chance, to eliminate the possibility of this kind of thing biting someone in the future. Will update again when I have something ready. |
Why not modify the test to check that some tie-breaking policy is followed consistently, i.e. that |
@jnothman yep, that's pretty close to what I'm thinking (although given that the current failure is due to returning 29 outliers, I'm guessing the tie-breaker currently in use goes in the other direction). I don't want to codify something that isn't actually a feature of the algorithm, though, and I don't believe the relevant papers specify a way that ties should be broken. So, I was going to allow the test to pass regardless of the "direction" in which the tie is broken, when there is a tie in a crucial region (here, around the 30th example). |
I will try to double check the way we threshold the scores using |
Got set up to dump some debug statements and was able to confirm my suspicions:
I'll update the test to avoid this pitfall. It's a little tricky because the code is reused for testing several different algorithms, not all of which seem to support 'decision_function', which is needed to get the underlying score values -- I learned this during my debug session. So there will have to be a bit of awkwardness in the test function to ensure we don't try to call 'decision_function' for an algorithm other than iForests for which it is not available. Hopefully I can get that coded up later today or early tomorrow. |
Thanks for the details @joshuakennethjones The tests that are failing are common tests: they are supposed to pass for all outlier detection estimators. |
Yes, I understand -- this is what is going to make the code somewhat awkward, as it will need conditionals to just do the old (possibly wrong, definitely wrong in the case of iForests, but passing) thing for the set of estimators that don't support the 'decision_function' method as a way of retrieving underlying anomaly scores. Given that it will just default to doing the existing thing if the method isn't supported, the tests should still pass for those estimators, but I would guess that someone familiar with those other algorithms may want to take a look at some point and consider whether they are truly correct vs. just randomly happening to pass. |
I will have a closer look but I won’t have much time before next week. |
Change assert_equal to assert ... == to adhere to latest conventions, and change test to properly deal with anomaly score ties in critical regions if 'decision_function' method is supported by the estimator in question, or default to the old behavior if not.
I have altered the tests in question so that, in cases where the proportion of examples beyond the decision threshold does not match the corruption parameter exactly, we will explicitly check for anomaly score ties across the critical region if the 'decision_function' method is supported by the estimator. If the 'decision_function' method is not supported by the estimator under test, the behavior will remain the same as under the original test. It all worked for me locally, so hopefully the checks will pass here. |
Refactoring and adding more tests to try and get coverage to an acceptable level.
I am not completely satisfied with the test checking that we have approximately the correct proportion of outliers but I do not have a better solution. I would like to take a bit more time to think about but I don't know when I will be able to do it. This issue is not related to what this PR does so I do not want to delay this fix because of these tests. Could you just check that the examples involving IsolationForest are not affected by the change of this PR? Especially the ones of the outlier detection section of the doc? https://scikit-learn.org/stable/modules/outlier_detection.html#novelty-and-outlier-detection |
I had a look at the examples in the documentation link you sent. Any shifts in the pictured boundaries, etc, are going to be more or less imperceptible to the eye, so in my mind that documentation should not need to change for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking the examples. I have been thinking a bit about the tests and this solution seems indeed to be the best solution to check the proportion of outliers. However this check should only be done for estimators with decision_function
. See my comment below.
@@ -1495,8 +1495,38 @@ def check_classifiers_train(name, classifier_orig, readonly_memmap=False): | |||
assert_array_equal(np.argsort(y_log_prob), np.argsort(y_prob)) | |||
|
|||
|
|||
def check_outlier_corruption(estimator, num_outliers, expected_outliers, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the data set this check can fail for LocalOutlierFactor
because this estimator has no decision_function
. The reason that it is not failing right now is because in this case (given data set, LocalOutlierFactor
with default parameters) we have num_outliers == expected_outliers
. I think we should only run this check for estimators that have a decision_function
. The only estimator that does not have a decision_function
is LocalOutlierFactor
and we can have a specific check in test_lof.py
.
@joshuakennethjones do you have time to finish this PR? |
I'm pretty tied up with work right now, so it may be a bit before I can look at the next set of requested changes. If you know what you want to do and feel comfortable making the remaining updates you would like, that's totally cool with me. |
I took over to finish this PR during the sprint, see #13251. |
Reference Issues/PRs
Fixes #11839
(sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.)
What does this implement/fix? Explain your changes.
When an input value to _average_path_length() is in {0,1}, the return value should be zero, not one, as in the existing implementation. Also, when the input value is 2, the return value should be 1, not 0.15... as in the current implementation. These results should be expected for two reasons: first, based on the 2012 iForest paper (the original paper indicated a zero value for terminal nodes to which < 2 training examples had been sorted in the first paragraph of section 4.2, but left this vague in the algorithm/equation specifications, and did not specify a unique value for nodes to which exactly two training examples had been sorted), where it is explicitly stated that c(n) (the value computed by _average_path_length) should take the value zero for n in {0,1} and takes the value 1 for n=2. Also, from a rational perspective, we want these values to monotonically increase with n, and in the current implementation this is not the case. This is a pretty straightforward fix -- I have altered the existing cases for inputs in {0,1} to return zero instead of 1 (which was already hard-coded for these cases) and added a case for an input value of 2 to return 1.
Thanks!
Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.
Steps to Test
Results from Submitted Code
array([0. , 0. , 1., 1.20739236])
Results From Existing Code
array([1. , 1. , 0.15443133, 1.20739236])