8000 [MRG] sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839 by joshuakennethjones · Pull Request #12085 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839 #12085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

joshuakennethjones
Copy link

Reference Issues/PRs

Fixes #11839
(sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.)

What does this implement/fix? Explain your changes.

When an input value to _average_path_length() is in {0,1}, the return value should be zero, not one, as in the existing implementation. Also, when the input value is 2, the return value should be 1, not 0.15... as in the current implementation. These results should be expected for two reasons: first, based on the 2012 iForest paper (the original paper indicated a zero value for terminal nodes to which < 2 training examples had been sorted in the first paragraph of section 4.2, but left this vague in the algorithm/equation specifications, and did not specify a unique value for nodes to which exactly two training examples had been sorted), where it is explicitly stated that c(n) (the value computed by _average_path_length) should take the value zero for n in {0,1} and takes the value 1 for n=2. Also, from a rational perspective, we want these values to monotonically increase with n, and in the current implementation this is not the case. This is a pretty straightforward fix -- I have altered the existing cases for inputs in {0,1} to return zero instead of 1 (which was already hard-coded for these cases) and added a case for an input value of 2 to return 1.

Thanks!

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.

Steps to Test

import numpy as np
from sklearn.ensemble.iforest import _average_path_length
_average_path_length(np.array([0,1,2,3]))

Results from Submitted Code

array([0. , 0. , 1., 1.20739236])

Results From Existing Code

array([1. , 1. , 0.15443133, 1.20739236])

Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
@jnothman
Copy link
Member
jnothman commented Sep 15, 2018 via email

@albertcthomas
Copy link
Contributor
albertcthomas commented Sep 17, 2018

Thanks for the PR @joshuakennethjones . I agree with the description of the suggested change but I have not reviewed the code yet. Will do it soon I hope.

Changed existing test to reflect correct values now produced by _average_path_length(), and added checks to ensure non-regression on all "base case" values in {0,1,2}.
@joshuakennethjones
Copy link
Author

Cool, thanks @albertcthomas -- happy to be able to help. @jnothman , I've just updated test_average_path_length to expect the corrected values for all "base case" inputs to _average_path_length in {0,1,2}.

@jnothman
Copy link
Member

Tests are failing

Copy link
Contributor
@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except from the failing tests and the minor comments below LGTM.

Made recommended enhancements to comments, and change assert_almost_equal to assert_equal where constants should be returned.
@joshuakennethjones
Copy link
Author

I'm hoping to find some more time to dig into the test I can see failing in the check logs. One comment is that it looks to me like that test is demanding something that isn't a guaranteed feature of the algorithm -- specifically, that exactly 30 of 300 input instances will fall beyond the chosen threshold when a corruption level of 10% is set. Yes, there is tolerance built into the test, but it is too small to allow any deviation from exactly 30 examples being marked as outliers, as far as I can tell. In fact, it appears that 29 examples are now being marked as outliers, and this is rejected. Notice that if we sort the input examples by anomaly score, and the 30th and 31st examples have equal scores, there is no way to choose a threshold such that exactly 30 examples lie beyond it.

@albertcthomas
Copy link
Contributor

Thanks @joshuakennethjones. I also had a look at the failing tests and I agree that they are not completely satisfying because as you explain we cannot expect to have the exact proportion of outliers. A fix might be to accept an error corresponding to one sample of the data set (i.e. a tolerance of 1/n_samples). maybe @jnothman has a better suggestion?

Here it would be good to at least check why it is failing now and not before this change (because of a tie?)

@joshuakennethjones
Copy link
Author

@albertcthomas -- thanks for taking a look! I'm looking into the test further. I'm working to verify that the current behavior is due to a tie in anomaly scores. I'm pretty sure that has to be the reason, but I agree that it makes sense to double check. I also have an idea about how to fix the test in a way that won't be subject to capturing "good luck" in the random seed values, which I believe was essentially the case when it previously succeeded. Of course, the change we are discussing here is going to change the anomaly scores produced, so no surprise there. Practically speaking, increasing the tolerance would probably make the test pass, because the odds of creating a random dataset with three examples in a critical position that have equal values should be quite low -- but it's still nonzero. If I'm going to fix the test I'd rather do something that deterministically succeeds regardless of random chance, to eliminate the possibility of this kind of thing biting someone in the future. Will update again when I have something ready.

@jnothman
Copy link
Member

Why not modify the test to check that some tie-breaking policy is followed consistently, i.e. that k > 30 outliers are returned, that the k - 30 + 1 lowest outlier scores are all equal...?

@joshuakennethjones
Copy link
Author

@jnothman yep, that's pretty close to what I'm thinking (although given that the current failure is due to returning 29 outliers, I'm guessing the tie-breaker currently in use goes in the other direction). I don't want to codify something that isn't actually a feature of the algorithm, though, and I don't believe the relevant papers specify a way that ties should be broken. So, I was going to allow the test to pass regardless of the "direction" in which the tie is broken, when there is a tie in a crucial region (here, around the 30th example).

@albertcthomas
Copy link
Contributor

I will try to double check the way we threshold the scores using np.percentile to be sure that this gives what we expect in terms of quantile. This will also depend on the interpolation method of the quantile function that we use (np.percentile uses 'linear' by default).

@joshuakennethjones
Copy link
Author

Got set up to dump some debug statements and was able to confirm my suspicions:

Before:

Position 26 score: -0.580305488488
Position 27 score: -0.453050432374
Position 28 score: -0.301445385117
Position 29 score: -0.106451546269
Position 30 score: 0.0118279495854
Position 31 score: 0.060199072667
Position 32 score: 0.110967885477
Position 33 score: 0.254962388828


After:

Position 26 score: -0.00229702188937
Position 27 score: -0.000207712944615
Position 28 score: -0.000207712944615
Position 29 score: 0.0
Position 30 score: 0.0
Position 31 score: 0.00760246724465
Position 32 score: 0.00928720850602
Position 33 score: 0.01266619782

I'll update the test to avoid this pitfall. It's a little tricky because the code is reused for testing several different algorithms, not all of which seem to support 'decision_function', which is needed to get the underlying score values -- I learned this during my debug session. So there will have to be a bit of awkwardness in the test function to ensure we don't try to call 'decision_function' for an algorithm other than iForests for which it is not available. Hopefully I can get that coded up later today or early tomorrow.

@albertcthomas
Copy link
Contributor

Thanks for the details @joshuakennethjones

The tests that are failing are common tests: they are supposed to pass for all outlier detection estimators.

@joshuakennethjones
Copy link
Author

Yes, I understand -- this is what is going to make the code somewhat awkward, as it will need conditionals to just do the old (possibly wrong, definitely wrong in the case of iForests, but passing) thing for the set of estimators that don't support the 'decision_function' method as a way of retrieving underlying anomaly scores. Given that it will just default to doing the existing thing if the method isn't supported, the tests should still pass for those estimators, but I would guess that someone familiar with those other algorithms may want to take a look at some point and consider whether they are truly correct vs. just randomly happening to pass.

@albertcthomas
Copy link
Contributor

I will have a closer look but I won’t have much time before next week.

Change assert_equal to assert ... == to adhere to latest conventions, and change test to properly deal with anomaly score ties in critical regions if 'decision_function' method is supported by the estimator in question, or default to the old behavior if not.
@joshuakennethjones
Copy link
Author

I have altered the tests in question so that, in cases where the proportion of examples beyond the decision threshold does not match the corruption parameter exactly, we will explicitly check for anomaly score ties across the critical region if the 'decision_function' method is supported by the estimator. If the 'decision_function' method is not supported by the estimator under test, the behavior will remain the same as under the original test. It all worked for me locally, so hopefully the checks will pass here.

Refactoring and adding more tests to try and get coverage to an acceptable level.
@albertcthomas
Copy link
Contributor

I am not completely satisfied with the test checking that we have approximately the correct proportion of outliers but I do not have a better solution. I would like to take a bit more time to think about but I don't know when I will be able to do it. This issue is not related to what this PR does so I do not want to delay this fix because of these tests.

Could you just check that the examples involving IsolationForest are not affected by the change of this PR? Especially the ones of the outlier detection section of the doc? https://scikit-learn.org/stable/modules/outlier_detection.html#novelty-and-outlier-detection

@joshuakennethjones
Copy link
Author

I had a look at the examples in the documentation link you sent. Any shifts in the pictured boundaries, etc, are going to be more or less imperceptible to the eye, so in my mind that documentation should not need to change for this PR.

Copy link
Contributor
@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking the examples. I have been thinking a bit about the tests and this solution seems indeed to be the best solution to check the proportion of outliers. However this check should only be done for estimators with decision_function. See my comment below.

@@ -1495,8 +1495,38 @@ def check_classifiers_train(name, classifier_orig, readonly_memmap=False):
assert_array_equal(np.argsort(y_log_prob), np.argsort(y_prob))


def check_outlier_corruption(estimator, num_outliers, expected_outliers, X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the data set this check can fail for LocalOutlierFactor because this estimator has no decision_function. The reason that it is not failing right now is because in this case (given data set, LocalOutlierFactor with default parameters) we have num_outliers == expected_outliers. I think we should only run this check for estimators that have a decision_function. The only estimator that does not have a decision_function is LocalOutlierFactor and we can have a specific check in test_lof.py.

@albertcthomas
Copy link
Contributor

@joshuakennethjones do you have time to finish this PR?

@joshuakennethjones
Copy link
Author

I'm pretty tied up with work right now, so it may be a bit before I can look at the next set of requested changes. If you know what you want to do and feel comfortable making the remaining updates you would like, that's totally cool with me.

@albertcthomas
Copy link
Contributor

I took over to finish this PR during the sprint, see #13251.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
3 participants
0