[MRG] sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839 #12085

joshuakennethjones · 2018-09-14T20:12:36Z

Reference Issues/PRs

Fixes #11839
(sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.)

What does this implement/fix? Explain your changes.

When an input value to _average_path_length() is in {0,1}, the return value should be zero, not one, as in the existing implementation. Also, when the input value is 2, the return value should be 1, not 0.15... as in the current implementation. These results should be expected for two reasons: first, based on the 2012 iForest paper (the original paper indicated a zero value for terminal nodes to which < 2 training examples had been sorted in the first paragraph of section 4.2, but left this vague in the algorithm/equation specifications, and did not specify a unique value for nodes to which exactly two training examples had been sorted), where it is explicitly stated that c(n) (the value computed by _average_path_length) should take the value zero for n in {0,1} and takes the value 1 for n=2. Also, from a rational perspective, we want these values to monotonically increase with n, and in the current implementation this is not the case. This is a pretty straightforward fix -- I have altered the existing cases for inputs in {0,1} to return zero instead of 1 (which was already hard-coded for these cases) and added a case for an input value of 2 to return 1.

Thanks!

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.

Steps to Test

import numpy as np
from sklearn.ensemble.iforest import _average_path_length
_average_path_length(np.array([0,1,2,3]))

Results from Submitted Code

array([0. , 0. , 1., 1.20739236])

Results From Existing Code

array([1. , 1. , 0.15443133, 1.20739236])

Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.

jnothman · 2018-09-15T23:27:35Z

Please add a non-regression test

albertcthomas · 2018-09-17T09:46:44Z

Thanks for the PR @joshuakennethjones . I agree with the description of the suggested change but I have not reviewed the code yet. Will do it soon I hope.

Changed existing test to reflect correct values now produced by _average_path_length(), and added checks to ensure non-regression on all "base case" values in {0,1,2}.

joshuakennethjones · 2018-09-17T14:04:35Z

Cool, thanks @albertcthomas -- happy to be able to help. @jnothman , I've just updated test_average_path_length to expect the corrected values for all "base case" inputs to _average_path_length in {0,1,2}.

jnothman · 2018-09-17T21:50:10Z

Tests are failing

albertcthomas

Except from the failing tests and the minor comments below LGTM.

sklearn/ensemble/tests/test_iforest.py

Made recommended enhancements to comments, and change assert_almost_equal to assert_equal where constants should be returned.

joshuakennethjones · 2018-09-18T20:56:52Z

I'm hoping to find some more time to dig into the test I can see failing in the check logs. One comment is that it looks to me like that test is demanding something that isn't a guaranteed feature of the algorithm -- specifically, that exactly 30 of 300 input instances will fall beyond the chosen threshold when a corruption level of 10% is set. Yes, there is tolerance built into the test, but it is too small to allow any deviation from exactly 30 examples being marked as outliers, as far as I can tell. In fact, it appears that 29 examples are now being marked as outliers, and this is rejected. Notice that if we sort the input examples by anomaly score, and the 30th and 31st examples have equal scores, there is no way to choose a threshold such that exactly 30 examples lie beyond it.

sklearn/ensemble/tests/test_iforest.py

albertcthomas · 2018-09-18T21:11:22Z

Thanks @joshuakennethjones. I also had a look at the failing tests and I agree that they are not completely satisfying because as you explain we cannot expect to have the exact proportion of outliers. A fix might be to accept an error corresponding to one sample of the data set (i.e. a tolerance of 1/n_samples). maybe @jnothman has a better suggestion?

Here it would be good to at least check why it is failing now and not before this change (because of a tie?)

joshuakennethjones · 2018-09-20T01:00:16Z

@albertcthomas -- thanks for taking a look! I'm looking into the test further. I'm working to verify that the current behavior is due to a tie in anomaly scores. I'm pretty sure that has to be the reason, but I agree that it makes sense to double check. I also have an idea about how to fix the test in a way that won't be subject to capturing "good luck" in the random seed values, which I believe was essentially the case when it previously succeeded. Of course, the change we are discussing here is going to change the anomaly scores produced, so no surprise there. Practically speaking, increasing the tolerance would probably make the test pass, because the odds of creating a random dataset with three examples in a critical position that have equal values should be quite low -- but it's still nonzero. If I'm going to fix the test I'd rather do something that deterministically succeeds regardless of random chance, to eliminate the possibility of this kind of thing biting someone in the future. Will update again when I have something ready.

jnothman · 2018-09-20T08:06:55Z

Why not modify the test to check that some tie-breaking policy is followed consistently, i.e. that k > 30 outliers are returned, that the k - 30 + 1 lowest outlier scores are all equal...?

joshuakennethjones · 2018-09-20T09:00:24Z

@jnothman yep, that's pretty close to what I'm thinking (although given that the current failure is due to returning 29 outliers, I'm guessing the tie-breaker currently in use goes in the other direction). I don't want to codify something that isn't actually a feature of the algorithm, though, and I don't believe the relevant papers specify a way that ties should be broken. So, I was going to allow the test to pass regardless of the "direction" in which the tie is broken, when there is a tie in a crucial region (here, around the 30th example).

albertcthomas · 2018-09-20T09:15:33Z

I will try to double check the way we threshold the scores using np.percentile to be sure that this gives what we expect in terms of quantile. This will also depend on the interpolation method of the quantile function that we use (np.percentile uses 'linear' by default).

joshuakennethjones · 2018-09-20T18:52:06Z

Got set up to dump some debug statements and was able to confirm my suspicions:

Before:

Position 26 score: -0.580305488488
Position 27 score: -0.453050432374
Position 28 score: -0.301445385117
Position 29 score: -0.106451546269
Position 30 score: 0.0118279495854
Position 31 score: 0.060199072667
Position 32 score: 0.110967885477
Position 33 score: 0.254962388828


After:

Position 26 score: -0.00229702188937
Position 27 score: -0.000207712944615
Position 28 score: -0.000207712944615
Position 29 score: 0.0
Position 30 score: 0.0
Position 31 score: 0.00760246724465
Position 32 score: 0.00928720850602
Position 33 score: 0.01266619782

I'll update the test to avoid this pitfall. It's a little tricky because the code is reused for testing several different algorithms, not all of which seem to support 'decision_function', which is needed to get the underlying score values -- I learned this during my debug session. So there will have to be a bit of awkwardness in the test function to ensure we don't try to call 'decision_function' for an algorithm other than iForests for which it is not available. Hopefully I can get that coded up later today or early tomorrow.

albertcthomas · 2018-09-20T19:33:41Z

Thanks for the details @joshuakennethjones

The tests that are failing are common tests: they are supposed to pass for all outlier detection estimators.

joshuakennethjones · 2018-09-20T20:32:53Z

Yes, I understand -- this is what is going to make the code somewhat awkward, as it will need conditionals to just do the old (possibly wrong, definitely wrong in the case of iForests, but passing) thing for the set of estimators that don't support the 'decision_function' method as a way of retrieving underlying anomaly scores. Given that it will just default to doing the existing thing if the method isn't supported, the tests should still pass for those estimators, but I would guess that someone familiar with those other algorithms may want to take a look at some point and consider whether they are truly correct vs. just randomly happening to pass.

albertcthomas · 2018-09-20T21:07:56Z

I will have a closer look but I won’t have much time before next week.

Change assert_equal to assert ... == to adhere to latest conventions, and change test to properly deal with anomaly score ties in critical regions if 'decision_function' method is supported by the estimator in question, or default to the old behavior if not.

joshuakennethjones · 2018-09-21T19:02:21Z

I have altered the tests in question so that, in cases where the proportion of examples beyond the decision threshold does not match the corruption parameter exactly, we will explicitly check for anomaly score ties across the critical region if the 'decision_function' method is supported by the estimator. If the 'decision_function' method is not supported by the estimator under test, the behavior will remain the same as under the original test. It all worked for me locally, so hopefully the checks will pass here.

Refactoring and adding more tests to try and get coverage to an acceptable level.

albertcthomas · 2018-11-08T11:52:57Z

I am not completely satisfied with the test checking that we have approximately the correct proportion of outliers but I do not have a better solution. I would like to take a bit more time to think about but I don't know when I will be able to do it. This issue is not related to what this PR does so I do not want to delay this fix because of these tests.

Could you just check that the examples involving IsolationForest are not affected by the change of this PR? Especially the ones of the outlier detection section of the doc? https://scikit-learn.org/stable/modules/outlier_detection.html#novelty-and-outlier-detection

joshuakennethjones · 2018-11-21T18:58:33Z

I had a look at the examples in the documentation link you sent. Any shifts in the pictured boundaries, etc, are going to be more or less imperceptible to the eye, so in my mind that documentation should not need to change for this PR.

albertcthomas

Thanks for checking the examples. I have been thinking a bit about the tests and this solution seems indeed to be the best solution to check the proportion of outliers. However this check should only be done for estimators with decision_function. See my comment below.

albertcthomas · 2018-12-03T17:22:10Z

sklearn/utils/estimator_checks.py

                assert_array_equal(np.argsort(y_log_prob), np.argsort(y_prob))


+def check_outlier_corruption(estimator, num_outliers, expected_outliers, X):


Depending on the data set this check can fail for LocalOutlierFactor because this estimator has no decision_function. The reason that it is not failing right now is because in this case (given data set, LocalOutlierFactor with default parameters) we have num_outliers == expected_outliers. I think we should only run this check for estimators that have a decision_function. The only estimator that does not have a decision_function is LocalOutlierFactor and we can have a specific check in test_lof.py.

albertcthomas · 2019-01-06T22:42:00Z

@joshuakennethjones do you have time to finish this PR?

joshuakennethjones · 2019-01-18T01:12:39Z

I'm pretty tied up with work right now, so it may be a bit before I can look at the next set of requested changes. If you know what you want to do and feel comfortable making the remaining updates you would like, that's totally cool with me.

albertcthomas · 2019-02-25T15:07:08Z

I took over to finish this PR during the sprint, see #13251.

Fix issue scikit-learn#11839

2bdb36b

Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.

Add non-regression test.

a8d8d21

Changed existing test to reflect correct values now produced by _average_path_length(), and added checks to ensure non-regression on all "base case" values in {0,1,2}.

albertcthomas reviewed Sep 18, 2018

View reviewed changes

sklearn/ensemble/tests/test_iforest.py Outdated Show resolved Hide resolved

sklearn/ensemble/tests/test_iforest.py Outdated Show resolved Hide resolved

Improve comment & test.

1efdcba

Made recommended enhancements to comments, and change assert_almost_equal to assert_equal where constants should be returned.

albertcthomas reviewed Sep 18, 2018

View reviewed changes

sklearn/ensemble/tests/test_iforest.py Outdated Show resolved Hide resolved

joshuakennethjones added 3 commits September 21, 2018 15:25

Sundry convention tweaks.

2c7dc0d

Add tests of tests.

e6740f7

Refactoring and adding more tests to try and get coverage to an acceptable level.

Manipulate whitespace.

eb190fe

albertcthomas reviewed Dec 3, 2018

View reviewed changes

albertcthomas mentioned this pull request Feb 25, 2019

[MRG+1] Fix iforest average path length #13251

Merged

glemaitre closed this in #13251 Feb 26, 2019

		assert_array_equal(np.argsort(y_log_prob), np.argsort(y_prob))


		def check_outlier_corruption(estimator, num_outliers, expected_outliers, X):

Uh oh!

[MRG] sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839 #12085

[MRG] sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839 #12085

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Steps to Test

Results from Submitted Code

Results From Existing Code

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants