-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Improve IsolationForest average depth evaluation #16721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a8f8b19
to
0869944
Compare
@matwey could you check why some of the tests failed? |
Sorry for this delay. I'll do. |
0869944
to
00d7bbc
Compare
We shold compare the implementation with the exact equation instead of the approximation.
00d7bbc
to
504cf2e
Compare
@albertcthomas I still have an issue with |
…rn#14771)" This reverts commit bcaf381. The test in reverted commit is useless and doesn't rely on the code implementation. The commit claims to fix scikit-learn#7141, where the isolation forest is trained on the identical values leading to the degenerated trees. Under described circumstances, one may check that the exact score value for every point in the parameter space is zero (or 0.5 depending on if we are talking about initial paper or scikit-learn implementation). However, there is no special code in existing implementation, and the score value is a subject of rounding erros. So, for instance, for 100 identical input samples, we have a forest predicting everything as inliners, but for 101 input samples, we have a forest predicting everything as outliers. The decision is taken only based on floating point rounding error value. One may check this by changing the number of input samples: X = np.ones((100, 10)) to X = np.ones((101, 10)) or something else.
The equation in the original paper is undersimplified: c(n) = 2 H(n-1) - 2 (n-1) / n, where H(k) ~ (ln(k) + gamma) The definition of the harmonic number is: H(n) = sum_{j=1}^{n} (1 / j) then it follows that: H(n) = H(n-1) + 1 / n and then: c(n) = 2 H(n-1) - 2 (n-1) / n = 2 ( H(n-1) - 1 + 1 / n ) = = 2 (H(n) - 1) Here we use simplier equation to save calculations.
504cf2e
to
5fd5ec7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @matwey
The math look good to me. I don't expect this to bring any sort of performance improvement, but if anything, this PR highlights some instability in the decision function.
Maybe we should go from
is_inlier[self.decision_function(X) < 0] = -1
to
is_inlier[self.decision_function(X) < -1e-15] = -1
or something like that.
@ngoix @glemaitre you guys have worked on this right? Thoughts?
average_path_length[not_mask] = 2.0 * (np.log(n_samples_leaf[not_mask]) | ||
+ np.euler_gamma - 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
average_path_length[not_mask] = 2.0 * (np.log(n_samples_leaf[not_mask]) | |
+ np.euler_gamma - 1.0) | |
average_path_length[not_mask] = 2 * (np.log(n_samples_leaf[not_mask]) | |
+ np.euler_gamma - 1) |
@@ -322,37 +325,3 @@ def test_iforest_deprecation(): | |||
warn_msg = "'behaviour' is deprecated in 0.22 and will be removed in 0.24" | |||
with pytest.warns(FutureWarning, match=warn_msg): | |||
iforest.fit(iris.data) | |||
|
|||
|
|||
def test_iforest_with_uniform_data(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely don't want to remove this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't insist on removing this, but #7141 should be resolved in a proper way then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for opening this PR, can you develop on why #7141 was not resolved in a proper way?
tbh I can't find what/where this issue was fixed, but this test seems to show it was fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry just saw your previous comment - so you're saying it was fixed because the approximation error flipped side basically. I think we should go with @NicolasHug suggestion then - is_inlier[self.decision_function(X) < -1e-15] = -1
instead of is_inlier[self.decision_function(X) < 0] = -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngoix Do you expect me to do it in this PR or in the separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another PR would be the proper way, but then we'd have to merge this one only afterwards because we definitely shouldn't remove this test.
I'd be ok to do it in this PR if it is just this one-line change
Allow small discripance for decision function as suggested in scikit-learn#16721.
Allow small discripance for decision function as suggested in scikit-learn#16721.
|
The equation in the original paper is undersimplified:
where H(k) ~ (ln(k) + gamma)
The definition of the harmonic number is:
then it follows that:
and then:
Here we use simplier equation to save calculations.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Any other comments?