8000 Improve IsolationForest average depth evaluation by matwey · Pull Request #16721 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Improve IsolationForest average depth evaluation #16721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

matwey
Copy link
@matwey matwey commented Mar 19, 2020

The equation in the original paper is undersimplified:

c(n) = 2 H(n-1) - 2 (n-1) / n,

where H(k) ~ (ln(k) + gamma)

The definition of the harmonic number is:

H(n) = sum_{j=1}^{n} (1 / j)

then it follows that:

H(n) = H(n-1) + 1 / n

and then:

c(n) = 2 H(n-1) - 2 (n-1) / n = 2 ( H(n-1) - 1 + 1 / n ) =
     = 2 (H(n) - 1)

Here we use simplier equation to save calculations.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

@albertcthomas
Copy link
Contributor

@matwey could you check why some of the tests failed?

@matwey
Copy link
Author
matwey commented Apr 9, 2020

@matwey could you check why some of the tests failed?

Sorry for this delay. I'll do.

@matwey matwey force-pushed the fix/isoforest_avg_path branch from 0869944 to 00d7bbc Compare April 15, 2020 13:18
We shold compare the implementation with the exact equation instead of the
approximation.
@matwey matwey force-pushed the fix/isoforest_avg_path branch from 00d7bbc to 504cf2e Compare April 15, 2020 13:20
@matwey
Copy link
Author
matwey commented Apr 15, 2020

@albertcthomas I still have an issue with test_iforest_with_uniform_data(), I've found that the test itself relies on floating point arithmetic implementation. Previously, iforest.score_samples(X) is evaluated to approx plus epsilon value and then compared to 0, with my commit it evaluates to minus epsilon value. One may show that the exact iforest.score_samples(X) should be exactly 0 if one does all calculations manually on the paper.

matwey added 2 commits April 18, 2020 11:31
…rn#14771)"

This reverts commit bcaf381.

The test in reverted commit is useless and doesn't rely on the code
implementation. The commit claims to fix scikit-learn#7141, where the isolation forest is
trained on the identical values leading to the degenerated trees.

Under described circumstances, one may check that the exact score value for
every point in the parameter space is zero (or 0.5 depending on if we are
talking about initial paper or scikit-learn implementation).
However, there is no special code in existing implementation, and the score
value is a subject of rounding erros. So, for instance, for 100 identical input
samples, we have a forest predicting everything as inliners, but for 101 input
samples, we have a forest predicting everything as outliers. The decision is
taken only based on floating point rounding error value.

One may check this by changing the number of input samples:

    X = np.ones((100, 10))

to

    X = np.ones((101, 10))

or something else.
The equation in the original paper is undersimplified:

    c(n) = 2 H(n-1) - 2 (n-1) / n,

where H(k) ~ (ln(k) + gamma)

The definition of the harmonic number is:

    H(n) = sum_{j=1}^{n} (1 / j)

then it follows that:

    H(n) = H(n-1) + 1 / n

and then:

    c(n) = 2 H(n-1) - 2 (n-1) / n = 2 ( H(n-1) - 1 + 1 / n ) =
         = 2 (H(n) - 1)

Here we use simplier equation to save calculations.
@matwey matwey force-pushed the fix/isoforest_avg_path branch from 504cf2e to 5fd5ec7 Compare April 18, 2020 08:39
Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @matwey

The math look good to me. I don't expect this to bring any sort of performance improvement, but if anything, this PR highlights some instability in the decision function.

Maybe we should go from

        is_inlier[self.decision_function(X) < 0] = -1

to

        is_inlier[self.decision_function(X) < -1e-15] = -1

or something like that.

@ngoix @glemaitre you guys have worked on this right? Thoughts?

Comment on lines +504 to +505
average_path_length[not_mask] = 2.0 * (np.log(n_samples_leaf[not_mask])
+ np.euler_gamma - 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
average_path_length[not_mask] = 2.0 * (np.log(n_samples_leaf[not_mask])
+ np.euler_gamma - 1.0)
average_path_length[not_mask] = 2 * (np.log(n_samples_leaf[not_mask])
+ np.euler_gamma - 1)

@@ -322,37 +325,3 @@ def test_iforest_deprecation():
warn_msg = "'behaviour' is deprecated in 0.22 and will be removed in 0.24"
with pytest.warns(FutureWarning, match=warn_msg):
iforest.fit(iris.data)


def test_iforest_with_uniform_data():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't want to remove this one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't insist on removing this, but #7141 should be resolved in a proper way then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for opening this PR, can you develop on why #7141 was not resolved in a proper way?
tbh I can't find what/where this issue was fixed, but this test seems to show it was fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry just saw your previous comment - so you're saying it was fixed because the approximation error flipped side basically. I think we should go with @NicolasHug suggestion then - is_inlier[self.decision_function(X) < -1e-15] = -1 instead of is_inlier[self.decision_function(X) < 0] = -1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngoix Do you expect me to do it in this PR or in the separate PR?

8000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another PR would be the proper way, but then we'd have to merge this one only afterwards because we definitely shouldn't remove this test.
I'd be ok to do it in this PR if it is just this one-line change

matwey added a commit to matwey/scikit-learn that referenced this pull request Apr 20, 2020
Allow small discripance for decision function as suggested in scikit-learn#16721.
@matwey matwey mentioned this pull request Apr 20, 2020
matwey added a commit to matwey/scikit-learn that referenced this pull request Apr 20, 2020
Allow small discripance for decision function as suggested in scikit-learn#16721.
@pjgao
Copy link
pjgao commented Dec 30, 2020

$c(n) = 2 H(n-1) - 2 (n-1) / n$ and $c(n) = 2 (H(n) - 1)$ will give different results when we use $H(i) \approx \ln i + 0.5772156649$ to approximate the harmonic number, because H(n) and H(n-1) 's approx is different when n is small as described in #15724

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0