8000 Average path length in iForest is inaccurate for small sizes · Issue #15724 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Average path length in iForest is inaccurate for small sizes #15724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Konrad0 opened this issue Nov 27, 2019 · 9 comments · May be fixed by #19087
Open

Average path length in iForest is inaccurate for small sizes #15724

Konrad0 opened this issue Nov 27, 2019 · 9 comments · May be fixed by #19087

Comments

@Konrad0
Copy link
Konrad0 commented Nov 27, 2019

Description

The routine _average_path_length() for isolation forests (sklearn.ensemble.IsolationForest) gives quite inaccurate results for small sizes (n_samples_leaf), which propagates to the anomaly scores. For n_samples_leaf=3 the error is around 30%. This is due to the fact that a very rough approximation mentioned in the original paper is used, which only reaches reasonable accuracy for very large numbers. Therefore I would suggest to modify the current routine as follows with a combination of a lookup table for small values and an improved asymptotic series for large values, which should be accurate to double precision. Please let me know what you think. (This might be considered an improvement on #13251.)

Steps/Code to Fix

def _average_path_length_small(n_samples_leaf):
    c = (0.0, 0.0, 1.0, 1.6666666666666666667, 
         2.1666666666666666667, 2.5666666666666666667, 2.9000000000000000000, 3.1857142857142857143, 
         3.4357142857142857143, 3.6579365079365079365, 3.8579365079365079365, 4.0397546897546897547, 
         4.2064213564213564214, 4.3602675102675102675, 4.5031246531246531247, 4.6364579864579864580, 
         4.7614579864579864580, 4.8791050452815158698, 4.9902161563926269809, 5.0954793142873638230, 
         5.1954793142873638230, 5.2907174095254590611, 5.3816265004345499702, 5.4685830221736804049, 
         5.5519163555070137383, 5.6319163555070137383, 5.7088394324300906613, 5.7829135065041647354, 
         5.8543420779327361640, 5.9233075951741154743, 5.9899742618407821410, 6.0544903908730402055, 
         6.1169903908730402055, 6.1775964514791008116, 6.2364199808908655175, 6.2935628380337226603, 
         6.3491183935892782159, 6.4031724476433322699, 6.4558040265907006910, 6.5070860778727519730, 
         6.5570860778727519730, 6.6058665656776300218, 6.6534856132966776409, 6.6999972412036543850, 
         6.7454517866581998396, 6.7898962311026442840, 6.8333744919722095014, 6.8759276834615712036, 
         6.9175943501282378702, 6.9584106766588501151, 6.9984106766588501151, 7.0376263629333599190)
    #return np.array(c)[n_samples_leaf]
    return np.array(c)[np.maximum(0, n_samples_leaf)]

def _average_path_length(n_samples_leaf):
    """
    The average path length in a n_samples iTree, which is equal to
    the average path length of an unsuccessful BST search since the
    latter has the same structure as an isolation tree.
    Parameters
    ----------
    n_samples_leaf : array-like, shape (n_samples,).
        The number of training samples in each test sample leaf, for
        each estimators.

    Returns
    -------
    average_path_length : array, same shape as n_samples_leaf
    """

    n_samples_leaf = check_array(n_samples_leaf, ensure_2d=False)

    n_samples_leaf_shape = n_samples_leaf.shape
    n_samples_leaf = n_samples_leaf.reshape((1, -1))
    average_path_length = np.zeros(n_samples_leaf.shape)

    mask_small = n_samples_leaf < 52
    not_mask = ~mask_small

    average_path_length[mask_small] = _average_path_length_small(n_samples_leaf[mask_small])

    tmp = 1.0/np.square(n_samples_leaf[not_mask])
    average_path_length[not_mask] = (
        2.0 * (np.log(n_samples_leaf[not_mask]) - 1.0 + np.euler_gamma)
        + 1.0/n_samples_leaf[not_mask] - tmp*(1.0/6 - tmp*(1.0/60 - tmp/126))
    )

    return average_path_length.reshape(n_samples_leaf_shape)

Versions

System:
python: 3.8.0 (default, Oct 23 2019, 18:51:26) [GCC 9.2.0]
executable: /usr/bin/python
machine: Linux-5.3.13-arch1-1-x86_64-with-glibc2.2.5

Python deps:
pip: 19.2.3
setuptools: 41.6.0
sklearn: 0.21.3
numpy: 1.17.4
scipy: 1.3.1
Cython: 0.29.14
pandas: 0.25.3

@albertcthomas
Copy link
Contributor

Thanks @Konrad0.

For n_samples_leaf=3 the error is around 30%. This is due to the fact that a very rough approximation mentioned in the original paper is used, which only reaches reasonable accuracy for very large numbers.

could you elaborate a bit more? do you have an example showing this?

@Konrad0
Copy link
Author
Konrad0 commented Dec 11, 2019

Hi @albertcthomas ,
sure, happy to elaborate. To quote the original paper (Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.)

Given a data set of n instances, Section 10.3.3 of [9] gives the average path length of unsuccessful search in BST as:
c(n) = 2H(n − 1) − (2(n − 1)/n),
where H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant).

Now we can calculate the average path length exactly (e.g. from sympy) or using the suggested logarithmic approximation (used in the isolation forests) and compare by their relative difference:

n exact approximate 1 - app/ex
3 1.66666666666667 1.207392357589623 0.275564585446226
10 3.85793650793651 3.748880484475505 0.0282679674060613
100 8.37475503527924 8.364671030072245 0.00120409554243883
1000 12.9709417211007 12.969940887100174 7.71597022047876e-5
10000 17.5752120720888 17.575112063754766 5.69030595976017e-6
100000 22.1802922597269 22.180282259643522 4.50854443976707e-7

Since there are more accurate approximations for the harmonic numbers (see Wikipedia), it seems unnecessary to put up with these deviations, therefore my improved implementation which should be accurate down to a couple of machine epsilon in double precision. (BTW the original authors don't claim that their approximation is applicable everywhere or its precision.)

(As a side note, using the definition of the harmonic numbers, the average path length can be simplified to c(n) = 2(H(n) - 1), but that's not important at the moment. Also, if we wanted to get rid of most of the code for the average path length, we could just use the relation between the harmonic numbers and the digamma function H(n) = ψ(n+1) + γ, but that's not very efficient.)

@jnothman
Copy link
Member
jnothman commented Dec 11, 2019 via email

@albertcthomas
Copy link
Contributor

highlighting the current error in anomaly scores?

yes it would be good to know the impact of these errors on the anomaly scores.

@Konrad0
Copy link
Author
Konrad0 commented Dec 12, 2019

Yeah, I'll try to construct a test case, give me a few days as I'm a bit busy.
In theory it the impact should be rather straightforward since the anomaly score is defined as s = -2**(-E(h)/c(n)) (positive sign in the original paper), where E(h) is the path length of a given point averaged over all trees in the forest. So for instance s_exact = -0.660 and s_approximate = -0.563 for E(h)=1 and n=3, a relative difference of 15%.

@albertcthomas
Copy link
Contributor

thanks. it will also be important to document this as this makes the implementation a bit different to what is described in the original paper.

@pjgao
Copy link
pjgao commented Dec 26, 2020

This modification sounds good, Is there any progress?

Konrad0 added a commit to Konrad0/scikit-learn that referenced this issue Jan 1, 2021
…#15724)

An improved approximation for the average path length in isolation
forests is introduced, based on more a accurate harmonic number
calculation. The routine _average_path_length() is mostly replaced.
@Konrad0
Copy link
Author
Konrad0 commented Jan 1, 2021

Hi all,

thanks for the reminder that I never submitted the PR, I just picked up the issue again. I just pushed the improved code and will try to construct a suitable test soon. Right now some of the existing tests fail as expected, I'll take care of that too and let you all know when the PR seems stable.
Thanks @pjgao for noticing that this issue may be related to #16721, maybe the guys working on that issue can take a look to see if this improvement has any impact on their problem.

@Konrad0 Konrad0 linked a pull request Jan 1, 2021 that will close this issue
@Konrad0
Copy link
Author
Konrad0 commented Jan 2, 2021

Hi @albertcthomas and @jnothman,
could you take a look at the PR (#19087) when you find some time? The build fails but I suspect that at least some of the failures are unrelated to these changes.
As suggested above, a couple of the tests fail without this improvement but pass now (their expected values had to be adjusted).

Also, to mitigate the numerical instability issue outlined in #16721 and #16967, I've implemented a threshold value of -1.0e-15 as suggested in the aforementioned PRs, which were never merged into master. Maybe @NicolasHug, @ngoix, @glemaitre, or @matwey would like to comment. (Sidenote: IMO this topic could use more investigation but as a separate issue.)
Thank you all for your contributions!

Sign up for free to join this conversation on GitHub. 52AE Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0