Average path length in iForest is inaccurate for small sizes #15724

Konrad0 · 2019-11-27T18:40:02Z

Description

The routine _average_path_length() for isolation forests (sklearn.ensemble.IsolationForest) gives quite inaccurate results for small sizes (n_samples_leaf), which propagates to the anomaly scores. For n_samples_leaf=3 the error is around 30%. This is due to the fact that a very rough approximation mentioned in the original paper is used, which only reaches reasonable accuracy for very large numbers. Therefore I would suggest to modify the current routine as follows with a combination of a lookup table for small values and an improved asymptotic series for large values, which should be accurate to double precision. Please let me know what you think. (This might be considered an improvement on #13251.)

Steps/Code to Fix

def _average_path_length_small(n_samples_leaf):
    c = (0.0, 0.0, 1.0, 1.6666666666666666667, 
         2.1666666666666666667, 2.5666666666666666667, 2.9000000000000000000, 3.1857142857142857143, 
         3.4357142857142857143, 3.6579365079365079365, 3.8579365079365079365, 4.0397546897546897547, 
         4.2064213564213564214, 4.3602675102675102675, 4.5031246531246531247, 4.6364579864579864580, 
         4.7614579864579864580, 4.8791050452815158698, 4.9902161563926269809, 5.0954793142873638230, 
         5.1954793142873638230, 5.2907174095254590611, 5.3816265004345499702, 5.4685830221736804049, 
         5.5519163555070137383, 5.6319163555070137383, 5.7088394324300906613, 5.7829135065041647354, 
         5.8543420779327361640, 5.9233075951741154743, 5.9899742618407821410, 6.0544903908730402055, 
         6.1169903908730402055, 6.1775964514791008116, 6.2364199808908655175, 6.2935628380337226603, 
         6.3491183935892782159, 6.4031724476433322699, 6.4558040265907006910, 6.5070860778727519730, 
         6.5570860778727519730, 6.6058665656776300218, 6.6534856132966776409, 6.6999972412036543850, 
         6.7454517866581998396, 6.7898962311026442840, 6.8333744919722095014, 6.8759276834615712036, 
         6.9175943501282378702, 6.9584106766588501151, 6.9984106766588501151, 7.0376263629333599190)
    #return np.array(c)[n_samples_leaf]
    return np.array(c)[np.maximum(0, n_samples_leaf)]

def _average_path_length(n_samples_leaf):
    """
    The average path length in a n_samples iTree, which is equal to
    the average path length of an unsuccessful BST search since the
    latter has the same structure as an isolation tree.
    Parameters
    ----------
    n_samples_leaf : array-like, shape (n_samples,).
        The number of training samples in each test sample leaf, for
        each estimators.

    Returns
    -------
    average_path_length : array, same shape as n_samples_leaf
    """

    n_samples_leaf = check_array(n_samples_leaf, ensure_2d=False)

    n_samples_leaf_shape = n_samples_leaf.shape
    n_samples_leaf = n_samples_leaf.reshape((1, -1))
    average_path_length = np.zeros(n_samples_leaf.shape)

    mask_small = n_samples_leaf < 52
    not_mask = ~mask_small

    average_path_length[mask_small] = _average_path_length_small(n_samples_leaf[mask_small])

    tmp = 1.0/np.square(n_samples_leaf[not_mask])
    average_path_length[not_mask] = (
        2.0 * (np.log(n_samples_leaf[not_mask]) - 1.0 + np.euler_gamma)
        + 1.0/n_samples_leaf[not_mask] - tmp*(1.0/6 - tmp*(1.0/60 - tmp/126))
    )

    return average_path_length.reshape(n_samples_leaf_shape)

Versions

System:
python: 3.8.0 (default, Oct 23 2019, 18:51:26) [GCC 9.2.0]
executable: /usr/bin/python
machine: Linux-5.3.13-arch1-1-x86_64-with-glibc2.2.5

Python deps:
pip: 19.2.3
setuptools: 41.6.0
sklearn: 0.21.3
numpy: 1.17.4
scipy: 1.3.1
Cython: 0.29.14
pandas: 0.25.3

The text was updated successfully, but these errors were encountered:

albertcthomas · 2019-12-10T09:57:43Z

Thanks @Konrad0.

For n_samples_leaf=3 the error is around 30%. This is due to the fact that a very rough approximation mentioned in the original paper is used, which only reaches reasonable accuracy for very large numbers.

could you elaborate a bit more? do you have an example showing this?

Konrad0 · 2019-12-11T00:00:41Z

Hi @albertcthomas ,
sure, happy to elaborate. To quote the original paper (Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.)

Given a data set of n instances, Section 10.3.3 of [9] gives the average path length of unsuccessful search in BST as:
c(n) = 2H(n − 1) − (2(n − 1)/n),
where H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant).

Now we can calculate the average path length exactly (e.g. from sympy) or using the suggested logarithmic approximation (used in the isolation forests) and compare by their relative difference:

n	exact	approximate	1 - app/ex
3	1.66666666666667	1.207392357589623	0.275564585446226
10	3.85793650793651	3.748880484475505	0.0282679674060613
100	8.37475503527924	8.364671030072245	0.00120409554243883
1000	12.9709417211007	12.969940887100174	7.71597022047876e-5
10000	17.5752120720888	17.575112063754766	5.69030595976017e-6
100000	22.1802922597269	22.180282259643522	4.50854443976707e-7

Since there are more accurate approximations for the harmonic numbers (see Wikipedia), it seems unnecessary to put up with these deviations, therefore my improved implementation which should be accurate down to a couple of machine epsilon in double precision. (BTW the original authors don't claim that their approximation is applicable everywhere or its precision.)

(As a side note, using the definition of the harmonic numbers, the average path length can be simplified to c(n) = 2(H(n) - 1), but that's not important at the moment. Also, if we wanted to get rid of most of the code for the average path length, we could just use the relation between the harmonic numbers and the digamma function H(n) = ψ(n+1) + γ, but that's not very efficient.)

jnothman · 2019-12-11T10:02:01Z

This sounds reasonable. Can you please submit a Pull Request? Would you be able to construct a test case that fails at master, highlighting the current error in anomaly scores?

albertcthomas · 2019-12-11T10:48:11Z

highlighting the current error in anomaly scores?

yes it would be good to know the impact of these errors on the anomaly scores.

Konrad0 · 2019-12-12T22:12:58Z

Yeah, I'll try to construct a test case, give me a few days as I'm a bit busy.
In theory it the impact should be rather straightforward since the anomaly score is defined as s = -2**(-E(h)/c(n)) (positive sign in the original paper), where E(h) is the path length of a given point averaged over all trees in the forest. So for instance s_exact = -0.660 and s_approximate = -0.563 for E(h)=1 and n=3, a relative difference of 15%.

albertcthomas · 2019-12-13T08:52:17Z

thanks. it will also be important to document this as this makes the implementation a bit different to what is described in the original paper.

pjgao · 2020-12-26T01:32:26Z

This modification sounds good, Is there any progress?

…#15724) An improved approximation for the average path length in isolation forests is introduced, based on more a accurate harmonic number calculation. The routine _average_path_length() is mostly replaced.

Konrad0 · 2021-01-01T18:20:48Z

Hi all,

thanks for the reminder that I never submitted the PR, I just picked up the issue again. I just pushed the improved code and will try to construct a suitable test soon. Right now some of the existing tests fail as expected, I'll take care of that too and let you all know when the PR seems stable.
Thanks @pjgao for noticing that this issue may be related to #16721, maybe the guys working on that issue can take a look to see if this improvement has any impact on their problem.

Konrad0 · 2021-01-02T15:08:37Z

Hi @albertcthomas and @jnothman,
could you take a look at the PR (#19087) when you find some time? The build fails but I suspect that at least some of the failures are unrelated to these changes.
As suggested above, a couple of the tests fail without this improvement but pass now (their expected values had to be adjusted).

Also, to mitigate the numerical instability issue outlined in #16721 and #16967, I've implemented a threshold value of -1.0e-15 as suggested in the aforementioned PRs, which were never merged into master. Maybe @NicolasHug, @ngoix, @glemaitre, or @matwey would like to comment. (Sidenote: IMO this topic could use more investigation but as a separate issue.)
Thank you all for your contributions!

pjgao mentioned this issue Dec 30, 2020

Improve IsolationForest average depth evaluation #16721

Open

Konrad0 linked a pull request Jan 1, 2021 that will close this issue

FIX Isolation forest path length for small samples #19087

Open

cmarmo added Enhancement module:ensemble labels Jul 5, 2021

cmarmo added the help wanted label Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Average path length in iForest is inaccurate for small sizes #15724

Average path length in iForest is inaccurate for small sizes #15724

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Average path length in iForest is inaccurate for small sizes #15724

Average path length in iForest is inaccurate for small sizes #15724

Comments

Description

Steps/Code to Fix

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!