-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Average path length in iForest is inaccurate for small sizes #15724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @Konrad0.
could you elaborate a bit more? do you have an example showing this? |
Hi @albertcthomas ,
Now we can calculate the average path length exactly (e.g. from sympy) or using the suggested logarithmic approximation (used in the isolation forests) and compare by their relative difference:
Since there are more accurate approximations for the harmonic numbers (see Wikipedia), it seems unnecessary to put up with these deviations, therefore my improved implementation which should be accurate down to a couple of machine epsilon in double precision. (BTW the original authors don't claim that their approximation is applicable everywhere or its precision.) (As a side note, using the definition of the harmonic numbers, the average path length can be simplified to c(n) = 2(H(n) - 1), but that's not important at the moment. Also, if we wanted to get rid of most of the code for the average path length, we could just use the relation between the harmonic numbers and the digamma function H(n) = ψ(n+1) + γ, but that's not very efficient.) |
This sounds reasonable. Can you please submit a Pull Request? Would you be
able to construct a test case that fails at master, highlighting the
current error in anomaly scores?
|
yes it would be good to know the impact of these errors on the anomaly scores. |
Yeah, I'll try to construct a test case, give me a few days as I'm a bit busy. |
thanks. it will also be important to document this as this makes the implementation a bit different to what is described in the original paper. |
This modification sounds good, Is there any progress? |
…#15724) An improved approximation for the average path length in isolation forests is introduced, based on more a accurate harmonic number calculation. The routine _average_path_length() is mostly replaced.
Hi all, thanks for the reminder that I never submitted the PR, I just picked up the issue again. I just pushed the improved code and will try to construct a suitable test soon. Right now some of the existing tests fail as expected, I'll take care of that too and let you all know when the PR seems stable. |
Hi @albertcthomas and @jnothman, Also, to mitigate the numerical instability issue outlined in #16721 and #16967, I've implemented a threshold value of |
Description
The routine
_average_path_length()
for isolation forests (sklearn.ensemble.IsolationForest) gives quite inaccurate results for small sizes (n_samples_leaf
), which propagates to the anomaly scores. Forn_samples_leaf=3
the error is around 30%. This is due to the fact that a very rough approximation mentioned in the original paper is used, which only reaches reasonable accuracy for very large numbers. Therefore I would suggest to modify the current routine as follows with a combination of a lookup table for small values and an improved asymptotic series for large values, which should be accurate to double precision. Please let me know what you think. (This might be considered an improvement on #13251.)Steps/Code to Fix
Versions
System:
python: 3.8.0 (default, Oct 23 2019, 18:51:26) [GCC 9.2.0]
executable: /usr/bin/python
machine: Linux-5.3.13-arch1-1-x86_64-with-glibc2.2.5
Python deps:
pip: 19.2.3
setuptools: 41.6.0
sklearn: 0.21.3
numpy: 1.17.4
scipy: 1.3.1
Cython: 0.29.14
pandas: 0.25.3
The text was updated successfully, but these errors were encountered: