8000 sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. · Issue #11839 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joshuakennethjones opened this issue Aug 16, 2018 · 3 comments · Fixed by #13251

Comments

@joshuakennethjones
Copy link

Description

When an input value to _average_path_length() is in {0,1}, the return value should be zero, not one, as in the existing implementation. Also, when the input value is 2, the return value should be 1, not 0.15... as in the current implementation. These results should be expected for two reasons: first, based on the 2012 iForest paper (the original paper indicated a zero value for terminal nodes to which < 2 training examples had been sorted in the first paragraph of section 4.2, but left this vague in the algorithm/equation specifications, and did not specify a unique value for nodes to which exactly two training examples had been sorted), where it is explicitly stated that c(n) (the value computed by _average_path_length) should take the value zero for n in {0,1} and takes the value 1 for n=2. Also, from a rational perspective, we want these values to monotonically increase with n, and in the current implementation this is not the case. This is a pretty easy fix, I think -- just alter the existing cases for inputs in {0,1} to return zero instead of 1 (already hard-coded for these cases) and add a case for an input value of 2 to return 1. Since I have not contributed in the past, I felt it best to relay the issue this way vs. making my own pull request. This issue will impact anomaly scores in a subtle but potentially meaningful way.

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.

Steps/Code to Reproduce

import numpy as np
from sklearn.ensemble.iforest import _average_path_length
_average_path_length(np.array([0,1,2,3]))

Expected Results

array([0. , 0. , 1., 1.20739236])

Actual Results

array([1. , 1. , 0.15443133, 1.20739236])

Versions

Linux-4.14.47-56.37.amzn1.x86_64-x86_64-with-glibc2.2.5
('Python', '2.7.14 (default, May  2 2018, 18:31:34) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]')
('NumPy', '1.14.5')
('SciPy', '1.1.0')
('Scikit-Learn', '0.19.1')

@joshuakennethjones
Copy link
Author

Here is one alternative version of the function that seems to work in case it is helpful:

def _average_path_length_fixed(n_samples_leaf):
    """ The average path length in a n_samples iTree, which is equal to
    the average path length of an unsuccessful BST search since the
    latter has the same structure as an isolation tree.
    Parameters
    ----------
    n_samples_leaf : array-like of shape (n_samples, n_estimators), or int.
        The number of training samples in each test sample leaf, for
        each estimators.
    Returns
    -------
    average_path_length : array, same shape as n_samples_leaf
    """
    if isinstance(n_samples_leaf, INTEGER_TYPES):
        if n_samples_leaf <= 1:
            return 0.
        if n_samples_leaf <= 2:
            return 1.
        else:
            return 2. * (np.log(n_samples_leaf - 1.) + euler_gamma) - 2. * (
                n_samples_leaf - 1.) / n_samples_leaf

    else:

        n_samples_leaf_shape = n_samples_leaf.shape
        n_samples_leaf = n_samples_leaf.reshape((1, -1))
        average_path_length = np.zeros(n_samples_leaf.shape)

        mask_1 = (n_samples_leaf <= 1)
        mask_2 = (n_samples_leaf == 2)
        not_mask = np.logical_not(np.logical_or(mask_1, mask_2))

        average_path_length[mask_1] = 0.
        average_path_length[mask_2] = 1.
        average_path_length[not_mask] = 2. * (
            np.log(n_samples_leaf[not_mask] - 1.) + euler_gamma) - 2. * (
                n_samples_leaf[not_mask] - 1.) / n_samples_leaf[not_mask]

        return average_path_length.reshape(n_samples_leaf_shape)

@artsiomkaltovich
Copy link

Maybe you will create a pull request?

@joshuakennethjones
Copy link
Author

Sure, I'll do that -- just didn't want to be presumptuous and didn't know the right protocol for this kind of thing. Thanks!

joshuakennethjones added a commit to joshuakennethjones/scikit-learn that referenced this issue Sep 14, 2018
Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
albertcthomas pushed a commit to albertcthomas/scikit-learn that referenced this issue Feb 25, 2019
Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants
0