sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839

joshuakennethjones · 2018-08-16T16:34:01Z

Description

When an input value to _average_path_length() is in {0,1}, the return value should be zero, not one, as in the existing implementation. Also, when the input value is 2, the return value should be 1, not 0.15... as in the current implementation. These results should be expected for two reasons: first, based on the 2012 iForest paper (the original paper indicated a zero value for terminal nodes to which < 2 training examples had been sorted in the first paragraph of section 4.2, but left this vague in the algorithm/equation specifications, and did not specify a unique value for nodes to which exactly two training examples had been sorted), where it is explicitly stated that c(n) (the value computed by _average_path_length) should take the value zero for n in {0,1} and takes the value 1 for n=2. Also, from a rational perspective, we want these values to monotonically increase with n, and in the current implementation this is not the case. This is a pretty easy fix, I think -- just alter the existing cases for inputs in {0,1} to return zero instead of 1 (already hard-coded for these cases) and add a case for an input value of 2 to return 1. Since I have not contributed in the past, I felt it best to relay the issue this way vs. making my own pull request. This issue will impact anomaly scores in a subtle but potentially meaningful way.

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.

Steps/Code to Reproduce

import numpy as np
from sklearn.ensemble.iforest import _average_path_length
_average_path_length(np.array([0,1,2,3]))

Expected Results

array([0. , 0. , 1., 1.20739236])

Actual Results

array([1. , 1. , 0.15443133, 1.20739236])

Versions

Linux-4.14.47-56.37.amzn1.x86_64-x86_64-with-glibc2.2.5
('Python', '2.7.14 (default, May  2 2018, 18:31:34) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]')
('NumPy', '1.14.5')
('SciPy', '1.1.0')
('Scikit-Learn', '0.19.1')

The text was updated successfully, but these errors were encountered:

joshuakennethjones · 2018-08-16T17:47:16Z

Here is one alternative version of the function that seems to work in case it is helpful:

def _average_path_length_fixed(n_samples_leaf):
    """ The average path length in a n_samples iTree, which is equal to
    the average path length of an unsuccessful BST search since the
    latter has the same structure as an isolation tree.
    Parameters
    ----------
    n_samples_leaf : array-like of shape (n_samples, n_estimators), or int.
        The number of training samples in each test sample leaf, for
        each estimators.
    Returns
    -------
    average_path_length : array, same shape as n_samples_leaf
    """
    if isinstance(n_samples_leaf, INTEGER_TYPES):
        if n_samples_leaf <= 1:
            return 0.
        if n_samples_leaf <= 2:
            return 1.
        else:
            return 2. * (np.log(n_samples_leaf - 1.) + euler_gamma) - 2. * (
                n_samples_leaf - 1.) / n_samples_leaf

    else:

        n_samples_leaf_shape = n_samples_leaf.shape
        n_samples_leaf = n_samples_leaf.reshape((1, -1))
        average_path_length = np.zeros(n_samples_leaf.shape)

        mask_1 = (n_samples_leaf <= 1)
        mask_2 = (n_samples_leaf == 2)
        not_mask = np.logical_not(np.logical_or(mask_1, mask_2))

        average_path_length[mask_1] = 0.
        average_path_length[mask_2] = 1.
        average_path_length[not_mask] = 2. * (
            np.log(n_samples_leaf[not_mask] - 1.) + euler_gamma) - 2. * (
                n_samples_leaf[not_mask] - 1.) / n_samples_leaf[not_mask]

        return average_path_length.reshape(n_samples_leaf_shape)

artsiomkaltovich · 2018-09-13T07:17:56Z

Maybe you will create a pull request?

joshuakennethjones · 2018-09-14T19:17:13Z

Sure, I'll do that -- just didn't want to be presumptuous and didn't know the right protocol for this kind of thing. Thanks!

Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.

joshuakennethjones added a commit to joshuakennethjones/scikit-learn that referenced this issue Sep 14, 2018

Fix issue scikit-learn#11839

2bdb36b

Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.

joshuakennethjones mentioned this issue Sep 14, 2018

[MRG] sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839 #12085

Closed

albertcthomas pushed a commit to albertcthomas/scikit-learn that referenced this issue Feb 25, 2019

Fix issue scikit-learn#11839

6E00 8fddcb8

Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.

albertcthomas mentioned this issue Feb 25, 2019

[MRG+1] Fix iforest average path length #13251

Merged

glemaitre closed this as completed in #13251 Feb 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839

sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839

sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839

Comments

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Uh oh!

Uh oh!

Uh oh!