8000 [MRG+1] FIX Correct depth formula in iforest (#8576) · scikit-learn/scikit-learn@4ab99c7 · GitHub
[go: up one dir, main page]

Skip to content

Commit 4ab99c7

Browse files
Peter Wangraghavrv
authored andcommitted
[MRG+1] FIX Correct depth formula in iforest (#8576)
* Fixed depth formula in iforest * Added non-regression test for issue #8549 * reverted some whitespace changes * Made changes to what's new and whitespace changes * Update whats_new.rst * Update whats_new.rst * fixed faulty whitespace * faulty whitespace fix and change to whats new * added constants to iforest average_path_length and the according non regression test * COSMIT * Update whats_new.rst * Corrected IsolationForest average path formula and added integer array equiv test * changed line to under 80 char * Update whats_new.rst * Update whats_new.rst * reran tests * redefine np.euler_gamma * added import statement for euler_gammma in iforest and test_iforest * changed np.euler_gamma to euler_gamma * fix small formatting issue * fix small formatting issue * modified average_path_length tests * formatting fix + removed redundant tests * fix import error * retry remote server error * retry remote server error * retry remote server error * re-added some iforest tests * re-added some iforest tests
1 parent 29597ca commit 4ab99c7

File tree

4 files changed

+34
-11
lines changed

4 files changed

+34
-11
lines changed

doc/whats_new.rst

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ parameters, may produce different models from the previous version. This often
1818
occurs due to changes in the modelling logic (bug fixes or enhancements), or in
1919
random sampling procedures.
2020

21-
* *to be listed*
21+
* :class:`sklearn.ensemble.IsolationForest` (bug fix)
2222

2323
Details are listed in the changelog below.
2424

@@ -156,7 +156,11 @@ Enhancements
156156

157157
Bug fixes
158158
.........
159-
- Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect
159+
- Fixed a bug where :class:`sklearn.ensemble.IsolationForest` uses an
160+
an incorrect formula for the average path length
161+
:issue:`8549` by `Peter Wang <https://github.com/PTRWang>`_.
162+
163+
- Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect
160164
result when input is a precomputed sparse matrix with initial
161165
rows all zero.
162166
:issue:`8306` by :user:`Akshay Gupta <Akshay0724>`
@@ -167,7 +171,7 @@ Bug fixes
167171

168172
- Fixed a bug where :func:`sklearn.model_selection.BaseSearchCV.inverse_transform`
169173
returns self.best_estimator_.transform() instead of self.best_estimator_.inverse_transform()
170-
:issue:`8344` by :user:`Akshay Gupta <Akshay0724>`
174+
:issue:`8344` by :user:`Akshay Gupta <Akshay0724>`
171175

172176
- Fixed a bug where :class:`sklearn.linear_model.RandomizedLasso` and
173177
:class:`sklearn.linear_model.RandomizedLogisticRegression` breaks for
@@ -274,13 +278,13 @@ API changes summary
274278
selection classes to be used with tools such as
275279
:func:`sklearn.model_selection.cross_val_predict`.
276280
:issue:`2879` by :user:`Stephen Hoover <stephen-hoover>`.
277-
278-
- Estimators with both methods ``decision_function`` and ``predict_proba``
279-
are now required to have a monotonic relation between them. The
280-
method ``check_decision_proba_consistency`` has been added in
281-
**sklearn.utils.estimator_checks** to check their consistency.
281+
282+
- Estimators with both methods ``decision_function`` and ``predict_proba``
283+
are now required to have a monotonic relation between them. The
284+
method ``check_decision_proba_consistency`` has been added in
285+
**sklearn.utils.estimator_checks** to check their consistency.
282286
:issue:`7578` by :user:`Shubham Bhardwaj <shubham0704>`
283-
287+
8000 284288

285289
.. _changes_0_18_1:
286290

sklearn/ensemble/iforest.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import numpy as np
88
import scipy as sp
99
from warnings import warn
10+
from sklearn.utils.fixes import euler_gamma
1011

1112
from scipy.sparse import issparse
1213

@@ -300,7 +301,7 @@ def _average_path_length(n_samples_leaf):
300301
if n_samples_leaf <= 1:
301302
return 1.
302303
else:
303-
return 2. * (np.log(n_samples_leaf) + 0.5772156649) - 2. * (
304+
return 2. * (np.log(n_samples_leaf - 1.) + euler_gamma) - 2. * (
304305
n_samples_leaf - 1.) / n_samples_leaf
305306

306307
else:
@@ -314,7 +315,7 @@ def _average_path_length(n_samples_leaf):
314315

315316
average_path_length[mask] = 1.
316317
average_path_length[not_mask] = 2. * (
317-
np.log(n_samples_leaf[not_mask]) + 0.5772156649) - 2. * (
318+
np.log(n_samples_leaf[not_mask] - 1.) + euler_gamma) - 2. * (
318319
n_samples_leaf[not_mask] - 1.) / n_samples_leaf[not_mask]
319320

320321
return average_path_length.reshape(n_samples_leaf_shape)

sklearn/ensemble/tests/test_iforest.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88

99
import numpy as np
1010

11+
from sklearn.utils.fixes import euler_gamma
12+
from sklearn.utils.testing import assert_almost_equal
1113
from sklearn.utils.testing import assert_array_equal
1214
from sklearn.utils.testing import assert_array_almost_equal
1315
from sklearn.utils.testing import assert_raises
@@ -19,6 +21,7 @@
1921

2022
from sklearn.model_selection import ParameterGrid
2123
from sklearn.ensemble import IsolationForest
24+
from sklearn.ensemble.iforest import _average_path_length
2225
from sklearn.model_selection import train_test_split
2326
from sklearn.datasets import load_boston, load_iris
2427
from sklearn.utils import check_random_state
@@ -211,3 +214,16 @@ def test_iforest_subsampled_features():
211214
clf = IsolationForest(max_features=0.8)
212215
clf.fit(X_train, y_train)
213216
clf.predict(X_test)
217+
218+
219+
def test_iforest_average_path_length():
220+
# It tests non-regression for #8549 which used the wrong formula
221+
# for average path length, strictly for the integer c 9E88 ase
222+
223+
result_one = 2. * (np.log(4.) + euler_gamma) - 2. * 4. / 5.
224+
result_two = 2. * (np.log(998.) + euler_gamma) - 2. * 998. / 999.
225+
assert_almost_equal(_average_path_length(1), 1., decimal=10)
226+
assert_almost_equal(_average_path_length(5), result_one, decimal=10)
227+
assert_almost_equal(_average_path_length(999), result_two, decimal=10)
228+
assert_array_almost_equal(_average_path_length(np.array([1, 5, 999])),
229+
[1., result_one, result_two], decimal=10)

sklearn/utils/fixes.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@ def _parse_version(version_string):
3636
version.append(x)
3737
return tuple(version)
3838

39+
euler_gamma = getattr(np, 'euler_gamma',
40+
0.577215664901532860606512090082402431)
3941

4042
np_version = _parse_version(np.__version__)
4143
sp_version = _parse_version(scipy.__version__)

0 commit comments

Comments
 (0)
0