10000 Unbiased MDI-like feature importance measure for random forests by GaetandeCast · Pull Request #31279 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Unbiased MDI-like feature importance measure for random forests #31279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 64 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
c0e22ea
First working implementation of UFI, does not support multi output, h…
GaetandeCast Apr 14, 2025
b1e9df8
Removed the normalization inherited from the old MDI to avoid instabi…
GaetandeCast Apr 15, 2025
2a694b6
added multi output support
GaetandeCast Apr 15, 2025
fd0abfb
removed redundant cross_impurity computations
GaetandeCast Apr 15, 2025
ef9f48d
added mdi_oob
GaetandeCast Apr 16, 2025
a225a42
redesigned ufi for better memory management
GaetandeCast Apr 17, 2025
83f3880
removed a debug import
GaetandeCast Apr 17, 2025
27618db
added mdi_oob, cleaned the code
GaetandeCast Apr 18, 2025
5ad9636
better unified the code between ufi and mdi_oob
GaetandeCast Apr 18, 2025
21d2e04
fixed a call oversight
GaetandeCast Apr 18, 2025
8194d6e
fixed an error in mdi_oob computations
GaetandeCast Apr 18, 2025
9e16a09
changed tests on feature_importances_ to use unbiased FI too
GaetandeCast Apr 22, 2025
8991d79
add tests to check that the added methods coincide with the papers an…
GaetandeCast Apr 23, 2025
a9d2983
added support for regression (only MSE split)
GaetandeCast Apr 24, 2025
710d42c
added warning for unbiased feature importance in classification witho…
GaetandeCast Apr 24, 2025
ddedf27
merged test_non_OOB_unbiased_feature_importances_class & _reg
GaetandeCast Apr 24, 2025
1de98fc
Fixed a few mistake so that ufi-regression matches feature_importance…
GaetandeCast Apr 25, 2025
c7c5d76
Extended the tests on matching the paper values to regression
GaetandeCast Apr 25, 2025
a44084d
re added tests on oob_score for dense X. They fail
GaetandeCast Apr 25, 2025
082206c
revert a small change to a test
GaetandeCast Apr 28, 2025
b028cb9
raise an error when calling unbiased feature importance with criterio…
GaetandeCast Apr 28, 2025
dcb3106
adapted the tests to the previous commit
GaetandeCast Apr 29, 2025
c61c8dc
Added log_loss ufi
GaetandeCast Apr 29, 2025
d198f20
fixed the oob_score_ issue, simplified the self.value accesses
GaetandeCast Apr 29, 2025
f2acf5f
updated api and tests for ufi with 'log_loss'
GaetandeCast Apr 30, 2025
f41cf3f
divide by 2 ufi 'log_loss' and improve tests
GaetandeCast Apr 30, 2025
af785d6
fix some linting
GaetandeCast Apr 30, 2025
ccd4f18
fixed Cython linting
GaetandeCast Apr 30, 2025
ac36aaa
added inline function for clarity and comments on available criteria
GaetandeCast Apr 30, 2025
fda8349
Merge branch 'main' into main
ogrisel Apr 30, 2025
5f1beed
Merge branch 'main' into unbiased-feature-importance
GaetandeCast May 7, 2025
f10721e
add sample weight support
GaetandeCast May 7, 2025
6966147
add test reg mse 1hot == classi gini
GaetandeCast May 9, 2025
50d47a8
fix bug in previous commit and simplify test
GaetandeCast May 9, 2025
ce52159
add support for methods in gradient boosting, when
GaetandeCast May 12, 2025
2b9099b
support and test degenerate case (ensemble of single node trees)
GaetandeCast May 12, 2025
d54cf0f
improve sample weight support and test
GaetandeCast May 12, 2025
8a09b39
move gradient boosting changes to gradient-boosting branch
GaetandeCast May 14, 2025
9fbebb5
finish removing gradient-boosting changes
GaetandeCast May 14, 2025
6fcf61c
move previous sample weight test to tree level
GaetandeCast May 14, 2025
241de66
add sample weight tests
GaetandeCast May 14, 2025
4c40c43
add convergence test between biased and unbiased fi
GaetandeCast May 14, 2025
8329b3b
add support for scipy sparse matrices
GaetandeCast May 15, 2025
0b48af4
update importances test to test with sparse data
GaetandeCast May 15, 2025
229cc4d
Update doc example on permutation and MDI importance
GaetandeCast May 16, 2025
6e4703b
remove unused code
GaetandeCast May 21, 2025
e8b06dc
improve test_importances
GaetandeCast May 21, 2025
452019d
remove unnecessary validation on X
GaetandeCast May 22, 2025
6fd9304
make functions private to avoid docstring test fail
GaetandeCast May 22, 2025
bbbfb38
add TODO on the public aspect of the method
GaetandeCast May 22, 2025
77bc017
add sparse conversion to csr that was done by the removed validate_X_…
GaetandeCast May 22, 2025
a77fd2e
changed tree oob sample weight test to be more simple and understandable
GaetandeCast May 22, 2025
dd59f9c
Update sklearn/tree/tests/test_tree.py
GaetandeCast May 22, 2025
8ba000e
add global_random_seed to match_paper tests
GaetandeCast May 23, 2025
9a01dbe
add skip if joblib version <1.2, remove sample weight test that was m…
GaetandeCast May 23, 2025
0c5cab5
fix regex match
GaetandeCast May 23, 2025
9e78f6d
drop the return_as parameter, remove joblib version skip in tests
GaetandeCast May 23, 2025
77676e7
add non normalized feature importance in private attribute
GaetandeCast May 27, 2025
474839f
add changelog entries for tree and ensemble
GaetandeCast May 30, 2025
ce9ebe0
fix coverage warnings
GaetandeCast Jun 2, 2025
be7c99d
divide importances by weighted_n_sample to avoid large unnormalized v…
GaetandeCast Jun 4, 2025
41729c6
Merge branch 'main' into unbiased-feature-importance
GaetandeCast Jun 6, 2025
88c30eb
remove mdi_oob, remove normalization for ufi
GaetandeCast Jun 6, 2025
eb3316b
made ufi only available with gini, mse and friedman_mse
GaetandeCast Jun 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
- Forest estimators such as :class:`ensemble.RandomForestClassifier` and
:class:`ensemble.ExtraTreesRegressor` now have a new attribute
for unbiased impurity feature importance: `unbiased_feature_importances_`
This method leverages out-of-bag samples to correct the known bias of MDI
importance. It is automatically computed during training when
`oob_score=True`.
By :user:`Gaétan de Castellane <GaetandeCast>`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
- :class:`tree.Tree` now has a method that allows passing test samples
to compute a test score and feature importance measure.
The private method `_compute_unbiased_feature_importance_and_oob_predictions`
is used by forest estimators to provide an unbiased feature importance by
using oob samples but could be made public to allow the user to pass their
own test data.
By :user:`Gaétan de Castellane <GaetandeCast>`.
54 changes: 46 additions & 8 deletions examples/inspection/plot_permutation_importance.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,17 @@
variable, as long as the model has the capacity to use them to overfit.

This example shows how to use Permutation Importances as an alternative that
can mitigate those limitations.
can mitigate those limitations. It also introduces a new method from recent
litterature on random forests that allows removing the aforementioned biases
from MDI while keeping its computational efficiency.

.. rubric:: References

* :doi:`L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32,
2001. <10.1023/A:1010933404324>`
2001. <10.1023/A:1010933404324>
* :doi:`Li, X., Wang, Y., Basu, S., Kumbier, K., & Yu, B., "A debiased MDI
feature importance measure for random forests". Proceedings of the 33rd Conference on
Neural Information Processing Systems (NeurIPS 2019). <10.48550/arXiv.1906.10845>`

"""

Expand Down Expand Up @@ -87,7 +92,7 @@
rf = Pipeline(
[
("preprocess", preprocessing),
("classifier", RandomForestClassifier(random_state=42)),
("classifier", RandomForestClassifier(random_state=42, oob_score=True)),
]
)
rf.fit(X_train, y_train)
Expand All @@ -98,9 +103,16 @@
# Before inspecting the feature importances, it is important to check that
# the model predictive performance is high enough. Indeed, there would be little
# interest in inspecting the important features of a non-predictive model.
#
# By default, random forests subsample a part of the dataset to train each tree, a
# procedure known as bagging, leaving aside "out-of-bag" (oob) samples.
# These samples can be leveraged to compute an accuracy score independantly of the
# training samples, when setting the parameter `oob_score = True`.
# This score should be close to the test score.

print(f"RF train accuracy: {rf.score(X_train, y_train):.3f}")
print(f"RF test accuracy: {rf.score(X_test, y_test):.3f}")
print(f"RF out-of-bag accuracy: {rf[-1].oob_score_:.3f}")

# %%
# Here, one can observe that the train accuracy is very high (the forest model
Expand Down Expand Up @@ -140,17 +152,24 @@
#
# The fact that we use training set statistics explains why both the
# `random_num` and `random_cat` features have a non-null importance.
#
# The attribute `unbiased_feature_importances_`, available as soon as `oob_score` is set
# to `True`, uses the out-of-bag samples of each tree to correct these biases.
# It succesfully detects the uninformative features by assigning them a near zero
# (here slightly negative) importance value.
import pandas as pd

feature_names = rf[:-1].get_feature_names_out()

mdi_importances = pd.Series(
rf[-1].feature_importances_, index=feature_names
).sort_values(ascending=True)
mdi_importances = pd.DataFrame(index=feature_names)
mdi_importances.loc[:, "unbiased mdi"] = rf[-1].unbiased_feature_importances_
mdi_importances.loc[:, "mdi"] = rf[-1].feature_importances_
mdi_importances = mdi_importances.sort_values(ascending=True, by="mdi")

# %%
ax = mdi_importances.plot.barh()
ax.set_title("Random Forest Feature Importances (MDI)")
ax.axvline(x=0, color="k", linestyle="--")
ax.figure.tight_layout()

# %%
Expand Down Expand Up @@ -232,15 +251,34 @@
)

# %%
import matplotlib.pyplot as plt

for name, importances in zip(["train", "test"], [train_importances, test_importances]):
ax = importances.plot.box(vert=False, whis=10)
ax.set_title(f"Permutation Importances ({name} set)")
ax.set_xlabel("Decrease in accuracy score")
ax.axvline(x=0, color="k", linestyle="--")
ax.figure.tight_layout()

plt.figure()
umdi_importances = pd.Series(
rf[-1].unbiased_feature_importances_[sorted_importances_idx],
index=feature_names[sorted_importances_idx],
)
ax = umdi_importances.plot.barh()
ax.set_title("Debiased MDI")
ax.axvline(x=0, color="k", linestyle="--")
ax.figure.tight_layout()
# %%
# Now, we can observe that on both sets, the `random_num` and `random_cat`
# features have a lower importance compared to the overfitting random forest.
# However, the conclusions regarding the importance of the other features are
# features have a lower permuatation importance compared to the overfitting random
# forest. However, the conclusions regarding the importance of the other features are
# still valid.
#
# These accurate permutation importances match the results obtained with oob-based
# impurity methods on this new random forest.
#
# Do note that permutation importances are costly as they require refitting the whole
# model for every permutation of each feature. When working on large datasets with
# random forests, it may be preferable to use debiased impurity based feature importance
# measures.
Loading
0