ENH Add get_feature_names_out for RandomTreesEmbedding module #21762

MaxwellLZH · 2021-11-23T15:21:38Z

Reference Issues/PRs

Part of issue #21308 . This is the same PR as #21459 (I got into some git issue with the last PR )

What does this implement/fix? Explain your changes.

Implementing get_feature_names_out for RandomForestEmbedding, VotingClassifier, VotingRegressor, StackingClassifier and StackingRegressor, with corresponding test cases.

thomasjpfan

Thank you for the PR @MaxwellLZH ! I recommend breaking this PR into 3 PRs:

Keep this one for RandomTreesEmbedding.
Another for Voting*
Another for Stacking* (Do not start this one until Voting* is complete since they are related and will have the same discussions)

This is because I think there is an argument for generating more descriptive names for each of the cases above.

sklearn/ensemble/tests/test_forest.py

sklearn/ensemble/_forest.py

thomasjpfan · 2021-11-26T19:34:52Z

sklearn/ensemble/tests/test_forest.py

+    assert_array_equal(
+        [f"randomtreesembedding{i}" for i in range(hasher._n_features_out)], names
+    )


I think it is better to explicitly test the public API. We can transform and get the number of features out:

n_features_out = hasher.transform(X).shape[1] assert_array_equal( [f"randomtreesembedding{i}" for i in range(n_features_out)], names )

There is an argument for using something like randomtreesembedding_3_10, where 3 is represents the tree that used to generate the leaf, and 10 is the leaf index.

Shall I leave the naming as it is for now? if we decided to go for randomtreesembedding_{i}_{j} then we can change the test cases accordingly later?

We need to make a decision in this PR. I like using randomtreesembedding_{i}_{j}, can we update this PR to use this formatting and see what other reviewers think?

This means a custom get_feature_names_out for tree embedding.

I've added a custom get_feature_names_out for tree embedding, where i is tree index starting from 1 and j is leaf index as suggested.

sklearn/ensemble/tests/test_voting.py

sklearn/ensemble/_forest.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

thomasjpfan

Please add an entry to the change log at doc/whats_new/v1.1.rst with tag |Enhacement|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

sklearn/ensemble/_forest.py

sklearn/ensemble/tests/test_forest.py

sklearn/ensemble/_forest.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

thomasjpfan

Thanks for the update! I am happy with the naming for the features here. Let's see what other reviewers think.

ogrisel

LGTM once the suggestions below as taken into account. I found the feature names surprising so I think it's was necessary to make the docstring more explicit and add an inline comment in the test.

sklearn/ensemble/_forest.py

sklearn/ensemble/tests/test_forest.py

doc/whats_new/v1.1.rst

thomasjpfan · 2022-03-07T16:00:18Z

I found the feature names surprising so I think it's was necessary to make the docstring more explicit and add an inline comment in the test.

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

I am okay with that option as well.

ogrisel · 2022-03-07T16:06:40Z

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

The current implementation is more informative but potentially a bit confusing. Using contiguous, leaf-only indexing would make the feature names less dependent on the internal tree data-structure but in a way this internal detail is already part of the public API because those are the indices returned by the apply public method.

So +0 for keeping the current indexing / naming scheme.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb · 2022-03-08T17:40:33Z

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

I don't have a strong preference. I'm fine with reflecting the tree structure with the additional comments from Olivier. It might make it easier to debug if needed as well.

jeremiedbb

LGTM. Thanks @MaxwellLZH

…-learn#21762) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

MaxwellLZH added 2 commits November 23, 2021 23:00

add feature_names_out for ensemble module

8931db7

bug fix

7dca042

github-actions bot added the module:ensemble label Nov 23, 2021

thomasjpfan reviewed Nov 26, 2021

View reviewed changes

thomasjpfan mentioned this pull request Jan 3, 2022

Implement get_feature_names_out for all estimators #21308

Closed

14 tasks

MaxwellLZH and others added 8 commits February 10, 2022 11:14

move Mixins to the left

da0b311

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

fix typo

9b4379f

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

change fit logic and test cases for RandomTreesEmbedding

1203f48

black formatting

b1aac98

bug fix

862d2ff

revert changes in voting and stacking

f05ec1f

use randomtreesembedding_{i}_{j} for feature_names_out

11b5be7

fix failed docstring test

ebd255d

thomasjpfan reviewed Feb 11, 2022

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

MaxwellLZH and others added 9 commits February 14, 2022 11:26

validate input_features

2b83260

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

keep tree index starts with 0

302a178

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

better variable naming

be6d5ba

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

remove _ClassNamePrefixFeaturesOutMixin and include TreansformerMixin

1d9033a

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

return ndarray of object dtype

41c55ad

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

update documentation

4c5b329

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

fix import error

05bcf72

Update test case & apply black formatting

2a6108e

Add entry in whatsnew

6726028

thomasjpfan approved these changes Feb 15, 2022

View reviewed changes

thomasjpfan changed the title ~~ENH Add get_feature_names_out for ensemble module~~ ENH Add get_feature_names_out for RandomTreesEmbedding module Feb 15, 2022

ogrisel approved these changes Mar 7, 2022

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

Update sklearn/ensemble/tests/test_forest.py

d8df622

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb and others added 3 commits March 8, 2022 18:32

Update sklearn/ensemble/_forest.py

26d54d4

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Update doc/whats_new/v1.1.rst

9dfe863

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into fet/ensemble-feature-names-out

a373726

jeremiedbb approved these changes Mar 8, 2022

View reviewed changes

jeremiedbb merged commit 26f5b26 into scikit-learn:main Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Add get_feature_names_out for RandomTreesEmbedding module #21762

ENH Add get_feature_names_out for RandomTreesEmbedding module #21762

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH Add get_feature_names_out for RandomTreesEmbedding module #21762

ENH Add get_feature_names_out for RandomTreesEmbedding module #21762

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!