8000 ENH Add get_feature_names_out for RandomTreesEmbedding module by MaxwellLZH · Pull Request #21762 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add get_feature_names_out for RandomTreesEmbedding module #21762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MaxwellLZH
Copy link
Contributor

Reference Issues/PRs

Part of issue #21308 . This is the same PR as #21459 (I got into some git issue with the last PR )

What does this implement/fix? Explain your changes.

Implementing get_feature_names_out for RandomForestEmbedding, VotingClassifier, VotingRegressor, StackingClassifier and StackingRegressor, with corresponding test cases.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @MaxwellLZH ! I recommend breaking this PR into 3 PRs:

  1. Keep this one for RandomTreesEmbedding.
  2. Another for Voting*
  3. Another for Stacking* (Do not start this one until Voting* is complete since they are related and will have the same discussions)

This is because I think there is an argument for generating more descriptive names for each of the cases above.

Comment on lines 1807 to 1809
assert_array_equal(
[f"randomtreesembedding{i}" for i in range(hasher._n_features_out)], names
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to explicitly test the public API. We can transform and get the number of features out:

    n_features_out = hasher.transform(X).shape[1]
    assert_array_equal(
        [f"randomtreesembedding{i}" for i in range(n_features_out)], names
    )

There is an argument for using something like randomtreesembedding_3_10, where 3 is represents the tree that used to generate the leaf, and 10 is the leaf index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I leave the naming as it is for now? if we decided to go for randomtreesembedding_{i}_{j} then we can change the test cases accordingly later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make a decision in this PR. I like using randomtreesembedding_{i}_{j}, can we update this PR to use this formatting and see what other reviewers think?

This means a custom get_feature_names_out for tree embedding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a custom get_feature_names_out for tree embedding, where i is tree index starting from 1 and j is leaf index as suggested.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v1.1.rst with tag |Enhacement|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

8000
MaxwellLZH and others added 9 commits February 14, 2022 11:26
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I am happy with the naming for the features here. Let's see what other reviewers think.

@thomasjpfan thomasjpfan changed the title ENH Add get_feature_names_out for ensemble module ENH Add get_feature_names_out for RandomTreesEmbedding module Feb 15, 2022
Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the suggestions below as taken into account. I found the feature names surprising so I think it's was necessary to make the docstring more explicit and add an inline comment in the test.

@thomasjpfan
Copy link
Member

I found the feature names surprising so I think it's was necessary to make the docstring more explicit and add an inline comment in the test.

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

I am okay with that option as well.

@ogrisel
Copy link
Member
ogrisel commented Mar 7, 2022

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

The current implementation is more informative but potentially a bit confusing. Using contiguous, leaf-only indexing would make the feature names less dependent on the internal tree data-structure but in a way this internal detail is already part of the public API because those are the indices returned by the apply public method.

So +0 for keeping the current indexing / naming scheme.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
jeremiedbb and others added 3 commits March 8, 2022 18:32
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@jeremiedbb
Copy link
Member

Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices?

I don't have a strong preference. I'm fine with reflecting the tree structure with the additional comments from Olivier. It might make it easier to debug if needed as well.

Copy link
Member
@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @MaxwellLZH

@jeremiedbb jeremiedbb merged commit 26f5b26 into scikit-learn:main Mar 8, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022
…-learn#21762)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0