-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Add get_feature_names_out for RandomTreesEmbedding module #21762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add get_feature_names_out for RandomTreesEmbedding module #21762
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @MaxwellLZH ! I recommend breaking this PR into 3 PRs:
- Keep this one for
RandomTreesEmbedding
. - Another for
Voting*
- Another for
Stacking*
(Do not start this one untilVoting*
is complete since they are related and will have the same discussions)
This is because I think there is an argument for generating more descriptive names for each of the cases above.
assert_array_equal( | ||
[f"randomtreesembedding{i}" for i in range(hasher._n_features_out)], names | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to explicitly test the public API. We can transform and get the number of features out:
n_features_out = hasher.transform(X).shape[1]
assert_array_equal(
[f"randomtreesembedding{i}" for i in range(n_features_out)], names
)
There is an argument for using something like randomtreesembedding_3_10
, where 3
is represents the tree that used to generate the leaf, and 10 is the leaf index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I leave the naming as it is for now? if we decided to go for randomtreesembedding_{i}_{j}
then we can change the test cases accordingly later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make a decision in this PR. I like using randomtreesembedding_{i}_{j}
, can we update this PR to use this formatting and see what other reviewers think?
This means a custom get_feature_names_out
for tree embedding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a custom get_feature_names_out
for tree embedding, where i
is tree index starting from 1 and j
is leaf index as suggested.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v1.1.rst
with tag |Enhacement|. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I am happy with the naming for the features here. Let's see what other reviewers think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the suggestions below as taken into account. I found the feature names surprising so I think it's was necessary to make the docstring more explicit and add an inline comment in the test.
Do you think it is better to ignore the internal indices from the trees and use 0, 1, 2, etc for the leaf indices? I am okay with that option as well. |
The current implementation is more informative but potentially a bit confusing. Using contiguous, leaf-only indexing would make the feature names less dependent on the internal tree data-structure but in a way this internal detail is already part of the public API because those are the indices returned by the So +0 for keeping the current indexing / naming scheme. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
I don't have a strong preference. I'm fine with reflecting the tree structure with the additional comments from Olivier. It might make it easier to debug if needed as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @MaxwellLZH
…-learn#21762) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Part of issue #21308 . This is the same PR as #21459 (I got into some git issue with the last PR )
What does this implement/fix? Explain your changes.
Implementing
get_feature_names_out
forRandomForestEmbedding
,VotingClassifier
,VotingRegressor
,StackingClassifier
andStackingRegressor
, with corresponding test cases.