10000 use get_feature_names from last step in pipeline if it is available by andaag · Pull Request #5172 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

use get_feature_names from last step in pipeline if it is available #5172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

andaag
Copy link
Contributor
@andaag andaag commented Aug 27, 2015

In the case:
make_pipeline(a,b,CountVectorizer()) you can now call get_feature_names() and get the result from the last step in the pipeline's get_feature_names function.

@jnothman
Copy link
Member

Related to #2007. Yes, I think I'd support this as an initial solution.

@@ -131,6 +131,22 @@ def named_steps(self):
def _final_estimator(self):
return self.steps[-1][1]

def get_feature_names(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should use if_delegate_has_method as below instead of the AttributeError here.

@jnothman
Copy link
Member

please add a test.

@GaelVaroquaux
Copy link
Member

I really cringe at this: there is not bound to how many methods we will add the the pipeline to solve all related usecases. At the end of the day, this is 4 lines of code to implement this without a method.

I am -1 for merge.

@amueller
Copy link
Member

I'm not sure this is the right solution, but I'm for adding get_feature_names to pipelines.
I was thinking about it more recently, and more consistent feature names support is currently the number 1 thing users ask for.

@amueller
Copy link
Member

@GaelVaroquaux the questions is not necessarily how many lines of code something is, but how much of the internals of scikit-learn someone needs to know to do something that should be obvious.
If this is inside a grid-search, it gets even worse.
I just wrote grid.best_estimator_.named_steps['CountVectorizer'].get_feature_names().
yeah that is just one line. But a line that is really hard to find out.

@amueller
Copy link
Member

I feel a much more common usecase is that it is not the last step of the pipeline that has the get_feature_names. I'm not entirely sure how to best handle that :-/
Maybe we should try to propagate feature names in the pipeline when possible, and just discard them if we find a pca or other decomposition?

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Aug 27, 2015 via email

@jnothman
Copy link
Member

@amueller see the discussion at #2007. I think we could change it to transform_feature_names for broader application.

I see what you mean, @GaelVaroquaux, but get_feature_names atm is about feature extractors. In the context of a FeatureUnion made up of feature extraction Pipelines (which is a common construct for heterogeneous data pending more specific solutions to that problem), I think this would be very convenient.

@amueller
Copy link
Member

@GaelVaroquaux I'm pretty sure we can solve the problem in the interesting cases without being fragile. We can totally support it in gridsearch and we don't have to support it in PCA (or can return first principle component etc).

I agree with most points made in #2007.
I feel that in terms of user interface, if I do make_pipeline(DictVectorizer(), StandardScaler(), SelectKBest()) or even make_pipeline(DictVectorizer(), StandardScaler(), SelectKBest(), LinearSVC()) then get_feature_names() on the pipeline is unambiguous and I shouldn't need to provide any features to transform.

To provide this capability, we probably need to implement some transform_feature_names function, but I feel that should be private.

[The LinearSVC() one is kind of tricky as that currently is a feature selector for better or worse, so it is not unambiguous if we want it to apply the feature selection. But that problem is handled in #4242]

@jnothman
Copy link
Member

Private? Why shouldn't I be able to write my own transformer that selects
or combines features?

On 28 August 2015 at 13:38, Andreas Mueller notifications@github.com
wrote:

@GaelVaroquaux https://github.com/GaelVaroquaux I'm pretty sure we can
solve the problem in the interesting cases without being fragile. We can
totally support it in gridsearch and we don't have to support it in PCA (or
can return first principle component etc).

I agree with most points made in #2007
#2007.
I feel that in terms of user interface, if I do make_pipeline(DictVectorizer(),
StandardScaler(), SelectKBest()) or even make_pipeline(DictVectorizer(),
StandardScaler(), SelectKBest(), LinearSVC()) then get_feature_names() on
the pipeline is unambiguous and I shouldn't need to provide any features to
transform.

To provide this capability, we probably need to implement some
transform_feature_names function, but I feel that should be private.


Reply to this email directly or view it on GitHub
#5172 (comment)
.

@andaag
Copy link
Contributor Author
andaag commented Aug 28, 2015

Is piping down features from the layer above as a default that bad?

In the case of make_pipeline(DictVectorizer(), StandardScaler(), SelectKBest()) this will work perfectly fine. In the case of using PCA for example get_feature_names could return pca_1, pca_2 etc. It would then be fairly obvious where the features came from.

I'm gonna hold off on updating the PR with tests and such until the discussion on how this should be done is getting closer to a conclusion :)

@amueller
Copy link
Member

@jnothman well you can but I wouldn't want to clutter the public API with something that is quite specific for pipelines.

@jnothman
Copy link
Member

IMO the need to pipeline affects a lot of the design of our API.

On 1 September 2015 at 01:31, Andreas Mueller notifications@github.com
wrote:

@jnothman https://github.com/jnothman well you can but I wouldn't want
to clutter the public API with something that is quite specific for
pipelines.


Reply to this email directly or view it on GitHub
#5172 (comment)
.

@amueller
Copy link
Member
amueller commented Sep 1, 2015

true. my current feeling is that adding another public function to the API is not really warranted. I see your point, though.

@jnothman jnothman mentioned this pull request Oct 24, 2015
@amueller amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019
Base automatically changed from master to main January 22, 2021 10:48
@thomasjpfan
Copy link
Member

On main, all transformers now have get_feature_names_out. Pipeline also defines a get_feature_names_out which passes the feature names through the steps. With that in mind, I am closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:pipeline Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0