-
-
Notifications
You must be signed in to change notification settings - Fork 26k
use get_feature_names from last step in pipeline if it is available #5172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Related to #2007. Yes, I think I'd support this as an initial solution. |
@@ -131,6 +131,22 @@ def named_steps(self): | |||
def _final_estimator(self): | |||
return self.steps[-1][1] | |||
|
|||
def get_feature_names(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should use if_delegate_has_method
as below instead of the AttributeError here.
please add a test. |
I really cringe at this: there is not bound to how many methods we will add the the pipeline to solve all related usecases. At the end of the day, this is 4 lines of code to implement this without a method. I am -1 for merge. |
I'm not sure this is the right solution, but I'm for adding |
@GaelVaroquaux the questions is not necessarily how many lines of code something is, but how much of the internals of scikit-learn someone needs to know to do something that should be obvious. |
I feel a much more common usecase is that it is not the last step of the pipeline that has the |
I understand that users require that, but it's going to be fragile and
break and eventually not fix the users problems (PCA, grid search...). So
our job is to not try to solve a problem that we will not be able to
solve reliably.
When a problem cannot be solve reliably, it's more useful to write a good
documentation that a function.
|
@amueller see the discussion at #2007. I think we could change it to I see what you mean, @GaelVaroquaux, but |
@GaelVaroquaux I'm pretty sure we can solve the problem in the interesting cases without being fragile. We can totally support it in gridsearch and we don't have to support it in PCA (or can return first principle component etc). I agree with most points made in #2007. To provide this capability, we probably need to implement some [The LinearSVC() one is kind of tricky as that currently is a feature selector for better or worse, so it is not unambiguous if we want it to apply the feature selection. But that problem is handled in #4242] |
Private? Why shouldn't I be able to write my own transformer that selects On 28 August 2015 at 13:38, Andreas Mueller notifications@github.com
|
Is piping down features from the layer above as a default that bad? In the case of make_pipeline(DictVectorizer(), StandardScaler(), SelectKBest()) this will work perfectly fine. In the case of using PCA for example get_feature_names could return pca_1, pca_2 etc. It would then be fairly obvious where the features came from. I'm gonna hold off on updating the PR with tests and such until the discussion on how this should be done is getting closer to a conclusion :) |
@jnothman well you can but I wouldn't want to clutter the public API with something that is quite specific for pipelines. |
IMO the need to pipeline affects a lot of the design of our API. On 1 September 2015 at 01:31, Andreas Mueller notifications@github.com
|
true. my current feeling is that adding another public function to the API is not really warranted. I see your point, though. |
On |
In the case:
make_pipeline(a,b,CountVectorizer()) you can now call get_feature_names() and get the result from the last step in the pipeline's get_feature_names function.