-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] ENH Add get_feature_names for various transformers #6431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH Add get_feature_names for various transformers #6431
Conversation
@jnothman Sure! |
Hello @jnothman and @nelson-liu , |
|
||
""" | ||
if input_features is not None: | ||
input_features = list(map(lambda input_feature: 'f(' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason function name is not used here (self.func.__name__
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't consider it very stable...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there can be lambdas, etc, but that's still the best information available. It allow to e.g. get a more inspectable feature names if several FunctionTransformers are used in a FeatureUnion. Is this use case uncommon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, repr is unstable, but func.__name__
is more deterministic:
>>> def foo(x): pass
>>> repr(foo)
'<function foo at 0x106683950>'
>>> foo.__name__
'foo'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got to balance a few things here:
- a machine-readable description of a feature's provenance (ideally a structured representation)
- a human readable (in limited memory!) description of a feature as distinct from a neighbour
- backwards compatibility issues (although I am willing to label some of these things as experimental and subject to change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good points.
a machine-readable description of a feature's provenance (ideally a structured representation)
Are you saying that string formatting is not powerful enough for these use cases, and there is a need for more structured representation (a dict, a special object)? What additional information are you thinking of?
a human readable (in limited memory!) description of a feature as distinct from a neighbour
In eli5 library instead of feature name lists we used a special object which generates feature names on a fly if they are autogenerated ('pca__x0' as opposed to 'petal width'), in order to avoid keeping it all in memory, and to save CPU time when only a subset of feature names is needed. For example, a common task is to get feature names which correspond to non-zero coefficients in a particular document, or feature names for top N features with largest coefficients. CPU was even more important than memory because there were noticeable delays in interactive usage.
Yes, I can see the appeal in that approach, but still don't have a clear I can also see that __ might be an appropriate way to designate function On 16 November 2016 at 19:01, Mikhail Korobov notifications@github.com
|
output_feature_names : list of string, length len(input_features) | ||
|
||
""" | ||
if input_features is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic is incorrect. func
is not applied to each feature independently, but to X as a whole. The only way to get sensible feature names out of a FunctionTransformer
is to have another parameter specify get_feature_names
, essentially, or else to count the number of columns output from transform
and make up default names on that basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think a string template (passed to FunctionTransformer constructor?) is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, something like FunctionTransformer(lambda X_df: X_df['col_a', 'col_b'])
should have feature names ['col_a', 'col_b']
.
output_feature_names : list of string, length len(input_features) | ||
|
||
""" | ||
return input_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if some of the features had only nan
in them? Shouldn't this method clip them from input_features, as Imputer
does to the column itself?
This is a PR for #6425 .
I've added
get_feature_names
for scalers, normalizers and imputers.Is the purpose of this function to maintain compatibility when
get_feature_names()
is implemented in Pipeline?Can @jnothman please have a look?