8000 [MRG] ENH Add get_feature_names for various transformers by yenchenlin · Pull Request #6431 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] ENH Add get_feature_names for various transformers #6431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

yenchenlin
Copy link
Contributor

This is a PR for #6425 .
I've added get_feature_names for scalers, normalizers and imputers.

Is the purpose of this function to maintain compatibility when get_feature_names() is implemented in Pipeline?

Can @jnothman please have a look?

@jnothman
Copy link
Member

Would I be able to get you to merge together all these trivial cases - #6431, #6432, #6436 - into one PR? Sorry for not suggesting that in the first place. Whatever decisions we make about API and testing we'll need to make consistently.

@yenchenlin
Copy link
Contributor Author

@jnothman Sure!

@yenchenlin yenchenlin changed the title [MRG] ENH Add get_feature_names for normalizers, scalers and imputers [MRG] ENH Add get_feature_names for various transformers Feb 24, 2016
@yenchenlin
Copy link
Contributor Author

Hello @jnothman and @nelson-liu ,
I've cherry-pick the commits from #6432 and #6436 .
I'll close #6432 once you guys confirm.


"""
if input_features is not None:
input_features = list(map(lambda input_feature: 'f(' +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason function name is not used here (self.func.__name__)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider it very stable...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there can be lambdas, etc, but that's still the best information available. It allow to e.g. get a more inspectable feature names if several FunctionTransformers are used in a FeatureUnion. Is this use case uncommon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, repr is unstable, but func.__name__ is more deterministic:

>>> def foo(x): pass
>>> repr(foo)
'<function foo at 0x106683950>'
>>> foo.__name__
'foo'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got to balance a few things here:

  • a machine-readable description of a feature's provenance (ideally a structured representation)
  • a human readable (in limited memory!) description of a feature as distinct from a neighbour
  • backwards compatibility issues (although I am willing to label some of these things as experimental and subject to change)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good points.

a machine-readable description of a feature's provenance (ideally a structured representation)

Are you saying that string formatting is not powerful enough for these use cases, and there is a need for more structured representation (a dict, a special object)? What additional information are you thinking of?

a human readable (in limited memory!) description of a feature as distinct from a neighbour

In eli5 library instead of feature name lists we used a special object which generates feature names on a fly if they are autogenerated ('pca__x0' as opposed to 'petal width'), in order to avoid keeping it all in memory, and to save CPU time when only a subset of feature names is needed. For example, a common task is to get feature names which correspond to non-zero coefficients in a particular document, or feature names for top N features with largest coefficients. CPU was even more important than memory because there were noticeable delays in interactive usage.

@jnothman
Copy link
Member

Yes, I can see the appeal in that approach, but still don't have a clear
resolution here.

I can also see that __ might be an appropriate way to designate function
application. Hmm.

On 16 November 2016 at 19:01, Mikhail Korobov notifications@github.com
wrote:

@kmike commented on this pull request.

In sklearn/preprocessing/_function_transformer.py
#6431:

  •    """
    
  •    Return feature names for output features
    
  •    Parameters
    

  •    input_features : list of string, length len(input_features), optional
    
  •        String names for input features if available. By default,
    
  •        None is used.
    
  •    Returns
    

  •    output_feature_names : list of string, length len(input_features)
    
  •    """
    
  •    if input_features is not None:
    
  •        input_features = list(map(lambda input_feature: 'f(' +
    

Yeah, good points.

a machine-readable description of a feature's provenance (ideally a
structured representation)

Are you saying that string formatting is not powerful enough for these use
cases, and there is a need for more structured representation (a dict, a
special object)? What additional information are you thinking of?

a human readable (in limited memory!) description of a feature as distinct
from a neighbour

In eli5 https://github.com/TeamHG-Memex/eli5 library instead of feature
name lists we used a special object which generates feature names on a fly
if they are autogenerated ('pca__x0' as opposed to 'petal width'), in order
to avoid keeping it all in memory, and to save CPU time when only a subset
of feature names is needed. For example, a common task is to get feature
names which correspond to non-zero coefficients in a particular document,
or feature names for top N features with largest coefficients. CPU was even
more important than memory because there were noticeable delays in
interactive usage.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6431, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz609Q5J2Y0_N728GGBSZ5qnRwQABCks5q-rhagaJpZM4Hgp7U
.

output_feature_names : list of string, length len(input_features)

"""
if input_features is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic is incorrect. func is not applied to each feature independently, but to X as a whole. The only way to get sensible feature names out of a FunctionTransformer is to have another parameter specify get_feature_names, essentially, or else to count the number of columns output from transform and make up default names on that basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a string template (passed to FunctionTransformer constructor?) is enough?

Copy link
Member
@jnothman jnothman Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, something like FunctionTransformer(lambda X_df: X_df['col_a', 'col_b']) should have feature names ['col_a', 'col_b'].

output_feature_names : list of string, length len(input_features)

"""
return input_features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if some of the features had only nan in them? Shouldn't this method clip them from input_features, as Imputer does to the column itself?

@amueller amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019
Base automatically changed from master to main January 22, 2021 10:49
@thomasjpfan
Copy link
Member

Closing because we have moved forward with SLEP007 and transformers will adopt get_feature_names_out.

Tracking issue: #21308

@thomasjpfan thomasjpfan closed this Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:preprocessing Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0