8000 Cannot get feature names after ColumnTransformer · Issue #12525 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Cannot get feature names after ColumnTransformer #12525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pjgao opened this issue Nov 6, 2018 · 20 comments
Closed

Cannot get feature names after ColumnTransformer #12525

pjgao opened this issue Nov 6, 2018 · 20 comments

Comments

@pjgao
Copy link
pjgao commented Nov 6, 2018

When I use ColumnTransformer to preprocess different columns (include numeric, category, text) with pipeline, I cannot get the feature names of the final transformed data, which is hard for debugging.

Here is the code:

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')

data = pd.read_csv(titanic_url)

target = data.pop('survived')

numeric_columns = ['age','sibsp','parch']
category_columns = ['pclass','sex','embarked']
text_columns = ['name','home.dest']

numeric_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='median')),
    ('scaler',StandardScaler()
    )
])
category_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='constant',fill_value='missing')),
    ('ohe',OneHotEncoder(handle_unknown='ignore'))
])
text_transformer = Pipeline(steps=[
    ('cntvec',CountVectorizer())
])

preprocesser = ColumnTransformer(transformers=[
    ('numeric',numeric_transformer,numeric_columns),
    ('category',category_transformer,category_columns),
    ('text',text_transformer,text_columns[0])
])

preprocesser.fit_transform(data)
  1. preprocesser.get_feature_names() will get error:
    AttributeError: Transformer numeric (type Pipeline) does not provide get_feature_names.
  2. In ColumnTransformertext_transformer can only process a string (eg 'Sex'), but not a list of string as text_columns
@jnothman
Copy link
Member
jnothman commented Nov 6, 2018

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

@pjgao
Copy link
Author
pjgao commented Nov 6, 2018

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

@jnothman
Copy link
Member
jnothman commented Nov 6, 2018 via email

@amueller
Copy link
Member
amueller commented Nov 7, 2018

1 is a duplicate of #6425, right? I want to write a slep on that.
I think supporting multiple text columns is pretty easy with ColumnTransformer. It's not the most pretty code but you could just add a CountVectorizer for each text column.

And your snippet doesn't really solve the issue because no get_feature_names doesn't mean you can just use the column names.

@pjgao
Copy link
Author
pjgao commented Nov 7, 2018

1 is a duplicate of #6425, right? I want to write a slep on that.
I think supporting multiple text columns is pretty easy with ColumnTransformer. It's not the most pretty code but you could just add a CountVectorizer for each text column.

And your snippet doesn't really solve the issue because no get_feature_names doesn't mean you can just use the column names.

yes, after a pandas DataFrame feeds in a preprocess pipeline, It's better to get feature names so that can know exactly what happened just from the generated data.

@amueller
Copy link
Member
amueller commented Nov 7, 2018

ok, closing as duplicate.

@amueller amueller closed this as completed Nov 7, 2018
@miemiekurisu
Copy link
miemiekurisu commented May 21, 2020

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

I made a tiny enhancement to get back the name like rawname_value for onehot forms:

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        raw_col_name_reverse = raw_col_name[::-1]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
            exchange_name = [(_.split("_")) for _ in preprocessor.transformers_[:-1][0][1].steps[-1][1].get_feature_names()]
            last_pre_name = ""
            last_raw_name = ""
            for pre_name,value in exchange_name:
                if pre_name==last_pre_name:
                    col_name.append(last_raw_name+"_"+value)
                if pre_name!=last_pre_name:
                    last_pre_name=pre_name
                    last_raw_name=raw_col_name_reverse.pop()
                    col_name.append(last_raw_name+"_"+value)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

@nickcorona
Copy link

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.

@kylegilde
Copy link

What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.

It would be nice to have a get_feature_names method for this configuration.

@kylegilde
Copy link
kylegilde commented Jun 8, 2020

What about if you apply simpleimputer with add_indica 8000 tor in a pipeline? This approach won't work.

Here is my contribution to the short-term solution. It coerces all the different array types to lists, and it handles the case of SimpleImputer(add_indicate=True). It's also a little more verbose.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        print('\n\ntransformer: ', transformer_in_columns[0])
        
        raw_col_name = list(transformer_in_columns[2])
        
        if isinstance(transformer_in_columns[1], Pipeline): 
            # if pipeline, get the last transformer
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
            
        try:
          if isinstance(transformer, OneHotEncoder):
            names = list(transformer.get_feature_names(raw_col_name))
            
          elif isinstance(transformer, SimpleImputer) and transformer.add_indicator:
            missing_indicator_indices = transformer.indicator_.features_
            missing_indicators = [raw_col_name[idx] + '_missing_flag' for idx in missing_indicator_indices]

            names = raw_col_name + missing_indicators
            
          else:
            names = list(transformer.get_feature_names())
          
        except AttributeError as error:
          names = raw_col_name
        
        print(names)    
        
        col_name.extend(names)
            
    return col_name

@kylegilde
Copy link
kylegilde commented Sep 10, 2020

FYI, I wrote some code and a blog about how to extract the feature names from complex Pipelines & ColumnTransformers. The code is an improvement over my previous post. https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4

@jobvisser03
Copy link

@kylegilde Great article and thanks for the code. Works like a charm. For global explanations I had been wrestling with KernelSHAP and alibi for some hours but didn't get my onehot transformer working without handle_unkown='ignore'

@roma-glushko
Copy link

Here is another version of the @pjgao's snippet that includes columns from reminder:

def get_columns_from_transformer(column_transformer, input_colums):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names(raw_col_name)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)

    [_, _, reminder_columns] = column_transformer.transformers_[-1]

    for col_idx in reminder_columns:
        col_name.append(input_colums[col_idx])

    return col_name

What do you think about adding similar function the the core codebase?

@inigohidalgo
Copy link

Hey @roma-glushko, thanks for your code!

For anyone else having an issue with this, I had to make a very minor change to his code:

def get_columns_from_transformer(column_transformer, input_colums):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names([raw_col_name])
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)

    [_, _, reminder_columns] = column_transformer.transformers_[-1]

    for col_idx in reminder_columns:
        col_name.append(input_colums[col_idx])

    return col_name

This block was missing square brackets around raw_col_name since get_feature_names expects a list of features.

try:
    names = transformer.get_feature_names([raw_col_name])

@chengyineng38
Copy link
chengyineng38 commented Mar 8, 2021

I was able to get it using col_transformer.named_transformers_["ohe"].get_feature_names()

col_transformer =  ColumnTransformer(transformers=[("ohe", OneHotEncoder(handle_unknown="ignore"), index_list)], remainder=Normalizer())
col_transformer.named_transformers_["ohe"].get_feature_names(
8000
)

@tjhallum
Copy link
tjhallum commented Oct 7, 2021

I was able to get it using col_transformer.named_transformers_["ohe"].get_feature_names()

col_transformer =  ColumnTransformer(transformers=[("ohe", OneHotEncoder(handle_unknown="ignore"), index_list)], remainder=Normalizer())
col_transformer.named_transformers_["ohe"].get_feature_names()

👆 @chengyineng38 this worked perfect 👌 for me. Thank you 🙏!!!

@zahs123
Copy link
zahs123 commented Oct 28, 2021

With the new version of sklearn the fix i had does not work anymore namely:

def get_feature_names(coltra):
    from sklearn.utils.validation import check_is_fitted
    from numpy import arange
    """Get feature names from all transformers.
    Returns
    -------
    feature_names : list of strings
        Names of the features produced by transform.
    """
    check_is_fitted(coltra)
    feature_names = []
    for name, trans, column, _ in coltra._iter(fitted=True):
        if trans == 'drop' or (
                hasattr(column, '__len__') and not len(column)):
            continue
        if trans == 'passthrough':
            if hasattr(coltra, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    feature_names.extend(column)
                else:
                    feature_names.extend(coltra._df_columns[column])
            else:
                indices = arange(coltra._n_features)
                feature_names.extend(['x%d' % i for i in indices[column]])
            continue
        if not hasattr(trans, 'get_feature_names'):
            # ADDED SECTION A
            if hasattr(coltra, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    feature_names.extend(f'{name}_{col}' for col in column)
                else:
                    feature_names.extend(
                        f'{name}_{col}'
                        for col in coltra._df_columns[column]
                        )
            else:
                indices = arange(coltra._n_features)
                feature_names.extend(['x%d' % i for i in indices[column]])
            continue
            # END SECTION A
        # ADDED SECTION B
        gfn_args = inspect.getfullargspec(trans.get_feature_names).args
        args_to_send = []
        if ('input_features' in gfn_args) and \
                not isinstance(column, slice):
            args_to_send = [column]
        feature_names.extend([name + "__" + f for f in
                                trans.get_feature_names(*args_to_send)])
        # END SECTION B
    return feature_names

i get the error:

feature_names.extend(['x%d' % i for i in indices[column]])
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

which i don't get when i use an older version of sklearn and xgboost.

i run get_feature_names(best_model['preprocessor']) where best model is best_grid.best_estimator_

how can i fix this?

@paulmattheww
Copy link

I don't want to throw a wrench in things, but this class worked fine so long as I used ColumnTransformer

https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4

@melgazar9
Copy link
melgazar9 commented Mar 17, 2022

For certain versions of python, using @kylegilde get_column_names_from_ColumnTransformer I get error TypeError: decoding str is not supported and x_0, x_1 for some versions of sklearn. I've shared below a slightly modified version of Kyles great function.

High level fixes:

  1. Use .format or f string in the print statements instead of commas
  2. Add a few if statements to based on the version of sklearn (e.g. get_feature_names vs get_feature_names_out)
  3. The if you aren't using at least python version 3.7 ten set clean_column_names = False since the skimpy package isn't available for earlier versions.
from skimpy import clean_columns
from sklearn.utils.validation import check_is_fitted

def get_column_names_from_ColumnTransformer(column_transformer, clean_column_names=True, verbose=True):  

    """
    Reference: Kyle Gilde: https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py
    Description: Get the column names from the a ColumnTransformer containing transformers & pipelines
    Parameters
    ----------
    verbose: Bool indicating whether to print summaries. Default set to True.
    Returns
    -------
    a list of the correct feature names
    Note:
    If the ColumnTransformer contains Pipelines and if one of the transformers in the Pipeline is adding completely new columns,
    it must come last in the pipeline. For example, OneHotEncoder, MissingIndicator & SimpleImputer(add_indicator=True) add columns
    to the dataset that didn't exist before, so there should come last in the Pipeline.
    Inspiration: https://github.com/scikit-learn/scikit-learn/issues/12525
    """

    assert isinstance(column_transformer, ColumnTransformer), "Input isn't a ColumnTransformer"
    
    check_is_fitted(column_transformer)

    new_feature_names, transformer_list = [], []

    for i, transformer_item in enumerate(column_transformer.transformers_): 
        transformer_name, transformer, orig_feature_names = transformer_item
        orig_feature_names = list(orig_feature_names)

        if len(orig_feature_names) == 0:
            continue

        if verbose: 
            print(f"\n\n{i}.Transformer/Pipeline: {transformer_name} {transformer.__class__.__name__}\n")
            print(f"\tn_orig_feature_names:{len(orig_feature_names)}")

        if transformer == 'drop':
            continue

        if isinstance(transformer, Pipeline):
            # if pipeline, get the last transformer in the Pipeline
            transformer = transformer.steps[-1][1]

        if hasattr(transformer, 'get_feature_names_out'):
            if 'input_features' in transformer.get_feature_names_out.__code__.co_varnames:
                names = list(transformer.get_feature_names_out(orig_feature_names))
            else:
                names = list(transformer.get_feature_names_out())
        elif hasattr(transformer, 'get_feature_names'):
            if 'input_features' in transformer.get_feature_names.__code__.co_varnames:
                names = list(transformer.get_feature_names(orig_feature_names))
            else:
                names = list(transformer.get_feature_names())

        elif hasattr(transformer,'indicator_') and transformer.add_indicator:
            # is this transformer one of the imputers & did it call the MissingIndicator?

            missing_indicator_indices = transformer.indicator_.features_
            missing_indicators = [orig_feature_names[idx] + '_missing_flag'\
                                  for idx in missing_indicator_indices]
            names = orig_feature_names + missing_indicators

        elif hasattr(transformer,'features_'):
            # is this a MissingIndicator class? 
            missing_indicator_indices = transformer.features_
            missing_indicators = [orig_feature_names[idx] + '_missing_flag'\
                                  for idx in missing_indicator_indices]

        else:

            names = orig_feature_names

        if verbose: 
            print(f"\tn_new_features:{len(names)}")
            print(f"\tnew_features: {names}\n")

        new_feature_names.extend(names)
        transformer_list.extend([transformer_name] * len(names))

    transformer_list, column_transformer_features = transformer_list, new_feature_names

    if clean_column_names:
        new_feature_names = list(clean_columns(pd.DataFrame(columns=new_feature_names)).columns)
    
    return new_feature_names

@kylegilde
Copy link

@melgazar9 , good to know! Hopefully, these workarounds all go away whenever #21308 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0