-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Cannot get feature names after ColumnTransformer #12525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is not an issue about ColumnTransformer.
Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc. |
Thanks for your kind reply! def get_column_names_from_ColumnTransformer(column_transformer):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
raw_col_name = transformer_in_columns[2]
if isinstance(transformer_in_columns[1],Pipeline):
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
names = transformer.get_feature_names()
except AttributeError: # if no 'get_feature_names' function, use raw column name
names = raw_col_name
if isinstance(names,np.ndarray): # eg.
col_name += names.tolist()
elif isinstance(names,list):
col_name += names
elif isinstance(names,str):
col_name.append(names)
return col_name Using above code, I can get my |
With respect to eli5, see transform_feature_names (used by explain_weights)
|
1 is a duplicate of #6425, right? I want to write a slep on that. And your snippet doesn't really solve the issue because no |
yes, after a pandas DataFrame feeds in a preprocess pipeline, It's better to get feature names so that can know exactly what happened just from the generated data. |
ok, closing as duplicate. |
I made a tiny enhancement to get back the name like rawname_value for onehot forms: def get_column_names_from_ColumnTransformer(column_transformer):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
raw_col_name = transformer_in_columns[2]
raw_col_name_reverse = raw_col_name[::-1]
if isinstance(transformer_in_columns[1],Pipeline):
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
names = transformer.get_feature_names()
exchange_name = [(_.split("_")) for _ in preprocessor.transformers_[:-1][0][1].steps[-1][1].get_feature_names()]
last_pre_name = ""
last_raw_name = ""
for pre_name,value in exchange_name:
if pre_name==last_pre_name:
col_name.append(last_raw_name+"_"+value)
if pre_name!=last_pre_name:
last_pre_name=pre_name
last_raw_name=raw_col_name_reverse.pop()
col_name.append(last_raw_name+"_"+value)
except AttributeError: # if no 'get_feature_names' function, use raw column name
names = raw_col_name
if isinstance(names,np.ndarray): # eg.
col_name += names.tolist()
elif isinstance(names,list):
col_name += names
elif isinstance(names,str):
col_name.append(names)
return col_name |
What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work. |
It would be nice to have a get_feature_names method for this configuration. |
Here is my contribution to the short-term solution. It coerces all the different array types to lists, and it handles the case of SimpleImputer(add_indicate=True). It's also a little more verbose. def get_column_names_from_ColumnTransformer(column_transformer):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
print('\n\ntransformer: ', transformer_in_columns[0])
raw_col_name = list(transformer_in_columns[2])
if isinstance(transformer_in_columns[1], Pipeline):
# if pipeline, get the last transformer
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
if isinstance(transformer, OneHotEncoder):
names = list(transformer.get_feature_names(raw_col_name))
elif isinstance(transformer, SimpleImputer) and transformer.add_indicator:
missing_indicator_indices = transformer.indicator_.features_
missing_indicators = [raw_col_name[idx] + '_missing_flag' for idx in missing_indicator_indices]
names = raw_col_name + missing_indicators
else:
names = list(transformer.get_feature_names())
except AttributeError as error:
names = raw_col_name
print(names)
col_name.extend(names)
return col_name |
FYI, I wrote some code and a blog about how to extract the feature names from complex Pipelines & ColumnTransformers. The code is an improvement over my previous post. https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4 |
@kylegilde Great article and thanks for the code. Works like a charm. For global explanations I had been wrestling with KernelSHAP and alibi for some hours but didn't get my onehot transformer working without |
Here is another version of the @pjgao's snippet that includes columns from reminder: def get_columns_from_transformer(column_transformer, input_colums):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
raw_col_name = transformer_in_columns[2]
if isinstance(transformer_in_columns[1],Pipeline):
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
names = transformer.get_feature_names(raw_col_name)
except AttributeError: # if no 'get_feature_names' function, use raw column name
names = raw_col_name
if isinstance(names,np.ndarray): # eg.
col_name += names.tolist()
elif isinstance(names,list):
col_name += names
elif isinstance(names,str):
col_name.append(names)
[_, _, reminder_columns] = column_transformer.transformers_[-1]
for col_idx in reminder_columns:
col_name.append(input_colums[col_idx])
return col_name What do you think about adding similar function the the core codebase? |
Hey @roma-glushko, thanks for your code! For anyone else having an issue with this, I had to make a very minor change to his code:
This block was missing square brackets around raw_col_name since
|
I was able to get it using
|
👆 @chengyineng38 this worked perfect 👌 for me. Thank you 🙏!!! |
With the new version of sklearn the fix i had does not work anymore namely:
i get the error: feature_names.extend(['x%d' % i for i in indices[column]]) which i don't get when i use an older version of sklearn and xgboost. i run how can i fix this? |
I don't want to throw a wrench in things, but this class worked fine so long as I used |
For certain versions of python, using @kylegilde High level fixes:
|
@melgazar9 , good to know! Hopefully, these workarounds all go away whenever #21308 is merged. |
When I use ColumnTransformer to preprocess different columns (include numeric, category, text) with pipeline, I cannot get the feature names of the final transformed data, which is hard for debugging.
Here is the code:
preprocesser.get_feature_names()
will get error:AttributeError: Transformer numeric (type Pipeline) does not provide get_feature_names.
ColumnTransformer
,text_transformer
can only process a string (eg 'Sex'), but not a list of string astext_columns
The text was updated successfully, but these errors were encountered: