-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Allowed trans='passthrough' to handle scalar column input. #14495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently the following raises an error: ``` from sklearn.compose import ColumnTransformer import numpy as np X = np.array([[1, 2, 3], [4, 5, 6]]).T ct = ColumnTransformer([('trans', 'passthrough', 0)]) ct.fit_transform(X) ``` If the transformer is 'passthrough', then scalar column names are not accepted, only something like [0] works. This PR changes the FunctionTransformer() being used when transformer is 'passthrough' to one which reshapes the data to be 2D, instead of just passing it through unchanged.
Somehow I remember intentionally not supporting this. If you change I think there's no danger in supporting scalar values in all cases though. |
But then something like |
I think that is exactly the reason why we didn't support it. I'm not opposed to allowing scalars to |
I don't understand why this is the right fix, though. Shouldn't the fix be in the stacking? The identity function is the right function, right? |
Before the stacking is done, |
That makes sense. In an ideal world, should that be the default for |
As of 0.22 |
to ensure the correct shape. | ||
""" | ||
if not getattr(X, 'ndim', 0) == 2: | ||
if hasattr(X, 'values'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't simply X = np.asarray(X)
be better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better but it's unclear to me whether either is good ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X = np.asarray(X)
would work on anything that has __array__
(NamedArray
for instance), but values
is specific to pandas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also values
does a copy.
I think Can you please provide an example for your use-case that motivated this? |
I suppose that an example would be if the you were doing a gridsearch with an estimator that can take in non-numeric columns e.g. catboost, and wanted to try processing the categorical columns with either an sklearn vectorizer, or using the built in processing of the estimator. |
Would use catboost on columns that you'd use DictVectorizer or CountVectorizer on? I would assume that would be more a job for OneHotEncoder?
Arguably you should not do that, and I don't really see a reason why you wouldn't accept
The real cost of any feature is not implementing it but maintaining it. Just because something is "easy to do" is really not a good reason to implement it. |
That is fair enough, makes a lot of sense. If 'passthrough' is only valid for non-scalar columns, would it be worth mentioning this in the docs, or at least raising more helpful error message than |
Yes, both of these sound like very good ideas! |
okay, I will close this pull request and open a new one with those changes, cheers |
Currently the following raises an error:
If the transformer is 'passthrough', then scalar column names are not
accepted, only something like [0] works.
This PR changes the FunctionTransformer() being used when transformer
is 'passthrough' to one which reshapes the data to be 2D, instead of
just passing it through unchanged.
Reference Issues/PRs
This PR needs merging before #14048, as tests need to be added for get_feature_names with passthrough being scalar.