Allowed trans='passthrough' to handle scalar column input.#14495
Allowed trans='passthrough' to handle scalar column input.#14495lrjball wants to merge 4 commits intoscikit-learn:masterfrom lrjball:passthrough_trans
Conversation
Currently the following raises an error:
```
from sklearn.compose import ColumnTransformer
import numpy as np
X = np.array([[1, 2, 3], [4, 5, 6]]).T
ct = ColumnTransformer([('trans', 'passthrough', 0)])
ct.fit_transform(X)
```
If the transformer is 'passthrough', then scalar column names are not
accepted, only something like [0] works.
This PR changes the FunctionTransformer() being used when transformer
is 'passthrough' to one which reshapes the data to be 2D, instead of
just passing it through unchanged.
|
Somehow I remember intentionally not supporting this. If you change I think there's no danger in supporting scalar values in all cases though. |
|
But then something like |
|
I think that is exactly the reason why we didn't support it. I'm not opposed to allowing scalars to |
|
I don't understand why this is the right fix, though. Shouldn't the fix be in the stacking? The identity function is the right function, right? |
|
Before the stacking is done, |
|
That makes sense. In an ideal world, should that be the default for |
|
As of 0.22 |
| to ensure the correct shape. | ||
| """ | ||
| if not getattr(X, 'ndim', 0) == 2: | ||
| if hasattr(X, 'values'): |
There was a problem hiding this comment.
wouldn't simply X = np.asarray(X) be better here?
There was a problem hiding this comment.
it's better but it's unclear to me whether either is good ;)
There was a problem hiding this comment.
X = np.asarray(X) would work on anything that has __array__ (NamedArray for instance), but values is specific to pandas.
|
I think Can you please provide an example for your use-case that motivated this? |
|
I suppose that an example would be if the you were doing a gridsearch with an estimator that can take in non-numeric columns e.g. catboost, and wanted to try processing the categorical columns with either an sklearn vectorizer, or using the built in processing of the estimator. |
Would use catboost on columns that you'd use DictVectorizer or CountVectorizer on? I would assume that would be more a job for OneHotEncoder?
Arguably you should not do that, and I don't really see a reason why you wouldn't accept
The real cost of any feature is not implementing it but maintaining it. Just because something is "easy to do" is really not a good reason to implement it. |
|
That is fair enough, makes a lot of sense. If 'passthrough' is only valid for non-scalar columns, would it be worth mentioning this in the docs, or at least raising more helpful error message than |
|
Yes, both of these sound like very good ideas! |
|
okay, I will close this pull request and open a new one with those changes, cheers |
Currently the following raises an error:
If the transformer is 'passthrough', then scalar column names are not
accepted, only something like [0] works.
This PR changes the FunctionTransformer() being used when transformer
is 'passthrough' to one which reshapes the data to be 2D, instead of
just passing it through unchanged.
Reference Issues/PRs
This PR needs merging before #14048, as tests need to be added for get_feature_names with passthrough being scalar.