8000 Allowed trans='passthrough' to handle scalar column input. by lrjball · Pull Request #14495 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Allowed trans='passthrough' to handle scalar column input. #14495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Allowed trans='passthrough' to handle scalar column input. #14495

wants to merge 4 commits into from

Conversation

lrjball
Copy link
Contributor
@lrjball lrjball commented Jul 28, 2019

Currently the following raises an error:

from sklearn.compose import ColumnTransformer
import numpy as np

X = np.array([[1, 2, 3], [4, 5, 6]]).T
ct = ColumnTransformer([('trans', 'passthrough', 0)])
ct.fit_transform(X)

If the transformer is 'passthrough', then scalar column names are not
accepted, only something like [0] works.

This PR changes the FunctionTransformer() being used when transformer
is 'passthrough' to one which reshapes the data to be 2D, instead of
just passing it through unchanged.

Reference Issues/PRs

This PR needs merging before #14048, as tests need to be added for get_feature_names with passthrough being scalar.

lrjball added 4 commits July 28, 2019 13:22
Currently the following raises an error:
```
from sklearn.compose import ColumnTransformer
import numpy as np

X = np.array([[1, 2, 3], [4, 5, 6]]).T
ct = ColumnTransformer([('trans', 'passthrough', 0)])
ct.fit_transform(X)
```

If the transformer is 'passthrough', then scalar column names are not
accepted, only something like [0] works.

This PR changes the FunctionTransformer() being used when transformer
is 'passthrough' to one which reshapes the data to be 2D, instead of
just passing it through unchanged.
@adrinjalali
Copy link
Member

Somehow I remember intentionally not supporting this. If you change passthrough to a StandardScaler, it still fails with a scalar feature index.

I think there's no danger in supporting scalar values in all cases though.

@lrjball
Copy link
Contributor Author
lrjball commented Jul 29, 2019

But then something like DictVectorizer will only allow a scalar index, and will fail on a list.
Is there a way to support scalar and list features for all transformers, e. g. if you pass a scalar to StandardScalar then it will be converted to a list of length 1, and vice versa for the DictVectorizer?

@amueller
Copy link
Member

I think that is exactly the reason why we didn't support it.
Right now there's very clear semantics: if you provide a scalar index, it'll be passed as a 1d indexable, otherwise as a 2d indexable.

I'm not opposed to allowing scalars to passthrough though.

@amueller
Copy link
Member

I don't understand why this is the right fix, though. Shouldn't the fix be in the stacking? The identity function is the right function, right?

@lrjball
Copy link
Contributor Author
lrjball commented Jul 29, 2019

Before the stacking is done, _validate_output checks that the output of each transformer is 2d, so if the identity function is used without any reshaping it will raise a ValueError here. Using the identity function but just with reshaping to be 2d seemed like the easiest fix for me, happy to change it though if there is a better solution.

8000
@amueller
Copy link
Member

That makes sense. In an ideal world, should that be the default for FunctionTransformer?
This just seems an odd place for the logic.
Will the current version work if X is a list of lists?

@lrjball
Copy link
Contributor Author
lrjball commented Jul 29, 2019

As of 0.22 FunctionTransformer().fit_transform(X) will just return X without any checks for type/shape etc. Am I right in thinking that this is the only Transformer that does not return a 2D array? I suppose behaving like lambda X: X is a sensible default.
I agree that it seems a bit awkward to add this logic in here. Maybe a parameter return_2d could be added to FunctionTransformer, which would default to False, but could be set to True here.

to ensure the correct shape.
"""
if not getattr(X, 'ndim', 0) == 2:
if hasattr(X, 'values'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't simply X = np.asarray(X) be better here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better but it's unclear to me whether either is good ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X = np.asarray(X) would work on anything that has __array__ (NamedArray for instance), but values is specific to pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also values does a copy.

@amueller
Copy link
Member
amueller commented Aug 7, 2019

I think FunctionTransformer is indeed the only transformer (that's not a meta-estimator) that returns something that's not 2d. Some other transformers return 2d sparse matrices but it's always 2d.

Can you please provide an example for your use-case that motivated this?
It sounds to me like you're trying to replace a vectorizer by a function transformer which doesn't really make a lot of sense in the context of sklearn (unless you do the proper vectorization there).

@lrjball
Copy link
Contributor Author
lrjball commented Aug 10, 2019

I suppose that an example would be if the you were doing a gridsearch with an estimator that can take in non-numeric columns e.g. catboost, and wanted to try processing the categorical columns with either an sklearn vectorizer, or using the built in processing of the estimator.
Also, if you were using a custom 'transformer' which only took in 1d data, then it would be useful to be able to set it to passthrough in a gridsearch.
I know that most of this time 'passthrough' will only be used in place of a transformer, but it doesn't seem like much work to allow it to support all functions.

@amueller
Copy link
Member

I suppose that an example would be if the you were doing a gridsearch with an estimator that can take in non-numeric columns e.g. catboost, and wanted to try processing the categorical columns with either an sklearn vectorizer, or using the built in processing of the estimator.

Would use catboost on columns that you'd use DictVectorizer or CountVectorizer on? I would assume that would be more a job for OneHotEncoder?

Also, if you were using a custom 'transformer' which only took in 1d data, then it would be useful to be able to set it to passthrough in a gridsearch.

Arguably you should not do that, and I don't really see a reason why you wouldn't accept (n_samples, 1) in this case.

I know that most of this time 'passthrough' will only be used in place of a transformer, but it doesn't seem like much work to allow it to support all functions.

The real cost of any feature is not implementing it but maintaining it. Just because something is "easy to do" is really not a good reason to implement it.
You're adding code in a place where you say it's awkward. Sounds like a code-smell and a bunch of bugs waiting to happen, for a feature for which we can't really identify a use-case.
If there's a real use-case we want to fix I would consider that, but I think the current API is consistent and the reason it's not supported and why it's awkward to support is because it doesn't make sense.

@lrjball
Copy link
Contributor Author
lrjball commented Aug 13, 2019

That is fair enough, makes a lot of sense. If 'passthrough' is only valid for non-scalar columns, would it be worth mentioning this in the docs, or at least raising more helpful error message than ValueError: The output of the 'trans' transformer should be 2D (scipy matrix, array or pandas Data Frame).?

@amueller
Copy link
Member

Yes, both of these sound like very good ideas!

@lrjball
Copy link
Contributor Author
lrjball commented Aug 18, 2019

okay, I will close this pull request and open a new one with those changes, cheers

@lrjball lrjball closed this Aug 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0