-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Column selector functions for ColumnTransformer #11301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As highlighted by @amueller earlier, this would not be robust to occurrences like unexpected float16. I will think about a solution for this; one quick thought would be: hardcode [np.float16,..] and rest of variations to be applied by default when float is passed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document and test passing a callable.
I'm really not sure that the factory needs to be in the same pr. Let's go with it for now, but anticipate that we might pull that into a separate contribution. Focus on getting the generic, relatively uncontroversial interface enhancement right first
@@ -522,7 +525,10 @@ def _get_column(X, key): | |||
if column_names: | |||
if hasattr(X, 'loc'): | |||
# pandas dataframes | |||
return X.loc[:, key] | |||
if not callable(key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do this before the ifs, setting key = key(X)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do
@@ -0,0 +1,69 @@ | |||
""" | |||
======================= | |||
Select Column Functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing examples would benefit from this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will drop this new example then, and enhance the already existing one for ColumnTransformer
.
@@ -597,6 +608,50 @@ def _get_transformer_list(estimators): | |||
return transformer_list | |||
|
|||
|
|||
def select_types(dtypes): | |||
"""Generate a column selector callable (type-based) to be passed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pep257: a short summary should be alone on the first line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
@@ -597,6 +608,50 @@ def _get_transformer_list(estimators): | |||
return transformer_list | |||
|
|||
|
|||
def select_types(dtypes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add to doc/modules/classes.rst
|
||
Parameters | ||
---------- | ||
dtypes : list of column dtypes to be selected from the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpydoc: type on this line, semantics on the next
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make the same factory also able to select columns by nane. Please add a param to do so and rename the function
""" | ||
def apply_dtype_mask(X, dtype): | ||
if hasattr(X, 'dtypes'): | ||
return X.dtypes == dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to use issubdtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am returning a boolean mask, issubdtype can't help me here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you can use np.asarray([issubdtype(xtype, dtypes) for xtype in X.dtypes], dtype=bool)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Not to be insistent; we prefer to use np.issubdtype
rather than ==
directly for consistency in the module?
masks = [apply_dtype_mask(X, t) for t in dtypes] | ||
return masks | ||
|
||
return lambda X: np.any(get_dtype_masks(X), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot pickle closures. Please implement this as a class with __call__
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep; also using a class with __call__
will also make the code a lot cleaner.
issubdtype is preferred to allow users to give broad dtypes rather than a
very specific one. most users don't care about float size or endianness to
distinguish one type of feature from another.
|
@jnothman would we be expecting users to pass as dtypes: Doing Would we maybe end up doing some user input preprocessing? (e.g. if user passes |
Hmmm... this is tricky. Pandas dtypes (CategoricalDtype at least) aren't dtypes and raise an exception with np.issubdtype. object should be treated as object_ because generic is too broad. But I'm not sure that float should be treated as float64, which it is in my numpy. We could consider something like: |
Didn't look yet at the implementation / discussion, but the relevant piece of pandas functionality to look at is |
Thanks. Perhaps we should assume this helper is only for DataFrames, and use that directly, i.e. use |
On the short term, maybe we should start with a PR with only the actual change to |
I agree
|
sounds good to me. |
I think the main issue might be that lambdas don't pickle, right? So users will need to actually define a function in some python file (not sure if defining a function in a notebook is sufficient?) |
@jorisvandenbossche thank you very much for the ping! I will develop with an eye on that PR, sorry for the delays .. :( |
Reference Issues
Closes #11190
What does this implement?
This PR introduces functions to generate column selector callables that can be passed to
make_column_transformer
andColumnTransformer
in place of the actual column selections.For now, I have implemen 8000 ted a selector for dtypes. I am working on a name selector.
(As discussed in #11190)
Other comments:
This PR is still incomplete. I have created it early in order to receive feedback.
The example I have included is just a quick refactorization of #11197. I will care further about the formatting when we are happy with the functions.