[WIP] Column selector functions for ColumnTransformer #11301

partmor · 2018-06-16T22:37:12Z

Reference Issues

What does this implement?

This PR introduces functions to generate column selector callables that can be passed to make_column_transformer and ColumnTransformer in place of the actual column selections.

For now, I have implemen 8000 ted a selector for dtypes. I am working on a name selector.
(As discussed in #11190)

Other comments:

This PR is still incomplete. I have created it early in order to receive feedback.
The example I have included is just a quick refactorization of #11197. I will care further about the formatting when we are happy with the functions.

…or drop)

partmor · 2018-06-16T22:46:47Z

As highlighted by @amueller earlier, this would not be robust to occurrences like unexpected float16. I will think about a solution for this; one quick thought would be: hardcode [np.float16,..] and rest of variations to be applied by default when float is passed?

jnothman

Please document and test passing a callable.

I'm really not sure that the factory needs to be in the same pr. Let's go with it for now, but anticipate that we might pull that into a separate contribution. Focus on getting the generic, relatively uncontroversial interface enhancement right first

jnothman · 2018-06-16T23:11:09Z

sklearn/compose/_column_transformer.py

@@ -522,7 +525,10 @@ def _get_column(X, key):
    if column_names:
        if hasattr(X, 'loc'):
            # pandas dataframes
-            return X.loc[:, key]
+            if not callable(key):


Can't we do this before the ifs, setting key = key(X)?

Yes, will do

jnothman · 2018-06-16T23:15:33Z

examples/compose/column_selector_callables.py

@@ -0,0 +1,69 @@
+"""
+=======================
+Select Column Functions


I think the existing examples would benefit from this

I will drop this new example then, and enhance the already existing one for ColumnTransformer.

jnothman · 2018-06-16T23:16:38Z

sklearn/compose/_column_transformer.py

@@ -597,6 +608,50 @@ def _get_transformer_list(estimators):
    return transformer_list


+def select_types(dtypes):
+    """Generate a column selector callable (type-based) to be passed to


Pep257: a short summary should be alone on the first line

jnothman · 2018-06-16T23:17:27Z

sklearn/compose/_column_transformer.py

@@ -597,6 +608,50 @@ def _get_transformer_list(estimators):
    return transformer_list


+def select_types(dtypes):


Please add to doc/modules/classes.rst

jnothman · 2018-06-16T23:31:51Z

sklearn/compose/_column_transformer.py

+
+    Parameters
+    ----------
+    dtypes : list of column dtypes to be selected from the dataset.


Numpydoc: type on this line, semantics on the next

I think we should make the same factory also able to select columns by nane. Please add a param to do so and rename the function

jnothman · 2018-06-16T23:33:12Z

sklearn/compose/_column_transformer.py

+    """
+    def apply_dtype_mask(X, dtype):
+        if hasattr(X, 'dtypes'):
+            return X.dtypes == dtype


I think we want to use issubdtype

Here I am returning a boolean mask, issubdtype can't help me here.

Well you can use np.asarray([issubdtype(xtype, dtypes) for xtype in X.dtypes], dtype=bool)

Got it. Not to be insistent; we prefer to use np.issubdtype rather than == directly for consistency in the module?

jnothman · 2018-06-16T23:36:18Z

sklearn/compose/_column_transformer.py

+        masks = [apply_dtype_mask(X, t) for t in dtypes]
+        return masks
+
+    return lambda X: np.any(get_dtype_masks(X), axis=0)


We cannot pickle closures. Please implement this as a class with __call__ instead

Yep; also using a class with __call__ will also make the code a lot cleaner.

jnothman · 2018-06-18T08:47:04Z

issubdtype is preferred to allow users to give broad dtypes rather than a very specific one. most users don't care about float size or endianness to distinguish one type of feature from another.

partmor · 2018-06-18T09:36:57Z

@jnothman would we be expecting users to pass as dtypes: np.floating, np.integer, np.number, np.object_ and so on? It worries me specially in the case of categorical columns (pandas dtype object).

Doing np.issubdtype(categorical_column_dtype, 'object') returns True, but with a FutureWarning (eventually makes my code crash when put into the functions). np.issubdtype(categorical_column_dtype, np.object_) does the job without warnings. The same applies np.issubdtype(np.float16, float) vs np.issubdtype(np.float16, np.floating) and so on.

Would we maybe end up doing some user input preprocessing? (e.g. if user passes float, take np.floating and so on...)

jnothman · 2018-06-18T10:07:15Z

Hmmm... this is tricky. Pandas dtypes (CategoricalDtype at least) aren't dtypes and raise an exception with np.issubdtype.

object should be treated as object_ because generic is too broad.

But I'm not sure that float should be treated as float64, which it is in my numpy.

We could consider something like:
select_columns(categorical=T/F, object=T/F, numeric=T/F, float=T/F, integer=T/F, fixedwidth_string=T/F, datetime=T/F, timedelta=T/F, ...) and raise errors on redundant specifications. But we should probably talk to pandas folks before reinventing wheels. @jorisvandenbossche, any ideas?

jorisvandenbossche · 2018-06-18T10:17:56Z

Didn't look yet at the implementation / discussion, but the relevant piece of pandas functionality to look at is DataFrame.select_dtypes method.

jnothman · 2018-06-18T10:48:14Z

Thanks. Perhaps we should assume this helper is only for DataFrames, and use that directly, i.e. use X.iloc[:1].select_dtypes(include=include, exclude=exclude).columns as our mask.

jorisvandenbossche · 2018-06-27T22:14:26Z

On the short term, maybe we should start with a PR with only the actual change to ColumnTransformer to accept functions, and leave defining such functions as sklearn API for later (in light of getting something in for the release)?

jnothman · 2018-06-28T00:27:01Z

I agree

amueller · 2018-06-28T13:52:28Z

sounds good to me.

amueller · 2018-06-28T14:23:13Z

I think the main issue might be that lambdas don't pickle, right? So users will need to actually define a function in some python file (not sure if defining a function in a notebook is sufficient?)

jorisvandenbossche · 2018-07-17T14:48:36Z

@partmor I opened #11592 with only the part to add the actual functionality (we are sprinting with the core devs, and would like to try to get this into the release). We can afterwards further use this PR to add the factory functions (and will make sure you get proper credit on the other PR).

partmor · 2018-07-17T15:06:04Z

@partmor I opened #11592 with only the part to add the actual functionality (we are sprinting with the core devs, and would like to try to get this into the release). We can afterwards further use this PR to add the factory functions (and will make sure you get proper credit on the other PR).

@jorisvandenbossche thank you very much for the ping! I will develop with an eye on that PR, sorry for the delays .. :(

partmor added 3 commits June 16, 2018 23:51

define select_types callable factory

0a7749b

include dtype column selector example

75daae1

support selector function for remainder=passthrough (already worked f…

b4081ee

…or drop)

add docstring to example file. apparently sphinx fails if missing

c8e6fd7

jnothman reviewed Jun 16, 2018

View reviewed changes

amueller mentioned this pull request Jun 28, 2018

Support standard data science use-case #10603

Open

jorisvandenbossche mentioned this pull request Jul 17, 2018

[MRG + 1] ENH: allow to pass callable as column specifier in ColumnTransformer #11592

Merged

jorisvandenbossche mentioned this pull request Oct 5, 2018

ColumnTransformer should be able to use a function to select columns #11190

Closed

amueller mentioned this pull request Oct 5, 2018

Provide factory functions for selecting columns in ColumnTransformer #12303

Closed

amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019

thomasjpfan mentioned this pull request Oct 28, 2019

FEA Add make_column_selector for ColumnTransformer #12371

Merged

NicolasHug closed this in #12371 Nov 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Column selector functions for ColumnTransformer #11301

[WIP] Column selector functions for ColumnTransformer #11301

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -597,6 +608,50 @@ def _get_transformer_list(estimators):
		return transformer_list


		def select_types(dtypes):

Uh oh!

[WIP] Column selector functions for ColumnTransformer #11301

[WIP] Column selector functions for ColumnTransformer #11301

Uh oh!

Conversation

Uh oh!

Reference Issues

What does this implement?

Other comments:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!