FEA Add make_column_selector for ColumnTransformer #12371

thomasjpfan · 2018-10-12T17:50:41Z

Reference Issues/PRs

Addresses #12303 for data type selection.

What does this implement/fix? Explain your changes.

Adds a factory method, make_select_dtypes, to create a callable for ColumnTransformer to select columns based on dtypes.

Closes #12303
Closes #11301

thomasjpfan · 2018-10-13T14:58:41Z

As a reference, I wanted the following example in the docstring of make_select_dtypes:

    >>> from sklearn.compose import ColumnTransformer, make_select_dtypes
    >>> from sklearn.preprocessing import Normalizer, OneHotEncoder
    >>> import pandas as pd
    >>> import numpy as np
    >>> df = pd.DataFrame({
    ...      'a': [0, 2],
    ...      'b': [-2.0, 2.0],
    ...      'c': ['one', 'two']
    ... })
    >>> norm = Normalizer(norm='l1')
    >>> ohe = OneHotEncoder()
    >>> ct = ColumnTransformer(
    ...     [("norm", norm, make_select_dtypes([np.number])),
    ...      ("ohe", ohe, make_select_dtypes([np.object]))])
    >>> ct.fit_transform(df)    # doctest: +NORMALIZE_WHITESPACE
    array([[ 0., -1.,  1.,  0.],
           [0.5, 0.5,  0.,  1 ]])

There may be an issue with using import pandas as pd in the doctest, in python 3.7. It raises a deprecation warning which will be fixed in pandas 0.24.0.

jnothman · 2018-10-14T06:04:50Z

I don't think the dtype selector needs to be separate from a name selector

thomasjpfan · 2018-10-14T13:31:21Z

When you mentioned pattern matching, did you mean using regular expressions? For example:

make_select_columns(regex="$hello_[bw]orld", dtype_include=[np.number])

jnothman · 2018-10-14T21:12:03Z

Yes, that would work for me. I'd prefer pattern to regex as a name

…tory

jnothman

Oh... I forgot you'd opened this! And I forgot that I had pending comments!!

jnothman · 2018-10-15T07:28:28Z

sklearn/compose/_column_transformer.py

@@ -772,3 +774,37 @@ def make_column_transformer(*transformers, **kwargs):
    return ColumnTransformer(transformer_list, n_jobs=n_jobs,
                             remainder=remainder,
                             sparse_threshold=sparse_threshold)
+
+
+def make_select_columns(pattern=None, dtype_include=None, dtype_exclude=None):


This should be a class. A closure is not picklable. (Ideally that should be tested)

jnothman · 2018-10-15T07:29:19Z

sklearn/compose/_column_transformer.py

+
+
+def make_select_columns(pattern=None, dtype_include=None, dtype_exclude=None):
+    """Creates a column selector callable which can be passed to


Pep257: a one-line summary followed by a blank line before further description

jnothman

Please add a ln example, at least to the docstring of ColumnTransformer

jnothman · 2019-10-17T21:41:04Z

sklearn/compose/_column_transformer.py

@@ -759,3 +761,54 @@ def is_neg(x): return isinstance(x, numbers.Integral) and x < 0
    elif _determine_key_type(key) == 'int':
        return np.any(np.asarray(key) < 0)
    return False
+
+
+class _ColumnSelector:


This can just be make_column_selector directly, or even public ColumnSelector

jnothman · 2019-10-17T21:41:17Z

sklearn/compose/_column_transformer.py

+        A selection of dtypes to include.
+
+    dtype_exclude : column dtype or list of column dtypes, default=None
+        A selection of dtypes to include.


Suggested change

A selection of dtypes to include.

A selection of dtypes to exclude.

jnothman · 2019-10-17T21:43:35Z

sklearn/compose/_column_transformer.py

+        will not be selected based on pattern.
+
+    dtype_include : column dtype or list of column dtypes, default=None
+        A selection of dtypes to include.


Please reference pandas documentation for select_dtypes

jnothman · 2019-10-17T21:46:18Z

sklearn/compose/_column_transformer.py

+    Parameters
+    ----------
+    pattern : str, default=None
+        Regular expression used to select columns. If None, column selection


Not clear that any column containing this pattern, rather than entirely matching, will be returned

Not clear that columns will only be returned if they match all criteria

jnothman

Thanks @thomasjpfan!

addressed't addressed my request for an example. It's probably also worth covering this in the user guide

jnothman · 2019-10-19T12:21:12Z

sklearn/compose/_column_transformer.py

+    """Create a callable to select columns to be used with
+    :func:`make_column_transformer` or :class:`ColumnTransformer`.
+
+    :func:`make_column_selector` can select columsn based on a columns datatype


Suggested change

:func:`make_column_selector` can select columsn based on a columns datatype

:func:`make_column_selector` can select columns based on a column's datatype

jnothman · 2019-10-19T12:21:34Z

sklearn/compose/_column_transformer.py

+    Parameters
+    ----------
+    pattern : str, default=None
+        Columns containing this regex pattern will be included. If None, column


Clarify that it is the name of the column

jnothman · 2019-10-19T12:24:55Z

sklearn/compose/_column_transformer.py

+            raise ValueError("make_column_selector can only be applied to "
+                             "pandas dataframes")
+        if self.dtype_include is not None or self.dtype_exclude is not None:
+            cols = (df.iloc[:1].select_dtypes(include=self.dtype_include,


Perhaps just transform df here, and get .columns after the conditional

jnothman · 2019-10-19T12:25:57Z

sklearn/compose/tests/test_column_transformer.py

+    (['col_str'], None, [np.int], '^col_s'),
+    (['col_int', 'col_float', 'col_str'], [np.number, np.object], None, None),
+])
+def test_make_column_selector_with_select_dtypes(cols, include,


Put pattern before include and exclude to match the function signature

…tory

jnothman

Otherwise LGTM

jnothman · 2019-10-23T01:33:09Z

doc/whats_new/v0.22.rst

@@ -101,6 +101,10 @@ Changelog
 :mod:`sklearn.compose`
 ......................

+- |Feature|  Adds :func:`compose.make_column_selector` which is used with
+  :class:`compose.ColumnTransformer` to select columns based on data types.


"to select DataFrame columns on the basis of name and dtype"

NicolasHug

Mostly nits but LGTM

In the UG there is this line:

Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer array, a slice, or a boolean mask

We should also mention the column_selector here, with e.g. appending "(see also the column_selector detailed below)".

NicolasHug · 2019-10-27T22:18:10Z

sklearn/compose/_column_transformer.py

+                                          exclude=self.dtype_exclude)
+        cols = df_row.columns
+        if self.pattern is not None:
+            cols = cols[cols.str.contains(self.pattern)]


Not sure whether we should pass regex=True just in case

Can not hurt in this case. (Just in case pandas changes their default)

NicolasHug · 2019-10-27T22:18:43Z

sklearn/compose/_column_transformer.py

+
+class make_column_selector:
+    """Create a callable to select columns to be used with
+    :func:`make_column_transformer` or :class:`ColumnTransformer`.


Suggested change

:func:`make_column_transformer` or :class:`ColumnTransformer`.

a :class:`ColumnTransformer`.

Should be enough

NicolasHug · 2019-10-27T22:21:27Z

doc/modules/compose.rst

@@ -515,6 +515,19 @@ above example would be::
                                  ('countvectorizer', CountVectorizer(),
                                   'title')])

+scikit-learn provides a :func:`~sklearn.compose.make_column_selector` to help
+select columns based on data type::


Suggested change

select columns based on data type::

select columns based on data type or column name::

NicolasHug · 2019-10-27T22:24:08Z

sklearn/compose/_column_transformer.py

+    selector : callable
+        Callable for column selection.
+
+    See also


We need another see also in the ColumnTransformer class to link there.

NicolasHug · 2019-10-27T22:26:30Z

sklearn/compose/_column_transformer.py

@@ -69,7 +71,8 @@ class ColumnTransformer(TransformerMixin, _BaseComposition):
            ``transformer`` expects X to be a 1d array-like (vector),
            otherwise a 2d array will be passed to the transformer.
            A callable is passed the input data `X` and can return any of the
-            above.
+            above. To select by name or dtype, use


Suggested change

above. To select by name or dtype, use

above. To select multiple columns by name or dtype, you can use

NicolasHug · 2019-10-27T22:27:49Z

sklearn/compose/_column_transformer.py

+    ...                   'expert_rating': [5, 3, 4, 5]})
+    >>> ct = make_column_transformer(
+    ...       (StandardScaler(),
+    ...        make_column_selector(dtype_include=np.number)),


Suggested change

... make_column_selector(dtype_include=np.number)),

... make_column_selector(dtype_include=np.number)), # rating

NicolasHug · 2019-10-27T22:29:11Z

sklearn/compose/tests/test_column_transformer.py

@@ -1180,3 +1184,83 @@ def test_column_transformer_mask_indexing(array_type):
    )
    X_trans = column_transformer.fit_transform(X)
    assert X_trans.shape == (3, 2)
+
+
+@pytest.mark.parametrize('cols,pattern,include,exclude', [


not sure why you hate spaces in these strings ^^ @thomasjpfan

NicolasHug · 2019-10-27T22:32:31Z

sklearn/compose/tests/test_column_transformer.py

+        selector(X)
+
+
+def test_make_column_selector_pickle():


Why is this test needed?

#12371 (comment)

NicolasHug · 2019-10-27T22:35:52Z

sklearn/compose/tests/test_column_transformer.py

+    (['col_int'], '^col_int', [np.number], None),
+    (['col_float', 'col_str'], 'float|str', None, None),
+    (['col_str'], '^col_s', None, [np.int]),
+    (['col_int', 'col_float', 'col_str'], None, [np.number, np.object], None),


Add a test where none of the conditions are met? That should just return [] right?

NicolasHug · 2019-10-27T22:36:35Z

sklearn/compose/tests/test_column_transformer.py

+    assert_array_equal(selector(X_df), sorted(cols))
+
+
+def test_column_transformer_mixed_dtypes():


Short comment please? e.g.

Functional test for column transformer + column selector

(which I assume is what the test is about)

jorisvandenbossche

Looks good to me as well!
Added a minor comment

doc/modules/compose.rst

jorisvandenbossche · 2019-10-28T12:35:47Z

sklearn/compose/_column_transformer.py

+        cols = df_row.columns
+        if self.pattern is not None:
+            cols = cols[cols.str.contains(self.pattern)]
+        return sorted(cols)


Is there a reason to do the sorted here?
Just converting it to a list would preserve the order of the columns in the original DataFrame ?

Agreed that sorted is not needed here. PR updated to return a list.

NicolasHug · 2019-10-29T13:18:04Z

@thomasjpfan this wasn't addressed about the UG to better expose this new tool #12371 (review)

…tory

jnothman · 2019-11-05T00:36:09Z

CI failures... Merge master?

jnothman · 2019-11-05T00:36:57Z

doc/modules/compose.rst

+
+>>> from sklearn.preprocessing import StandardScaler
+>>> from sklearn.compose import make_column_selector
+>>> ct = make_column_transformer(


Actually, it's a real test failure: make_column_transformer is not imported.

…tory

NicolasHug · 2019-11-05T14:21:37Z

Thanks thomas

amueller · 2019-11-05T17:22:57Z

sweet!!

amueller · 2019-11-05T17:26:48Z

This is great because it helps us get around pickling issues. But it doesn't make the 80% usecase much shorter :-/ I'm not sure if it's worth adding shortcuts like select_object or select_categorical...
We could have a dtype selector just on categorical dtype and that would encourage people to use that dtype.

jnothman · 2019-11-05T20:18:23Z

I think the first intention of this is to help users consider a generic solution rather than specify columns directly by name... Not everyone would have written a callable comfortably. Let's see how it's taken up rather than worrying about efficient of syntax.

thomasjpfan added 3 commits October 12, 2018 13:50

ENH: Adds make_select_dtypes

3c2f8f4

STY: Flake8

ff2c7e7

DOC: Remove example

cbd1169

Merge remote-tracking branch 'upstream/master' into select_dtypes_fac…

dca0726

…tory

thomasjpfan changed the title ~~[MRG] Adds factory method: make_select_dtypes for ColumnTransformer~~ [MRG] Adds factory method: make_select_columns for ColumnTransformer Oct 14, 2018

ENH: Adds string regex matching

fac1f55

thomasjpfan force-pushed the select_dtypes_factory branch from 7b8e523 to fac1f55 Compare October 16, 2018 23:03

amueller added the Stalled label Aug 5, 2019

thomasjpfan added 2 commits October 17, 2019 09:02

Merge remote-tracking branch 'upstream/master' into select_dtypes_fac…

ae8ed2f

…tory

CLN Cleans up tests and docstrings

90e3bde

jnothman reviewed Oct 17, 2019

View reviewed changes

thomasjpfan added 2 commits October 17, 2019 17:26

CLN Makes callable picklable

ff93079

CLN Only uses callable class

71b5bf5

jnothman reviewed Oct 17, 2019

View reviewed changes

DOC Address joels comments

56c68e9

jnothman reviewed Oct 19, 2019

View reviewed changes

thomasjpfan and others added 5 commits October 22, 2019 12:43

Merge remote-tracking branch 'upstream/master' into select_dtypes_fac…

4ea1f79

…tory

DOC Adds examples

81a6ccb

DOC Less imports

149b96a

BUG Fix in CI

4b69b61

Reference make_column_transformer when describing columns

2863fab

jnothman approved these changes Oct 23, 2019

View reviewed changes

CLN Address comments

59f281e

thomasjpfan changed the title ~~[MRG] Adds factory method: make_select_columns for ColumnTransformer~~ [MRG] Adds factory method: make_column_selector for ColumnTransformer Oct 25, 2019

NicolasHug approved these changes Oct 27, 2019

View reviewed changes

jorisvandenbossche reviewed Oct 28, 2019

View reviewed changes

thomasjpfan added 2 commits October 28, 2019 10:57

CLN Address comments

d535b66

CLN Convert to list

7b56e46

adrinjalali added High Priority High priority issues and pull requests and removed Stalled labels Oct 28, 2019

adrinjalali added this to the 0.22 milestone Oct 28, 2019

thomasjpfan added 4 commits October 30, 2019 00:46

DOC Move make_column_selector up in user guide

5893a78

DOC Skip doctest

3f9ce99

Merge remote-tracking branch 'upstream/master' into select_dtypes_fac…

3fa1f6d

…tory

TST Fix tests

f4feaea

jnothman reviewed Nov 5, 2019

View reviewed changes

thomasjpfan added 2 commits November 4, 2019 20:02

DOC Fix

d970fd2

Merge remote-tracking branch 'upstream/master' into select_dtypes_fac…

4d651c1

…tory

NicolasHug changed the title ~~[MRG] Adds factory method: make_column_selector for ColumnTransformer~~ FEA Add make_column_selector for ColumnTransformer Nov 5, 2019

NicolasHug merged commit 37ac3fd into scikit-learn:master Nov 5, 2019



		def make_select_columns(pattern=None, dtype_include=None, dtype_exclude=None):
		"""Creates a column selector callable which can be passed to

	A selection of dtypes to include.
	A selection of dtypes to exclude.

	:func:`make_column_selector` can select columsn based on a columns datatype
	:func:`make_column_selector` can select columns based on a column's datatype

	:func:`make_column_transformer` or :class:`ColumnTransformer`.
	a :class:`ColumnTransformer`.

	select columns based on data type::
	select columns based on data type or column name::

	above. To select by name or dtype, use
	above. To select multiple columns by name or dtype, you can use

	... make_column_selector(dtype_include=np.number)),
	... make_column_selector(dtype_include=np.number)), # rating

		assert_array_equal(selector(X_df), sorted(cols))


		def test_column_transformer_mixed_dtypes():

Uh oh!

FEA Add make_column_selector for ColumnTransformer #12371

FEA Add make_column_selector for ColumnTransformer #12371

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!