10000 FEA Add make_column_selector for ColumnTransformer by thomasjpfan · Pull Request #12371 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FEA Add make_column_selector for ColumnTransformer #12371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 5, 2019

Conversation

thomasjpfan
Copy link
Member
@thomasjpfan thomasjpfan commented Oct 12, 2018

Reference Issues/PRs

Addresses #12303 for data type selection.

What does this implement/fix? Explain your changes.

Adds a factory method, make_select_dtypes, to create a callable for ColumnTransformer to select columns based on dtypes.

Closes #12303
Closes #11301

@thomasjpfan
Copy link
Member Author
thomasjpfan commented Oct 13, 2018

As a reference, I wanted the following example in the docstring of make_select_dtypes:

    >>> from sklearn.compose import ColumnTransformer, make_select_dtypes
    >>> from sklearn.preprocessing import Normalizer, OneHotEncoder
    >>> import pandas as pd
    >>> import numpy as np
    >>> df = pd.DataFrame({
    ...      'a': [0, 2],
    ...      'b': [-2.0, 2.0],
    ...      'c': ['one', 'two']
    ... })
    >>> norm = Normalizer(norm='l1')
    >>> ohe = OneHotEncoder()
    >>> ct = ColumnTransformer(
    ...     [("norm", norm, make_select_dtypes([np.number])),
    ...      ("ohe", ohe, make_select_dtypes([np.object]))])
    >>> ct.fit_transform(df)    # doctest: +NORMALIZE_WHITESPACE
    array([[ 0., -1.,  1.,  0.],
           [0.5, 0.5,  0.,  1 ]])

There may be an issue with using import pandas as pd in the doctest, in python 3.7. It raises a deprecation warning which will be fixed in pandas 0.24.0.

@jnothman
Copy link
Member

I don't think the dtype selector needs to be separate from a name selector

@thomasjpfan
Copy link
Member Author

When you mentioned pattern matching, did you mean using regular expressions? For example:

make_select_columns(regex="$hello_[bw]orld", dtype_include=[np.number])

@jnothman
Copy link
Member
jnothman commented Oct 14, 2018 via email

@thomasjpfan thomasjpfan changed the title [MRG] Adds factory method: make_select_dtypes for ColumnTransformer [MRG] Adds factory method: make_select_columns for ColumnTransformer Oct 14, 2018
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I forgot you'd opened this! And I forgot that I had pending comments!!

@@ -772,3 +774,37 @@ def make_column_transformer(*transformers, **kwargs):
return ColumnTransformer(transformer_list, n_jobs=n_jobs,
remainder=remainder,
sparse_threshold=sparse_threshold)


def make_select_columns(pattern=None, dtype_include=None, dtype_exclude=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a class. A closure is not picklable. (Ideally that should be tested)



def make_select_columns(pattern=None, dtype_include=None, dtype_exclude=None):
"""Creates a column selector callable which can be passed to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pep257: a one-line summary followed by a blank line before further description

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a ln example, at least to the docstring of ColumnTransformer

@@ -759,3 +761,54 @@ def is_neg(x): return isinstance(x, numbers.Integral) and x < 0
elif _determine_key_type(key) == 'int':
return np.any(np.asarray(key) < 0)
return False


class _ColumnSelector:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be make_column_selector directly, or even public ColumnSelector

A selection of dtypes to include.

dtype_exclude : column dtype or list of column dtypes, default=None
A selection of dtypes to include.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A selection of dtypes to include.
A selection of dtypes to exclude.

will not be selected based on pattern.

dtype_include : column dtype or list of column dtypes, default=None
A selection of dtypes to include.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reference pandas documentation for select_dtypes

Parameters
----------
pattern : str, default=None
Regular expression used to select columns. If None, column selection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear that any column containing this pattern, rather than entirely matching, will be returned

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear that columns will only be returned if they match all criteria

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan!

addressed't addressed my request for an example. It's probably also worth covering this in the user guide

"""Create a callable to select columns to be used with
:func:`make_column_transformer` or :class:`ColumnTransformer`.

:func:`make_column_selector` can select columsn based on a columns datatype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:func:`make_column_selector` can select columsn based on a columns datatype
:func:`make_column_selector` can select columns based on a column's datatype

Parameters
----------
pattern : str, default=None
Columns containing this regex pattern will be included. If None, column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that it is the name of the column

raise ValueError("make_column_selector can only be applied to "
"pandas dataframes")
if self.dtype_include is not None or self.dtype_exclude is not None:
cols = (df.iloc[:1].select_dtypes(include=self.dtype_include,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just transform df here, and get .columns after the conditional

(['col_str'], None, [np.int], '^col_s'),
(['col_int', 'col_float', 'col_str'], [np.number, np.object], None, None),
])
def test_make_column_selector_with_select_dtypes(cols, include,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put pattern before include and exclude to match the function signature

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -101,6 +101,10 @@ Changelog
:mod:`sklearn.compose`
......................

- |Feature| Adds :func:`compose.make_column_selector` which is used with
:class:`compose.ColumnTransformer` to select columns based on data types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to select DataFrame columns on the basis of name and dtype"

@thomasjpfan thomasjpfan changed the title [MRG] Adds factory method: make_select_columns for ColumnTransformer [MRG] Adds factory method: make_column_selector for ColumnTransformer Oct 25, 2019
Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits but LGTM

In the UG there is this line:

Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer array, a slice, or a boolean mask

We should also mention the column_selector here, with e.g. appending "(see also the column_selector detailed below)".

exclude=self.dtype_exclude)
cols = df_row.columns
if self.pattern is not None:
cols = cols[cols.str.contains(self.pattern)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we should pass regex=True just in case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can not hurt in this case. (Just in case pandas changes their default)


class make_column_selector:
"""Create a callable to select columns to be used with
:func:`make_column_transformer` or :class:`ColumnTransformer`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:func:`make_column_transformer` or :class:`ColumnTransformer`.
a :class:`ColumnTransformer`.

Should be enough

@@ -515,6 +515,19 @@ above example would be::
('countvectorizer', CountVectorizer(),
'title')])

scikit-learn provides a :func:`~sklearn.compose.make_column_selector` to help
select columns based on data type::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
select columns based on data type::
select columns based on data type or column name::

E377
selector : callable
Callable for column selection.

See also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need another see also in the ColumnTransformer class to link there.

@@ -69,7 +71,8 @@ class ColumnTransformer(TransformerMixin, _BaseComposition):
``transformer`` expects X to be a 1d array-like (vector),
otherwise a 2d array will be passed to the transformer.
A callable is passed the input data `X` and can return any of the
above.
above. To select by name or dtype, use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
above. To select by name or dtype, use
above. To select multiple columns by name or dtype, you can use

... 'expert_rating': [5, 3, 4, 5]})
>>> ct = make_column_transformer(
... (StandardScaler(),
... make_column_selector(dtype_include=np.number)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
... make_column_selector(dtype_include=np.number)),
... make_column_selector(dtype_include=np.number)), # rating

@@ -1180,3 +1184,83 @@ def test_column_transformer_mask_indexing(array_type):
)
X_trans = column_transformer.fit_transform(X)
assert X_trans.shape == (3, 2)


@pytest.mark.parametrize('cols,pattern,include,exclude', [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you hate spaces in these strings ^^ @thomasjpfan

selector(X)


def test_make_column_selector_pickle():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(['col_int'], '^col_int', [np.number], None),
(['col_float', 'col_str'], 'float|str', None, None),
(['col_str'], '^col_s', None, [np.int]),
(['col_int', 'col_float', 'col_str'], None, [np.number, np.object], None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test where none of the conditions are met? That should just return [] right?

assert_array_equal(selector(X_df), sorted(cols))


def test_column_transformer_mixed_dtypes():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short comment please? e.g.

Functional test for column transformer + column selector

(which I assume is what the test is about)

Copy link
Member
@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as well!
Added a minor comment

cols = df_row.columns
if self.pattern is not None:
cols = cols[cols.str.contains(self.pattern)]
return sorted(cols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to do the sorted here?
Just converting it to a list would preserve the order of the columns in the original DataFrame ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that sorted is not needed here. PR updated to return a list.

@adrinjalali adrinjalali added High Priority High priority issues and pull requests and removed Stalled labels Oct 28, 2019
@adrinjalali adrinjalali added this to the 0.22 milestone Oct 28, 2019
@NicolasHug
Copy link
Member

@thomasjpfan this wasn't addressed about the UG to better expose this new tool #12371 (review)

@jnothman
Copy link
Member
jnothman commented Nov 5, 2019

CI failures... Merge master?


>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.compose import make_column_selector
>>> ct = make_column_transformer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's a real test failure: make_column_transformer is not imported.

@NicolasHug
Copy link
Member

Thanks thomas

@NicolasHug NicolasHug changed the title [MRG] Adds factory method: make_column_selector for ColumnTransformer FEA Add make_column_selector for ColumnTransformer Nov 5, 2019
@NicolasHug NicolasHug merged commit 37ac3fd into scikit-learn:master Nov 5, 2019
@amueller
Copy link
Member
amueller commented Nov 5, 2019

sweet!!

@amueller
Copy link
Member
amueller commented Nov 5, 2019

This is great because it helps us get around pickling issues. But it doesn't make the 80% usecase much shorter :-/ I'm not sure if it's worth adding shortcuts like select_object or select_categorical...
We could have a dtype selector just on categorical dtype and that would encourage people to use that dtype.

@jnothman
9E77 Copy link
Member
jnothman commented Nov 5, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority High priority issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide factory functions for selecting columns in ColumnTransformer
6 participants
0