-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FEA Add make_column_selector for ColumnTransformer #12371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add make_column_selector for ColumnTransformer #12371
Conversation
As a reference, I wanted the following example in the docstring of >>> from sklearn.compose import ColumnTransformer, make_select_dtypes
>>> from sklearn.preprocessing import Normalizer, OneHotEncoder
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
... 'a': [0, 2],
... 'b': [-2.0, 2.0],
... 'c': ['one', 'two']
... })
>>> norm = Normalizer(norm='l1')
>>> ohe = OneHotEncoder()
>>> ct = ColumnTransformer(
... [("norm", norm, make_select_dtypes([np.number])),
... ("ohe", ohe, make_select_dtypes([np.object]))])
>>> ct.fit_transform(df) # doctest: +NORMALIZE_WHITESPACE
array([[ 0., -1., 1., 0.],
[0.5, 0.5, 0., 1 ]]) There may be an issue with using |
I don't think the dtype selector needs to be separate from a name selector |
When you mentioned pattern matching, did you mean using regular expressions? For example: make_select_columns(regex="$hello_[bw]orld", dtype_include=[np.number]) |
Yes, that would work for me. I'd prefer pattern to regex as a name
|
7b8e523
to
fac1f55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I forgot you'd opened this! And I forgot that I had pending comments!!
@@ -772,3 +774,37 @@ def make_column_transformer(*transformers, **kwargs): | |||
return ColumnTransformer(transformer_list, n_jobs=n_jobs, | |||
remainder=remainder, | |||
sparse_threshold=sparse_threshold) | |||
|
|||
|
|||
def make_select_columns(pattern=None, dtype_include=None, dtype_exclude=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a class. A closure is not picklable. (Ideally that should be tested)
|
||
|
||
def make_select_columns(pattern=None, dtype_include=None, dtype_exclude=None): | ||
"""Creates a column selector callable which can be passed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pep257: a one-line summary followed by a blank line before further description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a ln example, at least to the docstring of ColumnTransformer
@@ -759,3 +761,54 @@ def is_neg(x): return isinstance(x, numbers.Integral) and x < 0 | |||
elif _determine_key_type(key) == 'int': | |||
return np.any(np.asarray(key) < 0) | |||
return False | |||
|
|||
|
|||
class _ColumnSelector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be make_column_selector directly, or even public ColumnSelector
A selection of dtypes to include. | ||
|
||
dtype_exclude : column dtype or list of column dtypes, default=None | ||
A selection of dtypes to include. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A selection of dtypes to include. | |
A selection of dtypes to exclude. |
will not be selected based on pattern. | ||
|
||
dtype_include : column dtype or list of column dtypes, default=None | ||
A selection of dtypes to include. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reference pandas documentation for select_dtypes
Parameters | ||
---------- | ||
pattern : str, default=None | ||
Regular expression used to select columns. If None, column selection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear that any column containing this pattern, rather than entirely matching, will be returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear that columns will only be returned if they match all criteria
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan!
addressed't addressed my request for an example. It's probably also worth covering this in the user guide
"""Create a callable to select columns to be used with | ||
:func:`make_column_transformer` or :class:`ColumnTransformer`. | ||
|
||
:func:`make_column_selector` can select columsn based on a columns datatype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:func:`make_column_selector` can select columsn based on a columns datatype | |
:func:`make_column_selector` can select columns based on a column's datatype |
Parameters | ||
---------- | ||
pattern : str, default=None | ||
Columns containing this regex pattern will be included. If None, column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify that it is the name of the column
raise ValueError("make_column_selector can only be applied to " | ||
"pandas dataframes") | ||
if self.dtype_include is not None or self.dtype_exclude is not None: | ||
cols = (df.iloc[:1].select_dtypes(include=self.dtype_include, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just transform df here, and get .columns after the conditional
(['col_str'], None, [np.int], '^col_s'), | ||
(['col_int', 'col_float', 'col_str'], [np.number, np.object], None, None), | ||
]) | ||
def test_make_column_selector_with_select_dtypes(cols, include, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put pattern before include and exclude to match the function signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
doc/whats_new/v0.22.rst
Outdated
@@ -101,6 +101,10 @@ Changelog | |||
:mod:`sklearn.compose` | |||
...................... | |||
|
|||
- |Feature| Adds :func:`compose.make_column_selector` which is used with | |||
:class:`compose.ColumnTransformer` to select columns based on data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to select DataFrame columns on the basis of name and dtype"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits but LGTM
In the UG there is this line:
Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer array, a slice, or a boolean mask
We should also mention the column_selector here, with e.g. appending "(see also the column_selector detailed below)".
exclude=self.dtype_exclude) | ||
cols = df_row.columns | ||
if self.pattern is not None: | ||
cols = cols[cols.str.contains(self.pattern)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether we should pass regex=True
just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can not hurt in this case. (Just in case pandas changes their default)
|
||
class make_column_selector: | ||
"""Create a callable to select columns to be used with | ||
:func:`make_column_transformer` or :class:`ColumnTransformer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:func:`make_column_transformer` or :class:`ColumnTransformer`. | |
a :class:`ColumnTransformer`. |
Should be enough
doc/modules/compose.rst
Outdated
@@ -515,6 +515,19 @@ above example would be:: | |||
('countvectorizer', CountVectorizer(), | |||
'title')]) | |||
|
|||
scikit-learn provides a :func:`~sklearn.compose.make_column_selector` to help | |||
select columns based on data type:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select columns based on data type:: | |
select columns based on data type or column name:: |
E377 | selector : callable | |
Callable for column selection. | ||
|
||
See also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need another see also in the ColumnTransformer class to link there.
@@ -69,7 +71,8 @@ class ColumnTransformer(TransformerMixin, _BaseComposition): | |||
``transformer`` expects X to be a 1d array-like (vector), | |||
otherwise a 2d array will be passed to the transformer. | |||
A callable is passed the input data `X` and can return any of the | |||
above. | |||
above. To select by name or dtype, use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above. To select by name or dtype, use | |
above. To select multiple columns by name or dtype, you can use |
... 'expert_rating': [5, 3, 4, 5]}) | ||
>>> ct = make_column_transformer( | ||
... (StandardScaler(), | ||
... make_column_selector(dtype_include=np.number)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... make_column_selector(dtype_include=np.number)), | |
... make_column_selector(dtype_include=np.number)), # rating |
@@ -1180,3 +1184,83 @@ def test_column_transformer_mask_indexing(array_type): | |||
) | |||
X_trans = column_transformer.fit_transform(X) | |||
assert X_trans.shape == (3, 2) | |||
|
|||
|
|||
@pytest.mark.parametrize('cols,pattern,include,exclude', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why you hate spaces in these strings ^^ @thomasjpfan
selector(X) | ||
|
||
|
||
def test_make_column_selector_pickle(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this test needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(['col_int'], '^col_int', [np.number], None), | ||
(['col_float', 'col_str'], 'float|str', None, None), | ||
(['col_str'], '^col_s', None, [np.int]), | ||
(['col_int', 'col_float', 'col_str'], None, [np.number, np.object], None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test where none of the conditions are met? That should just return [] right?
assert_array_equal(selector(X_df), sorted(cols)) | ||
|
||
|
||
def test_column_transformer_mixed_dtypes(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short comment please? e.g.
Functional test for column transformer + column selector
(which I assume is what the test is about)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as well!
Added a minor comment
cols = df_row.columns | ||
if self.pattern is not None: | ||
cols = cols[cols.str.contains(self.pattern)] | ||
return sorted(cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do the sorted
here?
Just converting it to a list would preserve the order of the columns in the original DataFrame ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that sorted
is not needed here. PR updated to return a list.
@thomasjpfan this wasn't addressed about the UG to better expose this new tool #12371 (review) |
CI failures... Merge master? |
doc/modules/compose.rst
Outdated
|
||
>>> from sklearn.preprocessing import StandardScaler | ||
>>> from sklearn.compose import make_column_selector | ||
>>> ct = make_column_transformer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's a real test failure: make_column_transformer
is not imported.
Thanks thomas |
sweet!! |
This is great because it helps us get around pickling issues. But it doesn't make the 80% usecase much shorter :-/ I'm not sure if it's worth adding shortcuts like |
I think the first intention of this is to help users consider a generic
solution rather than specify columns directly by name... Not everyone would
have written a callable comfortably.
Let's see how it's taken up rather than worrying about efficient of syntax.
|
Reference Issues/PRs
Addresses #12303 for data type selection.
What does this implement/fix? Explain your changes.
Adds a factory method,
make_select_dtypes
, to create a callable forColumnTransformer
to select columns based on dtypes.Closes #12303
Closes #11301