[MRG] FIX ColumnTransformer: raise error on reordered columns with remainder #14237

schuderer · 2019-07-02T18:11:20Z

Reference Issues/PRs

Fixes #14223.

What does this implement/fix? Explain your changes.

ColumnTransformer silently passes the wrong columns along as remainder when all the following conditions are met:

specifying columns by name (strings),
using the remainder option, and using
DataFrames where column ordering differs between fit and transform

Changes:

Raise a ValueError if column ordering changes while using remainder combined with named columns
Added a regression test
Class doc, param remainder: stated that named columns, remainder and changing column order don't mix.
Class doc, 'Note' section: added text that named columns don't imply that consistent column order does not matter.

Any other comments?

Although the conditional if inside another if looks clumsy, it is there for clarity to distinguish the relevance of the check from the check itself.

@amueller I can't say I am completely satisfied with how I put the columns check squarely in the middle of the transform method. I looked for a good alternative spot for a long time, but one existing function did not know self, the other did not know X, and I did not want to do any restructuring of existing code just yet. Your opinion on this is appreciated!

(This being my first contribution to a major open source project, I appreciate tips on what I should do differently or should have done better.)

schuderer · 2019-07-03T08:59:49Z

I have trouble understanding what exactly is behind the failed CI tests. I apparently have worsened test coverage in my fix -- although the regression test I added enters the conditional branches I added. In addition to that some azure-pipelines tests fail after 15 minutes with Cmd.exe/Bash exited with code '1', and I can't find any pointer in the logs about what exactly went wrong (except that some workers failed to return coverage data).

Appreciate any tips/hints on where to look and how to fix this.

thomasjpfan · 2019-07-03T12:25:22Z

Merge with master should resolve the CI issue.

…ining_reordered_cols

amueller · 2019-07-03T18:12:19Z

There's a stream of consciousness on the bigger issues I see in this realm here lol:
#14251

I think we should consider all the possible cases and then think about the behavior we want. I think it's very weird that reordering columns is fine in some cases but not in others, that's not very transparent.

jnothman

This is nice, thanks. Just needs tweaks

jnothman · 2019-07-05T00:03:43Z

sklearn/compose/_column_transformer.py

@@ -80,6 +81,9 @@ class ColumnTransformer(_BaseComposition, TransformerMixin):
        By setting ``remainder`` to be an estimator, the remaining
        non-specified columns will use the ``remainder`` estimator. The
        estimator must support :term:`fit` and :term:`transform`.
+        If columns in `transformers` are provided as strings, the ordering


How about "Note that using this feature requires that the DataFrame columns input at fit and transform have identical order."

Much clearer, will replace.

jnothman · 2019-07-05T00:04:29Z

sklearn/compose/_column_transformer.py

@@ -135,6 +139,10 @@ class ColumnTransformer(_BaseComposition, TransformerMixin):
    dropped from the resulting transformed feature matrix, unless specified
    in the `passthrough` keyword. Those columns specified with `passthrough`
    are added at the right to the output of the transformers.
+    Although this, as well as the option to specify column names,


Does this comment pertain to passthrough?

I intended this note to be a general (not passthrough-specific) heads-up to not draw incorrect conclusions from the ability to reference columns by name. If you are asking whether column reordering also causes problems with the remainder='passthrough' option, then this to my understanding is the case, too. Does this answer your question? If the text I added is superfluous or unclear, I can remove it.

Let's leave this text out for now, if that's alright. I think it could be clearer, but would rather fix the issue at hand than worry about it.

jnothman · 2019-07-05T00:04:53Z

sklearn/compose/_column_transformer.py

+        # No column reordering allowed for named cols combined with remainder
+        if self._remainder[2] is not None and hasattr(self, '_df_columns'):
+            if len(X.columns) != len(self._df_columns) \
+               or any(X.columns != self._df_columns):


Should we also allow for appended columns to maintain backwards compatibility?

No problem with that from my side. The current check is stricter than really necessary for this fix anyway. I will add a commit to change it and reflect it in the unit tests.
BTW: who commonly clicks "resolve conversation", the reviewer or the PR author?

I don't regularly use the "resolve conversation" feature

…ining_reordered_cols

jnothman

Otherwise lgtm

Please add an entry to the change log for 0.21.3 at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

jnothman · 2019-07-07T11:46:51Z

sklearn/compose/tests/test_column_transformer.py

+                         'Column ordering must be equal',
+                         tf.transform, X_trans_df)
+
+    # No for added columns if ordering is equal


Suggested change

# No for added columns if ordering is equal

# No error for added columns if ordering is identical

…ining_reordered_cols

schuderer · 2019-07-07T16:48:32Z

Thanks for your suggestions and help. Feel free to squash my commits.

jnothman · 2019-07-07T21:53:35Z

As a rule, we squash when merging

jnothman

Thank you! Another reviewer to help towards release?

thomasjpfan · 2019-07-07T22:02:08Z

sklearn/compose/_column_transformer.py

+        # No column reordering allowed for named cols combined with remainder
+        if self._remainder[2] is not None and hasattr(self, '_df_columns'):
+            n_cols_fit = len(self._df_columns)
+            n_cols_transform = len(X.columns)


This will raise an error if X is a numpy array. Check if X is a pandas array before heading into this path?

In future PR, we can use _df_columns to signal that X was a dataframe during fit and raise an error, if during transform X is not a dataframe.

A good point, @thomasjpfan

Good catch, thanks! Duck-typing check ok?

if self._remainder[2] is not None and hasattr(self, '_df_columns') and hasattr(X, 'columns'):

Yes, though ideally we should also be checking the number of columns when X is not a DataFrame. Not sure if that needs to be in this pull request

Interpreting the last suggestion loosely, I looked into also checking column order for structured numpy arrays (np arrays with named columns) in an analogous way as now for DataFrames and I found that this would change the PR a lot. ColumnTransformer currently explicitly does not allow named columns for non-DataFrames and raises an error if one tries. I now think that this was not the intended suggestion, but wanted to clearly state my (maybe misguided) interpretations. If this train of thought is relevant, we can continue this discussion in RFC #14251.

When reading the suggestion literally, i.e. only checking the number of columns for non-DataFrames -- this condition is only in the code originally to avoid an exception when referencing X.columns. If this check is a goal of the PR itself, I can change the conditionals (and it would become a somewhat separate check with a separate error message as _df_columns only exists for DataFrames combined with column names). I can gladly do this if it's found useful.

For now, I'll only commit the suggested code change from my previous comment (plus test assertion), but that does not mean that I wouldn't do the other one at all. I'll wait for a yay or nay on the numpy-num-of-cols-check. :)

PS to the commit: Note that when passing an np array to transform, an error will still be raised -- not here, but in the correct location (ColumnTransformer checks if named columns are used with non-DataFrames and raises a ValueError if this is not the case).

We explicitly decided not to support struct arrays. The proposal here was to make sure we continue to allow fitting on DataFrame and transforming on numpy array, if that is currently supported.

The check for the number of columns in an array is because we provide that security in every other estimator, raising an error if the input has changed shape. Here it is also applicable but not currently done. It's not really necessary for this pull request

Thanks @jnothman! I interpret your last sentence as you leaving it for me to decide. :)

I added a general self.n_features_ check in analogy to the one here

scikit-learn/sklearn/ensemble/bagging.py

Line 682 in 7b8cbc8

if self.n_features_ != X.shape[1]:

but tweaked it to allow for added columns (as discussed above). I expanded the tests to be explicit about accepting more columns and rejecting fewer columns.

Referencing relevant PR #13603 as ColumnTransformer now also has n_features_

but tweaked it to allow for added columns (as discussed above)

Things get a little hairy here. I feel like for numpy arrays we should not be that lenient, since they are arrays not columnar frames...

But I can accept this for the version 0.20-21 fix.

Referencing relevant PR #13603 as ColumnTransformer now also has n_features_

Similarly, since I'm trying to include this in a patch release, make this _n_features to keep the patch API compatible. This also avoids making more deprecation work in #13603 or subsequent to it.

but tweaked it to allow for added columns (as discussed above)

Things get a little hairy here. I feel like for numpy arrays we should not be that lenient, since they are arrays not columnar frames...

But I can accept this for the version 0.20-21 fix.

Ok, if it is acceptable as it is for this fix, I'm leaving it unchanged (both arrays as well as DFs are checked leniently, accepting added columns, but not removed columns). That way I'm also confident that this PR doesn't break any relied-on behavior. Looking forward to other PRs like #14251 and #13603 for helping in making checks more obvious for first-time contributors like me (and users, of course). I found this PR to be quite the unexpected rabbit hole: every time I thought I understood how the pieces work together, there was still more to discover. :D I learned a lot, though. 👍 Hope than this PR is still a net win for the maintainers regarding time spent vs contribution, despite the communication overhead and my occasional chattiness.

Referencing relevant PR #13603 as ColumnTransformer now also has n_features_

Similarly, since I'm trying to include this in a patch release, make this _n_features to keep the patch API compatible. This also avoids making more deprecation work in #13603 or subsequent to it.

I see, thanks! I renamed it to _n_features. This is the only change in the latest commit.

…ining_reordered_cols

jnothman

Still not entirely comfortable with allowing added columns in a numpy array.

jnothman · 2019-07-11T10:36:17Z

sklearn/compose/_column_transformer.py

        cols = []
        for columns in self._columns:
            cols.extend(_get_column_indices(X, columns))
-        remaining_idx = sorted(list(set(range(n_columns)) - set(cols))) or None
+        remaining_idx = \


Please avoid \ for line continuation.

remaining_idx = (sorted(list(set(range(self._n_features)) - set(cols))) or None)

is one option

schuderer · 2019-07-11T11:43:37Z

@jnothman Could you please make explicit whether you mean this as a request for me to change something (and if so, what should be the change exactly), or a RFC to the other devs? I've been trying to adapt everything to satisfy the comments to this PR, but it seems that this has not been working out so well, I feel that I've not been that good at determining the correct requested changes from some of the comments.

jnothman · 2019-07-11T12:40:01Z

Fair request. It's more an RFC for other devs. I've given my approval either way.

…ining_reordered_cols

glemaitre

Some style changes

sklearn/compose/_column_transformer.py

sklearn/compose/tests/test_column_transformer.py

…ining_reordered_cols

adrinjalali · 2019-07-15T15:10:09Z

Question to @jnothman and @amueller :

it seems to me that the consensus in #14251 is to implement option (1) of #14251 (comment)

Should this PR do that instead? I'm not sure if it's a good idea to potentially include this PR, and then implement the strict policy proposed in #14251.

amueller · 2019-07-17T17:05:27Z

I kind of agree with @adrinjalali, but option (1) is backward incompatible. Maybe we should fix the bug and then deprecate the behavior?

glemaitre · 2019-07-18T07:27:31Z

Maybe we should fix the bug and then deprecate the behavior?

This is more in line with what we do usually, ColumnTransformer not being any more experimental.
We raise the deprecation warning which will become an error.

I see @jnothman originally put it for 0.20.4, I am not really sure if the deprecation should start from 0.20 however. This seems weird.

glemaitre

In the meanwhile, I am fine with the changes that have been done.

adrinjalali · 2019-07-18T08:35:22Z

Fair enough, we can fix in 0.20.4, and deprecate in 0.21.

jnothman · 2019-07-18T23:59:13Z

Deprecate in 0.22 I think, @adrinjalali

jnothman · 2019-07-18T23:59:54Z

Or we can cut it how we want as experimental

adrinjalali · 2019-07-19T06:50:58Z

Or we can cut it how we want as experimental

I would prefer this. Although last time we changed something in ColumnTransformer, we didn't really treat it as experimental.

jnothman · 2019-07-22T13:16:37Z

Are we accepting this for the patch releases?

adrinjalali · 2019-07-22T13:37:07Z

thanks @schuderer

amueller · 2019-07-22T13:57:36Z

@jnothman I would say so?

scikit-learn#14237) * FIX Raise error on reordered columns in ColumnTransformer with remainder * FIX Check for different length of X.columns to avoid exception * FIX linter, line too long * FIX import _check_key_type from its new location utils * ENH Adjust doc, allow added columns * Fix comment typo as suggested, remove non-essential exposition in doc * Add PR 14237 to what's new * Avoid AttributeError in favor of ValueError "column names only for DF" * ENH Add check for n_features_ for array-likes and DataFrames * Rename self.n_features to self._n_features * Replaced backslash line continuation with parenthesis * Style changes

schuderer added 3 commits July 2, 2019 18:55

FIX Raise error on reordered columns in ColumnTransformer with remainder

6cfc722

FIX Check for different length of X.columns to avoid exception

d06b27a

FIX linter, line too long

16cef75

jnothman added this to the 0.20.4 milestone Jul 2, 2019

Merge remote-tracking branch 'upstream/master' into coltrans_fix_rema…

4feb366

…ining_reordered_cols

amueller mentioned this pull request Jul 3, 2019

RFC ColumnTransformer input validation and requirements #14251

Closed

FIX import _check_key_type from its new location utils

6421cb1

jnothman reviewed Jul 5, 2019

View reviewed changes

schuderer added 3 commits July 5, 2019 15:09

Merge remote-tracking branch 'upstream/master' into coltrans_fix_rema…

8d2a18d

…ining_reordered_cols

ENH Adjust doc, allow added columns

d709be1

Merge remote-tracking branch 'upstream/master' into coltrans_fix_rema…

096b58d

…ining_reordered_cols

jnothman approved these changes Jul 7, 2019

View reviewed changes

schuderer added 3 commits July 7, 2019 16:46

Fix comment typo as suggested, remove non-essential exposition in doc

8bfc3ce

Merge remote-tracking branch 'upstream/master' into coltrans_fix_rema…

d4ae600

…ining_reordered_cols

Add PR 14237 to what's new

47b645d

jnothman approved these changes Jul 7, 2019

View reviewed changes

thomasjpfan reviewed Jul 7, 2019

View reviewed changes

schuderer added 4 commits July 9, 2019 17:18

Merge remote-tracking branch 'upstream/master' into coltrans_fix_rema…

39360de

…ining_reordered_cols

Avoid AttributeError in favor of ValueError "column names only for DF"

3b404d8

ENH Add check for n_features_ for array-likes and DataFrames

011a2a2

Rename self.n_features to self._n_features

c61143c

jnothman approved these changes Jul 11, 2019

View reviewed changes

jnothman mentioned this pull request Jul 11, 2019

ENH ColumnTransformer.get_feature_names() handles passthrough #14048

Merged

Replaced backslash line continuation with parenthesis

5f0b92b

Merge remote-tracking branch 'upstream/master' into coltrans_fix_rema…

d8e1fc0

…ining_reordered_cols

glemaitre requested changes Jul 12, 2019

View reviewed changes

schuderer added 2 commits July 13, 2019 09:05

Merge remote-tracking branch 'upstream/master' into coltrans_fix_rema…

e1434c2

…ining_reordered_cols

Style changes

6c28e4d

glemaitre reviewed Jul 18, 2019

View reviewed changes

glemaitre self-requested a review July 18, 2019 07:29

adrinjalali merged commit 9115ab0 into scikit-learn:master Jul 22, 2019

This was referenced Aug 2, 2019

[MRG] Add n_features_in_ attribute to BaseEstimator #13603

Closed

ColumnTransformer input feature name and count validation #14544

Merged

madhuracj mentioned this pull request Aug 25, 2020

FIX Enforce strict column name order/count in ColumnTransformer #18256

Merged

	# No for added columns if ordering is equal
	# No error for added columns if ordering is identical

Uh oh!

[MRG] FIX ColumnTransformer: raise error on reordered columns with remainder #14237

[MRG] FIX ColumnTransformer: raise error on reordered columns with remainder #14237

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!