8000 [MRG] FIX ColumnTransformer: raise error on reordered columns with remainder by schuderer · Pull Request #14237 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] FIX ColumnTransformer: raise error on reordered columns with remainder #14237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

schuderer
Copy link
Contributor
@schuderer schuderer commented Jul 2, 2019

Reference Issues/PRs

Fixes #14223.

What does this implement/fix? Explain your changes.

ColumnTransformer silently passes the wrong columns along as remainder when all the following conditions are met:

  • specifying columns by name (strings),
  • using the remainder option, and using
  • DataFrames where column ordering differs between fit and transform

Changes:

  • Raise a ValueError if column ordering changes while using remainder combined with named columns
  • Added a regression test
  • Class doc, param remainder: stated that named columns, remainder and changing column orde 8000 r don't mix.
  • Class doc, 'Note' section: added text that named columns don't imply that consistent column order does not matter.

Any other comments?

Although the conditional if inside another if looks clumsy, it is there for clarity to distinguish the relevance of the check from the check itself.

@amueller I can't say I am completely satisfied with how I put the columns check squarely in the middle of the transform method. I looked for a good alternative spot for a long time, but one existing function did not know self, the other did not know X, and I did not want to do any restructuring of existing code just yet. Your opinion on this is appreciated!

(This being my first contribution to a major open source project, I appreciate tips on what I should do differently or should have done better.)

@jnothman jnothman added this to the 0.20.4 milestone Jul 2, 2019
@schuderer
Copy link
Contributor Author
schuderer commented Jul 3, 2019

I have trouble understanding what exactly is behind the failed CI tests. I apparently have worsened test coverage in my fix -- although the regression test I added enters the conditional branches I added. In addition to that some azure-pipelines tests fail after 15 minutes with Cmd.exe/Bash exited with code '1', and I can't find any pointer in the logs about what exactly went wrong (except that some workers failed to return coverage data).

Appreciate any tips/hints on where to look and how to fix this.

@thomasjpfan
Copy link
Member

Merge with master should resolve the CI issue.

@amueller
Copy link
Member
amueller commented Jul 3, 2019

There's a stream of consciousness on the bigger issues I see in this realm here lol:
#14251

I think we should consider all the possible cases and then think about the behavior we want. I think it's very weird that reordering columns is fine in some cases but not in others, that's not very transparent.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, thanks. Just needs tweaks

@@ -80,6 +81,9 @@ class ColumnTransformer(_BaseComposition, TransformerMixin):
By setting ``remainder`` to be an estimator, the remaining
non-specified columns will use the ``remainder`` estimator. The
estimator must support :term:`fit` and :term:`transform`.
If columns in `transformers` are provided as strings, the ordering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Note that using this feature requires that the DataFrame columns input at fit and transform have identical order."

Copy link
Contributor Author
@schuderer schuderer Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer, will replace.

@@ -135,6 +139,10 @@ class ColumnTransformer(_BaseComposition, TransformerMixin):
dropped from the resulting transformed feature matrix, unless specified
in the `passthrough` keyword. Those columns specified with `passthrough`
are added at the right to the output of the transformers.
Although this, as well as the option to specify column names,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment pertain to passthrough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended this note to be a general (not passthrough-specific) heads-up to not draw incorrect conclusions from the ability to reference columns by name. If you are asking whether column reordering also causes problems with the remainder='passthrough' option, then this to my understanding is the case, too. Does this answer your question? If the text I added is superfluous or unclear, I can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this text out for now, if that's alright. I think it could be clearer, but would rather fix the issue at hand than worry about it.

# No column reordering allowed for named cols combined with remainder
if self._remainder[2] is not None and hasattr(self, '_df_columns'):
if len(X.columns) != len(self._df_columns) \
or any(X.columns != self._df_columns):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also allow for appended columns to maintain backwards compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem with that from my side. The current check is stricter than really necessary for this fix anyway. I will add a commit to change it and reflect it in the unit tests.
BTW: who commonly clicks "resolve conversation", the reviewer or the PR author?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't regularly use the "resolve conversation" feature

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm

Please add an entry to the change log for 0.21.3 at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

'Column ordering must be equal',
tf.transform, X_trans_df)

# No for added columns if ordering is equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# No for added columns if ordering is equal
# No error for added columns if ordering is identical

@schuderer
Copy link
Contributor Author

Thanks for your suggestions and help. Feel free to squash my commits.

@jnothman
Copy link
Member
jnothman commented Jul 7, 2019

As a rule, we squash when merging

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Another reviewer to help towards release?

# No column reordering allowed for named cols combined with remainder
if self._remainder[2] is not None and hasattr(self, '_df_columns'):
n_cols_fit = len(self._df_columns)
n_cols_transform = len(X.columns)
Copy link
Member
@thomasjpfan thomasjpfan Jul 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise an error if X is a numpy array. Check if X is a pandas array before heading into this path?

In future PR, we can use _df_columns to signal that X was a dataframe during fit and raise an error, if during transform X is not a dataframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good point, @thomasjpfan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! Duck-typing check ok?

if self._remainder[2] is not None and hasattr(self, '_df_columns') and hasattr(X, 'columns'):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though ideally we should also be checking the number of columns when X is not a DataFrame. Not sure if that needs to be in this pull request

Copy link
Contributor Author
@schuderer schuderer Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interpreting the last suggestion loosely, I looked into also checking column order for structured numpy arrays (np arrays with named columns) in an analogous way as now for DataFrames and I found that this would change the PR a lot. ColumnTransformer currently explicitly does not allow named columns for non-DataFrames and raises an error if one tries. I now think that this was not the intended suggestion, but wanted to clearly state my (maybe misguided) interpretations. If this train of thought is relevant, we can continue this discussion in RFC #14251.

When reading the suggestion literally, i.e. only checking the number of columns for non-DataFrames -- this condition is only in the code originally to avoid an exception when referencing X.columns. If this check is a goal of the PR itself, I can change the conditionals (and it would become a somewhat separate check with a separate error message as _df_columns only exists for DataFrames combined with column names). I can gladly do this if it's found useful.

For now, I'll only commit the suggested code change from my previous comment (plus test assertion), but that does not mean that I wouldn't do the other one at all. I'll wait for a yay or nay on the numpy-num-of-cols-check. :)

Copy link
Contributor Author
@schuderer schuderer Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS to the commit: Note that when passing an np array to transform, an error will still be raised -- not here, but in the correct location (ColumnTransformer checks if named columns are used with non-DataFrames and raises a ValueError if this is not the case).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly decided not to support struct arrays. The proposal here was to make sure we continue to allow fitting on DataFrame and transforming on numpy array, if that is currently supported.

The check for the number of columns in an array is because we provide that security in every other estimator, raising an error if the input has changed shape. Here it is also applicable but not currently done. It's not really necessary for this pull request

Copy link
Contributor Author
@schuderer schuderer Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jnothman! I interpret your last sentence as you leaving it for me to decide. :)

I added a general self.n_features_ check in analogy to the one here

if self.n_features_ != X.shape[1]:
but tweaked it to allow for added columns (as discussed above). I expanded the tests to be explicit about accepting more columns and rejecting fewer columns.

Referencing relevant PR #13603 as ColumnTransformer now also has n_features_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but tweaked it to allow for added columns (as discussed above)

Things get a little hairy here. I feel like for numpy arrays we should not be that lenient, since they are arrays not columnar frames...

But I can accept this for the version 0.20-21 fix.

Referencing relevant PR #13603 as ColumnTransformer now also has n_features_

Similarly, since I'm trying to include this in a patch release, make this _n_features to keep the patch API compatible. This also avoids making more deprecation work in #13603 or subsequent to it.

Copy link
Contributor Author
@schuderer schuderer Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but tweaked it to allow for added columns (as discussed above)

Things get a little hairy here. I feel like for numpy arrays we should not be that lenient, since they are arrays not columnar frames...

But I can accept this for the version 0.20-21 fix.

Ok, if it is acceptable as it is for this fix, I'm leaving it unchanged (both arrays as well as DFs are checked leniently, accepting added columns, but not removed columns). That way I'm also confident that this PR doesn't break any relied-on behavior. Looking forward to other PRs like #14251 and #13603 for helping in making checks more obvious for first-time contributors like me (and users, of course). I found this PR to be quite the unexpected rabbit hole: every time I thought I understood how the pieces work together, there was still more to discover. :D I learned a lot, though. 👍 Hope than this PR is still a net win for the maintainers regarding time spent vs contribution, despite the communication overhead and my occasional chattiness.

Referencing relevant PR #13603 as ColumnTransformer now also has n_features_

Similarly, since I'm trying to include this in a patch release, make this _n_features to keep the patch API compatible. This also avoids making more deprecation work in #13603 or subsequent to it.

I see, thanks! I renamed it to _n_features. This is the only change in the latest commit.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not entirely comfortable with allowing added columns in a numpy array.

cols = []
for columns in self._columns:
cols.extend(_get_column_indices(X, columns))
remaining_idx = sorted(list(set(range(n_columns)) - set(cols))) or None
remaining_idx = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid \ for line continuation.

        remaining_idx = (sorted(list(set(range(self._n_features)) -
                                     set(cols)))
                         or None)

is one option

@schuderer
Copy link
Contributor Author
schuderer commented Jul 11, 2019

@jnothman Could you please make explicit whether you mean this as a request for me to change something (and if so, what should be the change exactly), or a RFC to the other devs? I've been trying to adapt everything to satisfy the comments to this PR, but it seems that this has not been working out so well, I feel that I've not been that good at determining the correct requested changes from some of the comments.

@jnothman
Copy link
Member
jnothman commented Jul 11, 2019 via email

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style changes

@adrinjalali
Copy link
Member

Question to @jnothman and @amueller :

it seems to me that the consensus in #14251 is to implement option (1) of #14251 (comment)

Should this PR do that instead? I'm not sure if it's a good idea to potentially include this PR, and then implement the strict policy proposed in #14251.

@amueller
Copy link
Member

I kind of agree with @adrinjalali, but option (1) is backward incompatible. Maybe we should fix the bug and then deprecate the behavior?

@glemaitre
Copy link
Member

Maybe we should fix the bug and then deprecate the behavior?

This is more in line with what we do usually, ColumnTransformer not being any more experimental.
We raise the deprecation warning which will become an error.

I see @jnothman originally put it for 0.20.4, I am not really sure if the deprecation should start from 0.20 however. This seems weird.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meanwhile, I am fine with the changes that have been done.

@glemaitre glemaitre self-requested a review July 18, 2019 07:29
@adrinjalali
Copy link
Member

Fair enough, we can fix in 0.20.4, and deprecate in 0.21.

@jnothman
Copy link
Member

Deprecate in 0.22 I think, @adrinjalali

@jnothman
Copy link
Member

Or we can cut it how we want as experimental

@adrinjalali
Copy link
Member

Or we can cut it how we want as experimental

I would prefer this. Although last time we changed something in ColumnTransformer, we didn't really treat it as experimental.

@jnothman
Copy link
Member

Are we accepting this for the patch releases?

@adrinjalali adrinjalali merged commit 9115ab0 into scikit-learn:master Jul 22, 2019
@adrinjalali
Copy link
Member

thanks @schuderer

@amueller
Copy link
Member

@jnothman I would say so?

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Jul 23, 2019
scikit-learn#14237)

* FIX Raise error on reordered columns in ColumnTransformer with remainder

* FIX Check for different length of X.columns to avoid exception

* FIX linter, line too long

* FIX import _check_key_type from its new location utils

* ENH Adjust doc, allow added columns

* Fix comment typo as suggested, remove non-essential exposition in doc

* Add PR 14237 to what's new

* Avoid AttributeError in favor of ValueError "column names only for DF"

* ENH Add check for n_features_ for array-likes and DataFrames

* Rename self.n_features to self._n_features

* Replaced backslash line continuation with parenthesis

* Style changes
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Jul 23, 2019
scikit-learn#14237)

* FIX Raise error on reordered columns in ColumnTransformer with remainder

* FIX Check for different length of X.columns to avoid exception

* FIX linter, line too long

* FIX import _check_key_type from its new location utils

* ENH Adjust doc, allow added columns

* Fix comment typo as suggested, remove non-essential exposition in doc

* Add PR 14237 to what's new

* Avoid AttributeError in favor of ValueError "column names only for DF"

* ENH Add check for n_features_ for array-likes and DataFrames

* Rename self.n_features to self._n_features

* Replaced backslash line continuation with parenthesis

* Style changes
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Jul 23, 2019
scikit-learn#14237)

* FIX Raise error on reordered columns in ColumnTransformer with remainder

* FIX Check for different length of X.columns to avoid exception

* FIX linter, line too long

* FIX import _check_key_type from its new location utils

* ENH Adjust doc, allow added columns

* Fix comment typo as suggested, remove non-essential exposition in doc

* Add PR 14237 to what's new

* Avoid AttributeError in favor of ValueError "column names only for DF"

* ENH Add check for n_features_ for array-likes and DataFrames

* Rename self.n_features to self._n_features

* Replaced backslash line continuation with parenthesis

* Style changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Named col indexing fails with ColumnTransformer remainder on changing DataFrame column ordering
6 participants
0