-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] FIX ColumnTransformer: raise error on reordered columns with remainder #14237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FIX ColumnTransformer: raise error on reordered columns with remainder #14237
Conversation
I have trouble understanding what exactly is behind the failed CI tests. I apparently have worsened test coverage in my fix -- although the regression test I added enters the conditional branches I added. In addition to that some azure-pipelines tests fail after 15 minutes with Cmd.exe/Bash exited with code '1', and I can't find any pointer in the logs about what exactly went wrong (except that some workers failed to return coverage data). Appreciate any tips/hints on where to look and how to fix this. |
Merge with master should resolve the CI issue. |
…ining_reordered_cols
There's a stream of consciousness on the bigger issues I see in this realm here lol: I think we should consider all the possible cases and then think about the behavior we want. I think it's very weird that reordering columns is fine in some cases but not in others, that's not very transparent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice, thanks. Just needs tweaks
@@ -80,6 +81,9 @@ class ColumnTransformer(_BaseComposition, TransformerMixin): | |||
By setting ``remainder`` to be an estimator, the remaining | |||
non-specified columns will use the ``remainder`` estimator. The | |||
estimator must support :term:`fit` and :term:`transform`. | |||
If columns in `transformers` are provided as strings, the ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Note that using this feature requires that the DataFrame columns input at fit and transform have identical order."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much clearer, will replace.
@@ -135,6 +139,10 @@ class ColumnTransformer(_BaseComposition, TransformerMixin): | |||
dropped from the resulting transformed feature matrix, unless specified | |||
in the `passthrough` keyword. Those columns specified with `passthrough` | |||
are added at the right to the output of the transformers. | |||
Although this, as well as the option to specify column names, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment pertain to passthrough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended this note to be a general (not passthrough-specific) heads-up to not draw incorrect conclusions from the ability to reference columns by name. If you are asking whether column reordering also causes problems with the remainder='passthrough' option, then this to my understanding is the case, too. Does this answer your question? If the text I added is superfluous or unclear, I can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this text out for now, if that's alright. I think it could be clearer, but would rather fix the issue at hand than worry about it.
# No column reordering allowed for named cols combined with remainder | ||
if self._remainder[2] is not None and hasattr(self, '_df_columns'): | ||
if len(X.columns) != len(self._df_columns) \ | ||
or any(X.columns != self._df_columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also allow for appended columns to maintain backwards compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem with that from my side. The current check is stricter than really necessary for this fix anyway. I will add a commit to change it and reflect it in the unit tests.
BTW: who commonly clicks "resolve conversation", the reviewer or the PR author?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't regularly use the "resolve conversation" feature
…ining_reordered_cols
…ining_reordered_cols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm
Please add an entry to the change log for 0.21.3 at doc/whats_new/v0.21.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
'Column ordering must be equal', | ||
tf.transform, X_trans_df) | ||
|
||
# No for added columns if ordering is equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# No for added columns if ordering is equal | |
# No error for added columns if ordering is identical |
Thanks for your suggestions and help. Feel free to squash my commits. |
As a rule, we squash when merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Another reviewer to help towards release?
# No column reordering allowed for named cols combined with remainder | ||
if self._remainder[2] is not None and hasattr(self, '_df_columns'): | ||
n_cols_fit = len(self._df_columns) | ||
n_cols_transform = len(X.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will raise an error if X
is a numpy array. Check if X
is a pandas array before heading into this path?
In future PR, we can use _df_columns
to signal that X
was a dataframe during fit
and raise an error, if during transform
X
is not a dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good point, @thomasjpfan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks! Duck-typing check ok?
if self._remainder[2] is not None and hasattr(self, '_df_columns') and hasattr(X, 'columns'):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though ideally we should also be checking the number of columns when X is not a DataFrame. Not sure if that needs to be in this pull request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interpreting the last suggestion loosely, I looked into also checking column order for structured numpy arrays (np arrays with named columns) in an analogous way as now for DataFrames and I found that this would change the PR a lot. ColumnTransformer currently explicitly does not allow named columns for non-DataFrames and raises an error if one tries. I now think that this was not the intended suggestion, but wanted to clearly state my (maybe misguided) interpretations. If this train of thought is relevant, we can continue this discussion in RFC #14251.
When reading the suggestion literally, i.e. only checking the number of columns for non-DataFrames -- this condition is only in the code originally to avoid an exception when referencing X.columns. If this check is a goal of the PR itself, I can change the conditionals (and it would become a somewhat separate check with a separate error message as _df_columns
only exists for DataFrames combined with column names). I can gladly do this if it's found useful.
For now, I'll only commit the suggested code change from my previous comment (plus test assertion), but that does not mean that I wouldn't do the other one at all. I'll wait for a yay or nay on the numpy-num-of-cols-check. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS to the commit: Note that when passing an np array to transform, an error will still be raised -- not here, but in the correct location (ColumnTransformer checks if named columns are used with non-DataFrames and raises a ValueError if this is not the case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explicitly decided not to support struct arrays. The proposal here was to make sure we continue to allow fitting on DataFrame and transforming on numpy array, if that is currently supported.
The check for the number of columns in an array is because we provide that security in every other estimator, raising an error if the input has changed shape. Here it is also applicable but not currently done. It's not really necessary for this pull request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jnothman! I interpret your last sentence as you leaving it for me to decide. :)
I added a general self.n_features_
check in analogy to the one here
scikit-learn/sklearn/ensemble/bagging.py
Line 682 in 7b8cbc8
if self.n_features_ != X.shape[1]: |
Referencing relevant PR #13603 as ColumnTransformer
now also has n_features_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but tweaked it to allow for added columns (as discussed above)
Things get a little hairy here. I feel like for numpy arrays we should not be that lenient, since they are arrays not columnar frames...
But I can accept this for the version 0.20-21 fix.
Referencing relevant PR #13603 as ColumnTransformer now also has n_features_
Similarly, since I'm trying to include this in a patch release, make this _n_features
to keep the patch API compatible. This also avoids making more deprecation work in #13603 or subsequent to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but tweaked it to allow for added columns (as discussed above)
Things get a little hairy here. I feel like for numpy arrays we should not be that lenient, since they are arrays not columnar frames...
But I can accept this for the version 0.20-21 fix.
Ok, if it is acceptable as it is for this fix, I'm leaving it unchanged (both arrays as well as DFs are checked leniently, accepting added columns, but not removed columns). That way I'm also confident that this PR doesn't break any relied-on behavior. Looking forward to other PRs like #14251 and #13603 for helping in making checks more obvious for first-time contributors like me (and users, of course). I found this PR to be quite the unexpected rabbit hole: every time I thought I understood how the pieces work together, there was still more to discover. :D I learned a lot, though. 👍 Hope than this PR is still a net win for the maintainers regarding time spent vs contribution, despite the communication overhead and my occasional chattiness.
Referencing relevant PR #13603 as ColumnTransformer now also has n_features_
Similarly, since I'm trying to include this in a patch release, make this
_n_features
to keep the patch API compatible. This also avoids making more deprecation work in #13603 or subsequent to it.
I see, thanks! I renamed it to _n_features
. This is the only change in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not entirely comfortable with allowing added columns in a numpy array.
cols = [] | ||
for columns in self._columns: | ||
cols.extend(_get_column_indices(X, columns)) | ||
remaining_idx = sorted(list(set(range(n_columns)) - set(cols))) or None | ||
remaining_idx = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid \
for line continuation.
remaining_idx = (sorted(list(set(range(self._n_features)) -
set(cols)))
or None)
is one option
@jnothman Could you please make explicit whether you mean this as a request for me to change something (and if so, what should be the change exactly), or a RFC to the other devs? I've been trying to adapt everything to satisfy the comments to this PR, but it seems that this has not been working out so well, I feel that I've not been that good at determining the correct requested changes from some of the comments. |
Fair request. It's more an RFC for other devs. I've given my approval
either way.
|
…ining_reordered_cols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style changes
…ining_reordered_cols
Question to @jnothman and @amueller : it seems to me that the consensus in #14251 is to implement option (1) of #14251 (comment) Should this PR do that instead? I'm not sure if it's a good idea to potentially include this PR, and then implement the strict policy proposed in #14251. |
I kind of agree with @adrinjalali, but option (1) is backward incompatible. Maybe we should fix the bug and then deprecate the behavior? |
This is more in line with what we do usually, I see @jnothman originally put it for 0.20.4, I am not really sure if the deprecation should start from 0.20 however. This seems weird. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meanwhile, I am fine with the changes that have been done.
Fair enough, we can fix in 0.20.4, and deprecate in 0.21. |
Deprecate in 0.22 I think, @adrinjalali |
Or we can cut it how we want as experimental |
I would prefer this. Although last time we changed something in ColumnTransformer, we didn't really treat it as experimental. |
Are we accepting this for the patch releases? |
thanks @schuderer |
@jnothman I would say so? |
scikit-learn#14237) * FIX Raise error on reordered columns in ColumnTransformer with remainder * FIX Check for different length of X.columns to avoid exception * FIX linter, line too long * FIX import _check_key_type from its new location utils * ENH Adjust doc, allow added columns * Fix comment typo as suggested, remove non-essential exposition in doc * Add PR 14237 to what's new * Avoid AttributeError in favor of ValueError "column names only for DF" * ENH Add check for n_features_ for array-likes and DataFrames * Rename self.n_features to self._n_features * Replaced backslash line continuation with parenthesis * Style changes
scikit-learn#14237) * FIX Raise error on reordered columns in ColumnTransformer with remainder * FIX Check for different length of X.columns to avoid exception * FIX linter, line too long * FIX import _check_key_type from its new location utils * ENH Adjust doc, allow added columns * Fix comment typo as suggested, remove non-essential exposition in doc * Add PR 14237 to what's new * Avoid AttributeError in favor of ValueError "column names only for DF" * ENH Add check for n_features_ for array-likes and DataFrames * Rename self.n_features to self._n_features * Replaced backslash line continuation with parenthesis * Style changes
scikit-learn#14237) * FIX Raise error on reordered columns in ColumnTransformer with remainder * FIX Check for different length of X.columns to avoid exception * FIX linter, line too long * FIX import _check_key_type from its new location utils * ENH Adjust doc, allow added columns * Fix comment typo as suggested, remove non-essential exposition in doc * Add PR 14237 to what's new * Avoid AttributeError in favor of ValueError "column names only for DF" * ENH Add check for n_features_ for array-likes and DataFrames * Rename self.n_features to self._n_features * Replaced backslash line continuation with parenthesis * Style changes
Reference Issues/PRs
Fixes #14223.
What does this implement/fix? Explain your changes.
ColumnTransformer silently passes the wrong columns along as
remainder
when all the following conditions are met:remainder
option, and usingfit
andtransform
Changes:
remainder
: stated that named columns, remainder and changing column orde 8000 r don't mix.Any other comments?
Although the conditional
if
inside anotherif
looks clumsy, it is there for clarity to distinguish the relevance of the check from the check itself.@amueller I can't say I am completely satisfied with how I put the columns check squarely in the middle of the
transform
method. I looked for a good alternative spot for a long time, but one existing function did not know self, the other did not know X, and I did not want to do any restructuring of existing code just yet. Your opinion on this is appreciated!(This being my first contribution to a major open source project, I appreciate tips on what I should do differently or should have done better.)