-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT test globally setting output via context manager #24932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While it is quite trivial to fix the issue for our transformer, we might start to break third-party transformers that have nested transformer(s) without any warning. This could be quite problematic. |
sklearn/impute/_iterative.py
Outdated
@@ -707,7 +722,7 @@ def fit_transform(self, X, y=None): | |||
Xt, estimator = self._impute_one_feature( | |||
Xt, | |||
mask_missing_values, | |||
feat_idx, | |||
int(feat_idx), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to understand why feat_idx
is not an integer here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feat_idx
is coming out of _get_ordered_idx
, where a few different things are done based on the imputation_order. Do you know if the issue happened in general, or for a specific imputation_order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that it was the default one. I will have a closer look to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK the reason to get a numpy array was for the following:
scikit-learn/sklearn/utils/__init__.py
Lines 189 to 193 in 75db1bc
if hasattr(key, "shape"): | |
# Work-around for indexing with read-only key in pandas | |
# FIXME: solved in pandas 0.25 | |
key = np.asarray(key) | |
key = key if key.flags.writeable else key.copy() |
We can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.int64(10)
will expose a shape
attribute :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
sklearn/impute/_iterative.py
Outdated
Xt[~mask_missing_values] = X[~mask_missing_values] | ||
for feat_idx, feat_mask in enumerate(mask_missing_values.T): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this introduce a performance regression when everything is a ndarray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a slowdown of x1.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a disjunction based on whether mask_missing_values
is a np.array
or a pd.DataFrame
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's write a small helper function because the same pattern also appears in transform
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is not mask_missing_values
being an array or a dataframe. This is always a 2D numpy array here.
The problem is that _safe_indexing
is not written to handle 2D arrays. Maybe, we could make it possible to pass axis=None
and in this case, we expect to have a 2D array of the same shape as the original one. It might be only working for NumPy array thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a distinction for whether Xt is a dataframe or not. Also pandas has a method for this specific use case. I put the logic in a function named _assign_where
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you, @glemaitre.
I just have a few comments and remarks.
sklearn/impute/_iterative.py
Outdated
Xt[~mask_missing_values] = X[~mask_missing_values] | ||
for feat_idx, feat_mask in enumerate(mask_missing_values.T): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a disjunction based on whether mask_missing_values
is a np.array
or a pd.DataFrame
?
sklearn/utils/__init__.py
Outdated
@@ -186,7 +186,9 @@ def _array_indexing(array, key, key_dtype, axis): | |||
|
|||
def _pandas_indexing(X, key, key_dtype, axis): | |||
"""Index a pandas dataframe or a series.""" | |||
if hasattr(key, "shape"): | |||
if hasattr(key, "shape") and not np.isscalar(key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use _is_arraylike_not_scalar
instead?
scikit-learn/sklearn/utils/validation.py
Lines 257 to 264 in 74ddf01
def _is_arraylike(x): | |
"""Returns whether the input is array-like.""" | |
return hasattr(x, "__len__") or hasattr(x, "shape") or hasattr(x, "__array__") | |
def _is_arraylike_not_scalar(array): | |
"""Return True if array is array-like and not a scalar""" | |
return _is_arraylike(array) and not np.isscalar(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first we could not because a tuple is an array-like and there's a different logic for the tuple case right after. But it turns out that we can get rid of the pandas < 0.25 workaround and that there's no reason to make it a list when it's a tuple and an array when it's not. We can always make it an array.
So in the end I think we can use _is_array_like_not_scalar
and simplify the whole logic. Done in a642ec9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre I directly pushed the requested changes and simplified the logic of some parts. LGTM
I think all the comments have been addressed. The performance regression has been avoided. Let's roll. |
closes #24931
closes #24923
Requires a merge of #24930 to test all estimators
This PR uses the
config_context
to set globally the output of estimators.TODO:
IterativeImputer
fit
andtransform
transform