8000 MAINT test globally setting output via context manager by glemaitre · Pull Request #24932 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MAINT test globally setting output via context manager #24932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Nov 24, 2022

Conversation

glemaitre
Copy link
Member
@glemaitre glemaitre commented Nov 15, 2022

closes #24931
closes #24923

Requires a merge of #24930 to test all estimators

This PR uses the config_context to set globally the output of estimators.

TODO:

@glemaitre glemaitre marked this pull request as draft November 15, 2022 11:29
@glemaitre glemaitre added this to the 1.2 milestone Nov 15, 2022
@glemaitre
Copy link
Member Author

While it is quite trivial to fix the issue for our transformer, we might start to break third-party transformers that have nested transformer(s) without any warning. This could be quite problematic.

@glemaitre glemaitre marked this pull request as ready for review November 17, 2022 14:27
@@ -707,7 +722,7 @@ def fit_transform(self, X, y=None):
Xt, estimator = self._impute_one_feature(
Xt,
mask_missing_values,
feat_idx,
int(feat_idx),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to understand why feat_idx is not an integer here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feat_idx is coming out of _get_ordered_idx, where a few different things are done based on the imputation_order. Do you know if the issue happened in general, or for a specific imputation_order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that it was the default one. I will have a closer look to this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK the reason to get a numpy array was for the following:

if hasattr(key, "shape"):
# Work-around for indexing with read-only key in pandas
# FIXME: solved in pandas 0.25
key = np.asarray(key)
key = key if key.flags.writeable else key.copy()

We can remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.int64(10) will expose a shape attribute :)

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Comment on lines 746 to 769
Xt[~mask_missing_values] = X[~mask_missing_values]
for feat_idx, feat_mask in enumerate(mask_missing_values.T):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this introduce a performance regression when everything is a ndarray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a slowdown of x1.5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a disjunction based on whether mask_missing_values is a np.array or a pd.DataFrame?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's write a small helper function because the same pattern also appears in transform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is not mask_missing_values being an array or a dataframe. This is always a 2D numpy array here.

The problem is that _safe_indexing is not written to handle 2D arrays. Maybe, we could make it possible to pass axis=None and in this case, we expect to have a 2D array of the same shape as the original one. It might be only working for NumPy array thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a distinction for whether Xt is a dataframe or not. Also pandas has a method for this specific use case. I put the logic in a function named _assign_where.

Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @glemaitre.

I just have a few comments and remarks.

Comment on lines 746 to 769
Xt[~mask_missing_values] = X[~mask_missing_values]
for feat_idx, feat_mask in enumerate(mask_missing_values.T):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a disjunction based on whether mask_missing_values is a np.array or a pd.DataFrame?

@@ -186,7 +186,9 @@ def _array_indexing(array, key, key_dtype, axis):

def _pandas_indexing(X, key, key_dtype, axis):
"""Index a pandas dataframe or a series."""
if hasattr(key, "shape"):
if hasattr(key, "shape") and not np.isscalar(key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use _is_arraylike_not_scalar instead?

def _is_arraylike(x):
"""Returns whether the input is array-like."""
return hasattr(x, "__len__") or hasattr(x, "shape") or hasattr(x, "__array__")
def _is_arraylike_not_scalar(array):
"""Return True if array is array-like and not a scalar"""
return _is_arraylike(array) and not np.isscalar(array)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first we could not because a tuple is an array-like and there's a different logic for the tuple case right after. But it turns out that we can get rid of the pandas < 0.25 workaround and that there's no reason to make it a list when it's a tuple and an array when it's not. We can always make it an array.

So in the end I think we can use _is_array_like_not_scalar and simplify the whole logic. Done in a642ec9

Copy link
Member
@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre I directly pushed the requested changes and simplified the logic of some parts. LGTM

@jeremiedbb
Copy link
Member

I think all the comments have been addressed. The performance regression has been avoided. Let's roll.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
0