-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
Add array API support for Nystroem approximation #29661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
While working on this I kept running into this error @OmarManzoor do you happen to have any pointer on how to get around this when you worked on #29433 ? Thanks! |
|
There is something fishy that happens right after sklearn/metrics/tests/test_common.py:1979: in check_array_api_metric_pairwise
check_array_api_metric(
X_np = array([[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]], dtype=float32)
Y_np = array([[0.2, 0.3, 0.4],
[0.5, 0.6, 0.7]], dtype=float32)
array_namespace = 'torch'
device = 'cpu'
dtype_name = 'float32'
metric = <function rbf_kernel at 0x125b99b20>
metric_kwargs = {}
sklearn/metrics/tests/test_common.py:1773: in check_array_api_metric
metric_xp = metric(a_xp, b_xp, **metric_kwargs)
a_np = array([[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]], dtype=float32)
a_xp = tensor([0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000])
array_namespace = 'torch'
b_np = array([[0.2, 0.3, 0.4],
[0.5, 0.6, 0.7]], dtype=float32)
b_xp = tensor([0.2000, 0.3000, 0.4000, 0.5000, 0.6000, 0.7000])
|
|
Hi @EmilyXinyi I checked out your branch locally and ran the Also @ogrisel I think it might be useful to merge the PR #29475 as the current PR makes use of the other kernels. What do you think? |
|
Let me update the branch in this PR to check if the errors in the CI can be reproduced or if they were a transient event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some feedback. In general, I think you should have a look at a merge PR that adds array API support to another unsupervised transformer (e.g. PCA):
This should give guidance on how to upgrade this estimator and test it.
|
@EmilyXinyi here are the instructions to update the changelog for this PR: https://github.com/scikit-learn/scikit-learn/tree/main/doc/whats_new/upcoming_changes#readme |
|
BTW @EmilyXinyi, once you have managed to fix the changelog problem of this PR, feel free to open another PR to improve the message of the failed changelog check to make it more informative. This is done in the |
|
Another reason why the checks are failing: Looking at implementing sparse matrix support for this kernel |
… round trip in _linear_kernel
…it-learn into array_API_nystroem
| if X.ndim == 1 and Y.ndim == 1: | ||
| return X @ Y | ||
| elif X.ndim == 2 and Y.ndim == 2: | ||
| return X @ Y.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov is complaining about this case not being covered, but i don't think the ndim == 2 thing will ever happen because of how we defined _pairwise_callable... maybe we should just remove the conditional check altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's simplify this code since it's a private test utility function. Maybe we could rename X and Y to x and y to avoid implying that they can be 2D arrays.
sklearn/kernel_approximation.py
Outdated
| X = np.asarray(X) | ||
| if Y is None: | ||
| Y_dtype = X.dtype | ||
| elif not sp.issparse(Y) and not isinstance(Y, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov is complaining about this case not being covered, and I pulled this straight out of pairwise.py.... so I wonder if we should add this in sklearn/utils/_array_api.py?
|
@EmilyXinyi I pushed d028da9 to help debug what's wrong with the CuPy failures on the CUDA CI. Please pull that commit locally before continuing your work on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment from the preview that was not addressed:
https://github.com/scikit-learn/scikit-learn/pull/29661/files#r2321428320
And here is another pass of feedback:
|
|
||
| K = rbf_kernel(X_np, gamma=gamma) | ||
|
|
||
| assert_array_almost_equal(K, np.dot(X_xp_transformed_np, X_xp_transformed_np.T)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use assert_allclose instead of assert_array_almost_equal in newly written or updated tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_allclose results in an error larger than the tolerance. This is the case for the non array API case too (the test above this). Should we keep assert_array_almost_equal for both, or find a way to reduce the error in the approximate itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's adjust the tolerance (rtol and/or atol) so that the assertion passes with NumPy.
| xp, _ = get_namespace(X, Y) | ||
| if not issparse(X) and not isinstance(X, np.ndarray): | ||
| X = np.asarray(X) | ||
| X = xp.asarray(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling that this line is not needed. Can you try to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this line X doesn't have a dtype attribute and the tests will fail with
FAILED sklearn/tests/test_common.py::test_estimators[BisectingKMeans(max_iter=5,n_clusters=2,n_init=2)-check_transformer_data_not_an_array] - AttributeError: '_NotAnArray' object has no attribute 'dtype'
which comes from this block
else:
Y_dtype = Y.dtype
> if X.dtype == Y_dtype == np.float32:
^^^^^^^
E AttributeError: '_NotAnArray' object has no attribute 'dtype'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a pass of feedback. I will accept the dtype_name suggestion to be able to see the CI results once the tests are fixed in that respect.
| xp, _ = get_namespace(X, Y) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xp is already retrieved 2 lines above. The call check_pairwise array should not change the namespace of X and Y. If it does, this is probably a bug and should be fixed.
| xp, _ = get_namespace(X, Y) |
| if X.ndim == 1 and Y.ndim == 1: | ||
| return X @ Y | ||
| elif X.ndim == 2 and Y.ndim == 2: | ||
| return X @ Y.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's simplify this code since it's a private test utility function. Maybe we could rename X and Y to x and y to avoid implying that they can be 2D arrays.
|
|
||
| K = rbf_kernel(X_np, gamma=gamma) | ||
|
|
||
| assert_array_almost_equal(K, np.dot(X_xp_transformed_np, X_xp_transformed_np.T)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's adjust the tolerance (rtol and/or atol) so that the assertion passes with NumPy.
|
@EmilyXinyi don't forgot to pull the above commit to your local branch before continuing your work on addressing the remaining CI test failures and review comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look at failing tests and just some small comments.
An addition to the tests that checks that the output namespace, device and dtype are expected might be nice.
| elif not issparse(Y) and not isinstance(Y, np.ndarray): | ||
| Y = np.asarray(Y) | ||
| Y_dtype = Y.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, why delete this?
This is the cause of the AttributeError: 'list' object has no attribute 'dtype' and AttributeError: 'tuple' object has no attribute 'dtype' errors, for when Y is not an array.
| @pytest.mark.parametrize( | ||
| "array_namespace, device, dtype_name", yield_namespace_device_dtype_combinations() | ||
| ) | ||
| def test_nystroem_approximation_array_api(array_namespace, device, dtype_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is failing for [torch-cpu-float32] and [array_api_strict-device2-float32] and seems do be due to float32 precision. Do you know why..?
| # output data type during `transform`. | ||
| self.random_weights_ = self.random_weights_.astype(X.dtype, copy=False) | ||
| self.random_offset_ = self.random_offset_.astype(X.dtype, copy=False) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just leave this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's lint doing the spaces and formatting 🫠
| self : object | ||
| Returns the instance itself. | ||
| """ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this implement/fix? Explain your changes.
Make Nystroem approximation array API compatible
To Do