-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Add array API support for Nystroem approximation #29661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
While working on this I kept running into this error @OmarManzoor do you happen to have any pointer on how to get around this when you worked on #29433 ? Thanks!
|
There is something fishy that happens right after sklearn/metrics/tests/test_common.py:1979: in check_array_api_metric_pairwise
check_array_api_metric(
X_np = array([[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]], dtype=float32)
Y_np = array([[0.2, 0.3, 0.4],
[0.5, 0.6, 0.7]], dtype=float32)
array_namespace = 'torch'
device = 'cpu'
dtype_name = 'float32'
metric = <function rbf_kernel at 0x125b99b20>
metric_kwargs = {}
sklearn/metrics/tests/test_common.py:1773: in check_array_api_metric
metric_xp = metric(a_xp, b_xp, **metric_kwargs)
a_np = array([[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]], dtype=float32)
a_xp = tensor([0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000])
array_namespace = 'torch'
b_np = array([[0.2, 0.3, 0.4],
[0.5, 0.6, 0.7]], dtype=float32)
b_xp = tensor([0.2000, 0.3000, 0.4000, 0.5000, 0.6000, 0.7000])
|
Hi @EmilyXinyi I checked out your branch locally and ran the Also @ogrisel I think it might be useful to merge the PR #29475 as the current PR makes use of the other kernels. What do you think? |
Let me update the branch in this PR to check if the errors in the CI can be reproduced or if they were a transient event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some feedback. In general, I think you should have a look at a merge PR that adds array API support to another unsupervised transformer (e.g. PCA):
This should give guidance on how to upgrade this estimator and test it.
@EmilyXinyi here are the instructions to update the changelog for this PR: https://github.com/scikit-learn/scikit-learn/tree/main/doc/whats_new/upcoming_changes#readme |
BTW @EmilyXinyi, once you have managed to fix the changelog problem of this PR, feel free to open another PR to improve the message of the failed changelog check to make it more informative. This is done in the |
Another reason why the checks are failing: Looking at implementing sparse matrix support for this kernel |
…it-learn into array_API_nystroem
sklearn/kernel_approximation.py
Outdated
@@ -31,6 +35,34 @@ | |||
validate_data, | |||
) | |||
|
|||
# Utility Functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pulled these functions straight out of here (pairwise.py
)
I feel like these functions may get used more and more as we make more estimators and metric array API compatible. Should we consider moving these functions into sklearn/utils/_array_api.py
?
K = polynomial_kernel(X_xp, degree=2, coef0=0.1) | ||
nystroem = Nystroem(kernel="precomputed", n_components=X_xp.shape[0]) | ||
X_xp_transformed = nystroem.fit_transform(K) | ||
X_xp_transformed_np = _convert_to_numpy(X_xp_transformed, xp=xp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the CI pipeline is complaining about this line when the array_namespace
is torch
with this error AttributeError: 'numpy.ndarray' object has no attribute 'cpu'
trying to reproduce this error on google colab...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the build log this is happening with torch-cpu
so you don't need to be on Google Colab, you can probably reproduce locally by installing pytorch-cpu
with conda or pip3 install torch --index-url https://download.pytorch.org/whl/cpu
with pip according to PyTorch install page.
My guess is that somewhere there is a lingering np
somewhere so that the torch array is turned into an numpy array silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm I can reproduce locally with:
pytest sklearn/tests/test_kernel_approximation.py -k 'test_nystroem_precomputed_kernel_array_api and torch'
I think one of the reason is that you forgot to enable array API for the .fit_transform
:
with config_context(array_api_dispatch=True):
but adding it doesn't seem to be enough locally, so there is more debugging left to do 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looking a bit closer adding the config_context
fixes it, I pushed a commit with a fix.
The laplacian kernel is not supported for non-numpy namespaces, the array-api test should skip this kernel. |
xp, _ = get_namespace(X) | ||
X_np = _convert_to_numpy(X, xp=xp) | ||
Y_np = _convert_to_numpy(Y, xp=xp) | ||
transformed = np.dot(X_np, Y_np.T) | ||
transformed_xp = xp.asarray(transformed) | ||
return transformed_xp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to avoid the numpy conversion here. A dot product is just a matrix multiplication if the 2 inputs are 2D arrays.
xp, _ = get_namespace(X) | |
X_np = _convert_to_numpy(X, xp=xp) | |
Y_np = _convert_to_numpy(Y, xp=xp) | |
transformed = np.dot(X_np, Y_np.T) | |
transformed_xp = xp.asarray(transformed) | |
return transformed_xp | |
return X @ Y.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array API wants X @ Y if both X and Y are 1D (it refuses to transpose 1D arrays) so I did the check dim conditions here
|
||
# test that available kernels fit and transform | ||
kernels_available = kernel_metrics() | ||
for kern in kernels_available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should skip the laplacian kernel here as well.
… round trip in _linear_kernel
…it-learn into array_API_nystroem
if X.ndim == 1 and Y.ndim == 1: | ||
return X @ Y | ||
elif X.ndim == 2 and Y.ndim == 2: | ||
return X @ Y.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov is complaining about this case not being covered, but i don't think the ndim == 2 thing
will ever happen because of how we defined _pairwise_callable... maybe we should just remove the conditional check altogether?
X = np.asarray(X) | ||
if Y is None: | ||
Y_dtype = X.dtype | ||
elif not sp.issparse(Y) and not isinstance(Y, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov is complaining about this case not being covered, and I pulled this straight out of pairwise.py.... so I wonder if we should add this in sklearn/utils/_array_api.py
?
@EmilyXinyi I pushed d028da9 to help debug what's wrong with the CuPy failures on the CUDA CI. Please pull that commit locally before continuing your work on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment from the preview that was not addressed:
https://github.com/scikit-learn/scikit-learn/pull/29661/files#r2321428320
And here is another pass of feedback:
if X.ndim == 1 and Y.ndim == 1: | ||
return X @ Y | ||
elif X.ndim == 2 and Y.ndim == 2: | ||
return X @ Y.T | ||
else: | ||
raise ValueError("Incompatible shapes for linear kernel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the following work in all cases?
if X.ndim == 1 and Y.ndim == 1: | |
return X @ Y | |
elif X.ndim == 2 and Y.ndim == 2: | |
return X @ Y.T | |
else: | |
raise ValueError("Incompatible shapes for linear kernel") | |
return X @ Y.T |
@@ -1013,6 +1047,8 @@ def fit(self, X, y=None): | |||
self : object | |||
Returns the instance itself. | |||
""" | |||
|
|||
xp, _ = get_namespace(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should retrieve the input data device with get_namespace_and_device
here.
@@ -1031,8 +1067,11 @@ def fit(self, X, y=None): | |||
n_components = self.n_components | |||
n_components = min(n_samples, n_components) | |||
inds = rnd.permutation(n_samples) | |||
basis_inds = inds[:n_components] | |||
basis = X[basis_inds] | |||
basis_inds = xp.asarray(inds[:n_components], dtype=xp.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then always pass the device argument when calling xp.asarray
to continue working on the same device.
|
||
K = rbf_kernel(X_np, gamma=gamma) | ||
|
||
assert_array_almost_equal(K, np.dot(X_xp_transformed_np, X_xp_transformed_np.T)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use assert_allclose
instead of assert_array_almost_equal
in newly written or updated tests.
|
||
K = rbf_kernel(X_np, gamma=gamma) | ||
|
||
assert_array_almost_equal(K, np.dot(X_xp_transformed_np, X_xp_transformed_np.T)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: I think we almost always can and should use the @
operator instead of np.dot
in new code.
# some basic tests | ||
rnd = np.random.RandomState(0) | ||
X_np = rnd.uniform(size=(10, 4)) | ||
X_xp = xp.asarray(X_np) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update all the tests to make sure that we actually use the device
returned by yield_namespace_device_dtype_combinations
:
X_xp = xp.asarray(X_np) | |
X_xp = xp.asarray(X_np, device=device) |
return X, Y, dtype | ||
|
||
|
||
def _find_floating_dtype_allow_sparse(X, Y, xp=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of duplicating the _return_float_dtype
function of sklearn/metrics/pairwise.py
here, I would rather move _find_floating_dtype_allow_sparse
to sklearn/metrics/pairwise.py
because it is only used once there.
if not issparse(X) and not isinstance(X, np.ndarray): | ||
X = np.asarray(X) | ||
X = xp.asarray(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling that this line is not needed. Can you try to remove it?
|
||
if Y is None: | ||
Y_dtype = X.dtype | ||
elif not issparse(Y) and not isinstance(Y, np.ndarray): | ||
Y = np.asarray(Y) | ||
Y = xp.asarray(Y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling that this line is not needed. Can you try to remove it? Then we can simplify the if / else structure of this code.
What does this implement/fix? Explain your changes.
Make Nystroem approximation array API compatible
To Do