8000 TST Extend tests for `scipy.sparse.*array` in `test_pairwise.py` by StefanieSenger · Pull Request #27288 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

TST Extend tests for scipy.sparse.*array in test_pairwise.py #27288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

StefanieSenger
Copy link
Contributor

Reference Issues/PRs

Towards #27090.

What does this implement/fix? Explain your changes.

This PR substitutes scipy sparse matrices with the scipy containers introduced in #27095 in the sklearn/metrics/tests/test_pairwise.py test file.

@github-actions
Copy link
github-actions bot commented Sep 4, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 90c1c0b. Link to the linter CI: here

@StefanieSenger
Copy link
Contributor Author
StefanieSenger commented Sep 4, 2023
 ERROR collecting metrics/tests/test_pairwise.py ________________
In test_euclidean_distances_known_result: 2 parameter sets specified, with different number of ids: 3
----------------- generated xml file: /temp_dir/test-data.xml ------------------

The failing test from the CI (test_euclidean_distances_known_result) passes in pytest with the correct number of test cases (9)... Do I need to do something?

@Charlie-XIAO
Copy link
Contributor
Charlie-XIAO commented Sep 4, 2023
 ERROR collecting metrics/tests/test_pairwise.py ________________
In test_euclidean_distances_known_result: 2 parameter sets specified, with different number of ids: 3
----------------- generated xml file: /temp_dir/test-data.xml ------------------

The failing test from the CI (test_euclidean_distances_known_result) passes in pytest with the correct number of test cases (9)... Do I need to do something?

I'm not maintainer but I think you can simply remove ids. The thing is XXX_CONTAINERS may have 1 or 2 elements based on scipy version, so when XXX_CONTAINERS only have 1 element the errored test case would have 2 parameter sets with 3 ids. (Well, unless you specify ids based on the length of XXX_CONTAINERS but I think it is not necessary since as far as I know, ids are mostly used for collecting specific test cases via the -k option).

@StefanieSenger
Copy link
Contributor Author
StefanieSenger commented Sep 6, 2023

Thank you, @Charlie-XIAO I will try this.

@OmarManzoor
Copy link
Contributor
OmarManzoor commented Sep 6, 2023

Hi @StefanieSenger,

I think the ids are basically used for identification purposes, otherwise some defaults will be used depending on the data structure. For example with ids the results are like:

test_pairwise.py::test_euclidean_distances_known_result[dense-dense] PASSED [ 11%]
test_pairwise.py::test_euclidean_distances_known_result[dense-sparse_matrix] PASSED [ 22%]
test_pairwise.py::test_euclidean_distances_known_result[sparse_matrix-dense] PASSED [ 44%]
test_pairwise.py::test_euclidean_distances_known_result[sparse_matrix-sparse_array] PASSED [ 66%]

without ids the results are

test_pairwise.py::test_euclidean_distances_known_result[array-array] PASSED [ 11%]
test_pairwise.py::test_euclidean_distances_known_result[array-csr_matrix] PASSED [ 22%]
test_pairwise.py::test_euclidean_distances_known_result[csr_matrix-array] PASSED [ 44%]
test_pairwise.py::test_euclidean_distances_known_result[csr_matrix-csr_array] PASSED [ 66%]

So we can either remove these ids or conditionally set the ids based on the container length.
Since they were present originally if we consider removing them, then I think we should also take another maintainer's opinion.

@StefanieSenger
Copy link
Contributor Author
StefanieSenger commented Sep 7, 2023

Hi, @OmarManzoor, I see and to be on the secure side I did now set the ids conditionally based on whether the container contained more than one element.
Though I'd prefer the simplest possible solution, if another maintainer confirms it's okay to delete the ids.

It seems that without passing the ids, these are named differently, but the combinations tested are the same.

=====================with passing IDs
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[dense-dense] PASSED                                                              [ 11%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[dense-sparse_matrix] PASSED                                                      [ 22%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[dense-sparse_array] PASSED                                                       [ 33%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[sparse_matrix-dense] PASSED                                                      [ 44%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[sparse_matrix-sparse_matrix] PASSED                                              [ 55%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[sparse_matrix-sparse_array] PASSED                                               [ 66%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[sparse_array-dense] PASSED                                                       [ 77%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[sparse_array-sparse_matrix] PASSED                                               [ 88%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[sparse_array-sparse_array] PASSED  


==================without passing IDs

sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[array-array] PASSED                                                              [ 11%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[array-csr_matrix] PASSED                                                         [ 22%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[array-csr_array] PASSED                                                          [ 33%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[csr_matrix-array] PASSED                                                         [ 44%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[csr_matrix-csr_matrix] PASSED                                                    [ 55%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[csr_matrix-csr_array] PASSED                                                     [ 66%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[csr_array-array] PASSED                                                          [ 77%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[csr_array-csr_matrix] PASSED                                                     [ 88%]
sklearn/metrics/tests/test_pairwise.py::test_euclidean_distances_known_result[csr_array-csr_array] PASSED                                                      [100%]

"dense", "sparse_matrix", "sparse_array" translates directly to , "array", "csr_matrix", "csr_array".

OmarManzoor
OmarManzoor previously approved these changes Sep 8, 2023
Copy link
Contributor
@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @StefanieSenger. LGTM.

@OmarManzoor OmarManzoor added the Waiting for Second Reviewer First reviewer is done, need a second one! label Sep 8, 2023
Copy link
Contributor
@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a number of other instances as well where we might needs to set ids conditionally. Since they all involve CSR_CONTAINERS we can simply have a variable at the top like

container_ids = (
        ["dense", "sparse_matrix"] if len(CSR_CONTAINERS) == 1
        else ["dense", "sparse_matrix", "sparse_array"]
)

and then use this in all the required places.

Or you can remove the ids if you prefer. Since I added the second reviewer tag, if another maintainer feels we need them, then they can be added.

@StefanieSenger
Copy link
Contributor Author

I think there are a number of other instances as well where we might needs to set ids conditionally. Since they all involve CSR_CONTAINERS we can simply have a variable at the top like

container_ids = (
["dense", "sparse_matrix"] if len(CSR_CONTAINERS) == 1
else ["dense", "sparse_matrix", "sparse_array"]
)

and then use this in all the required places.

Or you can remove the ids if you prefer. Since I added the second reviewer tag, if another maintainer feels we need them, then they can be added.

Thank you, @OmarManzoor.

I've taken a look into the pytests documentation. It seems that ids are strings, that pytest automatically assigns for identifying each test case, but we can also explicitly name them, if we want, for clarity or debugging.

Since the ids are not further used in the code, I will omit them.

@Charlie-XIAO
Copy link
Contributor

@StefanieSenger You may want to look at this one: #27262 (comment)

@StefanieSenger
Copy link
Contributor Author

@StefanieSenger You may want to look at this one: #27262 (comment)

Ah, thank you, @Charlie-XIAO. I will keep them then. Do you know why?

@Charlie-XIAO
Copy link
Contributor

@StefanieSenger You may want to look at this one: #27262 (comment)

Ah, thank you, @Charlie-XIAO. I will keep them then. Do you know why?

I think it's still for compatibility, for instance, one may have a script that selects tests to run based on names. But not sure, maybe @glemaitre?

@glemaitre
Copy link
Member

The ids allow to have the explicit parameter used during the parametrization in case it give some mistic names. I would say that they are not necessary if the set of parameters makes it already explicit and readable when running the output of pytest.

@StefanieSenger
Copy link
Contributor Author

It's easier to trace back failing tests then, I see. Thank you @glemaitre .

Copy link
Contributor
@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @StefanieSenger

@glemaitre glemaitre self-requested a review September 14, 2023 16:17
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM for the other tests.

@@ -146,8 +152,8 @@ def test_pairwise_distances(global_dtype):

# Test with sparse X and Y,
# currently only supported for Euclidean, L1 and cosine.
X_sparse = csr_matrix(X)
Y_sparse = csr_matrix(Y)
X_sparse = csr_container(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we isolate the code that test with the sparse matrices.
It would avoid to repeat the tests for the dense case multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre, do you mean to turn this test into two separate tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly.

Copy link
Contributor Author
@StefanieSenger StefanieSenger Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will try that. :)

Edit: It's done. :)

X_sparse = csr_matrix(X)
Y_sparse = csr_matrix(Y)
X_sparse = csr_container(X)
Y_sparse = csr_container(Y)

S = pairwise_distances(X_sparse, Y_sparse, metric="euclidean")
S2 = euclidean_distances(X_sparse, Y_sparse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In l. 168-169, I see the following:

S = pairwise_distances(X_sparse, Y_sparse.tocsc(), metric="manhattan")
S2 = manhattan_distances(X_sparse.tobsr(), Y_sparse.tocoo())

It means that somehow, we expect to try with other containers as well, BSR, COO, and CSC.
I assume that we just decorate the new test by parametrizing for all combinations. I don't think that we need to mix the type of X and y, but try the same type for both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I know what you mean here.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @StefanieSenger

@glemaitre glemaitre merged commit df60c75 into scikit-learn:main Sep 29, 2023
@StefanieSenger StefanieSenger deleted the sparse_array_test_pairwise branch September 30, 2023 05:29
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0