8000 TST Extend tests for `scipy.sparse.*array` in `sklearn/utils/tests/test_param_validation.py` by Charlie-XIAO · Pull Request #27950 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

TST Extend tests for scipy.sparse.*array in sklearn/utils/tests/test_param_validation.py #27950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

Charlie-XIAO
Copy link
Contributor
@Charlie-XIAO Charlie-XIAO commented Dec 12, 2023

Reference Issues/PRs

Towards #27090.

What does this implement/fix? Explain your changes.

Extend validation for spa 8000 rse containers. This would require changing "sparse matrix" to "sparse container" when necessary. Previous discussion at #27317.

The following are the changes I have made (because tests are failing):

Details

sklearn.cluster
    cluster.dbscan
        - sklearn/cluster/tests/test_dbscan.py::test_dbscan_input_not_modified
        - sklearn/cluster/tests/test_dbscan.py::test_dbscan_input_not_modified_precomputed_sparse_nodiag
    cluster.compute_optics_graph
        - sklearn/cluster/tests/test_optics.py::test_correct_number_of_clusters (*)
        - sklearn/cluster/tests/test_optics.py::test_dbscan_optics_parity (*)
        - sklearn/cluster/tests/test_optics.py::test_min_cluster_size_invalid2 (*)
        - sklearn/cluster/tests/test_optics.py::test_precomputed_dists (*)
sklearn.neighbors
    neighbors.sort_graph_by_row_values
        - sklearn/cluster/tests/test_dbscan.py::test_dbscan_precomputed_metric_with_initial_rows_zero (*)
        - sklearn/neighbors/tests/test_neighbors.py::test_sort_graph_by_row_values
        - sklearn/neighbors/tests/test_neighbors.py::test_sort_graph_by_row_values_copy
        - sklearn/neighbors/tests/test_neighbors.py::test_sort_graph_by_row_values_warning
        - sklearn/neighbors/tests/test_neighbors.py::test_sort_graph_by_row_values_bad_sparse_format
        - sklearn/neighbors/tests/test_neighbors.py::test_precomputed_sparse_invalid
        - sklearn/neighbors/tests/test_neighbors.py::test_radius_neighbors_returns_array_of_objects
        - sklearn/neighbors/tests/test_neighbors.py::test_kneighbors_regressor_sparse
sklearn.utils
    utils.extmath.randomized_svd
        - sklearn/utils/tests/test_extmath.py::test_randomized_svd_low_rank_all_dtypes
        - sklearn/utils/tests/test_extmath.py::test_randomized_svd_sparse_warnings

The following are the places where "sparse matrix" option is used in parameter validation:

Details

  • cluster.ward_tree
  • cluster.kmeans_plusplus
  • cluster.k_means
  • cluster.spectral_clustering
  • datasets.dump_svmlight_file
  • decomposition.non_negative_factorization
  • feature_selection.mutual_info_regression
  • feature_selection.mutual_info_classif
  • feature_selection.f_classif
  • feature_selection.chi2
  • feature_selection.r_regression
  • feature_selection.f_regression
  • inspection.partial_dependence
  • linear_model.lasso_path
  • linear_model.enet_path
  • linear_model.ridge_regression
  • manifold.trustworthiness
  • metrics.accuracy_score
  • metrics.multilabel_confusion_matrix
  • metrics.jaccard_score
  • metrics.zero_one_loss
  • metrics.f1_score
  • metrics.fbeta_score
  • metrics.precision_recall_fscore_support
  • metrics.class_likelihood_ratios
  • metrics.precision_score
  • metrics.recall_score
  • metrics.classification_report
  • metrics.hamming_loss
  • metrics.label_ranking_average_precision_score
  • metrics.label_ranking_loss
  • metrics.pairwise.euclidean_distances
  • metrics.pairwise_distances_argmin_min
  • metrics.pairwise_distances_argmin
  • metrics.pairwise.haversine_distances
  • metrics.pairwise.manhattan_distances
  • metrics.pairwise.cosine_distances
  • metrics.pairwise.paired_euclidean_distances
  • metrics.pairwise.paired_manhattan_distances
  • metrics.pairwise.paired_cosine_distances
  • metrics.pairwise.linear_kernel
  • metrics.pairwise.polynomial_kernel
  • metrics.pairwise.sigmoid_kernel
  • metrics.pairwise.rbf_kernel
  • metrics.pairwise.laplacian_kernel
  • metrics.pairwise.cosine_similarity
  • metrics.pairwise_distances_chunked
  • metrics.pairwise_distances
  • metrics.pairwise.pairwise_kernels
  • metrics.mutual_info_score
  • metrics.silhouette_score
  • metrics.silhouette_samples
  • model_selection.cross_validate
  • model_selection.cross_val_score
  • model_selection.cross_val_predict
  • model_selection.permutation_test_score
  • model_selection.learning_curve
  • model_selection.validation_curve
  • preprocessing.scale
  • preprocessing.maxabs_scale
  • preprocessing.robust_scale
  • preprocessing.normalize
  • preprocessing.binarize
  • preprocessing.add_dummy_feature
  • preprocessing.quantile_transform
  • svm.l1_min_c
  • utils.safe_mask
  • utils.class_weight.compute_sample_weight
  • utils.graph.single_source_shortest_path_length

One major reason why so many things are tested yet so few errors are raised without changing to "sparse array" is that, "array-like" seems to be covering sparse matrices and sparse arrays since they have the shape and __len__ attributes. I'm thinking... should "array-like" not include these sparse containers (for instance, by adding and not sparse.issparse(val))? @jjerphan

Also @glemaitre who posted a comment #27317 (comment) in the original PR and @jeremiedbb who is the main author of _param_validation.py.

@Charlie-XIAO Charlie-XIAO changed the title Tst sp param val TST Extend tests for scipy.sparse.*array in sklearn/utils/tests/test_param_validation.py Dec 12, 2023
Copy link
github-actions bot commented Dec 12, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 63dbdb3. Link to the linter CI: here

Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Charlie-XIAO.

@@ -438,7 +438,7 @@ def _compute_core_distances_(X, neighbors, min_samples, working_memory):

@validate_params(
{
"X": [np.ndarray, "sparse matrix"],
"X": [np.ndarray, "sparse container"],
Copy link
Member
@jjerphan jjerphan Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would keep such strings unchanged and consider their change in another PR since they touch the public API. I would wait for Jeremy, because he is the one who has the authority on it.

This applies here, and in other places as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjerphan In fact tests would fail without these changes, unless I do not change the validation for "sparse matrix". This involves some confusing issues and let me make things a bit clearer. I think there are two possibilities for this PR:

  1. Let "sparse matrix" and "sparse container" have the same validation, i.e., sparse.issparse() instead of changing the validation of "sparse matrix" to sparse.isspmatrix(). In this case I can leave the parameter validations unchanged.

  2. Change the validation of "sparse matrix" to sparse.isspmatrix() and add the option "sparse container". This is, however, much more complicated, because we would have to touch the public API to make tests pass. As a matter of fact, I would expect much more tests to fail with this change (meaning that more parameter validations need to be changed to make things pass). However, only a few tests failed for the following reason: "array-like" is covering sparse matrices and sparse arrays, so that if a function/class accepts both "array-like" and "sparse matrix" then there will be no error. In other words, with the current validation system, ["array-like", "sparse matrix"] is the same as ["array-like"], which I do not think is reasonable. To take this option, we may need to: (1) change "array-like" to not accept sparse stuff; (2) with the updated "sparse matrix" validation, find out all tests that are failing (I would expect a lot), and from those failing tests I think we can know which functions/classes have been tests on sparse arrays, and we can change the constraints from "sparse matrix" to "sparse container".

I'm actually not sure which option to take. Option 1 is somehow doing nothing, in which case I think we can just discard this PR. Option 2 is more reasonable, i.e., this PR is not only to extend tests but also to change the public API and fix parameter validations, but perhaps it is too much. Well, perhaps we can first make a PR to fix the problem of "array-like" covering sparse matrices and sparse arrays (if this is indeed not the desired behavior), then do the rest of Option 2 in this PR with the title changed. I do want to know what maintainers think, also @jeremiedbb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sparse containers are considered to be array-like. So option 2 works for me, but I would wait for the opinion of another maintainer.

@Charlie-XIAO
Copy link
Contributor Author
Charlie-XIAO commented Jan 10, 2024

I am trying to draw attention from maintainers and move the discussion to a more visible place, so I opened #28099.

@jeremiedbb
Copy link
Member

As explained in #28101 (comment), I think we should not introduce a separate sparse container constraint. Let's keep a single sparse matrix constraint for now (like we left sparse matrix in all our docstrings for now) and maybe at some point we'll rename it sparse container everywhere.

I extracted the addition of test for sparse containers and added it to #28101 so I think we can close this PR.

cc/ @glemaitre

@Charlie-XIAO
Copy link
Contributor Author
Charlie-XIAO commented Feb 28, 2024

Just to add some thoughts: I think it's reasonable to close, in the sense that: though having two constraints can help us find which functions/classes have been tested against sparse arrays and which are not, it may convey a wrong message to users that "sparse array" is intentionally not supported (while the fact is that it is simply not tested).

@jeremiedbb
Copy link
Member
jeremiedbb commented Feb 28, 2024

though having two constraints can help us find which functions/classes have been tested against sparse arrays and which are not,

That's interesting. I think one day we'll probably want to deprecate sparse matrices and keep only sparse arrays. I expect that it will reveal everything that was not properly tested against sparse arrays. And the things we'll miss will probably be the functions that we already don't test against sparse matrices although we should, but having 2 constraints can't help us there.

@glemaitre
Copy link
Member

I agree that we can close this PR and revisit the concept of sparse array when they actually behave as numpy array.

@glemaitre glemaitre closed this Feb 28, 2024
@Charlie-XIAO Charlie-XIAO deleted the tst-sp-param-val branch February 28, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0