-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
TST Extend tests for scipy.sparse.*array
in sklearn/utils/tests/test_param_validation.py
#27950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
scipy.sparse.*array
in sklearn/utils/tests/test_param_validation.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @Charlie-XIAO.
@@ -438,7 +438,7 @@ def _compute_core_distances_(X, neighbors, min_samples, working_memory): | |||
|
|||
@validate_params( | |||
{ | |||
"X": [np.ndarray, "sparse matrix"], | |||
"X": [np.ndarray, "sparse container"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would keep such strings unchanged and consider their change in another PR since they touch the public API. I would wait for Jeremy, because he is the one who has the authority on it.
This applies here, and in other places as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjerphan In fact tests would fail without these changes, unless I do not change the validation for "sparse matrix". This involves some confusing issues and let me make things a bit clearer. I think there are two possibilities for this PR:
-
Let "sparse matrix" and "sparse container" have the same validation, i.e.,
sparse.issparse()
instead of changing the validation of "sparse matrix" tosparse.isspmatrix()
. In this case I can leave the parameter validations unchanged. -
Change the validation of "sparse matrix" to
sparse.isspmatrix()
and add the option "sparse container". This is, however, much more complicated, because we would have to touch the public API to make tests pass. As a matter of fact, I would expect much more tests to fail with this change (meaning that more parameter validations need to be changed to make things pass). However, only a few tests failed for the following reason: "array-like" is covering sparse matrices and sparse arrays, so that if a function/class accepts both "array-like" and "sparse matrix" then there will be no error. In other words, with the current validation system, ["array-like", "sparse matrix"] is the same as ["array-like"], which I do not think is reasonable. To take this option, we may need to: (1) change "array-like" to not accept sparse stuff; (2) with the updated "sparse matrix" validation, find out all tests that are failing (I would expect a lot), and from those failing tests I think we can know which functions/classes have been tests on sparse arrays, and we can change the constraints from "sparse matrix" to "sparse container".
I'm actually not sure which option to take. Option 1 is somehow doing nothing, in which case I think we can just discard this PR. Option 2 is more reasonable, i.e., this PR is not only to extend tests but also to change the public API and fix parameter validations, but perhaps it is too much. Well, perhaps we can first make a PR to fix the problem of "array-like" covering sparse matrices and sparse arrays (if this is indeed not the desired behavior), then do the rest of Option 2 in this PR with the title changed. I do want to know what maintainers think, also @jeremiedbb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sparse containers are considered to be array-like. So option 2 works for me, but I would wait for the opinion of another maintainer.
I am trying to draw attention from maintainers and move the discussion to a more visible place, so I opened #28099. |
As explained in #28101 (comment), I think we should not introduce a separate I extracted the addition of test for sparse containers and added it to #28101 so I think we can close this PR. cc/ @glemaitre |
Just to add some thoughts: I think it's reasonable to close, in the sense that: though having two constraints can help us find which functions/classes have been tested against sparse arrays and which are not, it may convey a wrong message to users that "sparse array" is intentionally not supported (while the fact is that it is simply not tested). |
That's interesting. I think one day we'll probably want to deprecate sparse matrices and keep only sparse arrays. I expect that it will reveal everything that was not properly tested against sparse arrays. And the things we'll miss will probably be the functions that we already don't test against sparse matrices although we should, but having 2 constraints can't help us there. |
I agree that we can close this PR and revisit the concept of sparse array when they actually behave as numpy array. |
Reference Issues/PRs
Towards #27090.
What does this implement/fix? Explain your changes.
Extend validation for spa 8000 rse containers. This would require changing "sparse matrix" to "sparse container" when necessary. Previous discussion at #27317.
The following are the changes I have made (because tests are failing):
Details
The following are the places where "sparse matrix" option is used in parameter validation:
Details
cluster.ward_tree
cluster.kmeans_plusplus
cluster.k_means
cluster.spectral_clustering
datasets.dump_svmlight_file
decomposition.non_negative_factorization
feature_selection.mutual_info_regression
feature_selection.mutual_info_classif
feature_selection.f_classif
feature_selection.chi2
feature_selection.r_regression
feature_selection.f_regression
inspection.partial_dependence
linear_model.lasso_path
linear_model.enet_path
linear_model.ridge_regression
manifold.trustworthiness
metrics.accuracy_score
metrics.multilabel_confusion_matrix
metrics.jaccard_score
metrics.zero_one_loss
metrics.f1_score
metrics.fbeta_score
metrics.precision_recall_fscore_support
metrics.class_likelihood_ratios
metrics.precision_score
metrics.recall_score
metrics.classification_report
metrics.hamming_loss
metrics.label_ranking_average_precision_score
metrics.label_ranking_loss
metrics.pairwise.euclidean_distances
metrics.pairwise_distances_argmin_min
metrics.pairwise_distances_argmin
metrics.pairwise.haversine_distances
metrics.pairwise.manhattan_distances
metrics.pairwise.cosine_distances
metrics.pairwise.paired_euclidean_distances
metrics.pairwise.paired_manhattan_distances
metrics.pairwise.paired_cosine_distances
metrics.pairwise.linear_kernel
metrics.pairwise.polynomial_kernel
metrics.pairwise.sigmoid_kernel
metrics.pairwise.rbf_kernel
metrics.pairwise.laplacian_kernel
metrics.pairwise.cosine_similarity
metrics.pairwise_distances_chunked
metrics.pairwise_distances
metrics.pairwise.pairwise_kernels
metrics.mutual_info_score
metrics.silhouette_score
metrics.silhouette_samples
model_selection.cross_validate
model_selection.cross_val_score
model_selection.cross_val_predict
model_selection.permutation_test_score
model_selection.learning_curve
model_selection.validation_curve
preprocessing.scale
preprocessing.maxabs_scale
preprocessing.robust_scale
preprocessing.normalize
preprocessing.binarize
preprocessing.add_dummy_feature
preprocessing.quantile_transform
svm.l1_min_c
utils.safe_mask
utils.class_weight.compute_sample_weight
utils.graph.single_source_shortest_path_length
One major reason why so many things are tested yet so few errors are raised without changing to "sparse array" is that, "array-like" seems to be covering sparse matrices and sparse arrays since they have the
shape
and__len__
attributes. I'm thinking... should "array-like" not include these sparse containers (for instance, by addingand not sparse.issparse(val)
)? @jjerphanAlso @glemaitre who posted a comment #27317 (comment) in the original PR and @jeremiedbb who is the main author of
_param_validation.py
.