-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
TST Extend tests for scipy.sparse.*array
in sklearn/tests/test_random_projection.py
#27314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST Extend tests for scipy.sparse.*array
in sklearn/tests/test_random_projection.py
#27314
Conversation
The CI is failing with
because of my attempt to fix the independent function call like this:
I explain this to myself as the CI testing this file with an older scikit-learn version with a dependency from a scipy version older than 1.8 (where sparse arrays where introduced).(?) Please let me know, if I should call this function with a |
data, data_csr = make_sparse_random_data( | ||
sp.coo_array, n_samples, n_features, n_nonzeros | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this data generation. It means that we need to modify the following tests:
test_try_to_transform_before_fit
test_SparseRandomProj_output_representation
test_correct_RandomProjection_dimensions_embedding
test_random_projection_feature_names_out
You can call the make_sparse_random_data
in the test itself and we will parametrize outside. For the first test (i.e., test_try_to_transform_before_fit
), we don't have the parametrize. I think we can also avoid the parametrization on feature_names_out
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now I see that the generated data is used in some tests, which I had missed before.
I have implemented it how you hinted me, @glemaitre, not parametrizing the first and the last test, but I couldn't figure out why testing the sparse martrix on them is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, the rest is fine.
@KartikeyBartwal This is already a pull-request and not an issue. Please refer to the original issue #27090 for further information and check the remaining work to be done. |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Our CI always tests the development code (that is the The oldest supported versions of our dependencies are defined in |
# Make some random data with uniformly located non zero entries with | ||
# Gaussian distributed values | ||
def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0): | ||
@pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a test function: it does not start with test_
, to pytest.mark.*
annotations do nothing on this kind of helper functions.
You have to instead parametrize the test functions that call this make_sparse_random_data
and pass the coo_container
as argument to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for hinting me.
I believe, that I did a mistake in test_SparseRandomProj_output_representation
before that I have now fixed. This test makes data using the datatypes from the coo_container
and then converts the data into the datatypes from the csr_container
to also do some checks.
# Make some random data with uniformly located non zero entries with | ||
# Gaussian distributed values | ||
def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0): | ||
@pytest. 8000 mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
@@ -367,6 +380,7 @@ def test_johnson_lindenstrauss_min_dim(): | |||
|
|||
@pytest.mark.parametrize("random_projection_cls", all_RandomProjection) | |||
def test_random_projection_feature_names_out(random_projection_cls): | |||
data, _ = make_sparse_random_data(sp.coo_array, n_samples, n_features, n_nonzeros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular this call needs to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did parametrize this test now . In this test, the data made with both datatypes from the COO container is created, fitted and then _feature_names_out
are compared to the expected feature names.
Though, @glemaitre was suggesting maybe not to parametrize here, did I get this right? If so, should I put make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros)
instead?
@@ -365,8 +378,10 @@ def test_johnson_lindenstrauss_min_dim(): | |||
assert johnson_lindenstrauss_min_dim(100, eps=1e-5) == 368416070986 | |||
|
|||
|
|||
@pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I was thinking to not parametrize since the test is not really intended for that. But since not all the CIs support arrays, we cannot use them by default so this is best to parametrize as you did. Sorry for the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I'm grateful for your support.
@@ -222,13 +224,15 @@ def test_random_projection_transformer_invalid_input(): | |||
|
|||
|
|||
def test_try_to_transform_before_fit(): | |||
data, _ = make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, this is is bit the same than for get_feature_names_out
, we should probably parametrize even if not really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there will be a time when sp.coo_matrix
won't be supported by all the dependencies in the CI tests, even though currently sparse matrix is still supported by all of them. Got it.
n_features = 20 | ||
data, _ = make_sparse_random_data(5, n_features, int(n_features / 4)) | ||
data, _ = make_sparse_random_data(coo_container, 5, n_features, int(n_features / 4)) | ||
|
||
for RandomProjection in all_RandomProjection: | ||
rp_dense = RandomProjection(n_components=3, random_state=1).fit(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird here since data
is not dense. I would expect instead:
rp_dense = RandomProjection(n_components=3, random_state=1).fit(data.toarray())
rp_spase = RandomProjection(n_components=3, random_state=1).fit(data)
This would be inline with what the name of the variables meant. If this is the case, then we don't need to have both CSR and COO but only one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data
here is dense.
This is because make_sparse_random_data
returns the data in two types (dense and sparse in csr format). And from all the 9 tests where it is used, only test_inverse_transform
uses the sparse data option, too.
This part in make_sparse_random_data
:
data_coo = sp.coo_martix(a,(b,c), shape=(n_samples, n_features))
return data_coo.toarray(), data_coo.tocsr()
I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?
I'd also like to rename data, _ = make_sparse_random_data(...)
into something like what was used in test_inverse_transform
:
X_dense, X_csr = make_sparse_random_data(...)
It's clearer to read.
What do you think, @glemaitre ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the name is a bit ambiguous for this function.
I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?
Most of the algorithm in scikit-learn will use CSR matrices and will make COO -> CSR. So returning a CSR matrix will avoid this conversion.
X_dense, X_csr = make_sparse_random_data(...)
Indeed, this is already better because it informs the reader that the function is returning both dense and sparse format.
Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container
, we can have a sparse_format
: if None
, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.
However, if it changes many lines, then your original proposal is already a good enough proposal @StefanieSenger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.
@glemaitre
I will push a proposal in a minute, but it doesn't get rid of coo_container
as a param. I have tried to substitute coo_container
with sparse_format
, but now I believe that we have to keep coo_container
, because we need to parametrize those datatypes in all of the tests that use make_sparse_random_data
. Please let me know if this makes sense.
At least, the function now is much more explicit about what type it returns.
Ah, and there's a function def make_sparse_random_data
that is exactly equal in benchmarks/bench_random_projections.py
. Should this one be changed the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?
You can let the benchmark on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have only some last changes following your improvments. After that it looks fine to me.
n_samples, | ||
n_features, | ||
n_nonzeros, | ||
random_state=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also unsualy in our test:
random_state=0, | |
random_state=None, |
Could you then pass random_state=0
in the call from each test functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I‘ve tried to use global_random_seed
for this, but this lead to failing tests for test_random_projection_embedding_quality
. Does this mean this test is sensitive to the random seed it had used before?
for SparseRandomProj in all_SparseRandomProjection: | ||
# when using sparse input, the projected data can be forced to be a | ||
# dense numpy array | ||
rp = SparseRandomProj(n_components=10, dense_output=True, random_state=0) | ||
rp.fit(data) | ||
assert isinstance(rp.transform(data), np.ndarray) | ||
|
||
sparse_data = sp.csr_matrix(data) | ||
sparse_data = csr_container(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling the csr_container
here, could you create sparse_data
at the beginning by calling twice make_sparse_random_data
twice but once with sparse_format=None
and another with sparse_format="csr"
. We don't need to parametrize for both COO and CSR container.
Using the same random_state=0
in both call should lead to the same data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I did it. I have used global_random_seed
here, because I believe that assert isinstance(...)
orsp.issparse
would pass even if the data was different?
Should I change that to use random_state=0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's keep the random_state
to 0. We can revisit this test in a latter PR to see how much flaky it is by changing the random_state
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @glemaitre, I have changed these two data generations to use random_state=0
.
|
||
for RandomProjection in all_RandomProjection: | ||
rp_dense = RandomProjection(n_components=3, random_state=1).fit(data) | ||
rp_sparse = RandomProjection(n_components=3, random_state=1).fit( | ||
sp.csr_matrix(data) | ||
csr_container(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I would as well call twice the make_sparse_random_data
and create the same data
and data_sparse
from the beginning. No need for the product of all possibilities arrays/matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good you found that, thanks for hinting me. Less testing cases again. :)
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @StefanieSenger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thank you, @StefanieSenger!
Just minor nitpicks for getting to a self-documenting code (this modifies the state of the file before your contribution).
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Thank you, @jjerphan, it looks much cleaner now. |
Hi Stefanie, One last thing: could you add an entry in this section for 1.4's changelog? :) scikit-learn/doc/whats_new/v1.4.rst Lines 132 to 182 in 8db3aac
|
Hey @jjerphan, just did it. Thanks for your support. :) |
…dom_projection.py` (scikit-learn#27314) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
Towards #27090.
What does this implement/fix? Explain your changes.
This PR substitutes scipy sparse matrices with the scipy containers introduced in #27095 in the
sklearn/tests/test_random_projection.py
test file.Comment
It was a bit tricky, because one of the parametrized functions was re-used outside of test functions (see below). I am not sure why, since it doesn't seem to be re-used afterwards. I made up a solution.
n_samples, n_features = (10, 1000)
n_nonzeros = int(n_samples * n_features / 100.0)
data, data_csr = make_sparse_random_data(n_samples, n_features, n_nonzeros)