8000 TST Extend tests for `scipy.sparse.*array` in `sklearn/tests/test_random_projection.py` by StefanieSenger · Pull Request #27314 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

TST Extend tests for scipy.sparse.*array in sklearn/tests/test_random_projection.py #27314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

StefanieSenger
Copy link
Contributor
@StefanieSenger StefanieSenger commented Sep 7, 2023

Reference Issues/PRs

Towards #27090.

What does this implement/fix? Explain your changes.

This PR substitutes scipy sparse matrices with the scipy containers introduced in #27095 in the sklearn/tests/test_random_projection.py test file.

Comment

It was a bit tricky, because one of the parametrized functions was re-used outside of test functions (see below). I am not sure why, since it doesn't seem to be re-used afterwards. I made up a solution.

n_samples, n_features = (10, 1000)
n_nonzeros = int(n_samples * n_features / 100.0)
data, data_csr = make_sparse_random_data(n_samples, n_features, n_nonzeros)

@github-actions
Copy link
github-actions bot commented Sep 7, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6a0971d. Link to the linter CI: here

@StefanieSenger
Copy link
Contributor Author

The CI is failing with

/io/sklearn/tests/test_random_projection.py:66: in <module>
    sp.coo_array, n_samples, n_features, n_nonzeros
E   AttributeError: module 'scipy.sparse' has no attribute 'coo_array'

because of my attempt to fix the independent function call like this:

n_samples, n_features = (10, 1000)
n_nonzeros = int(n_samples * n_features / 100.0)
data, data_csr = make_sparse_random_data(
    sp.coo_array, n_samples, n_features, n_nonzeros
)

I explain this to myself as the CI testing this file with an older scikit-learn version with a dependency from a scipy version older than 1.8 (where sparse arrays where introduced).(?)

Please let me know, if I should call this function with a coo_container=sp.coo_matrix, or if those four lines can be removed, since they're outside of any test function. Or any other way to fix this.

Comment on lines 65 to 67
data, data_csr = make_sparse_random_data(
sp.coo_array, n_samples, n_features, n_nonzeros
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this data generation. It means that we need to modify the following tests:

  • test_try_to_transform_before_fit
  • test_SparseRandomProj_output_representation
  • test_correct_RandomProjection_dimensions_embedding
  • test_random_projection_feature_names_out

You can call the make_sparse_random_data in the test itself and we will parametrize outside. For the first test (i.e., test_try_to_transform_before_fit), we don't have the parametrize. I think we can also avoid the parametrization on feature_names_out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now I see that the generated data is used in some tests, which I had missed before.

I have implemented it how you hinted me, @glemaitre, not parametrizing the first and the last test, but I couldn't figure out why testing the sparse martrix on them is unnecessary.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, the rest is fine.

@glemaitre
Copy link
Member

@KartikeyBartwal This is already a pull-request and not an issue. Please refer to the original issue #27090 for further information and check the remaining work to be done.

8000
@glemaitre glemaitre self-requested a review September 14, 2023 16:37
StefanieSenger and others added 2 commits September 18, 2023 11:46
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@ogrisel
Copy link
Member
ogrisel commented Sep 18, 2023

I explain this to myself as the CI testing this file with an older scikit-learn version with a dependency from a scipy version older than 1.8 (where sparse arrays where introduced).(?)

Our CI always tests the development code (that is the main branch of a PR targeting main) of scikit-learn against different versions of Python, NumPy, SciPy and in the presence or absence of optional dependencies such as pandas and matplotlib.

The oldest supported versions of our dependencies are defined in sklearn/_min_dependencies.py and indeed, at the time or writing we want scikit-learn to work against SciPy 1.5.0, that is before the introduction scipy.sparse.coo_array (in 1.8). Therefore we should write code that does not assume the existence of this constructor. This is why the COO_CONTAINERS is defined with a conditional statement on the SciPy version in sklearn/utils/fixes.py at the time this module is imported (that is at runtime, on the user's machine). On older SciPy versions it only contains coo_matrix while on versions after 1.8 it contains both coo_matrix and coo_array.

# Make some random data with uniformly located non zero entries with
# Gaussian distributed values
def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0):
@pytest.mark.parametrize("coo_container", COO_CONTAINERS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a test function: it does not start with test_, to pytest.mark.* annotations do nothing on this kind of helper functions.

You have to instead parametrize the test functions that call this make_sparse_random_data and pass the coo_container as argument to this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for hinting me.

I believe, that I did a mistake in test_SparseRandomProj_output_representation before that I have now fixed. This test makes data using the datatypes from the coo_container and then converts the data into the datatypes from the csr_container to also do some checks.

# Make some random data with uniformly located non zero entries with
# Gaussian distributed values
def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0):
@pytest. 8000 mark.parametrize("coo_container", COO_CONTAINERS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize("coo_container", COO_CONTAINERS)

@@ -367,6 +380,7 @@ def test_johnson_lindenstrauss_min_dim():

@pytest.mark.parametrize("random_projection_cls", all_RandomProjection)
def test_random_projection_feature_names_out(random_projection_cls):
data, _ = make_sparse_random_data(sp.coo_array, n_samples, n_features, n_nonzeros)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular this call needs to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did parametrize this test now . In this test, the data made with both datatypes from the COO container is created, fitted and then _feature_names_out are compared to the expected feature names.

Though, @glemaitre was suggesting maybe not to parametrize here, did I get this right? If so, should I put make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros) instead?

@@ -365,8 +378,10 @@ def test_johnson_lindenstrauss_min_dim():
assert johnson_lindenstrauss_min_dim(100, eps=1e-5) == 368416070986


@pytest.mark.parametrize("coo_container", COO_CONTAINERS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was thinking to not parametrize since the test is not really intended for that. But since not all the CIs support arrays, we cannot use them by default so this is best to parametrize as you did. Sorry for the confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I'm grateful for your support.

@@ -222,13 +224,15 @@ def test_random_projection_transformer_invalid_input():


def test_try_to_transform_before_fit():
data, _ = make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, this is is bit the same than for get_feature_names_out, we should probably parametrize even if not really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there will be a time when sp.coo_matrix won't be supported by all the dependencies in the CI tests, even though currently sparse matrix is still supported by all of them. Got it.

n_features = 20
data, _ = make_sparse_random_data(5, n_features, int(n_features / 4))
data, _ = make_sparse_random_data(coo_container, 5, n_features, int(n_features / 4))

for RandomProjection in all_RandomProjection:
rp_dense = RandomProjection(n_components=3, random_state=1).fit(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird here since data is not dense. I would expect instead:

rp_dense = RandomProjection(n_components=3, random_state=1).fit(data.toarray())
rp_spase = RandomProjection(n_components=3, random_state=1).fit(data)

This would be inline with what the name of the variables meant. If this is the case, then we don't need to have both CSR and COO but only one of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data here is dense.

This is because make_sparse_random_data returns the data in two types (dense and sparse in csr format). And from all the 9 tests where it is used, only test_inverse_transform uses the sparse data option, too.

This part in make_sparse_random_data:

data_coo = sp.coo_martix(a,(b,c), shape=(n_samples, n_features))
return data_coo.toarray(), data_coo.tocsr()

I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?

I'd also like to rename data, _ = make_sparse_random_data(...)
into something like what was used in test_inverse_transform:
X_dense, X_csr = make_sparse_random_data(...)
It's clearer to read.

What do you think, @glemaitre ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the name is a bit ambiguous for this function.

I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?

Most of the algorithm in scikit-learn will use CSR matrices and will make COO -> CSR. So returning a CSR matrix will avoid this conversion.

X_dense, X_csr = make_sparse_random_data(...)

Indeed, this is already better because it informs the reader that the function is returning both dense and sparse format.

Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.

However, if it changes many lines, then your original proposal is already a good enough proposal @StefanieSenger

Copy link
Contributor Author
@StefanieSenger StefanieSenger Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.

@glemaitre
I will push a proposal in a minute, but it doesn't get rid of coo_container as a param. I have tried to substitute coo_container with sparse_format, but now I believe that we have to keep coo_container, because we need to parametrize those datatypes in all of the tests that use make_sparse_random_data. Please let me know if this makes sense.

At least, the function now is much more explicit about what type it returns.

Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?

You can let the benchmark on its own.

@glemaitre glemaitre self-requested a review September 28, 2023 08:55
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have only some last changes following your improvments. After that it looks fine to me.

n_samples,
n_features,
n_nonzeros,
random_state=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also unsualy in our test:

Suggested change
random_state=0,
random_state=None,

Could you then pass random_state=0 in the call from each test functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I‘ve tried to use global_random_seed for this, but this lead to failing tests for test_random_projection_embedding_quality. Does this mean this test is sensitive to the random seed it had used before?

for SparseRandomProj in all_SparseRandomProjection:
# when using sparse input, the projected data can be forced to be a
# dense numpy array
rp = SparseRandomProj(n_components=10, dense_output=True, random_state=0)
rp.fit(data)
assert isinstance(rp.transform(data), np.ndarray)

sparse_data = sp.csr_matrix(data)
sparse_data = csr_container(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling the csr_container here, could you create sparse_data at the beginning by calling twice make_sparse_random_data twice but once with sparse_format=None and another with sparse_format="csr". We don't need to parametrize for both COO and CSR container.

Using the same random_state=0 in both call should lead to the same data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I did it. I have used global_random_seed here, because I believe that assert isinstance(...) orsp.issparsewould pass even if the data was different?
Should I change that to use random_state=0 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's keep the random_state to 0. We can revisit this test in a latter PR to see how much flaky it is by changing the random_state.

Copy link
Contributor Author
@StefanieSenger StefanieSenger Sep 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @glemaitre, I have changed these two data generations to use random_state=0.


for RandomProjection in all_RandomProjection:
rp_dense = RandomProjection(n_components=3, random_state=1).fit(data)
rp_sparse = RandomProjection(n_components=3, random_state=1).fit(
sp.csr_matrix(data)
csr_container(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I would as well call twice the make_sparse_random_data and create the same data and data_sparse from the beginning. No need for the product of all possibilities arrays/matrices.

F438

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good you found that, thanks for hinting me. Less testing cases again. :)

StefanieSenger and others added 2 commits September 29, 2023 11:57
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
StefanieSenger and others added 12 commits September 29, 2023 12:00
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @StefanieSenger

@glemaitre glemaitre added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 31, 2023
Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thank you, @StefanieSenger!

Just minor nitpicks for getting to a self-documenting code (this modifies the state of the file before your contribution).

StefanieSenger and others added 7 commits November 11, 2023 22:10
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@StefanieSenger
Copy link
Contributor Author

Thank you, @jjerphan, it looks much cleaner now.

@jjerphan
Copy link
Member

Hi Stefanie,

One last thing: could you add an entry in this section for 1.4's changelog? :)

Support for SciPy sparse arrays
-------------------------------
Several estimators are now supporting SciPy sparse arrays. The following functions
and classes are impacted:
**Functions:**
- :func:`cluster.compute_optics_graph` in :pr:`27104` by
:user:`Maren Westermann <marenwestermann>` and in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :func:`cluster.kmeans_plusplus` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
- :func:`decomposition.non_negative_factorization` in :pr:`27100` by
:user:`Isaac Virshup <ivirshup>`;
- :func:`feature_selection.f_regression` in :pr:`27239` by
:user:`Yaroslav Korobko <Tialo>`;
- :func:`feature_selection.r_regression` in :pr:`27239` by
:user:`Yaroslav Korobko <Tialo>`;
- :func:`manifold.trustworthiness` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :func:`manifold.spectral_embedding` in :pr:`27240` by :user:`Yao Xiao <Charlie-XIAO>`;
- :func:`metrics.pairwise_distances` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :func:`metrics.pairwise_distances_chunked` in :pr:`27250` by
:user:`Yao Xiao <Charlie-XIAO>`;
- :func:`metrics.pairwise.pairwise_kernels` in :pr:`27250` by
:user:`Yao Xiao <Charlie-XIAO>`;
- :func:`sklearn.utils.multiclass.type_of_target` in :pr:`27274` by
:user:`Yao Xiao <Charlie-XIAO>`.
**Classes:**
- :class:`cluster.HDBSCAN` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`cluster.KMeans` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
- :class:`cluster.MiniBatchKMeans` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`;
- :class:`cluster.OPTICS` in :pr:`27104` by
:user:`Maren Westermann <marenwestermann>` and in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`decomposition.NMF` in :pr:`27100` by :user:`Isaac Virshup <ivirshup>`;
- :class:`decomposition.MiniBatchNMF` in :pr:`27100` by
:user:`Isaac Virshup <ivirshup>`;
- :class:`feature_extraction.text.TfidfTransformer` in :pr:`27219` by
:user:`Yao Xiao <Charlie-XIAO>`;
- :class:`cluster.Isomap` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :func:`manifold.SpectralEmbedding` in :pr:`27240` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`manifold.TSNE` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`impute.SimpleImputer` in :pr:`27277` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`impute.IterativeImputer` in :pr:`27277` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`impute.KNNImputer` in :pr:`27277` by :user:`Yao Xiao <Charlie-XIAO>`;
- :class:`kernel_approximation.PolynomialCountSketch` in :pr:`27301` by
:user:`Lohit SundaramahaLingam <lohitslohit>`;
- :class:`neural_network.BernoulliRBM` in :pr:`27252` by
:user:`Yao Xiao <Charlie-XIAO>`;
- :class:`preprocessing.PolynomialFeatures` in :pr:`27166` by
:user:`Mohit Joshi <work-mohit>`.

@StefanieSenger
Copy link
Contributor Author

Hey @jjerphan, just did it. Thanks for your support. :)

@jjerphan jjerphan enabled auto-merge (squash) November 12, 2023 11:51
@jjerphan jjerphan merged commit eb08740 into scikit-learn:main Nov 12, 2023
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…dom_projection.py` (scikit-learn#27314)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@StefanieSenger StefanieSenger deleted the sparse_array_test_random_projection branch April 18, 2024 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Changelog Needed Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0