10000 ENH: sparse: Add csr 1d sparse arrays and test_arithmetic1d by dschult · Pull Request #19833 · scipy/scipy · GitHub
[go: up one dir, main page]

Skip to content

ENH: sparse: Add csr 1d sparse arrays and test_arithmetic1d #19833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 7, 2024

Conversation

dschult
Copy link
Contributor
@dschult dschult commented Jan 8, 2024

Adding sparse array support within the csr code allows us to use the 2d sparsetools routines for csr with 1d sparse arrays. While 1d arrays do not have a compressed index, we can construct indptr = np.array([0 nnz]) and use the sparsetools functionality as if the 1d array were a 2d array. Reshaping the result where needed works well. The other importance of csr format is that many features in other formats are funneled to csr because that handles most everything. So we should be able to start enabling some of the features for sparse arrays that are waiting for 1d support after this PR. I envision additional formats as a special feature rather than an essential part of the 1d support process. We are getting close to full functionality.

This PR also introduces tests of 1d arithmetic taken from the test_base.py arithmetic tests for 2d, converted to 1d and modernized to use fixtures and pytest parametrize rather than subclassing test classes for the different formats. This PR includes the tests of arithmetic without including the tests of various data dtypes, as those are presumably tested well in 2d and the code is essentially the same for 1d and for 1xN 2d.

Indexing beyond integers (and tests of such) are not implemented for csr yet (or dok in #19715).

It is not clear whether it makes sense to use csc-1d. A 1d array is essentially a row. Using csc with this formulation means that indptr has N+1 elements. So, unless the indptr dtype is quit small, the array takes more RAM than a dense array. We could instead treat the csc-1d as a single column. But this breaks the usual symmetries between CSR and CSC. Converting between csr and csc would simply be a copy of the data and change of class. Another alternate implementation would be to make a 1d format that combines features of coo, csc, csr and leaves out support for 2d.
But I think this is likely to be the best way forward. So see what you think...

This PR is built on #19853 (base-1d). So it is independent of the dok-1d branch and the 3 PRs should be easier to review. Looking at it now will show files _base.py and test_common1d.py which are actually included in the base-1d PR.

@github-actions github-actions bot added scipy.sparse Meson Items related to the introduction of Meson as the new build system for SciPy enhancement A new feature or improvement labels Jan 8, 2024
@lucascolley lucascolley removed the Meson Items related to the introduction of Meson as the new build system for SciPy label Jan 8, 2024
@dschult dschult changed the title ENH: sparse: Add csr/csc 1d sparse arrays ENH: sparse: Add csr 1d sparse arrays and test_arithmetic1d Jan 11, 2024
@dschult dschult force-pushed the csr-1d branch 3 times, most recently from b8b4f0d to 1ab08c1 Compare February 8, 2024 02:00
Copy link
Member
@stefanv stefanv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve pending sorting out multiplication semantics.

10000 Copy link
Contributor Author
@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes made as per review:

  • construct matmul to follow array api -- in particular to produce 1d or 0d arrays as appropriate. 1d arrays are sparse. 0d arrays are ndarrays.
  • adjust tests for affected code
  • adjusted and updated comments
  • made code more readable, and logic more clear

@dschult dschult force-pushed the csr-1d branch 2 times, most recently from 77c8586 to e817933 Compare February 19, 2024 05:28
@dschult
Copy link
Contributor Author
dschult commented Feb 19, 2024

A1d @ A1d in NumPy returns a scalar. But the array-api says it should return a 0-dim array.

Which should sparse arrays return?
I think they should match the array-api (and presumably numpy will eventually change to). I've made this PR return 0-dim arrays in that case.

While adding tests, I found a broken behavior for this PRs 1d sparse arrays when they appear on the right of @. Their shape_as_2d is (N, 1) and correctly indicates a single column, but they were still storing a row in csr_array format. I have corrected this PR to correctly handle the "column" by converting the "row" to 2d and transposing (which makes it a csc_array). But that means we have to construct a csr_array for a column vector, which can explode memory for large column vectors as indptr will need M+1 elements when "@" multiplying a CSR row by a CSC column.

This is a problem for the 2d case (spmatrix or sparray) also. So I'm not going to fix that here.

To fix this we need a CSR @ CSC sparsetools function (we currently have CSR @ CSR). I have found an algorithm for that and will make a separate PR. We should be able to multiply sparse 1d arrays with length N while avoiding a csr column that requires an (N+1)-vector for indptr. Similar issues for 2d arrays that are very wide multiplying arrays that are very tall.

@rgommers
Copy link
Member

Array scalars kinda duck-type 0-D arrays. The discussion at numpy/numpy#25542 (comment) clarifies that returning array scalars therefore does not actually break array API compatibility. NumPy may or may not remove scalars in the future; it'd be a big operation and if this happens it'll be another major release, so at least several years away.

I'd say 0-D arrays are preferred for new code. For existing code, we should pay attention to API consistency as well - if there's 10 functions/methods and 9 already return scalars, then it's better for the 10th one to do the same.

@dschult dschult requested a review from perimosocordiae March 28, 2024 19:44
@perimosocordiae perimosocordiae merged commit 811b134 into scipy:main May 7, 2024
@perimosocordiae
Copy link
Member

Merged! 🎉
Thanks for your hard work on this, @dschult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0