8000 ENH: Replace `_lazywhere` with `xpx.apply_where` by crusaderky · Pull Request #22557 · scipy/scipy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Replace _lazywhere with xpx.apply_where #22557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Mar 22, 2025
Merged

Conversation

crusaderky
Copy link
Contributor
@crusaderky crusaderky commented Feb 19, 2025

@crusaderky crusaderky changed the title DNM Replace _lazy_apply with xpx.apply_where DNM Replace _lazywhere with xpx.apply_where Feb 19, 2025
@lucascolley lucascolley requested review from mdhaber and removed request for andyfaff, steppi and person142 February 19, 2025 15:38
@lucascolley lucascolley added maintenance Items related to regular maintenance tasks array types Items related to array API support and input array validation (see gh-18286) labels Feb 19, 2025
@mdhaber
Copy link
Contributor
mdhaber commented Feb 19, 2025

Note that marray tests do not run in CI

They should be running in the array API job because we install marray there. Is that not happening?

@mdhaber
Copy link
Contributor
mdhaber commented Feb 19, 2025

To reduce the diff and potential for regressions, can we move the existing _lazywhere to stats._distn_infrastructure and use it there (with a note to use it only in the old distribution infrastructure and old distributions)? I don't foresee that code being updated for array API. I might give old distribution private methods array API support so they can be used with the new infrastructure, and at that time, I would swap out _lazywhere with apply_where.

8000

@@ -670,7 +668,7 @@ def tmean(a, limits=None, inclusive=(True, True), axis=None):
# explicit dtype specification required due to data-apis/array-api-compat#152
sum = xp.sum(a, axis=axis, dtype=a.dtype)
n = xp.sum(xp.asarray(~mask, dtype=a.dtype), axis=axis, dtype=a.dtype)
mean = _lazywhere(n != 0, (sum, n), xp.divide, xp.nan)
mean = xpx.apply_where(n != 0, operator.truediv, (sum, n), fill_value=xp.nan)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor
@mdhaber mdhaber Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't studied the reason yet, but change looks fine.

@crusaderky
Copy link
Contributor Author

Note that marray tests do not run in CI

They should be running in the array API job because we install marray there. Is that not happening?

You're correct, my bad.

@crusaderky
Copy link
Contributor Author

To reduce the diff and potential for regressions, can we move the existing _lazywhere to stats._distn_infrastructure and use it there

To clarify, are you asking me to revert to _lazywhere in

  • scipy.stats._continuous_distns
  • scipy.stats._discrete_distns
  • scipy.stats._distn_infrastructure

and leave xpx.apply_where in the rest of scipy.stats?

  • scipy.stats._levy_stable
  • scipy.stats._multivariate
  • scipy.stats._stats_py
  • scipy.stats._wilcoxon

(with a note to use it only in the old distribution infrastructure and old distributions)?

Sorry I'm not familiar with what you mean with "old" versus "new" infrastructure - could you explain, give me a pointer to the documentation, or give me the exact verbiage?

@crusaderky crusaderky marked this pull request as draft February 19, 2025 16:21
@mdhaber
Copy link
Contributor
mdhaber commented Feb 19, 2025

Thanks for asking. Yeah - very close. scipy.stats._continuous_distns, scipy.stats._discrete_distns, scipy.stats._distn_infrastructure, and scipy.stats._levy_stable. and scipy.stats._multivariate are "old" in contrast to scipy.stats._distribution_infrastructure, scipy.stats_probability_distribution, and scipy.stats._new_distributions (which need to be renamed at some point). So originally I would have asked for those to be left alone.

But you know what - since you've already done the work, let's just revert the changes in scipy.stats._distn_infrastructure, move _lazywhere there, and restrict use of _lazywhere to that. I'll review the rest of the changes here.

Never mind. Let me review more closely. If it's all very systematic, we can just change it all. I thought there were more changes in _distn_infrastructure, which is the one I really wanted to avoid, but there are only a few.

Comment on lines 872 to 877
(r != 0) | (k != 0),
lambda k, M, n, r:
(-betaln(k+1, r) + betaln(k+r, 1)
- betaln(n-k+1, M-r-n+1) + betaln(M-r-k+1, 1)
+ betaln(n+1, M-n+1) - betaln(M+1, 1)),
(k, M, n, r), fill_value=0.0)
Copy link
Contributor
@mdhaber mdhaber Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is confusing me while reviewing these is that the third positional argument can be the arguments when fill_value is used, but the third positional argument is f2 elsewhere. In these case, please consider https://github.com/data-apis/array-api-extra/pull/141/files#r1962054546 or using keywords.

Copy link
Contributor
@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed most files and it all looks pretty good, but I'd like to address https://github.com/scipy/scipy/pull/22557#discussion_r1962042105/https://github.com/data-apis/array-api-extra/pull/141/files#r1962054546 before finishing up. Also, I'd feel more comfortable with the change if the test with hypothesis were run on apply_where (e.g. if it were brought over to array-api-extra). That was written pretty carefully IIRC.

Copy link
Member
@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does look quite good to me. One minor question inline. I agree with @mdhaber's review - can be merged once xpx.apply_where is in.

Slightly reworked, hopefully improved UI

Yep I think so!

@@ -692,11 +689,11 @@ def tmean(a, limits=None, inclusive=(True, True), axis=None):
# explicit dtype specification required due to data-apis/array-api-compat#152
sum = xp.sum(a, axis=axis, dtype=a.dtype)
n = xp.sum(xp.asarray(~mask, dtype=a.dtype), axis=axis, dtype=a.dtype)
mean = _lazywhere(n != 0, (sum, n), xp.divide, xp.nan)
mean = xpx.apply_where(n != 0, (sum, n), operator.truediv, fill_value=xp.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did xp.divide not work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See data-apis/array-api-extra#160:

If you forget about the meta-namespace and just use xp in the lambdas, at the moment most things will keep working.
This is because accidentally several functions in the dask.array, numpy, and cupy namespaces are interoperable or even the same function. However you will find cases where this doesn't hold true and you need the correct namespace.

This will become a much bigger source of headaches in the future when dask around generic Array API compatible namespaces will become commonplace (note: Dask does NOT support them today).

This pattern repeats itself many, many times in scipy. At the moment there are only a handful of cases that are array API-aware, and they all use xp.divide, so the problem can be worked around by replacing it with operator.truediv. But if you look at scipy.stats in #22557 you'll find a miriad of calls to np. functions inside the lambdas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I remember that now. Given that the code now looks a little harder to understand and this trick won't actually work in other places, I'd probably keep it unchanged and have an issue for solving this problem - and just fail the Dask test in the meantime.

If that's too much churn and this one-line change is more pragmatic, then fine with me as well of course. Dask just has a bunch of issues with namespacing, and this feels like a hack that happens to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK operator.truediv and xp.divide are identical though? It doesn't feel harder to read to me before or after the change?

@crusaderky crusaderky changed the title [DNM] ENH: Replace _lazywhere with xpx.apply_where ENH: Replace _lazywhere with xpx.apply_where Mar 18, 2025
@crusaderky crusaderky marked this pull request as ready for review March 18, 2025 23:22
@crusaderky
Copy link
Contributor Author

Test failures are unrelated. @mdhaber this is ready to merge!

@lucascolley lucascolley added this to the 1.16.0 milestone Mar 20, 2025
@mdhaber mdhaber merged commit 0a31577 into scipy:main Mar 22, 2025
39 of 41 checks passed
@crusaderky crusaderky deleted the apply_where branch March 23, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array types Items related to array API support and input array validation (see gh-18286) maintenance Items related to regular maintenance tasks scipy._lib scipy.optimize scipy.special scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0