BUG: fix histogram binning's consistency with numpy for outer edges#17391
BUG: fix histogram binning's consistency with numpy for outer edges#17391neutrinoceros wants to merge 2 commits intoastropy:mainfrom
Conversation
|
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
|
👋 Thank you for your draft pull request! Do you know that you can use |
f70867f to
eda890d
Compare
|
|
||
| # fmt: off | ||
| assert_allclose(bins, [-2.32503077, -1.89228844, -1.4595461, -1.02680377, -0.59406143, | ||
| assert_allclose(bins, [X_min, -1.89228844, -1.4595461, -1.02680377, -0.59406143, |
There was a problem hiding this comment.
I cannot tell from the diff. How much of a difference are we actually talking about? Maybe we shouldn't backport. Might introduce subtle and hard to track differences downstream.
There was a problem hiding this comment.
I'm testing X_min for the sake of clarity here but it's not a change with respect to the previous version, only the last edge (X_max) is incorrect on main.
There was a problem hiding this comment.
About backporting: I suppose this is technically an API change, but also a fix if we intend to be consistent with numpy, so I think a major release would be the ideal time to make that change.
There was a problem hiding this comment.
Well, RC is already out. So will we need RC2?
There was a problem hiding this comment.
Seems more a behavior change (hence API changes) than a bugfix. Was that causing a bug somehow ?
And btw, since I spent a few minutes to understand what the PR does, I'm not convinced by the need for this change: for the example here bins had the same size the modification, changing the size of bins might lead to subtle bugs for people that are not aware of this.
There was a problem hiding this comment.
The end goal is to delegate what we can to numpy, so here I'm "fixing" the inconsistency in our behavior as compared to theirs. But yeah, #8011 can arguably be triaged as a "bug" or a "feature request". It's apparent from the original request that there were some ambiguity there for @mwcraig too.
If it's too late for 7.0, maybe we could try to get this in early in 8.0's development ? If that's still too controversial somehow, I don't see a way forward and I guess we should close #8011 as won't fix.
There was a problem hiding this comment.
Same methods already gives different results it seems, but I'm no expert in this and maybe that's not a problem. But I'm not sure that "fixing" the bins afterwards is the best way to go and I don't know why we duplicates things from numpy (might be that there are some differences on purpose). I think @astrofrog implement some histogram methods here and in numpy so let's ping him 🙏
In [9]: counts, bins = histogram(X, bins="scott")
In [10]: counts2, bins2 = np.histogram(X, bins="scott")
In [11]: counts
Out[11]: array([ 2, 14, 27, 25, 16, 16])
In [12]: counts2
Out[12]: array([ 2, 14, 26, 26, 15, 17])
In [13]: bins
Out[13]:
array([-2.32503077, -1.59953424, -0.87403771, -0.14854117, 0.57695536,
1.3024519 , 2.02794843])
In [14]: bins2
Out[14]:
array([-2.32503077, -1.60379355, -0.88255632, -0.1613191 , 0.55991813,
1.28115536, 2.00239258])
In [15]: counts, bins = histogram(X, bins="freedman")
In [16]: counts2, bins2 = np.histogram(X, bins="fd")
In [17]: counts
Out[17]: array([ 2, 11, 16, 18, 22, 14, 13, 4])
In [18]: counts2
Out[18]: array([ 2, 8, 11, 21, 22, 14, 12, 10])
In [19]: bins
Out[19]:
array([-2.32503077, -1.74087192, -1.15671306, -0.5725542 , 0.01160465,
0.59576351, 1.17992237, 1.76408122, 2.34824008])
In [20]: bins2
Out[20]:
array([-2.32503077, -1.78410285, -1.24317494, -0.70224702, -0.1613191 ,
0.37960882, 0.92053674, 1.46146466, 2.00239258])
| # guarantee that outer edges are consistent with numpy's approach, | ||
| # it only ensures that data outside the range is excluded from | ||
| # calculation of the bin widths. | ||
| bins[0], bins[-1] = _get_outer_edges(a, range) |
There was a problem hiding this comment.
I thought the request in #8011 was to make the edges consistent with numpy. When does this function give different results from numpy?
There was a problem hiding this comment.
The last edge, as reported in #8011, and as output straight from estimators, may be larger than the maximum data point even when no explicit range is given. In contrast, Numpy guarantees that the last edge will be either the maximum data point or the right edge of the user provided range, and never anything else.
There was a problem hiding this comment.
Should we always use numpy's version? That could be a breaking change, but we can also consider it to be a bug fix.
There was a problem hiding this comment.
I'm not addressing the part of #8011 that proposes to delegate computing all edges to numpy here, because I think it'll be a more substantial change, but this is a first step in that direction.
|
Hi humans 👋 - this pull request hasn't had any new commits for approximately 4 months. I plan to close this in 30 days if the pull request doesn't have any new commits by then. In lieu of a stalled pull request, please consider closing this and open an issue instead if a reminder is needed to revisit in the future. Maintainers may also choose to add keep-open label to keep this PR open but it is discouraged unless absolutely necessary. If this PR still needs to be reviewed, as an author, you can rebase it to reset the clock. < 7DBD p dir="auto">If you believe I commented on this pull request incorrectly, please report this here. |
|
I had completely forgotten I opened #8011! Given how long we have gone without being consistent with what numpy does, I think this should count as an API change if it is implemented. I would lean towards going all-in with using numpy wherever possible in Astropy 8 as long we understand what differences that will cause. |
|
Hi humans 👋 - this pull request hasn't had any new commits for approximately 5 months. I plan to close this in 30 days if the pull request doesn't have any new commits by then. In lieu of a stalled pull request, please consider closing this and open an issue instead if a reminder is needed to revisit in the future. Maintainers may also choose to add keep-open label to keep this PR open but it is discouraged unless absolutely necessary. If this PR still needs to be reviewed, as an author, you can rebase it to reset the clock. If you believe I commented on this pull request incorrectly, please report this here. |
Description
This addresses the bug part from #8011
Since the functionality we need to imitate from numpy already exists as a function, and it's not public, it seemed to me that the most robust option would be to vendor it from there. Maybe we want to import it instead, though that would require a fallback mechanism too so I'm keeping it simple for the first round of review at least.