8000 BUG: fix histogram binning's consistency with numpy for outer edges by neutrinoceros · Pull Request #17391 · astropy/astropy · GitHub
[go: up one dir, main page]

Skip to content

BUG: fix histogram binning's consistency with numpy for outer edges#17391

Open
neutrinoceros wants to merge 2 commits intoastropy:mainfrom
neutrinoceros:stats/bug/8011/fix_histogram_binning_range_overshoot
Open

BUG: fix histogram binning's consistency with numpy for outer edges#17391
neutrinoceros wants to merge 2 commits intoastropy:mainfrom
neutrinoceros:stats/bug/8011/fix_histogram_binning_range_overshoot

Conversation

@neutrinoceros
Copy link
Contributor

Description

This addresses the bug part from #8011
Since the functionality we need to imitate from numpy already exists as a function, and it's not public, it seemed to me that the most robust option would be to vendor it from there. Maybe we want to import it instead, though that would require a fallback mechanism too so I'm keeping it simple for the first round of review at least.

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

@github-actions github-actions bot added the stats label Nov 14, 2024
@github-actions
Copy link
Contributor

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@github-actions
Copy link
Contributor

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@neutrinoceros neutrinoceros added this to the v7.0.0 milestone Nov 14, 2024

# fmt: off
assert_allclose(bins, [-2.32503077, -1.89228844, -1.4595461, -1.02680377, -0.59406143,
assert_allclose(bins, [X_min, -1.89228844, -1.4595461, -1.02680377, -0.59406143,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot tell from the diff. How much of a difference are we actually talking about? Maybe we shouldn't backport. Might introduce subtle and hard to track differences downstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing X_min for the sake of clarity here but it's not a change with respect to the previous version, only the last edge (X_max) is incorrect on main.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About backporting: I suppose this is technically an API change, but also a fix if we intend to be consistent with numpy, so I think a major release would be the ideal time to make that change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, RC is already out. So will we need RC2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ask @saimn and @mwcraig for inputs here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more a behavior change (hence API changes) than a bugfix. Was that causing a bug somehow ?
And btw, since I spent a few minutes to understand what the PR does, I'm not convinced by the need for this change: for the example here bins had the same size the modification, changing the size of bins might lead to subtle bugs for people that are not aware of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end goal is to delegate what we can to numpy, so here I'm "fixing" the inconsistency in our behavior as compared to theirs. But yeah, #8011 can arguably be triaged as a "bug" or a "feature request". It's apparent from the original request that there were some ambiguity there for @mwcraig too.
If it's too late for 7.0, maybe we could try to get this in early in 8.0's development ? If that's still too controversial somehow, I don't see a way forward and I guess we should close #8011 as won't fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same methods already gives different results it seems, but I'm no expert in this and maybe that's not a problem. But I'm not sure that "fixing" the bins afterwards is the best way to go and I don't know why we duplicates things from numpy (might be that there are some differences on purpose). I think @astrofrog implement some histogram methods here and in numpy so let's ping him 🙏

In [9]: counts, bins = histogram(X, bins="scott")

In [10]: counts2, bins2 = np.histogram(X, bins="scott")

In [11]: counts
Out[11]: array([ 2, 14, 27, 25, 16, 16])

In [12]: counts2
Out[12]: array([ 2, 14, 26, 26, 15, 17])

In [13]: bins
Out[13]: 
array([-2.32503077, -1.59953424, -0.87403771, -0.14854117,  0.57695536,
        1.3024519 ,  2.02794843])

In [14]: bins2
Out[14]: 
array([-2.32503077, -1.60379355, -0.88255632, -0.1613191 ,  0.55991813,
        1.28115536,  2.00239258])

In [15]: counts, bins = histogram(X, bins="freedman")

In [16]: counts2, bins2 = np.histogram(X, bins="fd")

In [17]: counts
Out[17]: array([ 2, 11, 16, 18, 22, 14, 13,  4])

In [18]: counts2
Out[18]: array([ 2,  8, 11, 21, 22, 14, 12, 10])

In [19]: bins
Out[19]: 
array([-2.32503077, -1.74087192, -1.15671306, -0.5725542 ,  0.01160465,
        0.59576351,  1.17992237,  1.76408122,  2.34824008])

In [20]: bins2
Out[20]: 
array([-2.32503077, -1.78410285, -1.24317494, -0.70224702, -0.1613191 ,
        0.37960882,  0.92053674,  1.46146466,  2.00239258])

@pllim pllim modified the milestones: v7.0.0, v7.1.0 Nov 14, 2024
# guarantee that outer edges are consistent with numpy's approach,
# it only ensures that data outside the range is excluded from
# calculation of the bin widths.
bins[0], bins[-1] = _get_outer_edges(a, range)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the request in #8011 was to make the edges consistent with numpy. When does this function give different results from numpy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last edge, as reported in #8011, and as output straight from estimators, may be larger than the maximum data point even when no explicit range is given. In contrast, Numpy guarantees that the last edge will be either the maximum data point or the right edge of the user provided range, and never anything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always use numpy's version? That could be a breaking change, but we can also consider it to be a bug fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not addressing the part of #8011 that proposes to delegate computing all edges to numpy here, because I think it'll be a more substantial change, but this is a first step in that direction.

@github-actions github-actions bot added the Close? Tell stale bot that this issue/PR is stale label Apr 14, 2025
@github-actions
Copy link
Contributor

Hi humans 👋 - this pull request hasn't had any new commits for approximately 4 months. I plan to close this in 30 days if the pull request doesn't have any new commits by then.

In lieu of a stalled pull request, please consider closing this and open an issue instead if a reminder is needed to revisit in the future. Maintainers may also choose to add keep-open label to keep this PR open but it is discouraged unless absolutely necessary.

If this PR still needs to be reviewed, as an author, you can rebase it to reset the clock.

< 7DBD p dir="auto">If you believe I commented on this pull request incorrectly, please report this here.

@mwcraig
Copy link
Member
mwcraig commented Apr 14, 2025

I had completely forgotten I opened #8011! Given how long we have gone without being consistent with what numpy does, I think this should count as an API change if it is implemented.

I would lean towards going all-in with using numpy wherever possible in Astropy 8 as long we understand what differences that will cause.

@neutrinoceros neutrinoceros removed the Close? Tell stale bot that this issue/PR is stale label Apr 15, 2025
@github-actions github-actions bot added the Close? Tell stale bot that this issue/PR is stale label Apr 16, 2025
@github-actions
Copy link
Contributor

Hi humans 👋 - this pull request hasn't had any new commits for approximately 5 months. I plan to close this in 30 days if the pull request doesn't have any new commits by then.

In lieu of a stalled pull request, please consider closing this and open an issue instead if a reminder is needed to revisit in the future. Maintainers may also choose to add keep-open label to keep this PR open but it is discouraged unless absolutely necessary.

If this PR still needs to be reviewed, as an author, you can rebase it to reset the clock.

If you believe I commented on this pull request incorrectly, please report this here.

@astrofrog astrofrog modified the milestones: v7.1.0, v7.2.0 Apr 28, 2025
@neutrinoceros neutrinoceros added keep-open and removed Close? Tell stale bot that this issue/PR is stale labels Apr 28, 2025
8E0B
@astrofrog astrofrog modified the milestones: v7.2.0, v8.0.0 Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

0