8000 ENH: When histogramming data with integer dtype, force bin width >= 1. by anntzer · Pull Request #12150 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: When histogramming data with integer dtype, force bin width >= 1. #12150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 4, 2024

Conversation

anntzer
Copy link
Contributor
@anntzer anntzer commented Oct 11, 2018

Bins of width < 1 don't make sense for integer data, they just add a
bunch of spurious, unpopulated bins. (Perhaps an even better
improvement would be to make sure that, when using integer data, the
binwidth is also integer, so that each bin always covers the same number
of possible values, but I guess that's possibly a more domain-specific
issue.)

Before the PR:

In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto")
Out[1]:
array([0.  , 0.45, 0.9 , 1.35, 1.8 , 2.25, 2.7 , 3.15, 3.6 , 4.05, 4.5 ,
       4.95, 5.4 , 5.85, 6.3 , 6.75, 7.2 , 7.65, 8.1 , 8.55, 9.  ])

After:

In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto")
Out[1]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

Copy link
Contributor
@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to get the review going two initial thoughts:

  • since matplotlib uses NumPy histogram, are there any cases of i.e., overlaying float and int histograms in a figure that one might worry about, even if a bit strange?
  • the docstring for histogram_bin_edges notes that "methods to estimate the optimal number of bins are well founded in literature"
    • is there literature guidance on this matter? or just plain obvious here?
    • if this is just obvious -- useful to check if i.e., R language hist behavior is similar?

"""
Test that bin width for integer data is at least 1.
"""
estimator_list = ['fd', 'scott', 'rice', 'sturges', 'auto']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a good use case for a parametrized test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this follows test_simple_weighted below, if you want to convert all of them to parametrized that would better be done as a separate PR imo.

@anntzer
Copy link
Contributor Author
anntzer commented Oct 19, 2018

From a quick google search, R doesn't appear to do anything special with integer data (https://www.rdocumentation.org/packages/graphics/versions/3.5.1/topics/hist), but I'd still argue the proposed behavior is clearly an improvement over the current status.
Multi-dataset autobinning "doesn't really work" as of now for Matplotlib (matplotlib/matplotlib#8638 and linked issues) so I wouldn't worry too much about it.

@charris
Copy link
Member
charris commented Nov 13, 2018

Let's get some other opinions. @eric-wieser Thoughts? Is there anyone else who does a lot of graphics work who can comment on this? It makes some sense to me, but I don't know what the effects would be, or what is bad about the current behavior.

Copy link
Member
@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, this patch has somewhat alarming behavior:

  • Asking for n bins will sometimes not actually give you n bins
  • An explicitly passed floating point array of bin edges will be ignored and replaced with a set of uniform bins

Both of these are likely to be things that break downstream users.

For bins='auto', yoru suggestion seems fairly harmless though. I'm not sure how reasonable it is for the other bin estimators.

@anntzer anntzer force-pushed the histogram-int-dtype branch from ba10cdf to 0488587 Compare November 13, 2018 08:19
@anntzer
Copy link
Contributor Author
anntzer commented Nov 13, 2018

Indeed, it was never the intent to override explicitly passed-in bins, but only the various estimators.
Thanks for the catch, fixed that (using it for all "string" estimators, but I'm fine if you prefer only doing so for "auto", just let me know).

@eric-wieser
Copy link
Member

Are you deliberately touching only histogram_bin_edges and not the common code path shared between it and histogram (_get_bin_edges)? I can't think of a good argument to make them different.

@anntzer anntzer force-pushed the histogram-int-dtype branch from 0488587 to 79b8458 Compare November 13, 2018 08:37
@anntzer
Copy link
Contributor Author
anntzer commented Nov 13, 2018

Indeed, putting the logic in _get_bin_edges is much better... done.

@anntzer anntzer force-pushed the histogram-int-dtype branch from 79b8458 to 7c04855 Compare November 13, 2018 09:31
@anntzer anntzer force-pushed the histogram-int-dtype branch from 7c04855 to e53f683 Compare January 5, 2019 11:23
@anntzer
Copy link
Contributor Author
anntzer commented Jan 5, 2019

Added tests showing that @eric-wieser's concerns above are handled.

@anntzer
Copy link
Contributor Author
anntzer commented Feb 7, 2019

Kindly bumping (happy to continue the discussion as well if this needs more argumentation).

Base automatically changed from master to main March 4, 2021 02:04
@seberg seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 17, 2024
@mattip
Copy link
Member
mattip commented Apr 17, 2024

We discussed this at a recent triage meeting and it looks reasonable. It needs a rebase and a release note.

@anntzer anntzer force-pushed the histogram-int-dtype branch from e53f683 to 7b38ee4 Compare April 18, 2024 09:06
@anntzer
Copy link
Contributor Author
anntzer commented Apr 18, 2024

Thanks for reviving this; rebased it and added release notes.

@anntzer anntzer force-pushed the histogram-int-dtype branch 2 times, most recently from fa26bf9 to 36596ed Compare April 18, 2024 11:35
Bins of width < 1 don't make sense for integer data, they just add a
bunch of spurious, unpopulated bins.  (Perhaps an even better
improvement would be to make sure that, when using integer data, the
binwidth is also integer, so that each bin always covers the same number
of possible values, but I guess that's possibly a more domain-specific
issue.)

Before the PR:

    In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto")
    Out[1]:
    array([0.  , 0.45, 0.9 , 1.35, 1.8 , 2.25, 2.7 , 3.15, 3.6 , 4.05, 4.5 ,
           4.95, 5.4 , 5.85, 6.3 , 6.75, 7.2 , 7.65, 8.1 , 8.55, 9.  ])

After:

    In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto")
    Out[1]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
@anntzer anntzer force-pushed the histogram-int-dtype branch from 36596ed to c63969c Compare April 18, 2024 13:58
Copy link
Member
@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @eric-wieser would you like to re-review?

@ngoldbaum ngoldbaum removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label May 4, 2024
@ngoldbaum
Copy link
Member

We looked at this at the last triage meeting and I said I’d give it another look and merge if everything looks ok. I can’t think of any issues that previous reviews missed so I’m bringing this in. Thanks @anntzer, sorry for taking so long on this!

@ngoldbaum ngoldbaum merged commit 53cfea9 into numpy:main May 4, 2024
65 checks passed
@anntzer anntzer deleted the histogram-int-dtype branch May 4, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0