-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: When histogramming data with integer dtype, force bin width >= 1. #12150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to get the review going two initial thoughts:
- since matplotlib uses NumPy histogram, are there any cases of i.e., overlaying float and int histograms in a figure that one might worry about, even if a bit strange?
- the docstring for
histogram_bin_edges
notes that "methods to estimate the optimal number of bins are well founded in literature"- is there literature guidance on this matter? or just plain obvious here?
- if this is just obvious -- useful to check if i.e., R language
hist
behavior is similar?
numpy/lib/tests/test_histograms.py
Outdated
""" | ||
Test that bin width for integer data is at least 1. | ||
""" | ||
estimator_list = ['fd', 'scott', 'rice', 'sturges', 'auto'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a good use case for a parametrized test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this follows test_simple_weighted below, if you want to convert all of them to parametrized that would better be done as a separate PR imo.
From a quick google search, R doesn't appear to do anything special with integer data (https://www.rdocumentation.org/packages/graphics/versions/3.5.1/topics/hist), but I'd still argue the proposed behavior is clearly an improvement over the current status. |
Let's get some other opinions. @eric-wieser Thoughts? Is there anyone else who does a lot of graphics work who can comment on this? It makes some sense to me, but I don't know what the effects would be, or what is bad about the current behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, this patch has somewhat alarming behavior:
- Asking for n bins will sometimes not actually give you n bins
- An explicitly passed floating point array of bin edges will be ignored and replaced with a set of uniform bins
Both of these are likely to be things that break downstream users.
For bins='auto'
, yoru suggestion seems fairly harmless though. I'm not sure how reasonable it is for the other bin estimators.
ba10cdf
to
0488587
Compare
Indeed, it was never the intent to override explicitly passed-in bins, but only the various estimators. |
Are you deliberately touching only |
0488587
to
79b8458
Compare
Indeed, putting the logic in |
79b8458
to
7c04855
Compare
7c04855
to
e53f683
Compare
Added tests showing that @eric-wieser's concerns above are handled. |
Kindly bumping (happy to continue the discussion as well if this needs more argumentation). |
We discussed this at a recent triage meeting and it looks reasonable. It needs a rebase and a release note. |
e53f683
to
7b38ee4
Compare
Thanks for reviving this; rebased it and added release notes. |
fa26bf9
to
36596ed
Compare
Bins of width < 1 don't make sense for integer data, they just add a bunch of spurious, unpopulated bins. (Perhaps an even better improvement would be to make sure that, when using integer data, the binwidth is also integer, so that each bin always covers the same number of possible values, but I guess that's possibly a more domain-specific issue.) Before the PR: In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto") Out[1]: array([0. , 0.45, 0.9 , 1.35, 1.8 , 2.25, 2.7 , 3.15, 3.6 , 4.05, 4.5 , 4.95, 5.4 , 5.85, 6.3 , 6.75, 7.2 , 7.65, 8.1 , 8.55, 9. ]) After: In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto") Out[1]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
36596ed
to
c63969c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @eric-wieser would you like to re-review?
We looked at this at the last triage meeting and I said I’d give it another look and merge if everything looks ok. I can’t think of any issues that previous reviews missed so I’m bringing this in. Thanks @anntzer, sorry for taking so long on this! |
Bins of width < 1 don't make sense for integer data, they just add a
bunch of spurious, unpopulated bins. (Perhaps an even better
improvement would be to make sure that, when using integer data, the
binwidth is also integer, so that each bin always covers the same number
of possible values, but I guess that's possibly a more domain-specific
issue.)
Before the PR:
After: