-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
ENH: When histogramming data with integer dtype, force bin width >= 1. #12150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to get the review going two initial thoughts:
- since matplotlib uses NumPy histogram, are there any cases of i.e., overlaying float and int histograms in a figure that one might worry about, even if a bit strange?
- the docstring for
histogram_bin_edgesnotes that "methods to estimate the optimal number of bins are well founded in literature"- is there literature guidance on this matter? or just plain obvious here?
- if this is just obvious -- useful to check if i.e., R language
histbehavior is similar?
numpy/lib/tests/test_histograms.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a good use case for a parametrized test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this follows test_simple_weighted below, if you want to convert all of them to parametrized that would better be done as a separate PR imo.
|
From a quick google search, R doesn't appear to do anything special with integer data (https://www.rdocumentation.org/packages/graphics/versions/3.5.1/topics/hist), but I'd still argue the proposed behavior is clearly an improvement over the current status. |
|
Let's get some other opinions. @eric-wieser Thoughts? Is there anyone else who does a lot of graphics work who can comment on this? It makes some sense to me, but I don't know what the effects would be, or what is bad about the current behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, this patch has somewhat alarming behavior:
- Asking for n bins will sometimes not actually give you n bins
- An explicitly passed floating point array of bin edges will be ignored and replaced with a set of uniform bins
Both of these are likely to be things that break downstream users.
For bins='auto', yoru suggestion seems fairly harmless though. I'm not sure how reasonable it is for the other bin estimators.
ba10cdf to
0488587
Compare
|
Indeed, it was never the intent to override explicitly passed-in bins, but only the various estimators. |
|
Are you deliberately touching only |
0488587 to
79b8458
Compare
|
Indeed, putting the logic in |
79b8458 to
7c04855
Compare
7c04855 to
e53f683
Compare
|
Added tests showing that @eric-wieser's concerns above are handled. |
|
Kindly bumping (happy to continue the discussion as well if this needs more argumentation). |
|
We discussed this at a recent triage meeting and it looks reasonable. It needs a rebase and a release note. |
e53f683 to
7b38ee4
Compare
|
Thanks for reviving this; rebased it and added release notes. |
fa26bf9 to
36596ed
Compare
Bins of width < 1 don't make sense for integer data, they just add a
bunch of spurious, unpopulated bins. (Perhaps an even better
improvement would be to make sure that, when using integer data, the
binwidth is also integer, so that each bin always covers the same number
of possible values, but I guess that's possibly a more domain-specific
issue.)
Before the PR:
In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto")
Out[1]:
array([0. , 0.45, 0.9 , 1.35, 1.8 , 2.25, 2.7 , 3.15, 3.6 , 4.05, 4.5 ,
4.95, 5.4 , 5.85, 6.3 , 6.75, 7.2 , 7.65, 8.1 , 8.55, 9. ])
After:
In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto")
Out[1]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
36596ed to
c63969c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @eric-wieser would you like to re-review?
|
We looked at this at the last triage meeting and I said I’d give it another look and merge if everything looks ok. I can’t think of any issues that previous reviews missed so I’m bringing this in. Thanks @anntzer, sorry for taking so long on this! |
Bins of width < 1 don't make sense for integer data, they just add a
bunch of spurious, unpopulated bins. (Perhaps an even better
improvement would be to make sure that, when using integer data, the
binwidth is also integer, so that each bin always covers the same number
of possible values, but I guess that's possibly a more domain-specific
issue.)
Before the PR:
After: