ENH: When histogramming data with integer dtype, force bin width >= 1. #12150

anntzer · 2018-10-11T16:23:33Z

Bins of width < 1 don't make sense for integer data, they just add a
bunch of spurious, unpopulated bins. (Perhaps an even better
improvement would be to make sure that, when using integer data, the
binwidth is also integer, so that each bin always covers the same number
of possible values, but I guess that's possibly a more domain-specific
issue.)

Before the PR:

In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto")
Out[1]:
array([0.  , 0.45, 0.9 , 1.35, 1.8 , 2.25, 2.7 , 3.15, 3.6 , 4.05, 4.5 ,
       4.95, 5.4 , 5.85, 6.3 , 6.75, 7.2 , 7.65, 8.1 , 8.55, 9.  ])

After:

In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto")
Out[1]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

tylerjereddy

Just to get the review going two initial thoughts:

since matplotlib uses NumPy histogram, are there any cases of i.e., overlaying float and int histograms in a figure that one might worry about, even if a bit strange?
the docstring for histogram_bin_edges notes that "methods to estimate the optimal number of bins are well founded in literature"
- is there literature guidance on this matter? or just plain obvious here?
- if this is just obvious -- useful to check if i.e., R language hist behavior is similar?

tylerjereddy · 2018-10-18T23:31:32Z

numpy/lib/tests/test_histograms.py

+        """
+        Test that bin width for integer data is at least 1.
+        """
+        estimator_list = ['fd', 'scott', 'rice', 'sturges', 'auto']


looks like a good use case for a parametrized test

this follows test_simple_weighted below, if you want to convert all of them to parametrized that would better be done as a separate PR imo.

anntzer · 2018-10-19T08:09:54Z

From a quick google search, R doesn't appear to do anything special with integer data (https://www.rdocumentation.org/packages/graphics/versions/3.5.1/topics/hist), but I'd still argue the proposed behavior is clearly an improvement over the current status.
Multi-dataset autobinning "doesn't really work" as of now for Matplotlib (matplotlib/matplotlib#8638 and linked issues) so I wouldn't worry too much about it.

charris · 2018-11-13T00:31:37Z

Let's get some other opinions. @eric-wieser Thoughts? Is there anyone else who does a lot of graphics work who can comment on this? It makes some sense to me, but I don't know what the effects would be, or what is bad about the current behavior.

eric-wieser

As is, this patch has somewhat alarming behavior:

Asking for n bins will sometimes not actually give you n bins
An explicitly passed floating point array of bin edges will be ignored and replaced with a set of uniform bins

Both of these are likely to be things that break downstream users.

For bins='auto', yoru suggestion seems fairly harmless though. I'm not sure how reasonable it is for the other bin estimators.

anntzer · 2018-11-13T08:20:30Z

Indeed, it was never the intent to override explicitly passed-in bins, but only the various estimators.
Thanks for the catch, fixed that (using it for all "string" estimators, but I'm fine if you prefer only doing so for "auto", just let me know).

eric-wieser · 2018-11-13T08:24:51Z

Are you deliberately touching only histogram_bin_edges and not the common code path shared between it and histogram (_get_bin_edges)? I can't think of a good argument to make them different.

Indeed, putting the logic in _get_bin_edges is much better... done.

anntzer · 2019-01-05T11:23:21Z

Added tests showing that @eric-wieser's concerns above are handled.

anntzer · 2019-02-07T08:51:59Z

Kindly bumping (happy to continue the discussion as well if this needs more argumentation).

mattip · 2024-04-17T19:33:21Z

We discussed this at a recent triage meeting and it looks reasonable. It needs a rebase and a release note.

anntzer · 2024-04-18T09:07:28Z

Thanks for reviving this; rebased it and added release notes.

Bins of width < 1 don't make sense for integer data, they just add a bunch of spurious, unpopulated bins. (Perhaps an even better improvement would be to make sure that, when using integer data, the binwidth is also integer, so that each bin always covers the same number of possible values, but I guess that's possibly a more domain-specific issue.) Before the PR: In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto") Out[1]: array([0. , 0.45, 0.9 , 1.35, 1.8 , 2.25, 2.7 , 3.15, 3.6 , 4.05, 4.5 , 4.95, 5.4 , 5.85, 6.3 , 6.75, 7.2 , 7.65, 8.1 , 8.55, 9. ]) After: In [1]: np.histogram_bin_edges(np.tile(np.arange(10), 1000), "auto") Out[1]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

mattip

LGTM. @eric-wieser would you like to re-review?

ngoldbaum · 2024-05-04T00:36:27Z

We looked at this at the last triage meeting and I said I’d give it another look and merge if everything looks ok. I can’t think of any issues that previous reviews missed so I’m bringing this in. Thanks @anntzer, sorry for taking so long on this!

Old review

charris added 01 - Enhancement component: numpy.lib labels Oct 11, 2018

tylerjereddy reviewed Oct 18, 2018

View reviewed changes

eric-wieser previously requested changes Nov 13, 2018

View reviewed changes

anntzer force-pushed the histogram-int-dtype branch from ba10cdf to 0488587 Compare November 13, 2018 08:19

anntzer force-pushed the histogram-int-dtype branch from 0488587 to 79b8458 Compare November 13, 2018 08:37

Copy link

Contributor Author

anntzer commented Nov 13, 2018

Indeed, putting the logic in _get_bin_edges is much better... done.

anntzer force-pushed the histogram-int-dtype branch from 79b8458 to 7c04855 Compare November 13, 2018 09:31

anntzer force-pushed the histogram-int-dtype branch from 7c04855 to e53f683 Compare January 5, 2019 11:23

anntzer mentioned this pull request Feb 4, 2020

Request: change hist bins default to 'auto' matplotlib/matplotlib#16403

Closed

Base automatically changed from master to main March 4, 2021 02:04

seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 17, 2024

anntzer force-pushed the histogram-int-dtype branch from e53f683 to 7b38ee4 Compare April 18, 2024 09:06

anntzer force-pushed the histogram-int-dtype branch 2 times, most recently from fa26bf9 to 36596ed Compare April 18, 2024 11:35

anntzer force-pushed the histogram-int-dtype branch from 36596ed to c63969c Compare April 18, 2024 13:58

mattip approved these changes Apr 18, 2024

View reviewed changes

ngoldbaum removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label May 4, 2024

ngoldbaum approved these changes May 4, 2024

View reviewed changes

ngoldbaum merged commit 53cfea9 into numpy:main May 4, 2024

anntzer deleted the histogram-int-dtype branch May 4, 2024 09:28

Donnerstagnacht mentioned this pull request Sep 7, 2024

Type checking support for new Numpy ndarray static shape types microsoft/pyright#8921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: When histogramming data with integer dtype, force bin width >= 1. #12150

ENH: When histogramming data with integer dtype, force bin width >= 1. #12150

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: When histogramming data with integer dtype, force bin width >= 1. #12150

ENH: When histogramming data with integer dtype, force bin width >= 1. #12150

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!