-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base #7090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -294,6 +327,15 @@ def histogram(a, bins=10, range=None, normed=False, weights=None, | |||
This estimator assumes normality of data and is too conservative for larger, | |||
non-normal datasets. This is the default method in R's `hist` method. | |||
|
|||
'Doane' | |||
.. math:: n_h = \\left\\lceil 1 + \\log _{2}(n) + \\log _{2}(1 + \\frac{\\left g_1 \\right}{\\sigma _{g_1}) \\right\\rceil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to keep the line shorter then 80 characters.
EDIT: if you would clean up the tests in that regard too, would be nice, but I won't press the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed up the line width and some other stuff in the docs.
I read a little more carefully and reconciled the issues numerical stability and excessive temp arrays. The updated version of the formula is now I have also fixed up Scott's formula based on Ward's paper cited in the Cross Validated answer (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.9725). It is a good sign that none of the tests changed or failed for this commit. |
☔ The latest upstream changes (presumably #6656) made this pull request unmergeable. Please resolve the merge conflicts. |
Looks like the release notes have a conflict. Could you rebase? |
0526cdd
to
a14bb52
Compare
Rebased and squashed. |
Any further comments? |
☔ The latest upstream changes (presumably #7181) made this pull request unmergeable. Please resolve the merge conflicts. |
a14bb52
to
ae8ab99
Compare
Squawk. |
ae8ab99
to
a14bb52
Compare
a14bb52
to
b8b5561
Compare
This PR fortuitously fixes the following mishap: https://github.com/numpy/numpy/blob/master/numpy/lib/nanfunctions.py#L922. Not a reason to pull it in in and of itself, but I just noticed that after a rebase and thought I should mention it somewhere without opening another issue/PR. |
OK, sorry for letting this sit. I am a bit overloaded generally right now. Since the only real problem I have with it is "do we really need it", and it is only half public, nor seems a big maintenance burden, I will merge this. If anyone thinks we are adding too many histogram estimators, or finds doane too weird (@josef-pkt if you want to check it go ahead, I never got around checking the original paper, but I think I am willing to trust @madphysicist and this stackexchange guy by now), we can revert the stuff. |
ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base
Sorry, I lost an unfinished reply a while ago. roughly: I thought some examples would be useful However, I saw that the current The Doane paper goes on a lot about finding "nice" numbers, rounding to round numbers mutliple of 5, 100, .... That sounds like a lot of effort for a very doubtful outcome, given that the R function that does something similar (I never tried) is criticized in the related references. So, I don't have any objections to adding options, but they also won't be a huge improvement. What might be useful in general is to switch to "auto" as default. |
Yeah, I had already got the gist of, doane not being a big deal/improvement. I would like switching to "auto" as default, it sounds just so much more useful then the default of 10. But I am not sure it won't be too painful for downstream. If someone whished to attempt it, I think I could be fine with trying. But be prepared to having to undo it or the mailing list already disagreeing.... |
One more "impression" I had looking at some references |
Sounds all interesting :). With the "auto" option, it could also be something that fits a bit more to downstream (i.e. plotting in matplotlib), though it might be a bit tricky to give a different bins keyword based on numpy version. |
It would also be possible to provide a FutureWarning, that users should explicitly specify the option. |
These are a couple of estimators I find myself using sometimes. They do not break existing code and tests show that they work sensibly.