ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base #7090

madphysicist · 2016-01-21T20:32:48Z

These are a couple of estimators I find myself using sometimes. They do not break existing code and tests show that they work sensibly.

seberg · 2016-01-24T09:20:03Z

numpy/lib/function_base.py

@@ -294,6 +327,15 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
        This estimator assumes normality of data and is too conservative for larger,
        non-normal datasets. This is the default method in R's `hist` method.

+    'Doane'
+        .. math:: n_h = \\left\\lceil 1 + \\log _{2}(n) + \\log _{2}(1 + \\frac{\\left g_1 \\right}{\\sigma _{g_1}) \\right\\rceil


Please try to keep the line shorter then 80 characters.
EDIT: if you would clean up the tests in that regard too, would be nice, but I won't press the issue.

Fixed up the line width and some other stuff in the docs.

madphysicist · 2016-01-26T23:45:03Z

I read a little more carefully and reconciled the issues numerical stability and excessive temp arrays. The updated version of the formula is now g1 = mean(((x - mu) / sigma)^3). This is computed using only one temporary array, just like the expanded formula, and is much more stable. I am now using the least expanded definition of gamma1 (a.k.a. g1/G1...) instead of the most expanded one 8000 from https://en.wikipedia.org/wiki/Skewness#Pearson.27s_moment_coefficient_of_skewness.

I have also fixed up Scott's formula based on Ward's paper cited in the Cross Validated answer (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.9725). cbrt(24 * sqrt(pi)) == 3.490830212.... I will not bother trying to implement any of Ward's higher level estimators until I understand them better.

It is a good sign that none of the tests changed or failed for this commit.

homu · 2016-02-01T13:17:29Z

☔ The latest upstream changes (presumably #6656) made this pull request unmergeable. Please resolve the merge conflicts.

charris · 2016-02-01T18:29:37Z

Looks like the release notes have a conflict. Could you rebase?

madphysicist · 2016-02-01T21:50:44Z

Rebased and squashed.

madphysicist · 2016-02-05T19:48:06Z

Any further comments?

homu · 2016-02-07T17:40:44Z

☔ The latest upstream changes (presumably #7181) made this pull request unmergeable. Please resolve the merge conflicts.

madphysicist · 2016-02-11T13:43:58Z

Squawk.

…_base

madphysicist · 2016-02-12T02:49:18Z

This PR fortuitously fixes the following mishap: https://github.com/numpy/numpy/blob/master/numpy/lib/nanfunctions.py#L922. Not a reason to pull it in in and of itself, but I just noticed that after a rebase and thought I should mention it somewhere without opening another issue/PR.

seberg · 2016-02-13T14:52:45Z

OK, sorry for letting this sit. I am a bit overloaded generally right now.

Since the only real problem I have with it is "do we really need it", and it is only half public, nor seems a big maintenance burden, I will merge this.

If anyone thinks we are adding too many histogram estimators, or finds doane too weird (@josef-pkt if you want to check it go ahead, I never got around checking the original paper, but I think I am willing to trust @madphysicist and this stackexchange guy by now), we can revert the stuff.

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base

josef-pkt · 2016-02-13T15:20:30Z

Sorry, I lost an unfinished reply a while ago.

roughly:
It has the wrong asymptotics which should be n**(1/3) IIRC
I looked at Wand's paper which derives a "statistical" plug-in bandwidth but requires a preliminary histogram and is considerably more computational work.

I thought some examples would be useful

However, I saw that the current auto option uses sturges as default for small samples. If that works well, then Doane should also work well and better in skewed samples.

The Doane paper goes on a lot about finding "nice" numbers, rounding to round numbers mutliple of 5, 100, .... That sounds like a lot of effort for a very doubtful outcome, given that the R function that does something similar (I never tried) is criticized in the related references.

So, I don't have any objections to adding options, but they also won't be a huge improvement.
(and didn't bother to finish my comment after getting distracted with other things.)

What might be useful in general is to switch to "auto" as default.

seberg · 2016-02-13T15:35:21Z

Yeah, I had already got the gist of, doane not being a big deal/improvement. I would like switching to "auto" as default, it sounds just so much more useful then the default of 10. But I am not sure it won't be too painful for downstream. If someone whished to attempt it, I think I could be fine with trying. But be prepared to having to undo it or the mailing list already disagreeing....

josef-pkt · 2016-02-13T15:35:41Z

One more "impression" I had looking at some references
There might be different use cases where for exploratory data analysis we want to learn something about the data where a histogram is similar to a kernel density estimate, and the "info-graphics" use case where we just want a nice picture without too much detail.

seberg · 2016-02-13T15:38:11Z

Sounds all interesting :). With the "auto" option, it could also be something that fits a bit more to downstream (i.e. plotting in matplotlib), though it might be a bit tricky to give a different bins keyword based on numpy version.

josef-pkt · 2016-02-13T16:24:23Z

It would also be possible to provide a FutureWarning, that users should explicitly specify the option.
My impression from reading examples across various places is that most users already have to specify the bins to get away from the small number.

madphysicist mentioned this pull request Jan 21, 2016

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base #7083

Closed

charris added 01 - Enhancement component: numpy.lib labels Jan 21, 2016

seberg reviewed Jan 24, 2016
View reviewed changes

madphysicist force-pushed the hist-estimators branch from 0526cdd to a14bb52 Compare February 1, 2016 21:31

madphysicist mentioned this pull request Feb 6, 2016

MAINT: Cleanup for histogram bin estimator selection #7199

Merged

madphysicist force-pushed the hist-estimators branch 3 times, most recently from a14bb52 to ae8ab99 Compare February 11, 2016 03:01

madphysicist force-pushed the hist-estimators branch from ae8ab99 to a14bb52 Compare February 11, 2016 19:21

Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function…

b8b5561

…_base

madphysicist force-pushed the hist-estimators branch from a14bb52 to b8b5561 Compare February 12, 2016 01:08

seberg added a commit that referenced this pull request Feb 13, 2016

Merge pull request #7090 from madphysicist/hist-estimators

a5c8529

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base

seberg merged commit a5c8529 into numpy:master Feb 13, 2016

homu mentioned this pull request Feb 13, 2016

MAINT/ENH: Support for weights and range when estimating optimal number of bins #6288

Closed

madphysicist deleted the hist-estimators branch February 13, 2016 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base #7090

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base #7090

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base #7090

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base #7090

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!