8000 ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base by madphysicist · Pull Request #7090 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base #7090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 13, 2016

Conversation

madphysicist
Copy link
Contributor

These are a couple of estimators I find myself using sometimes. They do not break existing code and tests show that they work sensibly.

@@ -294,6 +327,15 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
This estimator assumes normality of data and is too conservative for larger,
non-normal datasets. This is the default method in R's `hist` method.

'Doane'
.. math:: n_h = \\left\\lceil 1 + \\log _{2}(n) + \\log _{2}(1 + \\frac{\\left g_1 \\right}{\\sigma _{g_1}) \\right\\rceil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to keep the line shorter then 80 characters.
EDIT: if you would clean up the tests in that regard too, would be nice, but I won't press the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed up the line width and some other stuff in the docs.

@madphysicist
Copy link
Contributor Author

I read a little more carefully and reconciled the issues numerical stability and excessive temp arrays. The updated version of the formula is now g1 = mean(((x - mu) / sigma)^3). This is computed using only one temporary array, just like the expanded formula, and is much more stable. I am now using the least expanded definition of gamma1 (a.k.a. g1/G1...) instead of the most expanded one 8000 from https://en.wikipedia.org/wiki/Skewness#Pearson.27s_moment_coefficient_of_skewness.

I have also fixed up Scott's formula based on Ward's paper cited in the Cross Validated answer (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.9725). cbrt(24 * sqrt(pi)) == 3.490830212.... I will not bother trying to implement any of Ward's higher level estimators until I understand them better.

It is a good sign that none of the tests changed or failed for this commit.

@homu
Copy link
Contributor
homu commented Feb 1, 2016

☔ The latest upstream changes (presumably #6656) made this pull request unmergeable. Please resolve the merge conflicts.

@charris
Copy link
Member
charris commented Feb 1, 2016

Looks like the release notes have a conflict. Could you rebase?

@madphysicist
Copy link
Contributor Author

Rebased and squashed.

@madphysicist
Copy link
Contributor Author

Any further comments?

@homu
Copy link
Contributor
homu commented Feb 7, 2016

☔ The latest upstream changes (presumably #7181) made this pull request unmergeable. Please resolve the merge conflicts.

@madphysicist madphysicist force-pushed the hist-estimators branch 3 times, most recently from a14bb52 to ae8ab99 Compare February 11, 2016 03:01
@madphysicist
Copy link
Contributor Author

Squawk.

@madphysicist
Copy link
Contributor Author

This PR fortuitously fixes the following mishap: https://github.com/numpy/numpy/blob/master/numpy/lib/nanfunctions.py#L922. Not a reason to pull it in in and of itself, but I just noticed that after a rebase and thought I should mention it somewhere without opening another issue/PR.

@seberg
Copy link
Member
seberg commented Feb 13, 2016

OK, sorry for letting this sit. I am a bit overloaded generally right now.

Since the only real problem I have with it is "do we really need it", and it is only half public, nor seems a big maintenance burden, I will merge this.

If anyone thinks we are adding too many histogram estimators, or finds doane too weird (@josef-pkt if you want to check it go ahead, I never got around checking the original paper, but I think I am willing to trust @madphysicist and this stackexchange guy by now), we can revert the stuff.

seberg added a commit that referenced this pull request Feb 13, 2016
ENH: Added 'doane' and 'sqrt' estimators to np.histogram in numpy.function_base
@seberg seberg merged commit a5c8529 into numpy:master Feb 13, 2016
@josef-pkt
Copy link

Sorry, I lost an unfinished reply a while ago.

roughly:
It has the wrong asymptotics which should be n**(1/3) IIRC
I looked at Wand's paper which derives a "statistical" plug-in bandwidth but requires a preliminary histogram and is considerably more computational work.

I thought some examples would be useful

However, I saw that the current auto option uses sturges as default for small samples. If that works well, then Doane should also work well and better in skewed samples.

The Doane paper goes on a lot about finding "nice" numbers, rounding to round numbers mutliple of 5, 100, .... That sounds like a lot of effort for a very doubtful outcome, given that the R function that does something similar (I never tried) is criticized in the related references.

So, I don't have any objections to adding options, but they also won't be a huge improvement.
(and didn't bother to finish my comment after getting distracted with other things.)

What might be useful in general is to switch to "auto" as default.

@seberg
Copy link
Member
seberg commented Feb 13, 2016

Yeah, I had already got the gist of, doane not being a big deal/improvement. I would like switching to "auto" as default, it sounds just so much more useful then the default of 10. But I am not sure it won't be too painful for downstream. If someone whished to attempt it, I think I could be fine with trying. But be prepared to having to undo it or the mailing list already disagreeing....

@josef-pkt
Copy link

One more "impression" I had looking at some references
There might be different use cases where for exploratory data analysis we want to learn something about the data where a histogram is similar to a kernel density estimate, and the "info-graphics" use case where we just want a nice picture without too much detail.

@seberg
Copy link
Member
seberg commented Feb 13, 2016

Sounds all interesting :). With the "auto" option, it could also be something that fits a bit more to downstream (i.e. plotting in matplotlib), though it might be a bit tricky to give a different bins keyword based on numpy version.

@josef-pkt
Copy link

It would also be possible to provide a FutureWarning, that users should explicitly specify the option.
My impression from reading examples across various places is that most users already have to specify the bins to get away from the small number.

@madphysicist madphysicist deleted the hist-estimators branch February 13, 2016 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0