ENH: Add 'stone' estimator to np.histogram #8923

guoci · 2017-04-10T21:11:42Z

I have been using an estimator based on minimizing the estimated integrated squared error (ISE), and is a generalization of Scott's rule.

https://en.wikipedia.org/wiki/Histogram

Unlike most other methods, this estimation does not depend on the dataset size but depends on the nature of the observed distribution.

eric-wieser · 2017-04-10T21:18:15Z

numpy/lib/function_base.py

+        hh = ptp_x / nbins
+        return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh
+
+    nbins = min((jhat(x, nbins), nbins) for nbins in np.arange(1, max(n // 4, 2)))[1]


Looks like you're looking for this here:

min(np.arange(1, max(n // 4, 2)), key=lambda nbins: jhat(x, nbins))

eric-wieser · 2017-04-10T21:19:00Z

numpy/lib/function_base.py

+    return ptp_x / nbins
+
+
+


Should only be two blank lines here

eric-wieser · 2017-04-10T21:21:32Z

numpy/lib/function_base.py

+    Histogram bin estimator based on minimizing the estimated integrated squared error (ISE).
+
+    The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution.
+    The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule.


If this is based off Scott's rule, then it would make more sense further down in the file below that rule

(Below _hist_bin_scott)

eric-wieser · 2017-04-10T21:22:35Z

numpy/lib/function_base.py

+
+    The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution.
+    The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule.
+    https://en.wikipedia.org/wiki/Histogram


Link to a relevant section, not the whole histogram page

eric-wieser · 2017-04-10T21:23:52Z

numpy/lib/function_base.py

+
+    def jhat(x, nbins):
+        hh = ptp_x / nbins
+        return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh


I'm a little suspicious of histogram itself, but without any of it's other arguments. Specifically, not having weights available seems suspect

Edit: Actually this is fine, as the weight parameter is forbidden for custom estimators. It'd be nice if you had it available here though, since it looks pretty easy to add support for

Probably not, the expression is derived based on leave-one-out cross-validation.

@eric-wieser, is there a reason why weights are not allowed for any custom estimators? Would it be a good idea to enable weights parameter for custom estimators that can support it?

eric-wieser · 2017-04-10T21:34:26Z

numpy/lib/function_base.py

+
+    def jhat(x, nbins):
+        hh = ptp_x / nbins
+        return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh


N_k = np.histogram(x, bins=nbins)[0] followed by N_k.dot(N_k) is likely faster than (N_k**2).sum()

You also might want to divide through by n before doing dot in order to avoid integer overflow

Do you mean N_k = np.histogram(x, bins=nbins)[0] / n?

Yeah, I'm thinking you want to replace:

(np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n))

with the mathematically equivalent

N_k, unused_edges = np.histogram(x, bins=nbins) tmp = N_k / n # if you have a better name, go for it tmp.dot(tmp)

dot should be faster than (...**2).sum(), as the former does not allocate a temporary array before summing

eric-wieser · 2017-04-10T22:08:02Z

numpy/lib/function_base.py

        hh = ptp_x / nbins
-        return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh
+        N_k = np.histogram(x, bins=nbins)[0] / float(n)


No need for float, this file starts with from __future__ import division

eric-wieser · 2017-04-10T22:08:34Z

numpy/lib/function_base.py

@@ -247,7 +247,7 @@ def _hist_bin_ise(x):

    The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution.
    The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule.
-    https://en.wikipedia.org/wiki/Histogram
+    https://en.wikipedia.org/wiki/Histogram#Scott.27s_normal_reference_rule


In your defense, this link didn't actually exist until after you made the pr 😉

eric-wieser · 2017-04-10T22:11:15Z

numpy/lib/function_base.py

+    Histogram bin estimator based on minimizing the estimated integrated squared error (ISE).
+
+    The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution.
+    The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule.


(Below _hist_bin_scott)

eric-wieser · 2017-04-10T22:14:33Z

~~Not something that this PR should worry about, but...~~

~~I'm a little concerned that estimators are all wrong when range != None - namely, they all use ptp in their calculations, when I feel they should probably be using mx - mn.~~

disregard that, I am wrong

eric-wieser · 2017-04-10T22:18:52Z

I think you need to find a better reference to give for this approach.

Right now, all you have is a link to wikipedia, where the citation is dead - so I've got nothing to convince myself that that formula is correct or useful - only that your code matches it.

eric-wieser · 2017-04-10T22:19:42Z

Needs documentation in the histogram docstring too

guoci · 2017-04-10T22:22:30Z

Can I use the following as reference?
https://webcache.googleusercontent.com/search?q=cache:6_rhJZVgHgwJ:https://maikolsolis.wordpress.com/2014/04/26/optimizing-histogram-cross-validation+&cd=3&hl=en&ct=clnk&gl=us

eric-wieser · 2017-04-10T22:29:43Z

I've updated the wiki page with a live link that doesn't invoke a cache

shoyer

A better reference here (rather than a blog) is Larry Wasserman's "All of Statistics". I've updated Wikipedia, moving this estimator to its own section: https://en.wikipedia.org/wiki/Histogram#Minimizing_cross-validation_estimated_squared_error

This is formula (20.14) in my copy (I have the first printing, in which the formula is actually incorrect, but there is an errata).

shoyer · 2017-04-16T21:06:19Z

numpy/lib/function_base.py

+    def jhat(nbins):
+        hh = ptp_x / nbins
+        N_k = np.histogram(x, bins=nbins)[0] / n
+        return (2 - (n + 1) * N_k.dot(N_k)) / (n - 1) / hh


Comparing to the formula on the Wikipedia page, it looks like you're missing a factor of 1/n**2 on the second term.

I think it should be (with a few extra parentheses for clarity):

(2 - ((n + 1.0) / n ** 2) * N_k.dot(N_k)) / ((n - 1) * hh)

We have already divided N_k with n, to avoid integer overflow (as suggested by @eric-wieser ), so that is correct.

Ah, very good. I would suggest using p_k or p_j then (which matches Wasserman's notation, maybe I should update the Wiki page) as the variable name instead of N_k.

Also, this divide by n-1 is kinda pointless, right? It's always positive and constant, so doesn't affect the minima

shoyer · 2017-04-16T21:13:04Z

numpy/lib/function_base.py

+        N_k = np.histogram(x, bins=nbins)[0] / n
+        return (2 - (n + 1) * N_k.dot(N_k)) / (n - 1) / hh
+
+    nbins = min(np.arange(1, max(n // 4, 2)), key=jhat)


Why max(n // 4, 2) for the upper bound? I can imagine some pathological cases where the optimal number of bins would be infinite.

Actually, my pathological cases violate the assumptions behind this derivation, which is namely that $\int (f^\prime(u))^2 du < \infty$. Still, would be good to explain/justify this choice.

It is arbitrary to some degree, to reduce the computation time. Do you have better suggestions?

How slow is this function currently on a moderately sized dataset?

Also, it would be faster to use range than arange here, since you're just iterating over it anyway

@shoyer more than a minute for data of size 100,000.

It seems like the upper bound should scale sub-linearly, given that that is the case for all the other estimators.

I also wonder if it would make more sense to use one of the other (ideally more conservative) bin estimators for the upper bound.

Given these concerns, perhaps the square root rule would make sense, given the general arguments that the number of bins should scale like the cube root? Maybe max(100, int(np.sqrt(n))) to avoid strange behavior for small n?

@shoyer, that upper bound seems fine to me and doesn't break any unit tests.

shoyer · 2017-04-16T22:14:51Z

It would also be nice to add a unit test verifying that this matches Scott's rule for normally distributed data.

eric-wieser · 2017-04-18T15:43:46Z

numpy/lib/function_base.py

@@ -356,7 +356,7 @@ def jhat(nbins):
        p_k = np.histogram(x, bins=nbins)[0] / n
        return (2 - (n + 1) * p_k.dot(p_k)) / hh

-    nbins = min(range(1, max(n // 4, 2)), key=jhat)
+    nbins = min(range(1, max(100, int(np.sqrt(n)))), key=jhat)


Are you sure you didn't mean min? What are you trying to achieve here? Clearly, a data set of length 1 should not try 100 bins

max was my idea, with the thought that it's OK to do extra work for the sake of a more exhaustive search on small datasets, because for small n the square-root rule may not be a strict upper bound. But I agree, 100 is probably too large.

The other non-data dependent rule we have is the Rice Rule. It suggests more bins than the square root rule up to n=64 (8 bins with both). So at the very least we want the constant to be 8 or larger. Or we could explicitly use the Rice Rule here. Either we definitely want a comment explaining the choice ("search up to the larger of the Rice or square root rules").

It would be nice if we had some empirical test suite of distributions to try these rules on. It's hard to guess how much of a margin we need without trying this out on some real data...

Is n guaranteed to be an upper bound on the number of bins needed?

So max(min(n, 100), sqrt(n))

I don't know of any strict upper bounds.

This paper uses an upper bound of 5 * n ** (1/3) (for a different rule), which might be a reasonable choice:
https://arxiv.org/pdf/physics/0605197.pdf

There is no upper bound in general, so a search with a fixed upper limit will not always work.

shoyer · 2017-04-18T17:40:07Z

This paper by Stone appears to be the origination of this rule, so maybe calling it "stone" would make more sense:
http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/34.pdf

eric-wieser · 2017-04-18T17:51:26Z

Nice find, @shoyer!

guoci · 2017-04-20T16:36:59Z

@shoyer, I have implemented a test to verify that Scott's rule and this method converges with an increasing sample size.

eric-wieser · 2017-04-21T22:04:43Z

@guoci: Your test fails with small numerical errors on one specific test setup. That's pretty weird - perhaps something to do with native float sizes? assert_almost_equal might solve the problem

Also, we should probably include a link to that originating paper?

guoci · 2017-04-21T22:17:35Z

@eric-wieser, the test did pass on my machine, and all build jobs succeeded except for one, but I have not figured out why it failed.
Where should we put the link?

shoyer · 2017-04-21T22:33:18Z

numpy/lib/tests/test_function_base.py

+              for seed in range(256)]
+
+        # the average difference between the two methods decreases as the dataset size increases.
+        assert_almost_equal(abs(np.mean(ll, axis=0) - 0.5),


It's probably a good idea to set rtol and/or atol to give a little wiggle room.

guoci · 2017-07-28T15:21:22Z

@eric-wieser Am I supposed to write the release notes?

eric-wieser · 2017-12-10T22:48:53Z

Tagging this as 1.15, so we don't forget.

I'll rebase this after #10186 is merged

eric-wieser · 2017-12-27T07:56:28Z

Alright, rebased and squashed. Had to apply the changes manually, since git did not detect the rename.

Note that as a result of the file move, many of the outdated comments above may still apply.

eric-wieser · 2017-12-27T07:59:31Z

numpy/lib/histograms.py

+    n = x.size
+    ptp_x = np.ptp(x)
+    if n <= 1 or ptp_x == 0:
+        return ptp_x


Would't this be better spelt return 0?

Right now x = [inf] causes NaN to be returned, which may not be deliberate.

This problem will be handled before this function is called.

numpy/numpy/lib/histograms.py

Lines 231 to 233 in a10bf46

if not (np.isfinite(first_edge) and np.isfinite(last_edge)):

raise ValueError(

'range parameter must be finite.')

In which case, return 0 would definitely be clearer here

eric-wieser · 2018-03-18T18:30:28Z

I'm a little worried that this will merge badly after my other change to histogram. Can you do a git pull origin master and update the PR, so that I can see what the merge looks like?

charris · 2018-11-13T19:18:03Z

What is the status of this?

shoyer · 2018-11-13T19:34:55Z

I guess we never settled on the name for this parameter.

I think method='ise' is too vague. This methods using a cross validation estimate of the integrate squared error, but there are other ways to estimate squared error (e.g., Bayesian methods).

A more descriptive alternative might be something like method='loocv' or method='cv-ise' but this is really starting to string together too many random letters to be memorable.

So let's use method='stone' instead. This matches the convention of the other method names, and conveniently is also a short word in the English dictionary.

guoci · 2018-11-13T21:58:52Z

@shoyer renamed.

shoyer

Sorry for another round of nit-picky comments! I promise we will actually merge this this time.

numpy/lib/tests/test_histograms.py

doc/release/1.15.0-notes.rst

numpy/lib/histograms.py

charris · 2018-11-14T14:43:59Z

Release note needs updating to 1.16.0.

numpy/lib/histograms.py

charris · 2018-11-20T17:17:52Z

@guoci Might want to squash these 10 commits.

shoyer · 2018-11-20T17:27:22Z

I just pushed a commit with a minor rewording to the description of the "stone" rule (to mention leave one out cross validation) in the np.histogram docstring.

numpy/lib/tests/test_histograms.py

numpy/lib/histograms.py

shoyer · 2018-11-20T20:54:30Z

I think this is good to go now?

charris · 2018-11-22T15:07:29Z

Thanks @guoci , and thanks to Stephan and Eric for the reviews.

eric-wieser reviewed Apr 10, 2017

View reviewed changes

charris added 01 - Enhancement component: numpy.lib labels Apr 11, 2017

shoyer reviewed Apr 16, 2017

View reviewed changes

eric-wieser reviewed Apr 18, 2017

View reviewed changes

shoyer reviewed Apr 21, 2017

View reviewed changes

eric-wieser added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 22, 2017

eric-wieser added this to the 1.15.0 release milestone Dec 10, 2017

eric-wieser force-pushed the histo_ise branch 3 times, most recently from 51b8f63 to 9fb7f5f Compare December 27, 2017 07:55

eric-wieser reviewed Dec 27, 2017

View reviewed changes

guoci force-pushed the histo_ise branch from 4dab4f9 to 0931632 Compare March 18, 2018 18:38

guoci force-pushed the histo_ise branch 6 times, most recently from 46dd46d to eae6b3c Compare November 12, 2018 16:27

guoci force-pushed the histo_ise branch from eae6b3c to b62b367 Compare November 13, 2018 20:39

shoyer reviewed Nov 13, 2018

View reviewed changes

guoci force-pushed the histo_ise branch from 6453301 to 3c024c8 Compare November 15, 2018 18:49

eric-wieser reviewed Nov 16, 2018

View reviewed changes

numpy/lib/histograms.py Outdated Show resolved Hide resolved

guoci force-pushed the histo_ise branch from 3c024c8 to 4b4f609 Compare November 16, 2018 17:55

charris changed the title ~~ENH: Add 'ise' estimator to np.histogram~~ ENH: Add 'stone' estimator to np.histogram Nov 20, 2018

guoci force-pushed the histo_ise branch from 79d3ad3 to 47b3a83 Compare November 20, 2018 17:20

ENH: Add 'ise' estimator to np.histogram

219a52c

guoci force-pushed the histo_ise branch from 47b3a83 to 219a52c Compare November 20, 2018 17:22

DOC: mention LOOCV in description of "stone" rule.

c3c6cd5

eric-wieser reviewed Nov 20, 2018

View reviewed changes

numpy/lib/tests/test_histograms.py Outdated Show resolved Hide resolved

eric-wieser reviewed Nov 20, 2018

View reviewed changes

numpy/lib/tests/test_histograms.py Outdated Show resolved Hide resolved

eric-wieser reviewed Nov 20, 2018

View reviewed changes

numpy/lib/histograms.py Outdated Show resolved Hide resolved

eric-wieser reviewed Nov 20, 2018

View reviewed changes

numpy/lib/histograms.py Show resolved Hide resolved

resolve issues from review

b0b07ca

charris merged commit 1d38e41 into numpy:master Nov 22, 2018

	if not (np.isfinite(first_edge) and np.isfinite(last_edge)):
	raise ValueError(
	'range parameter must be finite.')

Uh oh!

ENH: Add 'stone' estimator to np.histogram #8923

ENH: Add 'stone' estimator to np.histogram #8923

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!