8000 ENH: Add 'stone' estimator to np.histogram by guoci · Pull Request #8923 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Add 'stone' estimator to np.histogram #8923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 22, 2018
Merged

Conversation

guoci
Copy link
Contributor
@guoci guoci commented Apr 10, 2017

I have been using an estimator based on minimizing the estimated integrated squared error (ISE), and is a generalization of Scott's rule.

https://en.wikipedia.org/wiki/Histogram

Unlike most other methods, this estimation does not depend on the dataset size but depends on the nature of the observed distribution.

hh = ptp_x / nbins
return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh

nbins = min((jhat(x, nbins), nbins) for nbins in np.arange(1, max(n // 4, 2)))[1]
Copy link
Member
@eric-wieser eric-wieser Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're looking for this here:

min(np.arange(1, max(n // 4, 2)), key=lambda nbins: jhat(x, nbins))

return ptp_x / nbins



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should only be two blank lines here

Histogram bin estimator based on minimizing the estimated integrated squared error (ISE).

The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution.
The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is based off Scott's rule, then it would make more sense further down in the file below that rule

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Below _hist_bin_scott)


The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution.
The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule.
https://en.wikipedia.org/wiki/Histogram
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to a relevant section, not the whole histogram page


def jhat(x, nbins):
hh = ptp_x / nbins
return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh
Copy link
Member
@eric-wieser eric-wieser Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little suspicious of histogram itself, but without any of it's other arguments. Specifically, not having weights available seems suspect

Edit: Actually this is fine, as the weight parameter is forbidden for custom estimators. It'd be nice if you had it available here though, since it looks pretty easy to add support for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, the expression is derived based on leave-one-out cross-validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser, is there a reason why weights are not allowed for any custom estimators? Would it be a good idea to enable weights parameter for custom estimators that can support it?


def jhat(x, nbins):
hh = ptp_x / nbins
return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N_k = np.histogram(x, bins=nbins)[0] followed by N_k.dot(N_k) is likely faster than (N_k**2).sum()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also might want to divide through by n before doing dot in order to avoid integer overflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean N_k = np.histogram(x, bins=nbins)[0] / n?

Copy link
Member
@eric-wieser eric-wieser Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm thinking you want to replace:

(np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n))

with the mathematically equivalent

N_k, unused_edges = np.histogram(x, bins=nbins)
tmp = N_k / n  # if you have a better name, go for it
tmp.dot(tmp)

dot should be faster than (...**2).sum(), as the former does not allocate a temporary array before summing

hh = ptp_x / nbins
return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh
N_k = np.histogram(x, bins=nbins)[0] / float(n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for float, this file starts with from __future__ import division

@@ -247,7 +247,7 @@ def _hist_bin_ise(x):

The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution.
The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule.
https://en.wikipedia.org/wiki/Histogram
https://en.wikipedia.org/wiki/Histogram#Scott.27s_normal_reference_rule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your defense, this link didn't actually exist until after you made the pr 😉

Histogram bin estimator based on minimizing the estimated integrated squared error (ISE).

The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution.
The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Below _hist_bin_scott)

@eric-wieser
Copy link
Member
eric-wieser commented Apr 10, 2017

Not something that this PR should worry about, but...

I'm a little concerned that estimators are all wrong when range != None - namely, they all use ptp in their calculations, when I feel they should probably be using mx - mn.

disregard that, I am wrong

@eric-wieser
Copy link
Member

I think you need to find a better reference to give for this approach.

Right now, all you have is a link to wikipedia, where the citation is dead - so I've got nothing to convince myself that that formula is correct or useful - only that your code matches it.

@eric-wieser
Copy link
Member

Needs documentation in the histogram docstring too

@guoci
Copy link
Contributor Author
guoci commented Apr 10, 2017

@eric-wieser
Copy link
Member

I've updated the wiki page with a live link that doesn't invoke a cache

Copy link
Member
@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better reference here (rather than a blog) is Larry Wasserman's "All of Statistics". I've updated Wikipedia, moving this estimator to its own section: https://en.wikipedia.org/wiki/Histogram#Minimizing_cross-validation_estimated_squared_error

This is formula (20.14) in my copy (I have the first printing, in which the formula is actually incorrect, but there is an errata).

def jhat(nbins):
hh = ptp_x / nbins
N_k = np.histogram(x, bins=nbins)[0] / n
return (2 - (n + 1) * N_k.dot(N_k)) / (n - 1) / hh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing to the formula on the Wikipedia page, it looks like you're missing a factor of 1/n**2 on the second term.

I think it should be (with a few extra parentheses for clarity):

(2 - ((n + 1.0) / n ** 2) * N_k.dot(N_k)) / ((n - 1) * hh)

Copy link
Contributor Author
@guoci guoci Apr 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already divided N_k with n, to avoid integer overflow (as suggested by @eric-wieser ), so that is correct.

Copy link
Member
@shoyer shoyer Apr 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very good. I would suggest using p_k or p_j then (which matches Wasserman's notation, maybe I should update the Wiki page) as the variable name instead of N_k.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this divide by n-1 is kinda pointless, right? It's always positive and constant, so doesn't affect the minima

N_k = np.histogram(x, bins=nbins)[0] / n
return (2 - (n + 1) * N_k.dot(N_k)) / (n - 1) / hh

nbins = min(np.arange(1, max(n // 4, 2)), key=jhat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why max(n // 4, 2) for the upper bound? I can imagine some pathological cases where the optimal number of bins would be infinite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my pathological cases violate the assumptions behind this derivation, which is namely that $\int (f^\prime(u))^2 du < \infty$. Still, would be good to explain/justify this choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is arbitrary to some degree, to reduce the computation time. Do you have better suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How slow is this function currently on a moderately sized dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be faster to use range than arange here, since you're just iterating over it anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer more than a minute for data of size 100,000.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the upper bound should scale sub-linearly, given that that is the case for all the other estimators.

I also wonder if it would make more sense to use one of the other (ideally more conservative) bin estimators for the upper bound.

Given these concerns, perhaps the square root rule would make sense, given the general arguments that the number of bins should scale like the cube root? Maybe max(100, int(np.sqrt(n))) to avoid strange behavior for small n?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer, that upper bound seems fine to me and doesn't break any unit tests.

@shoyer
Copy link
Member
shoyer commented Apr 16, 2017

It would also be nice to add a unit test verifying that this matches Scott's rule for normally distributed data.

@@ -356,7 +356,7 @@ def jhat(nbins):
p_k = np.histogram(x, bins=nbins)[0] / n
return (2 - (n + 1) * p_k.dot(p_k)) / hh

nbins = min(range(1, max(n // 4, 2)), key=jhat)
nbins = min(range(1, max(100, int(np.sqrt(n)))), key=jhat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you didn't mean min? What are you trying to achieve here? Clearly, a data set of length 1 should not try 100 bins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max was my idea, with the thought that it's OK to do extra work for the sake of a more exhaustive search on small datasets, because for small n the square-root rule may not be a strict upper bound. But I agree, 100 is probably too large.

The other non-data dependent rule we have is the Rice Rule. It suggests more bins than the square root rule up to n=64 (8 bins with both). So at the very least we want the constant to be 8 or larger. Or we could explicitly use the Rice Rule here. Either we definitely want a comment explaining the choice ("search up to the larger of the Rice or square root rules").

It would be nice if we had some empirical test suite of distributions to try these rules on. It's hard to guess how much of a margin we need without trying this out on some real data...

Copy link
Member
@eric-wieser eric-wieser Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is n guaranteed to be an upper bound on the number of bins needed?

So max(min(n, 100), sqrt(n))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of any strict upper bounds.

This paper uses an upper bound of 5 * n ** (1/3) (for a different rule), which might be a reasonable choice:
https://arxiv.org/pdf/physics/0605197.pdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no upper bound in general, so a search with a fixed upper limit will not always work.

@shoyer
Copy link
Member
shoyer commented Apr 18, 2017

This paper by Stone appears to be the origination of this rule, so maybe calling it "stone" would make more sense:
http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/34.pdf

@eric-wieser
Copy link
Member

Nice find, @shoyer!

@guoci
Copy link
Contributor Author
guoci commented Apr 20, 2017

@shoyer, I have implemented a test to verify that Scott's rule and this method converges with an increasing sample size.

@eric-wieser
Copy link
Member
eric-wieser commented Apr 21, 2017

@guoci: Your test fails with small numerical errors on one specific test setup. That's pretty weird - perhaps something to do with native float sizes? assert_almost_equal might solve the problem

Also, we should probably include a link to that originating paper?

@guoci
Copy link
Contributor Author
guoci commented Apr 21, 2017

@eric-wieser, the test did pass on my machine, and all build jobs succeeded except for one, but I have not figured out why it failed.
Where should we put the link?

for seed in range(256)]

# the average difference between the two methods decreases as the dataset size increases.
assert_almost_equal(abs(np.mean(ll, axis=0) - 0.5),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to set rtol and/or atol to give a little wiggle room.

@eric-wieser eric-wieser added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 22, 2017
@guoci
Copy link
Contributor Author
guoci commented Jul 28, 2017

@eric-wieser Am I supposed to write the release notes?

@eric-wieser
Copy link
Member

Tagging this as 1.15, so we don't forget.

I'll rebase this after #10186 is merged

@eric-wieser eric-wieser added this to the 1.15.0 release milestone Dec 10, 2017
@eric-wieser eric-wieser force-pushed the histo_ise branch 3 times, most recently from 51b8f63 to 9fb7f5f Compare December 27, 2017 07:55
@eric-wieser
Copy link
Member
eric-wieser commented Dec 27, 2017

Alright, rebased and squashed. Had to apply the changes manually, since git did not detect the rename.

Note that as a result of the file move, many of the outdated comments above may still apply.

n = x.size
ptp_x = np.ptp(x)
if n <= 1 or ptp_x == 0:
return ptp_x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would't this be better spelt return 0?

Right now x = [inf] causes NaN to be returned, which may not be deliberate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem will be handled before this function is called.

if not (np.isfinite(first_edge) and np.isfinite(last_edge)):
raise ValueError(
'range parameter must be finite.')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case, return 0 would definitely be clearer here

@eric-wieser
Copy link
Member

I'm a little worried that this will merge badly after my other change to histogram. Can you do a git pull origin master and update the PR, so that I can see what the merge looks like?

@guoci guoci force-pushed the histo_ise branch 6 times, most recently from 46dd46d to eae6b3c Compare November 12, 2018 16:27
@charris
Copy link
Member
charris commented Nov 13, 2018

What is the status of this?

@shoyer
Copy link
Member
shoyer commented Nov 13, 2018

I guess we never settled on the name for this parameter.

I think method='ise' is too vague. This methods using a cross validation estimate of the integrate squared error, but there are other ways to estimate squared error (e.g., Bayesian methods).

A more descriptive alternative might be something like method='loocv' or method='cv-ise' but this is really starting to string together too many random letters to be memorable.

So let's use method='stone' instead. This matches the convention of the other method names, and conveniently is also a short word in the English dictionary.

@guoci
Copy link
Contributor Author
10000 guoci commented Nov 13, 2018

@shoyer renamed.

Copy link
Member
@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for another round of nit-picky comments! I promise we will actually merge this this time.

@charris
Copy link
Member
charris commented Nov 14, 2018

Release note needs updating to 1.16.0.

@charris charris changed the title ENH: Add 'ise' estimator to np.histogram ENH: Add 'stone' estimator to np.histogram Nov 20, 2018
@charris
Copy link
Member
charris commented Nov 20, 2018

@guoci Might want to squash these 10 commits.

@shoyer
Copy link
Member
shoyer commented Nov 20, 2018

I just pushed a commit with a minor rewording to the description of the "stone" rule (to mention leave one out cross validation) in the np.histogram docstring.

@shoyer
Copy link
Member
shoyer commented Nov 20, 2018

I think this is good to go now?

@charris charris merged commit 1d38e41 into numpy:master Nov 22, 2018
@charris
Copy link
Member
charris commented Nov 22, 2018

Thanks @guoci , and thanks to Stephan and Eric for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0