-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Add 'stone' estimator to np.histogram #8923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy/lib/function_base.py
Outdated
hh = ptp_x / nbins | ||
return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh | ||
|
||
nbins = min((jhat(x, nbins), nbins) for nbins in np.arange(1, max(n // 4, 2)))[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're looking for this here:
min(np.arange(1, max(n // 4, 2)), key=lambda nbins: jhat(x, nbins))
numpy/lib/function_base.py
Outdated
return ptp_x / nbins | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should only be two blank lines here
numpy/lib/function_base.py
Outdated
Histogram bin estimator based on minimizing the estimated integrated squared error (ISE). | ||
|
||
The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution. | ||
The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is based off Scott's rule, then it would make more sense further down in the file below that rule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Below _hist_bin_scott
)
numpy/lib/function_base.py
Outdated
|
||
The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution. | ||
The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule. | ||
https://en.wikipedia.org/wiki/Histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to a relevant section, not the whole histogram page
numpy/lib/function_base.py
Outdated
|
||
def jhat(x, nbins): | ||
hh = ptp_x / nbins | ||
return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little suspicious of histogram
itself, but without any of it's other arguments. Specifically, not having weights
available seems suspect
Edit: Actually this is fine, as the weight parameter is forbidden for custom estimators. It'd be nice if you had it available here though, since it looks pretty easy to add support for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, the expression is derived based on leave-one-out cross-validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser, is there a reason why weights are not allowed for any custom estimators? Would it be a good idea to enable weights parameter for custom estimators that can support it?
numpy/lib/function_base.py
Outdated
|
||
def jhat(x, nbins): | ||
hh = ptp_x / nbins | ||
return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N_k = np.histogram(x, bins=nbins)[0]
followed by N_k.dot(N_k)
is likely faster than (N_k**2).sum()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also might want to divide through by n
before doing dot
in order to avoid integer overflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean N_k = np.histogram(x, bins=nbins)[0] / n
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm thinking you want to replace:
(np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n))
with the mathematically equivalent
N_k, unused_edges = np.histogram(x, bins=nbins)
tmp = N_k / n # if you have a better name, go for it
tmp.dot(tmp)
dot
should be faster than (...**2).sum()
, as the former does not allocate a temporary array before summing
numpy/lib/function_base.py
Outdated
hh = ptp_x / nbins | ||
return (2 - (n + 1) * (np.histogram(x, bins=nbins)[0] ** 2).sum() / (n * n)) / (n - 1) / hh | ||
N_k = np.histogram(x, bins=nbins)[0] / float(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for float
, this file starts with from __future__ import division
numpy/lib/function_base.py
Outdated
@@ -247,7 +247,7 @@ def _hist_bin_ise(x): | |||
|
|||
The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution. | |||
The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule. | |||
https://en.wikipedia.org/wiki/Histogram | |||
https://en.wikipedia.org/wiki/Histogram#Scott.27s_normal_reference_rule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your defense, this link didn't actually exist until after you made the pr 😉
numpy/lib/function_base.py
Outdated
Histogram bin estimator based on minimizing the estimated integrated squared error (ISE). | ||
|
||
The number of bins is chosen by minimizing the estimated ISE against the unknown true distribution. | ||
The ISE is estimated using cross-validation and can be regarded as a generalization of Scott's rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Below _hist_bin_scott
)
disregard that, I am wrong |
I think you need to find a better reference to give for this approach. Right now, all you have is a link to wikipedia, where the citation is dead - so I've got nothing to convince myself that that formula is correct or useful - only that your code matches it. |
Needs documentation in the |
I've updated the wiki page with a live link that doesn't invoke a cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better reference here (rather than a blog) is Larry Wasserman's "All of Statistics". I've updated Wikipedia, moving this estimator to its own section: https://en.wikipedia.org/wiki/Histogram#Minimizing_cross-validation_estimated_squared_error
This is formula (20.14) in my copy (I have the first printing, in which the formula is actually incorrect, but there is an errata).
numpy/lib/function_base.py
Outdated
def jhat(nbins): | ||
hh = ptp_x / nbins | ||
N_k = np.histogram(x, bins=nbins)[0] / n | ||
return (2 - (n + 1) * N_k.dot(N_k)) / (n - 1) / hh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing to the formula on the Wikipedia page, it looks like you're missing a factor of 1/n**2
on the second term.
I think it should be (with a few extra parentheses for clarity):
(2 - ((n + 1.0) / n ** 2) * N_k.dot(N_k)) / ((n - 1) * hh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already divided N_k
with n, to avoid integer overflow (as suggested by @eric-wieser ), so that is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, very good. I would suggest using p_k
or p_j
then (which matches Wasserman's notation, maybe I should update the Wiki page) as the variable name instead of N_k
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this divide by n-1
is kinda pointless, right? It's always positive and constant, so doesn't affect the minima
numpy/lib/function_base.py
Outdated
N_k = np.histogram(x, bins=nbins)[0] / n | ||
return (2 - (n + 1) * N_k.dot(N_k)) / (n - 1) / hh | ||
|
||
nbins = min(np.arange(1, max(n // 4, 2)), key=jhat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why max(n // 4, 2)
for the upper bound? I can imagine some pathological cases where the optimal number of bins would be infinite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, my pathological cases violate the assumptions behind this derivation, which is namely that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is arbitrary to some degree, to reduce the computation time. Do you have better suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How slow is this function currently on a moderately sized dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be faster to use range
than arange
here, since you're just iterating over it anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer more than a minute for data of size 100,000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the upper bound should scale sub-linearly, given that that is the case for all the other estimators.
I also wonder if it would make more sense to use one of the other (ideally more conservative) bin estimators for the upper bound.
Given these concerns, perhaps the square root rule would make sense, given the general arguments that the number of bins should scale like the cube root? Maybe max(100, int(np.sqrt(n)))
to avoid strange behavior for small n
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer, that upper bound seems fine to me and doesn't break any unit tests.
It would also be nice to add a unit test verifying that this matches Scott's rule for normally distributed data. |
numpy/lib/function_base.py
Outdated
@@ -356,7 +356,7 @@ def jhat(nbins): | |||
p_k = np.histogram(x, bins=nbins)[0] / n | |||
return (2 - (n + 1) * p_k.dot(p_k)) / hh | |||
|
|||
nbins = min(range(1, max(n // 4, 2)), key=jhat) | |||
nbins = min(range(1, max(100, int(np.sqrt(n)))), key=jhat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you didn't mean min
? What are you trying to achieve here? Clearly, a data set of length 1
should not try 100
bins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max
was my idea, with the thought that it's OK to do extra work for the sake of a more exhaustive search on small datasets, because for small n
the square-root rule may not be a strict upper bound. But I agree, 100 is probably too large.
The other non-data dependent rule we have is the Rice Rule. It suggests more bins than the square root rule up to n=64
(8 bins with both). So at the very least we want the constant to be 8 or larger. Or we could explicitly use the Rice Rule here. Either we definitely want a comment explaining the choice ("search up to the larger of the Rice or square root rules").
It would be nice if we had some empirical test suite of distributions to try these rules on. It's hard to guess how much of a margin we need without trying this out on some real data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is n
guaranteed to be an upper bound on the number of bins needed?
So max(min(n, 100), sqrt(n))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of any strict upper bounds.
This paper uses an upper bound of 5 * n ** (1/3)
(for a different rule), which might be a reasonable choice:
https://arxiv.org/pdf/physics/0605197.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no upper bound in general, so a search with a fixed upper limit will not always work.
This paper by Stone appears to be the origination of this rule, so maybe calling it "stone" would make more sense: |
Nice find, @shoyer! |
@shoyer, I have implemented a test to verify that Scott's rule and this method converges with an increasing sample size. |
@guoci: Your test fails with small numerical errors on one specific test setup. That's pretty weird - perhaps something to do with native float sizes? Also, we should probably include a link to that originating paper? |
@eric-wieser, the test did pass on my machine, and all build jobs succeeded except for one, but I have not figured out why it failed. |
for seed in range(256)] | ||
|
||
# the average difference between the two methods decreases as the dataset size increases. | ||
assert_almost_equal(abs(np.mean(ll, axis=0) - 0.5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a good idea to set rtol and/or atol to give a little wiggle room.
@eric-wieser Am I supposed to write the release notes? |
Tagging this as 1.15, so we don't forget. I'll rebase this after #10186 is merged |
51b8f63
to
9fb7f5f
Compare
Alright, rebased and squashed. Had to apply the changes manually, since git did not detect the rename. Note that as a result of the file move, many of the outdated comments above may still apply. |
numpy/lib/histograms.py
Outdated
n = x.size | ||
ptp_x = np.ptp(x) | ||
if n <= 1 or ptp_x == 0: | ||
return ptp_x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would't this be better spelt return 0
?
Right now x = [inf]
causes NaN to be returned, which may not be deliberate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem will be handled before this function is called.
Lines 231 to 233 in a10bf46
if not (np.isfinite(first_edge) and np.isfinite(last_edge)): | |
raise ValueError( | |
'range parameter must be finite.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case, return 0
would definitely be clearer here
I'm a little worried that this will merge badly after my other change to |
46dd46d
to
eae6b3c
Compare
What is the status of this? |
I guess we never settled on the name for this parameter. I think A more descriptive alternative might be something like So let's use |
@shoyer renamed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for another round of nit-picky comments! I promise we will actually merge this this time.
Release note needs updating to 1.16.0. |
@guoci Might want to squash these 10 commits. |
I just pushed a commit with a minor rewording to the description of the "stone" rule (to mention leave one out cross validation) in the |
I think this is good to go now? |
Thanks @guoci , and thanks to Stephan and Eric for the reviews. |
I have been using an estimator based on minimizing the estimated integrated squared error (ISE), and is a generalization of Scott's rule.
https://en.wikipedia.org/wiki/Histogram
Unlike most other methods, this estimation does not depend on the dataset size but depends on the nature of the observed distribution.