-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
How will new histogram bin selectors work with range parameter? #7411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Uhh, very good point, I think they may just ignore it when estimating the number of bins. @nayyarv, @madphysicist what do you guys think? |
If I am not mistaken, @nayyarv included range support in the last commit he made. All the estimators return a number of bins in the end, even the ones that compute a bin width. If I am not mistaken, the bin count is applied over the range. I will look into it. |
I looked back at the modifications to histogram. Support for range was inroduced here: #7243. It effectively trims off any part of the array that is outside the range. However, the conversion of bin-width computations to bin counts is done using the I believe that the way bin counts are handled is a bug since the histogram is normalized to the requested range, not the
I prefer option 2 because it will require less computation and will allow the estimators to be more self contained. Rather than recomputing the I can see a number of actions coming from this issue:
On a side note, weights are coming soon. I am working out A) a new way of doing introselect, and B) some benchmarks to demonstrate that it is much better to do that than naive sort. |
To sum up, @seberg the range is being handled, but there is a subtle bug. |
Thanks for investigating, @madphysicist This addresses part 1 of my question, which is when the range is trimmed. Part 2 of my question is about when the range is extended beyond (min, max), or one side is trimmed while the other is extended. My worry was that the bin selectors would set the number of bins according to the input data (as they should), and then these would be stretched across the full range. This would cause inflated bin widths in the range where the actual data lies. I think that your solution of making the bin estimators return a bin width instead of a number of bins will also solve this second part. |
@madphysicist is very much on top of this. There is an implicit assumption that the number of bins returned is based on the range of the data. It's explicit in the fd and scott estimators, while more implicit for sqrt, sturges, rice, doane, This means when the range is excessively large, the number of bins calculated will be non-optimal since they were calculated for the range of the data passed in, not the actual range set by the user. I agree that returning the binwidth instead is a very clean solution, and then using Here's a suggested PR: master...nayyarv:master The methods that estimate bin numbers are decorated to return binwidth's instead, while those that calculate binwidth are simplified. |
@nayyarv Would you be OK with putting the decorator code directly into the functions? |
Aside from that, LGTM. Moving the duplicated range computation up and documenting the behavior can be a separate PR. |
@davidchall I did not explicitly acknowledge this in my original post, but your original post did in fact uncover a bug. Your worry that the range would be stretched was well founded, but @nayyarv has proposed what I think is the correct fix. With those changes, the optimal size will be computed based on the clipped data but correctly applied to the range regardless of how it overlaps the data. |
@nayyarv. 'auto' should return the minimum width. Also, I am pretty sure that you would want to remove the |
@davidchall It looks like using widths did in fact solve the problem, as you can see from the updated tests. The simple range test now reports about twice the number of bins it used to because the range is approximately twice the PTP of the input. |
@madphysicist Thank you - this is the behaviour that I would expect. It might be a good idea to mention this interplay between range and the binning method in the docstring for the main histogram() function, but that is just my own opinion. |
@davidchall Thanks for noticing that. It was what we discussed and was certainly my intent all along. I will update the docs shortly. |
Fixed. I think we should move further discussion of the code and docs to the PR chat. |
Fixes numpy#7411. Tests and documentation updated. Fixes other small issues with range and bin count computations.
@davidchall The PR has moved to #7423 in case you have any further comments. You can still leave comments in #7416 since the entire set of code changes is visible there. |
Fixes numpy#7411. Tests and documentation updated. Fixes other small issues with range and bin count computations.
I've seen that v1.11 will be introducing aliases for different histogram binning optimization methods. I'm really looking forward to this feature, so thank you for implementing this!
My question is how these methods will interact with the range parameter (this is not currently documented).
The text was updated successfully, but these errors were encountered: