-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Added iqr function to compute IQR metric in numpy/function_base.py #7137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updated histogram 'fd' bin estimator to use this function.
I don't think adding almost 400 lines of code for what is essentially documentation for a 2 liner is worthwhile |
Statsmodels does have an The docstring in this PR looks better than what's in statsmodels now, and arguably statsmodels is a better place for this function than numpy. Alternatively maybe scipy.stats, but it's not general enough to be included in numpy imho. So maybe improve the statsmodels version instead? |
IMO, scipy.stats could also get it and use it (in KDE it would be nice except that's multivariate). but maybe with the normal constant. Another simple robust dispersion estimate is MAD, and the statsmodels version is used as helper function by some packages or scripts. |
It's really only 8 lines of code, and it seems funny to say we don't want a change because the docs and tests are too thorough :-). Also I have a fond feeling towards these tests since I guess they're what caught that terrible wrong-answers bug in #7129. OTOH it's true that we don't want to keep piling up infinite numbers of functions in numpy to cover every eventuality. On net I think I'm +0.5 for this -- it really is a pretty fundamental operation with obvious semantics, so I don't see why people should have to keep reimplementing it, or pull in statsmodels just for it. (It's useful way beyond people who use statsmodels -- it'd be like requiring statsmodels for anyone who wanted to compute stddev...) And the maintenance burden is unlikely to be high or even measurable. |
In my opinion iqr is really a redundant convenience function given that In our use cases in statsmodels we would need the variance estimate based on IQR, which is more difficult to remember, divide by 1.349 or 1.34. (I thought statsmodels has it but don't find it.) (related: statsmodels has robust skew and kurtosis estimates with normal reference based on percentile, https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/stattools.py#L270 ) |
+1 |
To emphasize the two main points made by @argriffing that reall motivated my proposal, a) IQR is very commonly used as a replacement for standard deviation, and b) it is much better suited for dealing with non-normal data, which is why it is used. |
I know iqr is common, I just think its unnecessary to have in numpy just to save one line of code though if we add it we should also have |
What would naniqr do? |
NaN is the de facto way to represent missing data in the numpy/pandas stack. |
I think IQR is not that common. statsmodels uses it in KDE, and MAD in robust. IMO, if you start to add two-liners like this (where even the name says how to calculate it, if you remember what the three letters stand for), then you will get a lot of other PR's with functions that are also "commonly" used (in some neighborhood). How about moving the trimmed statistics from scipy.stats to numpy? (My response to a discussion on python user about adding the stats package was mainly where to draw the line. Especially, when someone mentioned t-test and you need to add something like scipy.special to the mix.) Up to you. |
@argriffing. So you would treat NaNs as a mask instead of just returning NaN as the result? Is there a nanpercentile? |
@josef-pkt. Your argument makes a lot of sense. While I would still prefer to leave the PR here, do you think that I could make a similar PR to scipy.stat instead? Successfully I mean. |
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanpercentile.html I don't use these functions, and it looks like pandas implements its own nan ops. |
@madphysicist Adding to scipy.stats is largely up to @rgommers. I would add an option for scaling IQR so it's an estimate for the variance for the normal reference case (*). (That's what I wanted to get at in statsmodels before we added the robust skew and kurtosis, but I never got around to collecting those outlier robust functions.) @argriffing AFAIR, nanops in pandas are older than most of the nanxxx functions in numpy, and still have more coverage. (*) I was playing with some examples a long time ago in related functions to get the scaling for other distributions as reference. But as in the case of MAD in statsmodels, I don't think anyone will usually change from the normal reference scaling or from no scaling, i.e. raw IQR. |
4523ef0
to
26c0c0f
Compare
Sorry about the weird commits. I just learned a couple of things about |
This is really statistics related and indeed not that commonly used I think. Adding it to If adding to The same applies to |
Is there a consensus for moving this to scipy? |
I am perfectly OK with modifying this to be scipy compatible. I do have one question though. If #7129 is only fixed for numpy 1.12.0, the tests for this function will break in scipy. I see a couple of ways for dealing with that:
What are your recommendations for the proper way to handle the situation? |
If it were me then I'd probably go with the latter, because no one will
|
no |
@rgommers. Fair enough. That was just an example. I will look into |
After making a bunch of changes, I have transferred this PR to |
thanks @madphysicist |
Motivation
This function is used in one place in numpy already (to compute the Freedman-Diaconis histogram bin estimator) in addition to being requested on Stack Overflow a couple of times:
It is also used in matplotlib for box and violin plots: http://matplotlib.org/faq/howto_faq.html#interpreting-box-plots-and-violin-plots. It is a very simple, common and robust dispersion estimator. There does not appear to be an implementation for it anywhere in numpy or scipy.
About
This function is a convenience combination of
np.percentile
andnp.subtract
. As such, it allows the the difference between any two percentiles to be computed, not necessarily (25, 75), which is the default. All of the recent enhancements topercentile
are used.The documentation and testing is borrowed heavily from
np.percentile
.Wikipedia Reference: https://en.wikipedia.org/wiki/Interquartile_range
Note
The tests will not pass until the bug-fix for
np.percentile
kwarginterpolation='midpoint'
(#7129) is incorporated and this PR is rebased.