ENH: Added iqr function to compute IQR metric in numpy/function_base.py #7137

madphysicist · 2016-01-28T18:55:56Z

Motivation
This function is used in one place in numpy already (to compute the Freedman-Diaconis histogram bin estimator) in addition to being requested on Stack Overflow a couple of times:

It is also used in matplotlib for box and violin plots: http://matplotlib.org/faq/howto_faq.html#interpreting-box-plots-and-violin-plots. It is a very simple, common and robust dispersion estimator. There does not appear to be an implementation for it anywhere in numpy or scipy.

About
This function is a convenience combination of np.percentile and np.subtract. As such, it allows the the difference between any two percentiles to be computed, not necessarily (25, 75), which is the default. All of the recent enhancements to percentile are used.

The documentation and testing is borrowed heavily from np.percentile.

Wikipedia Reference: https://en.wikipedia.org/wiki/Interquartile_range

Note
The tests will not pass until the bug-fix for np.percentile kwarg interpolation='midpoint' (#7129) is incorporated and this PR is rebased.

Updated histogram 'fd' bin estimator to use this function.

juliantaylor · 2016-01-28T19:30:29Z

I don't think adding almost 400 lines of code for what is essentially documentation for a 2 liner is worthwhile
maybe it sufficient to mention iqr in the documentation of percentile

rgommers · 2016-01-28T19:41:20Z

Statsmodels does have an iqr function: https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tools/eval_measures.py#L267

The docstring in this PR looks better than what's in statsmodels now, and arguably statsmodels is a better place for this function than numpy. Alternatively maybe scipy.stats, but it's not general enough to be included in numpy imho.

So maybe improve the statsmodels version instead?

josef-pkt · 2016-01-28T20:28:49Z

IMO, scipy.stats could also get it and use it (in KDE it would be nice except that's multivariate). but maybe with the normal constant.
FYI, I added it in statsmodels after the fact, it's used inline in several places as a two-liner and they don't delegate to the function yet. AFAICS, the statsmodels version predates vectorized percentiles in numpy and scipy, and is mostly redundant now.

Another simple robust dispersion estimate is MAD, and the statsmodels version is used as helper function by some packages or scripts.

njsmith · 2016-01-28T21:29:34Z

It's really only 8 lines of code, and it seems funny to say we don't want a change because the docs and tests are too thorough :-). Also I have a fond feeling towards these tests since I guess they're what caught that terrible wrong-answers bug in #7129. OTOH it's true that we don't want to keep piling up infinite numbers of functions in numpy to cover every eventuality.

On net I think I'm +0.5 for this -- it really is a pretty fundamental operation with obvious semantics, so I don't see why people should have to keep reimplementing it, or pull in statsmodels just for it. (It's useful way beyond people who use statsmodels -- it'd be like requiring statsmodels for anyone who wanted to compute stddev...) And the maintenance burden is unlikely to be high or even measurable.

josef-pkt · 2016-01-28T22:50:49Z

In my opinion iqr is really a redundant convenience function given that percentile calculates for an arbitrary numpy of quantiles at once.

In our use cases in statsmodels we would need the variance estimate based on IQR, which is more difficult to remember, divide by 1.349 or 1.34. (I thought statsmodels has it but don't find it.)

(related: statsmodels has robust skew and kurtosis estimates with normal reference based on percentile, https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/stattools.py#L270 )

argriffing · 2016-01-28T23:55:55Z

+1
The use case I've seen is in scipy.stats where most distributions belong to location-scale families on the real line. For many distributions (e.g. the normal distribution), a good way to get the roughest initial guess of the location and scale is to look at the sample mean and standard deviation respectively. But for other distributions these summaries are not so great. Instead, you often want to guess the location using the median, and for these distributions iqr is often the analogous way to guess the scale.

madphysicist · 2016-01-29T00:33:58Z

To emphasize the two main points made by @argriffing that reall motivated my proposal, a) IQR is very commonly used as a replacement for standard deviation, and b) it is much better suited for dealing with non-normal data, which is why it is used.

juliantaylor · 2016-01-29T00:46:00Z

I know iqr is common, I just think its unnecessary to have in numpy just to save one line of code
we also don't have a median absolute deviation which is a good 10 lines if done properly.

though if we add it we should also have naniqr for symmetry

madphysicist · 2016-01-29T00:48:19Z

What would naniqr do?

argriffing · 2016-01-29T00:50:28Z

What would naniqr do?

NaN is the de facto way to represent missing data in the numpy/pandas stack.
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanstd.html
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanmedian.html

josef-pkt · 2016-01-29T01:07:42Z

I think IQR is not that common. statsmodels uses it in KDE, and MAD in robust.

IMO, if you start to add two-liners like this (where even the name says how to calculate it, if you remember what the three letters stand for), then you will get a lot of other PR's with functions that are also "commonly" used (in some neighborhood).

How about moving the trimmed statistics from scipy.stats to numpy?
What about winsorized mean and standard deviation?
The histogram PR also needs the skewness.

(My response to a discussion on python user about adding the stats package was mainly where to draw the line. Especially, when someone mentioned t-test and you need to add something like scipy.special to the mix.)

Up to you.

madphysicist · 2016-01-29T01:11:11Z

@argriffing. So you would treat NaNs as a mask instead of just returning NaN as the result? Is there a nanpercentile?

madphysicist · 2016-01-29T01:20:09Z

@josef-pkt. Your argument makes a lot of sense. While I would still prefer to leave the PR here, do you think that I could make a similar PR to scipy.stat instead? Successfully I mean.

argriffing · 2016-01-29T01:29:20Z

Is there a nanpercentile?

http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanpercentile.html

I don't use these functions, and it looks like pandas implements its own nan ops.
https://github.com/pydata/pandas/blob/master/pandas/core/nanops.py
I don't know if new code should use the numpy nan* functions or if there is a reason other than consistency for adding new ones.

josef-pkt · 2016-01-29T02:04:09Z

@madphysicist Adding to scipy.stats is largely up to @rgommers. I would add an option for scaling IQR so it's an estimate for the variance for the normal reference case (*). (That's what I wanted to get at in statsmodels before we added the robust skew and kurtosis, but I never got around to collecting those outlier robust functions.)
I think having readily available descriptive statistics that are robust to heavy tails or outliers are very useful.
BTW: scipy.stats function now also have a option what to do about nans, to get something like naniqr through a keyword.

@argriffing AFAIR, nanops in pandas are older than most of the nanxxx functions in numpy, and still have more coverage.

(*) I was playing with some examples a long time ago in related functions to get the scaling for other distributions as reference. But as in the case of MAD in statsmodels, I don't think anyone will usually change from the normal reference scaling or from no scaling, i.e. raw IQR.

madphysicist · 2016-01-29T03:41:03Z

Sorry about the weird commits. I just learned a couple of things about git rebase without realizing. Everything is back to the way it was and I will be more careful in the future.

rgommers · 2016-01-29T08:07:25Z

This is really statistics related and indeed not that commonly used I think. Adding it to scipy.stats is fine with me - it's either there or in statsmodels imho.

If adding to scipy.stats, then it can go in stats.py where a number of similar metrics live (sem, trimmed var/std/mean, geometric mean, scoreatpercentile, etc.). It should then have a nan_policy keyword, support masked array inputs, and preferably have the scaling option that @josef-pkt mentions above.

The same applies tomad I'd say.

charris · 2016-02-01T16:39:12Z

Is there a consensus for moving this to scipy?

madphysicist · 2016-02-01T16:52:11Z

I am perfectly OK with modifying this to be scipy compatible. I do have one question though. If #7129 is only fixed for numpy 1.12.0, the tests for this function will break in scipy. I see a couple of ways for dealing with that:

1. Comment out the tests that rely on midpoint interpolation and leave a comment to reenable them when the appropriate version of numpy becomes available.
2. Add an if statement around those tests. Something along the lines of `if distutils.version.LooseVersion(np.__version__) >= distutils.version.LooseVersion("1.12.0"): ...`

What are your recommendations for the proper way to handle the situation?

njsmith · 2016-02-01T18:08:11Z

If it were me then I'd probably go with the latter, because no one will
ever notice the comment and reenable the test. But really you should ask
the scipy devs that since they're a somewhat different group of people and
are the ones you'll actually have to please :-)
On Feb 1, 2016 8:52 AM, "Mad Physicist" notifications@github.com wrote:

I am perfectly OK with modifying this to be scipy compatible. I do have
one question though. If #7129 #7129
is only fixed for numpy 1.12.0, the tests for this function will break in
scipy. I see a couple of ways for dealing with that:

Comment out the tests that rely on midpoint interpolation and leave a comment to reenable them when the appropriate version of numpy becomes available.

Add an if statement around those tests. Something along the lines of if distutils.version.LooseVersion(np.__version__) >= distutils.version.LooseVersion("1.12.0"): ...

What are your recommendations for the proper way to handle the situation?

—
Reply to this email directly or view it on GitHub
#7137 (comment).

rgommers · 2016-02-01T19:32:52Z

no LooseVersion please, it's useless. Use from scipy._lib._version import NumpyVersion.

madphysicist · 2016-02-02T13:49:39Z

@rgommers. Fair enough. That was just an example. I will look into NumpyVersion.

madphysicist · 2016-02-04T21:01:24Z

After making a bunch of changes, I have transferred this PR to scipy.stats.

rgommers · 2016-02-04T21:15:49Z

thanks @madphysicist

ENH: Added iqr function to compute IQR metric in numpy/function_base.py

26c0c0f

Updated histogram 'fd' bin estimator to use this function.

charris added 01 - Enhancement component: numpy.lib labels Jan 28, 2016

madphysicist closed this Jan 29, 2016

madphysicist reopened this Jan 29, 2016

madphysicist force-pushed the iqr-function branch from 4523ef0 to 26c0c0f Compare January 29, 2016 03:38

madphysicist closed this Feb 1, 2016

madphysicist mentioned this pull request Feb 4, 2016

ENH: Added iqr function to compute IQR metric in scipy/stats/stats.py scipy/scipy#5808

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Added iqr function to compute IQR metric in numpy/function_base.py #7137

ENH: Added iqr function to compute IQR metric in numpy/function_base.py #7137

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: Added iqr function to compute IQR metric in numpy/function_base.py #7137

ENH: Added iqr function to compute IQR metric in numpy/function_base.py #7137

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!