8000 ENH: Added iqr function to compute IQR metric in numpy/function_base.py by madphysicist · Pull Request #7137 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Added iqr function to compute IQR metric in numpy/function_base.py #7137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

madphysicist
Copy link
Contributor

Motivation
This function is used in one place in numpy already (to compute the Freedman-Diaconis histogram bin estimator) in addition to being requested on Stack Overflow a couple of times:

It is also used in matplotlib for box and violin plots: http://matplotlib.org/faq/howto_faq.html#interpreting-box-plots-and-violin-plots. It is a very simple, common and robust dispersion estimator. There does not appear to be an implementation for it anywhere in numpy or scipy.

About
This function is a convenience combination of np.percentile and np.subtract. As such, it allows the the difference between any two percentiles to be computed, not necessarily (25, 75), which is the default. All of the recent enhancements to percentile are used.

The documentation and testing is borrowed heavily from np.percentile.

Wikipedia Reference: https://en.wikipedia.org/wiki/Interquartile_range

Note
The tests will not pass until the bug-fix for np.percentile kwarg interpolation='midpoint' (#7129) is incorporated and this PR is rebased.

Updated histogram 'fd' bin estimator to use this function.
@juliantaylor
Copy link
Contributor

I don't think adding almost 400 lines of code for what is essentially documentation for a 2 liner is worthwhile
maybe it sufficient to mention iqr in the documentation of percentile

@rgommers
Copy link
Member

Statsmodels does have an iqr function: https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tools/eval_measures.py#L267

The docstring in this PR looks better than what's in statsmodels now, and arguably statsmodels is a better place for this function than numpy. Alternatively maybe scipy.stats, but it's not general enough to be included in numpy imho.

So maybe improve the statsmodels version instead?

@josef-pkt
Copy link

IMO, scipy.stats could also get it and use it (in KDE it would be nice except that's multivariate). but maybe with the normal constant.
FYI, I added it in statsmodels after the fact, it's used inline in several places as a two-liner and they don't delegate to the function yet. AFAICS, the statsmodels version predates vectorized percentiles in numpy and scipy, and is mostly redundant now.

Another simple robust dispersion estimate is MAD, and the statsmodels version is used as helper function by some packages or scripts.

@njsmith
Copy link
Member
njsmith commented Jan 28, 2016

It's really only 8 lines of code, and it seems funny to say we don't want a change because the docs and tests are too thorough :-). Also I have a fond feeling towards these tests since I guess they're what caught that terrible wrong-answers bug in #7129. OTOH it's true that we don't want to keep piling up infinite numbers of functions in numpy to cover every eventuality.

On net I think I'm +0.5 for this -- it really is a pretty fundamental operation with obvious semantics, so I don't see why people should have to keep reimplementing it, or pull in statsmodels just for it. (It's useful way beyond people who use statsmodels -- it'd be like requiring statsmodels for anyone who wanted to compute stddev...) And the maintenance burden is unlikely to be high or even measurable.

@josef-pkt
Copy link

In my opinion iqr is really a redundant convenience function given that percentile calculates for an arbitrary numpy of quantiles at once.

In our use cases in statsmodels we would need the variance estimate based on IQR, which is more difficult to remember, divide by 1.349 or 1.34. (I thought statsmodels has it but don't find it.)

(related: statsmodels has robust skew and kurtosis estimates with normal reference based on percentile, https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/stattools.py#L270 )

@argriffing
Copy link
Contributor

+1
The use case I've seen is in scipy.stats where most distributions belong to location-scale families on the real line. For many distributions (e.g. the normal distribution), a good way to get the roughest initial guess of the location and scale is to look at the sample mean and standard deviation respectively. But for other distributions these summaries are not so great. Instead, you often want to guess the location using the median, and for these distributions iqr is often the analogous way to guess the scale.

@madphysicist
Copy link
Contributor Author

To emphasize the two main points made by @argriffing that reall motivated my proposal, a) IQR is very commonly used as a replacement for standard deviation, and b) it is much better suited for dealing with non-normal data, which is why it is used.

@juliantaylor
Copy link
Contributor

I know iqr is common, I just think its unnecessary to have in numpy just to save one line of code
we also don't have a median absolute deviation which is a good 10 lines if done properly.

though if we add it we should also have naniqr for symmetry

@madphysicist
Copy link
Contributor Author

What would naniqr do?

@argriffing
Copy link
Contributor

What would naniqr do?

NaN is the de facto way to represent missing data in the numpy/pandas stack.
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanstd.html
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanmedian.html

@josef-pkt
Copy link

I think IQR is not that common. statsmodels uses it in KDE, and MAD in robust.

IMO, if you start to add two-liners like this (where even the name says how to calculate it, if you remember what the three letters stand for), then you will get a lot of other PR's with functions that are also "commonly" used (in some neighborhood).

How about moving the trimmed statistics from scipy.stats to numpy?
What about winsorized mean and standard deviation?
The histogram PR also needs the skewness.

(My response to a discussion on python user about adding the stats package was mainly where to draw the line. Especially, when someone mentioned t-test and you need to add something like scipy.special to the mix.)

Up to you.

@madphysicist
Copy link
Contributor Author

@argriffing. So you would treat NaNs as a mask instead of just returning NaN as the result? Is there a nanpercentile?

@madphysicist
Copy link
Contributor Author

@josef-pkt. Your argument makes a lot of sense. While I would still prefer to leave the PR here, do you think that I could make a similar PR to scipy.stat instead? Successfully I mean.

@argriffing
Copy link
Contributor

Is there a nanpercentile?

http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.nanpercentile.html

I don't use these functions, and it looks like pandas implements its own nan ops.
https://github.com/pydata/pandas/blob/master/pandas/core/nanops.py
I don't know if new code should use the numpy nan* functions or if there is a reason other than consistency for adding new ones.

@josef-pkt
Copy link

@madphysicist Adding to scipy.stats is largely up to @rgommers. I would add an option for scaling IQR so it's an estimate for the variance for the normal reference case (*). (That's what I wanted to get at in statsmodels before we added the robust skew and kurtosis, but I never got around to collecting those outlier robust functions.)
I think having readily available descriptive statistics that are robust to heavy tails or outliers are very useful.
BTW: scipy.stats function now also have a option what to do about nans, to get something like naniqr through a keyword.

@argriffing AFAIR, nanops in pandas are older than most of the nanxxx functions in numpy, and still have more coverage.

(*) I was playing with some examples a long time ago in related functions to get the scaling for other distributions as reference. But as in the case of MAD in statsmodels, I don't think anyone will usually change from the normal reference scaling or from no scaling, i.e. raw IQR.

@madphysicist
Copy link
Contributor Author

Sorry about the weird commits. I just learned a couple of things about git rebase without realizing. Everything is back to the way it was and I will be more careful in the future.

@rgommers
Copy link
Member

This is really statistics related and indeed not that commonly used I think. Adding it to scipy.stats is fine with me - it's either there or in statsmodels imho.

If adding to scipy.stats, then it can go in stats.py where a number of similar metrics live (sem, trimmed var/std/mean, geometric mean, scoreatpercentile, etc.). It should then have a nan_policy keyword, support masked array inputs, and preferably have the scaling option that @josef-pkt mentions above.

The same applies tomad I'd say.

@charris
Copy link
Member
charris commented Feb 1, 2016

Is there a consensus for moving this to scipy?

@madphysicist
Copy link
Contributor Author

I am perfectly OK with modifying this to be scipy compatible. I do have one question though. If #7129 is only fixed for numpy 1.12.0, the tests for this function will break in scipy. I see a couple of ways for dealing with that:

1. Comment out the tests that rely on midpoint interpolation and leave a comment to reenable them when the appropriate version of numpy becomes available.
2. Add an if statement around those tests. Something along the lines of `if distutils.version.LooseVersion(np.__version__) >= distutils.version.LooseVersion("1.12.0"): ...`

What are your recommendations for the proper way to handle the situation?

@njsmith
Copy link
Member
njsmith commented Feb 1, 2016

If it were me then I'd probably go with the latter, because no one will
ever notice the comment and reenable the test. But really you should ask
the scipy devs that since they're a somewhat different group of people and
are the ones you'll actually have to please :-)
On Feb 1, 2016 8:52 AM, "Mad Physicist" notifications@github.com wrote:

I am perfectly OK with modifying this to be scipy compatible. I do have
one question though. If #7129 #7129
is only fixed for numpy 1.12.0, the tests for this function will break in
scipy. I see a couple of ways for dealing with that:

  1. Comment out the tests that rely on midpoint interpolation and leave a comment to reenable them when the appropriate version of numpy becomes available.
  2. Add an if statement around those tests. Something along the lines of if distutils.version.LooseVersion(np.__version__) >= distutils.version.LooseVersion("1.12.0"): ...

What are your recommendations for the proper way to handle the situation?


Reply to this email directly or view it on GitHub
#7137 (comment).

@rgommers
Copy link
Member
rgommers commented Feb 1, 2016

no LooseVersion please, it's useless. Use from scipy._lib._version import NumpyVersion.

@madphysicist
Copy link
Contributor Author

@rgommers. Fair enough. That was just an example. I will look into NumpyVersion.

@madphysicist
Copy link
Contributor Author

After making a bunch of changes, I have transferred this PR to scipy.stats.

@rgommers
Copy link
Member
rgommers commented Feb 4, 2016

thanks @madphysicist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0