ENH: scipy.stats: add multivariate hypergeometric distribution #12839

tirthasheshpatel · 2020-09-10T20:09:02Z

Reference issue

Closes #12585

What does this implement/fix?

This PR implements the multivariate hypergeometric distribution in scipy.stats module!

Documentation suite for the methods of the distribution
Tests for all the added methods of the distribution

@mdhaber

scipy/stats/_multivariate.py

tirthasheshpatel · 2020-09-12T06:06:50Z

I also tested the time taken by rvs method implemented vs numpy's generator:

speed test

In [1]: import numpy as np

In [2]: from scipy.stats import multivariate_hypergeom as mvhg

In [3]: rng = np.random.Generator(np.random.PCG64())
In [4]: m = np.random.randint(0, 100, 100)

In [5]: n = np.random.randint(0, m.sum())

In [6]: n
Out[6]: 75

In [7]: m
Out[7]:
array([28, 63,  1, 78, 49, 89,  0, 37, 35, 99, 87, 38,  3, 14, 38, 61, 35,
       32, 86, 44, 91, 20, 72, 15, 19, 86, 29, 25, 80,  7, 18, 49, 95, 40,
       11, 53, 72,  8, 17,  8, 69,  2, 42, 27,  0, 58, 91,  3, 51, 23, 92,
       54, 17, 40, 44, 90, 12, 60, 41, 63, 28, 81, 55, 66, 92, 76, 68, 59,
       69, 87, 41, 22, 68, 65, 42, 68, 43, 80,  2, 63, 12, 48,  4, 25, 11,
       82, 86,  8, 51, 90, 26, 74, 49,  3, 54, 16,  9, 34, 73, 37])

In [8]: %time _ = mvhg.rvs(m=m, n=n, size=1000)
CPU times: user 62.5 ms, sys: 0 ns, total: 62.5 ms
Wall time: 50.9 ms

In [9]: %time _ = mvhg.rvs(m=m, n=n, size=1000, random_state=rng)
CPU times: user 31.2 ms, sys: 15.6 ms, total: 46.9 ms
Wall time: 21.8 ms

NumPy's generator is more than twice as fast!

Some more tests

In [10]: %time _ = mvhg.rvs(m=np.repeat(m, 2), n=n*2, size=1000)
CPU times: user 109 ms, sys: 0 ns, total: 109 ms
Wall time: 106 ms

In [11]: %time _ = mvhg.rvs(m=np.repeat(m, 2), n=n*2, size=1000, random_state=rng)
CPU times: user 46.9 ms, sys: 0 ns, total: 46.9 ms
Wall time: 40.8 ms

In [12]: %time _ = mvhg.rvs(m=np.repeat(m, 3), n=n*3, size=1000)
CPU times: user 156 ms, sys: 0 ns, total: 156 ms
Wall time: 150 ms

In [13]: %time _ = mvhg.rvs(m=np.repeat(m, 3), n=n*3, size=1000, random_state=rng)
CPU times: user 62.5 ms, sys: 0 ns, total: 62.5 ms
Wall time: 60.3 ms

In [14]: %time _ = mvhg.rvs(m=np.repeat(m, 30), n=n*30, size=1000)
CPU times: user 1.55 s, sys: 31.2 ms, total: 1.58 s
Wall time: 1.58 s

In [15]: %time _ = mvhg.rvs(m=np.repeat(m, 30), n=n*30, size=1000, random_state=rng)
CPU times: user 656 ms, sys: 31.2 ms, total: 688 ms
Wall time: 691 ms

scipy/stats/tests/test_multivariate.py

scipy/stats/_multivariate.py

mdhaber · 2020-09-12T18:52:58Z

I wouldn't worry about the speed too much. IMO speeding up the RVS does not need to be in scope. You're already using the faster NumPy alternatives when the provided seed is of the appropriate type, right?

tirthasheshpatel · 2020-09-12T18:55:50Z

You're already using the faster NumPy alternatives when the provided seed is of the appropriate type, right?

Yes. When the random_state (or the seed) parameter to the rvs method is a numpy generator, it will use the new numpy distribution to sample which are fast enough!

scipy/stats/_multivariate.py

rlucas7

@tirthasheshpatel one (minor) comment in two places, how the version check for numpy Generator is performed should be pep-440 compliant.

rlucas7 · 2020-10-28T23:54:59Z

scipy/stats/tests/test_multivariate.py

+        assert_allclose(rvs.mean(0), rv.mean(), rtol=1e-2)
+
+    def test_rvs_numpy(self):
+        if np.__version__ < '1.18':


same comment I made before also applies here about the pep-440 comparison and there is a skipif annotation if you want to make it a little cleaner and consistent with how we've handled this in other PRs IIRC.

not sure if I'm seeing a stale change but the test here checks numpy's version for generator support?

scipy/stats/tests/test_multivariate.py

…in tests

WarrenWeckesser · 2020-10-29T05:37:53Z

@tirthasheshpatel, there is a problem with the rvs method:

In [6]: import scipy

In [7]: scipy.__version__
Out[7]: '1.6.0.dev0+d2a1c98'

In [8]: from scipy.stats import multivariate_hypergeom

A basic example, which uses size=1 by default:

In [10]: multivariate_hypergeom.rvs([1, 3, 9], 5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-fd2505ebd1ae> in <module>
----> 1 multivariate_hypergeom.rvs([1, 3, 9], 5)

~/mc37scipymaster/lib/python3.7/site-packages/scipy-1.6.0.dev0+d2a1c98-py3.7-macosx-10.9-x86_64.egg/scipy/stats/_multivariate.py in rvs(self, m, n, size, random_state)
   4678 
   4679         if size == 1:
-> 4680             return rvs.squeeze(-1)
   4681         return rvs
   4682 

ValueError: cannot select an axis to squeeze out which has size not equal to one

Now try with size=None, which should generate a single variate:

In [11]: multivariate_hypergeom.rvs([1, 3, 9], 5, size=None)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-ed5f23f9cd57> in <module>
----> 1 multivariate_hypergeom.rvs([1, 3, 9], 5, size=None)

~/mc37scipymaster/lib/python3.7/site-packages/scipy-1.6.0.dev0+d2a1c98-py3.7-macosx-10.9-x86_64.egg/scipy/stats/_multivariate.py in rvs(self, m, n, size, random_state)
   4663             size_ = (size, )
   4664 
-> 4665         rvs = np.empty(size_ + (m.shape[-1], ), dtype=m.dtype)
   4666         rem = M
   4667 

TypeError: unsupported operand type(s) for +: 'NoneType' and 'tuple'

I'm working on a fix for this particular problem, and for handling size as I suggested in an earlier comment. A couple changes that I'd like to see are (1) change the default size to None (as it is in multinomial and in the numpy random methods), and (2) eliminate all uses of squeeze for changing the final shape.

We currently don't have a written specification for exactly how the multivariate distribution are supposed to handle broadcasting, nor how broadcasting of the parameters interacts with the size parameter of the rvs method, so I'll work on writing up some notes on that while I work on the code.

tirthasheshpatel · 2020-10-29T06:46:23Z

change the default size to None (as it is in multinomial F438 and in the numpy random methods)

Done!

eliminate all uses of squeeze for changing the final shape

I have eliminated all except one which is necessary to handle a case in the previous comment. When the size=None, the first dimension will be one (because size_=1) that needs to be squeezed for compatibility with numpy. For other conditions, I have removed squeezing as you said.

WarrenWeckesser

@tirthasheshpatel, I made a bunch more suggestions, mostly cosmetic, in-line. Sorry I missed these on my first pass through the code.

There are few more significant issues to be discussed.

This example breaks because M-1 is 0; the result of cov in this case should be an array of zeros with shape (3, 3)):

In [189]: multivariate_hypergeom.cov([1, 0, 0], 1)
/Users/warren/mc37scipymaster/lib/python3.7/site-packages/scipy-1.6.0.dev0+21d6fce-py3.7-macosx-10.9-x86_64.egg/scipy/stats/_multivariate.py:4620: RuntimeWarning: invalid value encountered in true_divide
output = (-n * (M-n)/(M-1) / (M**2) *
/Users/warren/mc37scipymaster/lib/python3.7/site-packages/scipy-1.6.0.dev0+21d6fce-py3.7-macosx-10.9-x86_64.egg/scipy/stats/_multivariate.py:4627: RuntimeWarning: invalid value encountered in long_scalars
m[..., i]*(M-m[..., i]) / (M**2))
Out[189]: 
array([[nan, nan, nan],
     [nan, nan, nan],
     [nan, nan, nan]])

It would be nice if these edges cases were handled:

multivariate_hypergeom.mean([0, 0, 0], 0) should return np.array([0.0, 0.0, 0.0]), and multivariate_hypergeometric.cov([0, 0, 0], 0) should return an array with shape (3, 3) containing all zeros. (The PMF is 1 at [0, 0, 0] and 0 everywhere else, so the expected value of the distribution is [0, 0, 0].)
multivariate_hypergeom.mean([], 0) should return np.array([], dtype=np.float64), and multivariate_hypergeom.cov([], 0) should return np.array([], shape=(0, 0), dtype=np.float64)

I mentioned this in an in-line comment: I don't think we should be casting the inputs from whatever is given to integers. An input such as multivariate_hypergeom.mean([1.2, 3.4, 5.6], 7.8) should be rejected with an error (perhaps something like TypeError: input must be integers). I include the quantiles in this, so multivariate_hypergeom.pmf([2.5, 3.1], [3, 5], 5) should also raise a TypeError. Currently the code checks for non-integer quantiles and returns 0 for the PMF; instead, I think we can interpret the distribution as one whose domain is the integers (it is a discrete distribution, after all), and simply not accept floating point values.
Broadcasting in logpmf, pmf, mean, var and cov works nicely! But there are some issues in the rvs method.

The NumPy sampler for the multivariate hypergeometric distribution does not handle broadcasting. So there is an inconsistency in whether or not multivariate_hypergeom.rvs handles broadcasting, depending on the NumPy version and the parameter random_state.
Reviewing this issue led to looking into the state of broadcasting for the multivariate distributions in NumPy. There is currently a pull request to add it to NumPy's multinomial sampler (numpy/numpy#16740). In numpy/numpy#17669, I started a discussion on how broadcasting and the size parameter interact for the multivariate distributions. To be consistent with the rules that I outlined there, size=None is not the same as size=1. Instead, size=None means the size is determined by the broadcast of the distribution parameters.

scipy/stats/_multivariate.py

scipy/stats/__init__.py

scipy/stats/_multivariate.py

tirthasheshpatel

@WarrenWeckesser A lot of hefty changes below! I have added code to handle all the edge cases and I also added checks for integer x, m , and n.

First of all, thank you so much for such a thorough look at this PR. I really missed all these edge cases and I am happy they are covered now.

I have tried my best to handle all the edge cases as concisely as possible. Below are some changes and their justifications. One big comment I had was whether or not it is OK to use masked_arrays to handle edge cases. I am not very experienced with masked_arrays and don't know the consequences of using them, but they do seem to resolve all the edge cases. Sorry, if I have introduced some breaking code below while trying to resolve the edge cases 😅

scipy/stats/_multivariate.py

tirthasheshpatel · 2020-11-02T14:35:02Z

scipy/stats/_multivariate.py

+        x = np.asarray(x)
+        if not np.issubdtype(x.dtype, np.integer):


The comment about only allowing integers has been addressed here

scipy/stats/_multivariate.py

scipy/stats/tests/test_multivariate.py

WarrenWeckesser

That's great progress!

I made several minor suggestions inline, but there are a few bigger items still be be addressed:

These (very) edge cases don't work:
```
multivariate_hypergeom.mean([], 0)
multivariate_hypergeom.cov([], 0)
```
(See my comment back in #12839 (review)) The problem is that when np.asarray is given an empty list, it use float64 as the default dtype. You could handle this in _process_parameters by doing something like this:
```
 m = np.asarray(m)
 if m.size == 0:
     m = m.astype(int)
```
And do the same for n. Note, however, that I haven't checked if the rest of the code will "just work" with these emtpy arrays and return the expected results. If it turns out that it will require a lot of special code to handle these cases, we could defer handling it for now, and just raise an error stating that empty inputs are not allowed.
Also noted in #12839 (review), the NumPy distribution sampler that is available in NumPy 1.18 or later does not handle broadcasting. So if the user passes parameters to the rvs() method that require broadcasting, we can't use the NumPy sampler. It might be simpler to just not attempt to use the NumPy sampler at all. Always use the version that you implemented here, and add a comment that we can start using the NumPy version when it is updated to handle broadcasting.
The comment I made in item 4 of #12839 (review) still applies: if size is None, the size is not to be treated as size=1. Instead, the size is determined by the broadcast shape of the input parameters. So an input such as
```
multivariate_hypergeom.rvs([2, 3, 5], [1, 2, 3, 4])
```
should act like multivariate_hypergeom.rvs([2, 3, 5], [1, 2, 3, 4], size=(4,)) and generate a result with shape (4, 3), and
```
multivariate_hypergeom.rvs([[5, 10, 20], [10, 10, 15]], [[2], [3], [4], [5]])
```
(where the first argument has shape (2, 3) and the second has shape (4, 1)) should act like
```
multivariate_hypergeom.rvs([[5, 10, 20], [10, 10, 15]], [[2], [3], [4], [5]], size=(4, 2))
```
and generate a result with shape (4, 2, 3).

scipy/stats/_multivariate.py

scipy/stats/tests/test_multivariate.py

…le size=None

tirthasheshpatel · 2020-11-12T12:47:46Z

Thanks, @WarrenWeckesser for the explanation of the behavior when size=None. It should be fixed now. I guess all your comments addressed now, right?

WarrenWeckesser

I have a few more tiny changes. I'll commit them as a batch, and let the tests run once more.

scipy/stats/_multivariate.py

scipy/stats/tests/test_multivariate.py

scipy/stats/_multivariate.py

Whitespace tweaks and remove imports of `NumpyVersion`

WarrenWeckesser · 2020-11-12T18:19:51Z

I'm not sure what happened with appveyor and travisci when I pushed a batch of small changes. Closing and reopening to try again.

WarrenWeckesser · 2020-11-12T18:31:05Z

@rlucas7, I think the change that you requested was related to the use of NumpyVersion. That is no longer being imported. Could you take another look and update your review when you get a chance?

rlucas7

looking good w/the skipif test on rvs, thanks, marking approved.

@WarrenWeckesser or @mdhaber is this ready to squash-and-merge then?

EDIT: It seems ready for merge, if no more comments (or someone else merges beforehand), I'll merge on Sunday (Nov 15th, in the morning EST).

scipy/stats/tests/test_multivariate.py

mdhaber · 2020-11-20T22:20:36Z

@rlucas7 I haven't reviewed recently, but what i saw before was good, so with +2 LGTM!

When you merge, please modify the line about multivariate t in the 1.6.0 release notes e.g.:

The multivariate t distribution has been added as scipy.stats.multivariate_t, and the multivariate hypergeometric distribution has been added as scipy.stats.multivariate_hypergeom.

rlucas7 · 2020-11-21T01:11:56Z

Thanks for the ping @mdhaber
Thanks @tirthasheshpatel for sticking with this one until it gets merged.
And thanks @mdhaber and @WarrenWeckesser for the reviewing

squash and merged-glad to have this in 1.6.0.

tirthasheshpatel · 2020-11-21T03:48:15Z

Glad I could help. This was a lot of fun to do.

Thanks, @mdhaber, @rlucas7, and @WarrenWeckesser for all the reviews and help!

ENH: scipy.stats: add multivariate hypergeometric distribution

89d8611

tirthasheshpatel commented Sep 10, 2020

View reviewed changes

scipy/stats/_multivariate.py Outdated Show resolved Hide resolved

tirthasheshpatel added 5 commits September 11, 2020 10:53

FIXUP: compile the documentation using docformat

061e446

FIXUP: fix the lints

3444daf

FIXUP: fix lints s in test

874fae5

FIXUP: add a test for rvs broadcating

e2b1a02

FIXUP: avoid broadcasting externally in rvs method

93aee37

tirthasheshpatel added 3 commits September 12, 2020 16:57

FIXUP: fix documentation

cac3d99

FIXUP: fix the documentation warning

3d7bc29

FIXUP: fix a doc string

864e2eb