8000 ENH: scipy.stats: add multivariate hypergeometric distribution by tirthasheshpatel · Pull Request #12839 · scipy/scipy · GitHub
[go: up one dir, main page]

Skip to content

ENH: scipy.stats: add multivariate hypergeometric distribution #12839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Nov 21, 2020

Conversation

tirthasheshpatel
Copy link
Member

Reference issue

Closes #12585

What does this implement/fix?

This PR implements the multivariate hypergeometric distribution in scipy.stats module!

  • Documentation suite for the methods of the distribution
  • Tests for all the added methods of the distribution

@mdhaber

@tirthasheshpatel
Copy link
Member Author
tirthasheshpatel commented Sep 12, 2020

I also tested the time taken by rvs method implemented vs numpy's generator:

speed test
In [1]: import numpy as np

In [2]: from scipy.stats import multivariate_hypergeom as mvhg

In [3]: rng = np.random.Generator(np.random.PCG64())
In [4]: m = np.random.randint(0, 100, 100)

In [5]: n = np.random.randint(0, m.sum())

In [6]: n
Out[6]: 75

In [7]: m
Out[7]:
array([28, 63,  1, 78, 49, 89,  0, 37, 35, 99, 87, 38,  3, 14, 38, 61, 35,
       32, 86, 44, 91, 20, 72, 15, 19, 86, 29, 25, 80,  7, 18, 49, 95, 40,
       11, 53, 72,  8, 17,  8, 69,  2, 42, 27,  0, 58, 91,  3, 51, 23, 92,
       54, 17, 40, 44, 90, 12, 60, 41, 63, 28, 81, 55, 66, 92, 76, 68, 59,
       69, 87, 41, 22, 68, 65, 42, 68, 43, 80,  2, 63, 12, 48,  4, 25, 11,
       82, 86,  8, 51, 90, 26, 74, 49,  3, 54, 16,  9, 34, 73, 37])

In [8]: %time _ = mvhg.rvs(m=m, n=n, size=1000)
CPU times: user 62.5 ms, sys: 0 ns, total: 62.5 ms
Wall time: 50.9 ms

In [9]: %time _ = mvhg.rvs(m=m, n=n, size=1000, random_state=rng)
CPU times: user 31.2 ms, sys: 15.6 ms, total: 46.9 ms
Wall time: 21.8 ms

NumPy's generator is more than twice as fast!

Some more tests
In [10]: %time _ = mvhg.rvs(m=np.repeat(m, 2), n=n*2, size=1000)
CPU times: user 109 ms, sys: 0 ns, total: 109 ms
Wall time: 106 ms

In [11]: %time _ = mvhg.rvs(m=np.repeat(m, 2), n=n*2, size=1000, random_state=rng)
CPU times: user 46.9 ms, sys: 0 ns, total: 46.9 ms
Wall time: 40.8 ms

In [12]: %time _ = mvhg.rvs(m=np.repeat(m, 3), n=n*3, size=1000)
CPU times: user 156 ms, sys: 0 ns, total: 156 ms
Wall time: 150 ms

In [13]: %time _ = mvhg.rvs(m=np.repeat(m, 3), n=n*3, size=1000, random_state=rng)
CPU times: user 62.5 ms, sys: 0 ns, total: 62.5 ms
Wall time: 60.3 ms

In [14]: %time _ = mvhg.rvs(m=np.repeat(m, 30), n=n*30, size=1000)
CPU times: user 1.55 s, sys: 31.2 ms, total: 1.58 s
Wall time: 1.58 s

In [15]: %time _ = mvhg.rvs(m=np.repeat(m, 30), n=n*30, size=1000, random_state=rng)
CPU times: user 656 ms, sys: 31.2 ms, total: 688 ms
Wall time: 691 ms

@mdhaber
Copy link
Contributor
mdhaber commented Sep 12, 2020

I wouldn't worry about the speed too much. IMO speeding up the RVS does not need to be in scope. You're already using the faster NumPy alternatives when the provided seed is of the appropriate type, right?

@tirthasheshpatel
Copy link
Member Author
tirthasheshpatel commented Sep 12, 2020

You're already using the faster NumPy alternatives when the provided seed is of the appropriate type, right?

Yes. When the random_state (or the seed) parameter to the rvs method is a numpy generator, it will use the new numpy distribution to sample which are fast enough!

Copy link
Member
@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tirthasheshpatel one (minor) comment in two places, how the version check for numpy Generator is performed should be pep-440 compliant.

assert_allclose(rvs.mean(0), rv.mean(), rtol=1e-2)

def test_rvs_numpy(self):
if np.__version__ < '1.18':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment I made before also applies here about the pep-440 comparison and there is a skipif annotation if you want to make it a little cleaner and consistent with how we've handled this in other PRs IIRC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I'm seeing a stale change but the test here checks numpy's version for generator support?

@WarrenWeckesser
Copy link
Member

@tirthasheshpatel, there is a problem with the rvs method:

In [6]: import scipy

In [7]: scipy.__version__
Out[7]: '1.6.0.dev0+d2a1c98'

In [8]: from scipy.stats import multivariate_hypergeom

A basic example, which uses size=1 by default:

In [10]: multivariate_hypergeom.rvs([1, 3, 9], 5)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-fd2505ebd1ae> in <module>
----> 1 multivariate_hypergeom.rvs([1, 3, 9], 5)

~/mc37scipymaster/lib/python3.7/site-packages/scipy-1.6.0.dev0+d2a1c98-py3.7-macosx-10.9-x86_64.egg/scipy/stats/_multivariate.py in rvs(self, m, n, size, random_state)
   4678 
   4679         if size == 1:
-> 4680             return rvs.squeeze(-1)
   4681         return rvs
   4682 

ValueError: cannot select an axis to squeeze out which has size not equal to one

Now try with size=None, which should generate a single variate:

In [11]: multivariate_hypergeom.rvs([1, 3, 9], 5, size=None)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-ed5f23f9cd57> in <module>
----> 1 multivariate_hypergeom.rvs([1, 3, 9], 5, size=None)

~/mc37scipymaster/lib/python3.7/site-packages/scipy-1.6.0.dev0+d2a1c98-py3.7-macosx-10.9-x86_64.egg/scipy/stats/_multivariate.py in rvs(self, m, n, size, random_state)
   4663             size_ = (size, )
   4664 
-> 4665         rvs = np.empty(size_ + (m.shape[-1], ), dtype=m.dtype)
   4666         rem = M
   4667 

TypeError: unsupported operand type(s) for +: 'NoneType' and 'tuple'

I'm working on a fix for this particular problem, and for handling size as I suggested in an earlier comment. A couple changes that I'd like to see are (1) change the default size to None (as it is in multinomial and in the numpy random methods), and (2) eliminate all uses of squeeze for changing the final shape.

We currently don't have a written specification for exactly how the multivariate distribution are supposed to handle broadcasting, nor how broadcasting of the parameters interacts with the size parameter of the rvs method, so I'll work on writing up some notes on that while I work on the code.

@tirthasheshpatel
Copy link
Member Author

change the default size to None (as it is in multinomial F438 and in the numpy random methods)

Done!

eliminate all uses of squeeze for changing the final shape

I have eliminated all except one which is necessary to handle a case in the previous comment. When the size=None, the first dimension will be one (because size_=1) that needs to be squeezed for compatibility with numpy. For other conditions, I have removed squeezing as you said.

Copy link
Member
@WarrenWeckesser WarrenWeckesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tirthasheshpatel, I made a bunch more suggestions, mostly cosmetic, in-line. Sorry I missed these on my first pass through the code.

There are few more significant issues to be discussed.

  1. This example breaks because M-1 is 0; the result of cov in this case should be an array of zeros with shape (3, 3)):
In [189]: multivariate_hypergeom.cov([1, 0, 0], 1)
/Users/warren/mc37scipymaster/lib/python3.7/site-packages/scipy-1.6.0.dev0+21d6fce-py3.7-macosx-10.9-x86_64.egg/scipy/stats/_multivariate.py:4620: RuntimeWarning: invalid value encountered in true_divide
output = (-n * (M-n)/(M-1) / (M**2) *
/Users/warren/mc37scipymaster/lib/python3.7/site-packages/scipy-1.6.0.dev0+21d6fce-py3.7-macosx-10.9-x86_64.egg/scipy/stats/_multivariate.py:4627: RuntimeWarning: invalid value encountered in long_scalars
m[..., i]*(M-m[..., i]) / (M**2))
Out[189]: 
array([[nan, nan, nan],
     [nan, nan, nan],
     [nan, nan, nan]])
  1. It would be nice if these edges cases were handled:
  • multivariate_hypergeom.mean([0, 0, 0], 0) should return np.array([0.0, 0.0, 0.0]), and multivariate_hypergeometric.cov([0, 0, 0], 0) should return an array with shape (3, 3) containing all zeros. (The PMF is 1 at [0, 0, 0] and 0 everywhere else, so the expected value of the distribution is [0, 0, 0].)
  • multivariate_hypergeom.mean([], 0) should return np.array([], dtype=np.float64), and multivariate_hypergeom.cov([], 0) should return np.array([], shape=(0, 0), dtype=np.float64)
  1. I mentioned this in an in-line comment: I don't think we should be casting the inputs from whatever is given to integers. An input such as multivariate_hypergeom.mean([1.2, 3.4, 5.6], 7.8) should be rejected with an error (perhaps something like TypeError: input must be integers). I include the quantiles in this, so multivariate_hypergeom.pmf([2.5, 3.1], [3, 5], 5) should also raise a TypeError. Currently the code checks for non-integer quantiles and returns 0 for the PMF; instead, I think we can interpret the distribution as one whose domain is the integers (it is a discrete distribution, after all), and simply not accept floating point values.

  2. Broadcasting in logpmf, pmf, mean, var and cov works nicely! But there are some issues in the rvs method.

  • The NumPy sampler for the multivariate hypergeometric distribution does not handle broadcasting. So there is an inconsistency in whether or not multivariate_hypergeom.rvs handles broadcasting, depending on the NumPy version and the parameter random_state.
  • Reviewing this issue led to looking into the state of broadcasting for the multivariate distributions in NumPy. There is currently a pull request to add it to NumPy's multinomial sampler (numpy/numpy#16740). In numpy/numpy#17669, I started a discussion on how broadcasting and the size parameter interact for the multivariate distributions. To be consistent with the rules that I outlined there, size=None is not the same as size=1. Instead, size=None means the size is determined by the broadcast of the distribution parameters.

Copy link
Member Author
@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WarrenWeckesser A lot of hefty changes below! I have added code to handle all the edge cases and I also added checks for integer x, m , and n.

First of all, thank you so much for such a thorough look at this PR. I really missed all these edge cases and I am happy they are covered now.

I have tried my best to handle all the edge cases as concisely as possible. Below are some changes and their justifications. One big comment I had was whether or not it is OK to use masked_arrays to handle edge cases. I am not very experienced with masked_arrays and don't know the consequences of using them, but they do seem to resolve all the edge cases. Sorry, if I have introduced some breaking code below while trying to resolve the edge cases 😅

Comment on lines +4477 to +4478
x = np.asarray(x)
if not np.issubdtype(x.dtype, np.integer):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about only allowing integers has been addressed here

Copy link
Member
@WarrenWeckesser WarrenWeckesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great progress!

I made several minor suggestions inline, but there are a few bigger items still be be addressed:

  1. These (very) edge cases don't work:

    multivariate_hypergeom.mean([], 0)
    multivariate_hypergeom.cov([], 0)
    

    (See my comment back in #12839 (review)) The problem is that when np.asarray is given an empty list, it use float64 as the default dtype. You could handle this in _process_parameters by doing something like this:

     m = np.asarray(m)
     if m.size == 0:
         m = m.astype(int)
    

    And do the same for n. Note, however, that I haven't checked if the rest of the code will "just work" with these emtpy arrays and return the expected results. If it turns out that it will require a lot of special code to handle these cases, we could defer handling it for now, and just raise an error stating that empty inputs are not allowed.

  2. Also noted in #12839 (review), the NumPy distribution sampler that is available in NumPy 1.18 or later does not handle broadcasting. So if the user passes parameters to the rvs() method that require broadcasting, we can't use the NumPy sampler. It might be simpler to just not attempt to use the NumPy sampler at all. Always use the version that you implemented here, and add a comment that we can start using the NumPy version when it is updated to handle broadcasting.

  3. The comment I made in item 4 of #12839 (review) still applies: if size is None, the size is not to be treated as size=1. Instead, the size is determined by the broadcast shape of the input parameters. So an input such as

    multivariate_hypergeom.rvs([2, 3, 5], [1, 2, 3, 4])
    

    should act like multivariate_hypergeom.rvs([2, 3, 5], [1, 2, 3, 4], size=(4,)) and generate a result with shape (4, 3), and

    multivariate_hypergeom.rvs([[5, 10, 20], [10, 10, 15]], [[2], [3], [4], [5]])
    

    (where the first argument has shape (2, 3) and the second has shape (4, 1)) should act like

    multivariate_hypergeom.rvs([[5, 10, 20], [10, 10, 15]], [[2], [3], [4], [5]], size=(4, 2))
    

    and generate a result with shape (4, 2, 3).

@tirthasheshpatel
Copy link
Member Author

Thanks, @WarrenWeckesser for the explanation of the behavior when size=None. It should be fixed now. I guess all your comments addressed now, right?

Copy link
Member
@WarrenWeckesser WarrenWeckesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few more tiny changes. I'll commit them as a batch, and let the tests run once more.

Whitespace tweaks and remove imports of `NumpyVersion`
@WarrenWeckesser
Copy link
Member
WarrenWeckesser commented Nov 12, 2020

I'm not sure what happened with appveyor and travisci when I pushed a batch of small changes. Closing and reopening to try again.

@WarrenWeckesser
Copy link
Member

@rlucas7, I think the change that you requested was related to the use of NumpyVersion. That is no longer being imported. Could you take another look and update your review when you get a chance?

Copy link
Member
@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good w/the skipif test on rvs, thanks, marking approved.

@WarrenWeckesser or @mdhaber is this ready to squash-and-merge then?

EDIT: It seems ready for merge, if no more comments (or someone else merges beforehand), I'll merge on Sunday (Nov 15th, in the morning EST).

@mdhaber
Copy link
Contributor
mdhaber commented Nov 20, 2020

@rlucas7 I haven't reviewed recently, but what i saw before was good, so with +2 LGTM!

When you merge, please modify the line about multivariate t in the 1.6.0 release notes e.g.:

The multivariate t distribution has been added as scipy.stats.multivariate_t, and the multivariate hypergeometric distribution has been added as scipy.stats.multivariate_hypergeom.

@rlucas7 rlucas7 merged commit 3a8a3a1 into scipy:master Nov 21, 2020
@rlucas7
Copy link
Member
rlucas7 commented Nov 21, 2020

Thanks for the ping @mdhaber
Thanks @tirthasheshpatel for sticking with this one until it gets merged.
And thanks @mdhaber and @WarrenWeckesser for the reviewing

squash and merged-glad to have this in 1.6.0.

@tirthasheshpatel
Copy link
Member Author

Glad I could help. This was a lot of fun to do.

Thanks, @mdhaber, @rlucas7, and @WarrenWeckesser for all the reviews and help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add Multivariate Hypergeometric Distribution
4 participants
0