8000 BUG: Add np.random.dirichlet2 by trendelkampschroer · Pull Request #5872 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: Add np.random.dirichlet2 #5872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

trendelkampschroer
Copy link
Contributor

np.random.dirichlet fails for small alpha parameters (see #5851).
The new implmentation np.random.dirichlet2 switches to generation
via beta RVs (stick breaking approach) whenever all alpha parameters
are smaller than one.

np.random.dirichlet fails for small alpha parameters (see numpy#5851).
The new implmentation np.random.dirichlet2 switches to generation
via beta RVs (stick breaking approach) whenever all alpha parameters
are smaller than one.

diric = np.zeros(shape, np.float64)
val_arr = <ndarray>diric
val_data= <double*>PyArray_DATA(val_arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate the general look of this alignment scheme, but I understand you are just replicating what is elsewhere in this module. In any case, there should be a space before =, even if there isn't in the equivalent line of dirichlet.

@argriffing
Copy link
Contributor

If you want to add unit tests, it would make sense to at least add tests analogous to np.random.dirichlet in https://github.com/numpy/numpy/blob/master/numpy/random/tests/test_random.py.

The first test test_dirichlet is just a backwards compatibility test. I guess the dirichlet2 analogue would have different numbers. Maybe it would also make sense to check both branches of the sampler (max alpha < 1 vs. >= 1) and to not use just 2 alpha values (dirichlet with 2 alpha values is a beta distribution).

The second test test_dirichlet_size is for plumbing and broadcasting. I think it could be used for dirichlet2 without any modifications. Well, maybe it would be good to also check a vector of alpha values with length other than 2. Neither of these tests check statistical properties of the distribution, although this is checked in scipy.
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_multivariate.py

@argriffing
Copy link
Contributor

When I mentioned an analogy to http://en.wikipedia.org/wiki/Pairwise_summation in response to @jaimefrio's concerns about stability, I had in mind something like the following:

from __futur
8000
e__ import print_function, division

import numpy as np

def mybeta(a, b):
    # allow one or both of (a, b) to be zero
    if a < 0 or b < 0:
        raise ValueError
    if a == 0 and b == 0:
        return 0, 0
    if a == 0:
        return 0, 1
    if b == 0:
        return 1, 0
    c = np.random.beta(a, b)
    return c, 1-c

def _pairwise(a, start, stop, out):
    # update `out` in-place and return the sub-sequence sum
    n = stop - start
    if n == 1:
        return a[start]
    else:
        m = start + n // 2
        if n == 2:
            s0 = a[start]
            s1 = a[m]
        else:
            s0 = _pairwise(a, start, m, out)
            s1 = _pairwise(a, m, stop, out)
        c0, c1 = mybeta(s0, s1)
        out[start:m] *= c0
        out[m:stop] *= c1
        return s0 + s1

def _dirichlet2(a):
    a = np.asarray(a, dtype=float)
    out = np.ones_like(a)
    _pairwise(a, 0, len(a), out)
    return out

def dirichlet2(a, size=None):
    if size is None:
        return _dirichlet2(a)
    else:
        return np.array([_dirichlet2(a) for _ in range(size)])


def main():
    for a in (
            [1, 2, 3, 4],
            [1e-20, 1e-18, 0.1],
            [0, 0, 0, 42, 0],
            [0, 0.1, 0, 0, 0, 1e-20, 0],
            ):
        x = dirichlet2(a, size=10000)
        print(x[0])
        print(x.mean(axis=0))
        print()

main()
[ 0.0178193   0.11677658  0.34310545  0.52229867]
[ 0.09893858  0.19934303  0.3018269   0.3998915 ]

[ 0.  0.  1.]
[ 0.  0.  1.]

[ 0.  0.  0.  1.  0.]
[ 0.  0.  0.  1.  0.]

[ 0.  1.  0.  0.  0.  0.  0.]
[ 0.  1.  0.  0.  0.  0.  0.]

This is n log n slowness instead of linear in the number of categories.

Edit: updated to allow inputs that are exactly zero

@charris
Copy link
Member
charris commented Aug 18, 2017

Anyone still interested in this?

@trendelkampschroer
Copy link
8000 Contributor Author

Yes, I think it is still an important issue. But there seems to be little concensus on how to proceed.

@bashtage
Copy link
Contributor

Could be incorporated in NEP 19, xref #13164 #13163

@trendelkampschroer
Copy link
Contributor Author

How would that be done in practice. The new implementation would contain a code path, which would break numpy`strict backward compatibility requirement for random numbers.

Would we have two different dirichlet implementation, a legacy one attached to the legacy random state instance, and a new attached to the new random state instances?

@bashtage
Copy link
Contributor
bashtage commented Apr 12, 2019 via email

@trendelkampschroer
Copy link
Contributor Author
trendelkampschroer commented Apr 12, 2019

Ok, which branch would I need to branch off (#13164 or #13163) and where is the new RandomState class that needs attaching of a new Dirichlet sampler located?

@mattip
Copy link
Member
mattip commented May 31, 2019

#13163 was merged, you can rebase off HEAD.

@bashtage
Copy link
Contributor
bashtage commented Jun 2, 2019

@trendelkampschroer If you have time to work on this, you can replace dirichelet in Generator -- you don't need to create a new function or allow for an alternative algorithm since there is no rule about stream compat since this is a new API.

@mattip
Copy link
Member
mattip commented Jul 5, 2019

Now would be a good time to change the dirichlet function in Generator.pyx since we are breaking stream compatibility with mtrand.

@WarrenWeckesser
Copy link
Member
WarrenWeckesser commented Nov 15, 2019

@trendelkampschroer, thanks for getting this started back in 2015 (!). @bashtage pinged you in June about resuming this work, but we haven't had a response. Here's another ping. I'd like to get this PR, or something similar, into NumPy to fix the problem with small alpha values in dirichlet. If we don't hear from you within a few days, we'll take up where you left off.

As a reminder for anyone else, here's the problem that can occur with dirichlet (using NumPy version '1.18.0.dev0+1ebf711'):

In [21]: alpha = np.array([1e-4, 1e-5, 1e-4])

In [22]: rng = np.random.default_rng(11335577)

In [23]: rng.dirichlet(alpha, size=12)
Out[23]: 
array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [ 0.,  0.,  1.],
       [nan, nan, nan],
       [nan, nan, nan]])

Now that NEP 19 is implemented, we can change the stream of variates produced by a distribution. The legacy code in numpy.random must remain untouched, but the code in the new Generator class (in https://github.com/numpy/numpy/blob/master/numpy/random/_generator.pyx) can be changed.

For what it's worth, here's a Python+NumPy implementation of the "stick-breaking" algorithm:

def dirichlet(alpha, size=1, rng=None):
    if not hasattr(rng, 'beta'):
        rng = np.random.default_rng(rng)

    # XXX alpha is expected to be a sequence of positive floats,
    # with length K at least 2.
    # XXX For now, assume size is an integer (i.e. not None, and not a tuple).
    p = np.empty((size, len(alpha)))

    a = alpha[:-1]
    # b is [sum(alpha[1:]), sum(alpha[2:]), ..., sum(alpha[K-1:])]
    b = np.cumsum(alpha[:0:-1])[::-1]  
    v = rng.beta(a, b, size=(size, len(a)))

    p1mv = np.ones((size, len(a)))
    p1mv[:, 1:] = np.cumprod(1 - v[:, :-1], axis=1)
    p[:, :-1] = v*p1mv
    # The last column is one minus the sum of all the previous columns.
    p[:, -1] = 1 - p[:, :-1].sum(axis=1)

    return p

It handles the small values where the old code fails:

In [24]: rng = np.random.default_rng(11335577)

In [25]: dirichlet(alpha, size=12, rng=rng)
Out[25]: 
array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])

Here's a quick check that it is not generating any nans, and it is not doing something egregiously wrong. The mean of a large sample agrees with the theoretical mean:

In [32]: x = dirichlet(alpha, size=1000000)                                               

In [33]: x.mean(axis=0)                                                                   
Out[33]: array([0.47584289, 0.04751686, 0.47664025])

In [34]: alpha / alpha.sum()   # expected values of the distribution                                                              
Out[34]: array([0.47619048, 0.04761905, 0.47619048])

@trendelkampschroer
Copy link
Contributor Author
trendelkampschroer commented Nov 15, 2019

Thanks a ton for looking into this and sorry for not getting back to you. I have started an implementation of the stick breaking approach as cython code, but was not able to compile it. I'll dig it up and paste the relevant code snippet.

I am sorry, but I can't seem to find the necessary time to see that through, although I absolutely would love to contribute. Please feel free to take this up from here.

Thanks so much for putting so much effort into a redesign of numpy random and into the numpy package in general!

@trendelkampschroer
Copy link
Contributor Author
trendelkampschroer commented Nov 17, 2019 R AD86 26;

@WarrenWeckesser @mattip
I have been able to put some work into this and I have opened a PR containing the stick breaking implementation for dirichlet distributed random vectors. The PR is here #14924

While developing this I have found that the beta distribution is suffering from slowness when parameters a, b have small values. I have added tests that demonstrate this. This affects the stick-breaking implementation as the small a, b case is the one that we wanted to fix in the first place.

In order to make the stick breaking approach robust for cases with small alpha parameters we first need to adress the slowness in the beta generator.

Please let me know If I need to use a different base branch than master for my PR. I just wanted to finally bring this into the world to give this some momentum.

@charris
Copy link
Member
charris commented Nov 18, 2019

Closing. There is now #14924.

@charris charris closed this Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0