BUG: Add np.random.dirichlet2 #5872

trendelkampschroer · 2015-05-13T16:05:54Z

np.random.dirichlet fails for small alpha parameters (see #5851).
The new implmentation np.random.dirichlet2 switches to generation
via beta RVs (stick breaking approach) whenever all alpha parameters
are smaller than one.

np.random.dirichlet fails for small alpha parameters (see numpy#5851). The new implmentation np.random.dirichlet2 switches to generation via beta RVs (stick breaking approach) whenever all alpha parameters are smaller than one.

jaimefrio · 2015-05-13T16:35:06Z

numpy/random/mtrand/mtrand.pyx

+
+        diric   = np.zeros(shape, np.float64)
+        val_arr = <ndarray>diric
+        val_data= <double*>PyArray_DATA(val_arr)        


I hate the general look of this alignment scheme, but I understand you are just replicating what is elsewhere in this module. In any case, there should be a space before =, even if there isn't in the equivalent line of dirichlet.

argriffing · 2015-05-13T16:49:10Z

If you want to add unit tests, it would make sense to at least add tests analogous to np.random.dirichlet in https://github.com/numpy/numpy/blob/master/numpy/random/tests/test_random.py.

The first test test_dirichlet is just a backwards compatibility test. I guess the dirichlet2 analogue would have different numbers. Maybe it would also make sense to check both branches of the sampler (max alpha < 1 vs. >= 1) and to not use just 2 alpha values (dirichlet with 2 alpha values is a beta distribution).

The second test test_dirichlet_size is for plumbing and broadcasting. I think it could be used for dirichlet2 without any modifications. Well, maybe it would be good to also check a vector of alpha values with length other than 2. Neither of these tests check statistical properties of the distribution, although this is checked in scipy.
https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_multivariate.py

argriffing · 2015-05-13T22:13:41Z

When I mentioned an analogy to http://en.wikipedia.org/wiki/Pairwise_summation in response to @jaimefrio's concerns about stability, I had in mind something like the following:

from __futur
8000
e__ import print_function, division

import numpy as np

def mybeta(a, b):
    # allow one or both of (a, b) to be zero
    if a < 0 or b < 0:
        raise ValueError
    if a == 0 and b == 0:
        return 0, 0
    if a == 0:
        return 0, 1
    if b == 0:
        return 1, 0
    c = np.random.beta(a, b)
    return c, 1-c

def _pairwise(a, start, stop, out):
    # update `out` in-place and return the sub-sequence sum
    n = stop - start
    if n == 1:
        return a[start]
    else:
        m = start + n // 2
        if n == 2:
            s0 = a[start]
            s1 = a[m]
        else:
            s0 = _pairwise(a, start, m, out)
            s1 = _pairwise(a, m, stop, out)
        c0, c1 = mybeta(s0, s1)
        out[start:m] *= c0
        out[m:stop] *= c1
        return s0 + s1

def _dirichlet2(a):
    a = np.asarray(a, dtype=float)
    out = np.ones_like(a)
    _pairwise(a, 0, len(a), out)
    return out

def dirichlet2(a, size=None):
    if size is None:
        return _dirichlet2(a)
    else:
        return np.array([_dirichlet2(a) for _ in range(size)])


def main():
    for a in (
            [1, 2, 3, 4],
            [1e-20, 1e-18, 0.1],
            [0, 0, 0, 42, 0],
            [0, 0.1, 0, 0, 0, 1e-20, 0],
            ):
        x = dirichlet2(a, size=10000)
        print(x[0])
        print(x.mean(axis=0))
        print()

main()

[ 0.0178193   0.11677658  0.34310545  0.52229867]
[ 0.09893858  0.19934303  0.3018269   0.3998915 ]

[ 0.  0.  1.]
[ 0.  0.  1.]

[ 0.  0.  0.  1.  0.]
[ 0.  0.  0.  1.  0.]

[ 0.  1.  0.  0.  0.  0.  0.]
[ 0.  1.  0.  0.  0.  0.  0.]

This is n log n slowness instead of linear in the number of categories.

Edit: updated to allow inputs that are exactly zero

charris · 2017-08-18T19:27:48Z

Anyone still interested in this?

trendelkampschroer · 2017-08-18T21:35:05Z

Yes, I think it is still an important issue. But there seems to be little concensus on how to proceed.

bashtage · 2019-04-10T17:54:45Z

Could be incorporated in NEP 19, xref #13164 #13163

trendelkampschroer · 2019-04-12T07:02:45Z

How would that be done in practice. The new implementation would contain a code path, which would break numpy`strict backward compatibility requirement for random numbers.

Would we have two different dirichlet implementation, a legacy one attached to the legacy random state instance, and a new attached to the new random state instances?

bashtage · 2019-04-12T10:03:07Z

Yes, that's right. Two different versions in different classes. This is already the case for normals, everything that depends on normals, gammas, and exponentials.

…

On Fri, Apr 12, 2019, 08:02 Benjamin Trendelkamp-Schroer < ***@***.***> wrote: How would that be done in practice. The new implementation would contain a code path, which would break numpy`strict backward compatibility requirement for random numbers. Would we have two different dirichlet implementation, a legacy one attached to the legacy random state instance, and a new attached to the new random state instances? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5872 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFU5RaOIFfOnSFQ8cNY2vGwg6grAVWstks5vgC-ggaJpZM4EZXYz> . On Fri, Apr 12, 2019, 08:02 Benjamin Trendelkamp-Schroer < ***@***.***> wrote: How would that be done in practice. The new implementation would contain a code path, which would break numpy`strict backward compatibility requirement for random numbers. Would we have two different dirichlet implementation, a legacy one attached to the legacy random state instance, and a new attached to the new random state instances? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5872 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFU5RaOIFfOnSFQ8cNY2vGwg6grAVWstks5vgC-ggaJpZM4EZXYz> .

trendelkampschroer · 2019-04-12T21:22:56Z

Ok, which branch would I need to branch off (#13164 or #13163) and where is the new RandomState class that needs attaching of a new Dirichlet sampler located?

mattip · 2019-05-31T13:10:38Z

#13163 was merged, you can rebase off HEAD.

bashtage · 2019-06-02T01:40:49Z

@trendelkampschroer If you have time to work on this, you can replace dirichelet in Generator -- you don't need to create a new function or allow for an alternative algorithm since there is no rule about stream compat since this is a new API.

mattip · 2019-07-05T23:42:02Z

Now would be a good time to change the dirichlet function in Generator.pyx since we are breaking stream compatibility with mtrand.

WarrenWeckesser · 2019-11-15T00:25:09Z

@trendelkampschroer, thanks for getting this started back in 2015 (!). @bashtage pinged you in June about resuming this work, but we haven't had a response. Here's another ping. I'd like to get this PR, or something similar, into NumPy to fix the problem with small alpha values in dirichlet. If we don't hear from you within a few days, we'll take up where you left off.

As a reminder for anyone else, here's the problem that can occur with dirichlet (using NumPy version '1.18.0.dev0+1ebf711'):

In [21]: alpha = np.array([1e-4, 1e-5, 1e-4])

In [22]: rng = np.random.default_rng(11335577)

In [23]: rng.dirichlet(alpha, size=12)
Out[23]: 
array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [ 0.,  0.,  1.],
       [nan, nan, nan],
       [nan, nan, nan]])

Now that NEP 19 is implemented, we can change the stream of variates produced by a distribution. The legacy code in numpy.random must remain untouched, but the code in the new Generator class (in https://github.com/numpy/numpy/blob/master/numpy/random/_generator.pyx) can be changed.

For what it's worth, here's a Python+NumPy implementation of the "stick-breaking" algorithm:

def dirichlet(alpha, size=1, rng=None):
    if not hasattr(rng, 'beta'):
        rng = np.random.default_rng(rng)

    # XXX alpha is expected to be a sequence of positive floats,
    # with length K at least 2.
    # XXX For now, assume size is an integer (i.e. not None, and not a tuple).
    p = np.empty((size, len(alpha)))

    a = alpha[:-1]
    # b is [sum(alpha[1:]), sum(alpha[2:]), ..., sum(alpha[K-1:])]
    b = np.cumsum(alpha[:0:-1])[::-1]  
    v = rng.beta(a, b, size=(size, len(a)))

    p1mv = np.ones((size, len(a)))
    p1mv[:, 1:] = np.cumprod(1 - v[:, :-1], axis=1)
    p[:, :-1] = v*p1mv
    # The last column is one minus the sum of all the previous columns.
    p[:, -1] = 1 - p[:, :-1].sum(axis=1)

    return p

It handles the small values where the old code fails:

In [24]: rng = np.random.default_rng(11335577)

In [25]: dirichlet(alpha, size=12, rng=rng)
Out[25]: 
array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])

Here's a quick check that it is not generating any nans, and it is not doing something egregiously wrong. The mean of a large sample agrees with the theoretical mean:

In [32]: x = dirichlet(alpha, size=1000000)                                               

In [33]: x.mean(axis=0)                                                                   
Out[33]: array([0.47584289, 0.04751686, 0.47664025])

In [34]: alpha / alpha.sum()   # expected values of the distribution                                                              
Out[34]: array([0.47619048, 0.04761905, 0.47619048])

trendelkampschroer · 2019-11-15T10:52:11Z

Thanks a ton for looking into this and sorry for not getting back to you. I have started an implementation of the stick breaking approach as cython code, but was not able to compile it. I'll dig it up and paste the relevant code snippet.

I am sorry, but I can't seem to find the necessary time to see that through, although I absolutely would love to contribute. Please feel free to take this up from here.

Thanks so much for putting so much effort into a redesign of numpy random and into the numpy package in general!

trendelkampschroer · 2019-11-17T22:10:52Z

@WarrenWeckesser @mattip
I have been able to put some work into this and I have opened a PR containing the stick breaking implementation for dirichlet distributed random vectors. The PR is here #14924

While developing this I have found that the beta distribution is suffering from slowness when parameters a, b have small values. I have added tests that demonstrate this. This affects the stick-breaking implementation as the small a, b case is the one that we wanted to fix in the first place.

In order to make the stick breaking approach robust for cases with small alpha parameters we first need to adress the slowness in the beta generator.

Please let me know If I need to use a different base branch than master for my PR. I just wanted to finally bring this into the world to give this some momentum.

charris · 2019-11-18T22:40:19Z

Closing. There is now #14924.

BUG: Add np.random.dirichlet2

9329f20

np.random.dirichlet fails for small alpha parameters (see numpy#5851). The new implmentation np.random.dirichlet2 switches to generation via beta RVs (stick breaking approach) whenever all alpha parameters are smaller than one.

argriffing added the component: numpy.random label May 13, 2015

jaimefrio reviewed May 13, 2015
View reviewed changes

charris added 00 - Bug 01 - Enhancement and removed 00 - Bug labels May 13, 2015

argriffing mentioned this pull request Jun 10, 2015

scipy.stats.dirichlet.pdf() bug scipy/scipy#4949

Closed

argriffing mentioned this pull request Jun 22, 2015

clarify dirichlet distribution error handling scipy/scipy#4984

Merged

ev-br mentioned this pull request Jun 23, 2015

Alternative RandomState-type objects scipy/scipy#4989

Closed

charris mentioned this pull request Aug 18, 2017

BUG: Missing dirichlet input validation #9577

Merged

WarrenWeckesser mentioned this pull request Nov 14, 2019

Bug in np.random.dirichlet for small alpha parameters #5851

Closed

charris closed this Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Add np.random.dirichlet2 #5872

BUG: Add np.random.dirichlet2 #5872

BUG: Add np.random.dirichlet2 #5872

BUG: Add np.random.dirichlet2 #5872

Conversation

Choose a reason for hiding this comment