8000 BUG: Add np.random.dirichlet2 by trendelkampschroer · Pull Request #5872 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: Add np.random.dirichlet2 #5872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions numpy/random/mtrand/mtrand.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -4602,6 +4602,141 @@ cdef class RandomState:

return diric

def dirichlet2(self, object alpha, size=None):
"""
dirichlet2(alpha, size=None)

Draw samples from the Dirichlet distribution.

Draw `size` samples of dimension k from a Dirichlet distribution. A
Dirichlet-distributed random variable can be seen as a multivariate
generalization of a Beta distribution. Dirichlet pdf is the conjugate
prior of a multinomial in Bayesian inference.

Parameters
----------
alpha : array
Parameter of the distribution (k dimension for sample of
dimension k).
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn. Default is None, in which case a
single value is returned.

Returns
-------
samples : ndarray,
The drawn samples, of shape (size, alpha.ndim).

Notes
-----
.. math:: X \\approx \\prod_{i=1}^{k}{x^{\\alpha_i-1}_i}

Uses the following property for computation: for each dimension,
draw a random sample y_i from a standard gamma generator of shape
`alpha_i`, then
:math:`X = \\frac{1}{\\sum_{i=1}^k{y_i}} (y_1, \\ldots, y_n)` is
Dirichlet distributed.

References
----------
.. [1] David McKay, "Information Theory, Inference and Learning
Algorithms," chapter 23,
http://www.inference.phy.cam.ac.uk/mackay/
.. [2] Wikipedia, "Dirichlet distribution",
http://en.wikipedia.org/wiki/Dirichlet_distribution

Examples
--------
Taking an example cited in Wikipedia, this distribution can be used if
one wanted to cut strings (each of initial length 1.0) into K pieces
with different lengths, where each piece had, on average, a designated
average length, but allowing some variation in the relative sizes of
the pieces.

>>> s = np.random.dirichlet((10, 5, 3), 20).transpose()

>>> plt.barh(range(20), s[0])
>>> plt.barh(range(20), s[1], left=s[0], color='g')
>>> plt.barh(range(20), s[2], left=s[0]+s[1], color='r')
>>> plt.title("Lengths of Strings")

"""

#=================
# Pure python algo
#=================
#alpha = N.atleast_1d(alpha)
#k = alpha.size

#if n == 1:
# val = N.zeros(k)
# for i in range(k):
# val[i] = sgamma(alpha[i], n)
# val /= N.sum(val)
#else:
# val = N.zeros((k, n))
# for i in range(k):
# val[i] = sgamma(alpha[i], n)
# val /= N.sum(val, axis = 0)
# val = val.T

#return val

cdef npy_intp k
cdef npy_intp totsize
cdef ndarray alpha_arr, val_arr
cdef double *alpha_data
cdef double *val_data
cdef npy_intp i, j
cdef double acc, invacc

cdef double alpha_max
cdef double alpha_sum

k = len(alpha)
alpha_arr = <ndarray>PyArray_ContiguousFromObject(alpha, NPY_DOUBLE, 1, 1)
alpha_data = <double*>PyArray_DATA(alpha_arr)

shape = _shape_from_size(size, k)

diric = np.zeros(shape, np.float64)
val_arr = <ndarray>diric
val_data= <double*>PyArray_DATA(val_arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate the general look of this alignment scheme, but I understand you are just replicating what is elsewhere in this module. In any case, there should be a space before =, even if there isn't in the equivalent line of dirichlet.


i = 0
totsize = PyArray_SIZE(val_arr)
alpha_max = alpha_arr.max()

if alpha_max < 1.0:
alpha_sum = alpha_arr.sum()
"""Generation via beta-RVs"""
with self.lock, nogil:
while i < totsize:
acc = 1.0
invacc = alpha_sum
for j from 0<= j < k-1:
invacc -= alpha_data[j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this numerically stable? Or is there the risk that floating point rounding will have us end with a negative invacc? If the risk doesn't exist, I think it is worth a comment explaining why not, and if it does exist perhaps we should, instead of a single alpha_sum value, store an array, something like:

alpha_cumsums = np.cumsum(alpha_arr[::-1])

and index into it to retrieve the partial sums.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer my own question:

>>> a = np.array([1e-20, 1e-18, .1])
>>> a_s = a.sum()
>>> for j in a[::-1]:
...     a_s -= j
...     print(a_s)
...
0.0
-1e-18
-1.01e-18

So we need to do something about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats true of course. I had also thought about storing the cumsum of the alphas, but was hesitant to introduce another array. I think we also need to guard against underflow of acc.

Let me think about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use something recursive like http://en.wikipedia.org/wiki/Pairwise_summation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only a mitigation, not a fix. Consider the case of an failure with
only two element in the array.

On Wed, May 13, 2015 at 2:42 PM, argriffing notifications@github.com
wrote:

In numpy/random/mtrand/mtrand.pyx
#5872 (comment):

  •    val_arr = <ndarray>diric
    
  •    val_data= <double*>PyArray_DATA(val_arr)
    
  •    i = 0
    
  •    totsize = PyArray_SIZE(val_arr)
    
  •    alpha_max = alpha_arr.max()
    
  •    if alpha_max < 1.0:
    
  •        alpha_sum = alpha_arr.sum()
    
  •        """Generation via beta-RVs"""
    
  •        with self.lock, nogil:
    
  •            while i < totsize:
    
  •                acc = 1.0
    
  •                invacc = alpha_sum
    
  •                for j from 0<= j < k-1:
    
  •                    invacc -= alpha_data[j]
    

Maybe use something recursive like
http://en.wikipedia.org/wiki/Pairwise_summation?


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/pull/5872/files#r30262312.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a failure with only two elements in the array then I'm not understanding something because the 2-element case is the same as the beta distribution. I'm assuming we can sample from a beta distribution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't stop the technical discussion, but FTR I'm -1 on merging
this until we have some solution that doesn't expose a public dirichlet2
method, for reasons discussed in #5851. In case anyone starts getting any
itchy merge fingers cough@charris_cough_ ;-).
On May 13, 2015 1:21 PM, "argriffing" notifications@github.com wrote:

In numpy/random/mtrand/mtrand.pyx
#5872 (comment):

  •    val_arr = <ndarray>diric
    
  •    val_data= <double*>PyArray_DATA(val_arr)
    
  •    i = 0
    
  •    totsize = PyArray_SIZE(val_arr)
    
  •    alpha_max = alpha_arr.max()
    
  •    if alpha_max < 1.0:
    
  •        alpha_sum = alpha_arr.sum()
    
  •        """Generation via beta-RVs"""
    
  •        with self.lock, nogil:
    
  •            while i < totsize:
    
  •                acc = 1.0
    
  •                invacc = alpha_sum
    
  •                for j from 0<= j < k-1:
    
  •                    invacc -= alpha_data[j]
    

If there is a failure with only two elements in the array then I'm not
understanding something because the 2-element case is the same as the beta
distribution. I'm assuming we can sample from a beta distribution.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/pull/5872/files#r30271641.

val_data[i+j] = acc * rk_beta(self.internal_state, alpha_data[j], invacc)
acc = acc - val_data[i+j]
val_data[i+(k-1)] = acc
i = i+k

else:
with self.lock, nogil:
while i < totsize:
acc = 0.0
for j from 0 <= j < k:
val_data[i+j] = rk_standard_gamma(self.internal_state,
alpha_data[j])
acc = acc + val_data[i+j]
invacc = 1/acc
for j from 0 <= j < k:
val_data[i+j] = val_data[i+j] * invacc
i = i + k

return diric

# Shuffling and permutations:
def shuffle(self, object x):
"""
Expand Down Expand Up @@ -4755,6 +4890,7 @@ logseries = _rand.logseries
multivariate_normal = _rand.multivariate_normal
multinomial = _rand.multinomial
dirichlet = _rand.dirichlet
dirichlet2 = _rand.dirichlet2

shuffle = _rand.shuffle
permutation = _rand.permutation
0