ENH: Allow size=0 in numpy.random.choice, regardless of array #8717

ghost · 2017-02-28T21:00:10Z

Fixes #8311. Includes tests.

eric-wieser · 2017-03-02T17:31:14Z

numpy/random/mtrand/mtrand.pyx

@@ -1094,6 +1094,8 @@ cdef class RandomState:

        # Format and Verify input
        a = np.array(a, copy=False)
+        if np.prod(size) == 0:


This isn't testing the right thing at all. The code already worked when size had zero-entries, and this breaks np.random.choice(['a', 'b'], size=(3, 0, 4))

Test:

assert_equal(np.random.choice(['a', 'b'], size=(3, 0, 4)).shape, (3, 0, 4))

eric-wieser · 2017-03-02T17:31:55Z

numpy/random/tests/test_random.py

+        s = (0,)
+        a = np.array([1.5, 2.5, 3.5])
+        p = [0.1, 0.4, 0.5]
+        assert_equal(np.random.choice([], s).shape, s)


This is the only test that fails on master, all the others already pass

eric-wieser · 2017-03-02T17:38:15Z

Also while we're here, it would be nice if np.random.randint(x, x, size=(1,0,2)) was allowed as well. I think if you fixed that, the only change you'd need to make to choice is removing the error.

ghost · 2017-03-02T17:44:33Z

Hi eric, thanks for your comments. I didn't quite think it through. I wanted np.sum instead of prod and I'll remove the unnecessary tests. Should the randint idea be included here or in a different PR? Seems not quite related to allowing size=0?

eric-wieser · 2017-03-02T17:45:49Z

@MareinK: prod is correct, but the size of the resulting array is wrong.

I don't think you should be special casing zeros size here. Remove the bit that throws the error, and then fix the bit that fails next. Fixing the randint thing will make random.choice(0, (0, 2, 3)) work, so is part of the same fix IMO.

ghost · 2017-03-02T17:47:52Z

Aaah I see what you mean, I will be working on it.

ghost · 2017-03-05T16:22:23Z

I made a new attempt. I'm not sure about the changes to randint_helpers.pxi.in. I had to create a new case for size==0 and move the definition of rng inside the relevant cases because it throws an error under conditions that are valid when size==0. I was thinking about some other alternatives:

Current solution:

    if size is None:
        rng = <npy_{{npy_udt}}>(high - low)
        rk_random_{{npy_udt}}(off, rng, 1, &buf, state)
        return np.{{np_dt}}(<npy_{{npy_dt}}>buf)
    elif np.prod(size) == 0:
        return <ndarray>np.empty(size, np.{{np_dt}})
    else:
        array = <ndarray>np.empty(size, np.{{np_dt}})
        cnt = PyArray_SIZE(array)
        array_data = <npy_{{npy_udt}} *>PyArray_DATA(array)
        rng = <npy_{{npy_udt}}>(high - low)
        with nogil:
            rk_random_{{npy_udt}}(off, rng, cnt, array_data, state)
        return array

First alternative: merge the case for size==0 with the existing case where array is created: just don't fill the array.

    if size is None:
        rng = <npy_{{npy_udt}}>(high - low)
        rk_random_{{npy_udt}}(off, rng, 1, &buf, state)
        return np.{{np_dt}}(<npy_{{npy_dt}}>buf)
    else:
        array = <ndarray>np.empty(size, np.{{np_dt}})
        if np.prod(size) != 0:
            cnt = PyArray_SIZE(array)
            array_data = <npy_{{npy_udt}} *>PyArray_DATA(array)
            rng = <npy_{{npy_udt}}>(high - low)
            with nogil:
                rk_random_{{npy_udt}}(off, rng, cnt, array_data, state)
        return array

Second alternative: keep the definition for rng outside the cases but make it conditional.

    if np.prod(size) != 0:
        rng = <npy_{{npy_udt}}>(high - low)
    if size is None:
        rk_random_{{npy_udt}}(off, rng, 1, &buf, state)
        return np.{{np_dt}}(<npy_{{npy_dt}}>buf)
    elif np.prod(size) == 0:
        return <ndarray>np.empty(size, np.{{np_dt}})
    else:
        array = <ndarray>np.empty(size, np.{{np_dt}})
        cnt = PyArray_SIZE(array)
        array_data = <npy_{{npy_udt}} *>PyArray_DATA(array)
        with nogil:
            rk_random_{{npy_udt}}(off, rng, cnt, array_data, state)
        return array

A third possibility combines these two options.

Do any of these seem preferable? I'm looking forward to any feedback!

ghost · 2017-03-05T16:35:06Z

AppVeyor failed with

There are newer queued builds for this pull request, failing early.

Not sure what caused this?

charris · 2017-03-05T16:56:43Z

It has just finished #8669, not sure what is going on.

charris · 2017-03-05T17:00:46Z

Appveyor just tested the merge of #8669, with the same changeset as this. Not sure what is going on there. Anyway, I restarted the tests.

eric-wieser · 2017-03-05T17:58:18Z

Can you just make rk_random_{{npy_udt}} work when cnt==0? Also, needs tests for the new randint behaviour too

eric-wieser · 2017-03-05T17:59:32Z

And why can't you compute the subtraction outside the if statement? Surely doing a subtraction alone won't cause an error.

ghost · 2017-03-05T23:25:30Z

The subtraction (high - low) can be computed outside of the if statement, but it is then cast to <npy_{{npy_udt}}>, which is unsigned. So this operation throws an error when low > high. Namely, OverflowError: can't convert negative value to {{npy_udt}}. But low > high is not an invalid input when prod(size)==0 (and it occurs with e.g. randint(0,0,0), as randfunc is called with low, high-1 or 0,-1). So that's why I made it so that the operation isn't performed in that case. Is there a better solution?

As for rk_random_{{npy_udt}}: it already works when cnt==0, but calling it requires the rng variable, so I made sure rk_random was not called when rng was not assigned. However the call does seem to work even if rng has not been assigned, but I'm not sure if that is well-defined behaviour? That is, the following passes the current tests:

    cdef npy_{{npy_udt}} off, rng, buf
    cdef npy_{{npy_udt}} *out
    cdef ndarray array "arrayObject"
    cdef npy_intp cnt
    cdef rk_state *state = <rk_state *>PyCapsule_GetPointer(rngstate, NULL)

    off = <npy_{{npy_udt}}>(<npy_{{npy_dt}}>low)
    if low <= high:
        rng = <npy_{{npy_udt}}>(high - low)
    
    if size is None:
        rk_random_{{npy_udt}}(off, rng, 1, &buf, state)
        return np.{{np_dt}}(<npy_{{npy_dt}}>buf)
    else:
        array = <ndarray>np.empty(size, np.{{np_dt}})
        cnt = PyArray_SIZE(array)
        array_data = <npy_{{npy_udt}} *>PyArray_DATA(array)
        with nogil:
            rk_random_{{npy_udt}}(off, rng, cnt, array_data, state)
        return array

I'll add tests for randint too.

seberg · 2017-03-05T23:34:58Z

low > high would seem always an invalid input? Is there a real usecase for special casing this, because otherwise the error seems right to me.

eric-wieser · 2017-03-05T23:36:26Z

Yeah, to me the only important case is low==high, which just works here.

seberg · 2017-03-05T23:41:12Z

Hmmm, true, you can take nothing out of an empty set I guess. A bit funny, but probably right.

eric-wieser · 2017-03-05T23:43:34Z

I mean, that's the entire point of this PR!

Although Wikipedia argues that [a,b) is a valid but empty set even if b < a. So maybe the current patch is correct in allowing that.

ghost · 2017-03-06T00:48:21Z

Getting low==high to work in randint is enough, but it's not enough in _rand_{{npy_dt}}. That's because in randint(low,high,...) there is the call randfunc(low,high-1,...) (where randfunc=_rand_{{npy_dt}}). So for example the call randint(0,0,0) (which is also called to compute choice([],0)) calls randfunc(0,-1,...). So then in _rand_{{npy_dt}}, low>=high must work. Or am I mistaken?

eric-wieser · 2017-03-07T01:17:37Z

Lots of good points there. I'm generally pretty happy with this patch as it is, once you add the other tests.

A test of randint(0, -10, 0) would be nice to, to cover the inverted-interval case. I think that should be allowed to pass and return an empty array as well..

ghost · 2017-03-07T18:49:47Z

Tests added.

eric-wieser · 2017-03-07T19:06:33Z

Tests look good to me.

The subtraction (high - low) can be computed outside of the if statement, but it is then cast to <npy_{{npy_udt}}>, which is unsigned. So this operation throws an error when low > high. Namely, OverflowError: can't convert negative value to {{npy_udt}}

Still confused about this. If this is true, then how does this line work:

off = <npy_{{npy_udt}}>(<npy_{{npy_dt}}>low)

ghost · 2017-03-07T21:19:55Z

Do you mean, why does this line not throw the same error when low<0? It seems to be because of the additional cast to npy_{{npy_dt}}. Indeed, adding this cast to the high-low subtraction resolves the 'negative value' error, but it introduces an error in one test and causes another to fail.

Although, I don't really understand how the low parameter even works at all, as the cast to an unsigned type would suggest that randint(-10,20) has the same effect as randint(10,20), right? But of course it doesn't. I thought maybe the sign information would be used somewhere else, but low and high aren't used anywhere else, except in high-low, and the sign information is lost there, too. The only things passed to rk_random_{{npy_udt}} are off and rng as far as the distribution is concerned, and those are both unsigned, so how can it know about negative offsets? This is probably not relevant to the PR though. Just confused.

eric-wieser · 2017-03-07T22:07:55Z

as the cast to an unsigned type would suggest that randint(-10,20) has the same effect as randint(10,20), right?

Nope, (uint8_t) -10 gives -10 & 0xff = 246. ~~Although that might be implementation defined~~

ghost · 2017-03-07T22:13:10Z

Aha, and then random numbers are generated based on that, and then interpreted as signed again... I suppose.

How about the issue with off? Is the extra conversion to npy_{{npy_dt}} something to pursue for rng?

eric-wieser · 2017-03-07T22:42:44Z

@MareinK: I've added a commit to fix that - there's actually a bug in the subtraction high - low, which can cause undefined behaviour. Try not to force push over it without intending to!

10000

eric-wieser · 2017-03-07T22:43:29Z

numpy/random/mtrand/mtrand.pyx

 include "numpy.pxd"
+include "randint_helpers.pxi"


This needs to be after include "numpy.pxd" to have access to the numpy types

eric-wieser · 2017-03-07T22:43:44Z

numpy/random/mtrand/randint_helpers.pxi.in

@@ -23,7 +23,7 @@ def get_dispatch(dtypes):

 {{for npy_dt, npy_udt, np_dt in get_dispatch(dtypes)}}

-def _rand_{{npy_dt}}(low, high, size, rngstate):
+def _rand_{{npy_dt}}(npy_{{npy_dt}} low, npy_{{npy_dt}} high, size, rngstate):


May as well enforce casting here

eric-wieser · 2017-03-07T22:44:27Z

numpy/random/mtrand/randint_helpers.pxi.in

-    rng = <npy_{{npy_udt}}>(high - low)
-    off = <npy_{{npy_udt}}>(<npy_{{npy_dt}}>low)
+    off = <npy_{{npy_udt}}>(low)
+    rng = <npy_{{npy_udt}}>(high) - <npy_{{npy_udt}}>(low)


high - low can produce a result that doesn't fit in npy_dt, and cause signed overflow (which is undefined in C - not sure about Cython). Unsigned overflow, however, is well-defined, and does the right thing.

eric-wieser · 2017-03-08T14:14:11Z

numpy/random/mtrand/mtrand.pyx

@@ -1106,8 +1106,6 @@ cdef class RandomState:
            raise ValueError("a must be 1-dimensional")
        else:
            pop_size = a.shape[0]
-            if pop_size is 0:
-                raise ValueError("a must be non-empty")


Perhaps "a cannot be empty unless no samples are taken"

ghost · 2017-03-08T15:18:15Z

I made the proposed changes. I noticed that for each error that is raised, the np.prod(size)!=0 check is made now. Is this the right way to go about it? Or would one global check be better? Perhaps at least the outcome could be placed in a variable and reused for each check.

I also noticed that now, there are some interesting effects with multi dimensional arrays, such as:

np.random.choice([[[[]]]],(3,0,4)) = array([], shape=(3, 0, 4, 1, 1, 0), dtype=float64)

Not sure if that is desirable. An error can be raised instead by removing just a single prod!=0 check (without other effects). Although going from our work so far I think I would expect array([], shape=(3, 0, 4) to be the result.

eric-wieser · 2017-03-08T15:45:58Z

numpy/random/mtrand/mtrand.pyx

                raise ValueError("a must be greater than 0")
-        elif a.ndim != 1:
+        elif a.ndim != 1 and np.prod(size) != 0:


This shouldn't be here.

eric-wieser · 2017-03-08T15:46:29Z

numpy/random/mtrand/mtrand.pyx

@@ -1100,14 +1100,14 @@ cdef class RandomState:
                pop_size = operator.index(a.item())
            except TypeError:
                raise ValueError("a must be 1-dimensional or an integer")
-            if pop_size <= 0:
+            if pop_size <= 0 and np.prod(size) != 0:
                raise ValueError("a must be greater than 0")


This message probably wants fixing to be consistent with the others

eric-wieser · 2017-03-08T16:00:48Z

In case my previous comment didn't convey it, np.random.choice([[[[]]]],(3,0,4)) should be illegal, because numpy has decided that choice should only work on 1d arrays.

Arguably you could extend it to work over the axisth dimension of ND arrays, but that's for another PR, and it's not clear what size would mean in that case.

ghost · 2017-03-08T16:02:21Z

Maybe for another time then :) It's illegal again with my latest commit (ValueError: a must be 1-dimensional).

eric-wieser · 2017-03-08T16:09:30Z

@seberg, I think I've had too much of a hand in this to give an impartial review - can you take a quick look?

homu · 2017-03-26T15:50:18Z

☔ The latest upstream changes (presumably #8649) made this pull request unmergeable. Please resolve the merge conflicts.

eric-wieser · 2017-04-26T19:36:58Z

doc/release/1.13.0-notes.rst

+Even when no elements needed to be drawn, ``np.random.randint`` and
+``np.random.choice`` raised an error when the arguments described an empty
+distribution. This has been fixed so that e.g.
+``np.random.choice([],0) == np.array([],dtype=float64)``.
 Bundled version of LAPACK is now 3.2.2


Newline needed here

homu · 2017-04-30T17:24:27Z

☔ The latest upstream changes (presumably #8885) made this pull request unmergeable. Please resolve the merge conflicts.

charris · 2017-08-18T20:37:12Z

#9576 affects randint so may be related.

charris · 2017-08-18T20:42:11Z

See also #7810, which adds multidimensional array support to random.choice.

Allow randint to handle zero size arrays to allow choice to function on empty inputs xref numpy#8717

mattip · 2018-06-13T20:02:47Z

This was approved but now needs a rebase/merge to resolve conflicts

ghost · 2018-06-15T10:21:51Z

I'd like to continue work on this but I've since deleted the fork that this request was based on. Is the best solution to create a new fork and a new pull request? Perhaps as described here? Thanks.

charris · 2018-06-15T13:47:25Z

@MareinK The PR is still available. I can put it up as a new PR if you don't have the code any longer.

ghost · 2018-06-16T15:15:18Z

Indeed I don't have the code available anymore. As I said I can reconstruct it, but I don't know what is the best way to go about it.

mattip · 2018-07-27T12:18:21Z

Replaced by #11383

charris added 01 - Enhancement component: numpy.random labels Feb 28, 2017

eric-wieser reviewed Mar 2, 2017

View reviewed changes

ghost closed this Mar 5, 2017

ghost reopened this Mar 5, 2017

eric-wieser reviewed Mar 7, 2017

View reviewed changes

eric-wieser reviewed Mar 8, 2017

View reviewed changes

Marein Könings and others added 2 commits March 8, 2017 15:35

ENH: allow size=0 in random.randint and random.choice

2623882

BUG: Avoid unsigned overflow in subtraction

4f76793

eric-wieser reviewed Mar 8, 2017

View reviewed changes

Improve error messages, add tests

5d5545e

eric-wieser approved these changes Mar 8, 2017

View reviewed changes

eric-wieser requested a review from seberg March 8, 2017 16:09

Merge remote-tracking branch 'origin' into random-choice-zero-size

c9d9ea1

eric-wieser reviewed Apr 26, 2017

View reviewed changes

eric-wieser mentioned this pull request Jun 15, 2017

randint over all of uint64 raises OverflowError if high=... is uint64 but not if it is int #9256

Closed

charris mentioned this pull request Aug 18, 2017

ENH: Add broadcasting to randint #9576

Closed

bashtage added a commit to bashtage/numpy that referenced this pull request Aug 31, 2017

ENH: Allow zero size arrays in randint

4840744

Allow randint to handle zero size arrays to allow choice to function on empty inputs xref numpy#8717

eric-wieser mentioned this pull request Feb 16, 2018

np.random.choice(0, size=0) fails #10597

Closed

mattip mentioned this pull request Jun 19, 2018

ENH: Allow size=0 in numpy.random.choice #11383

Merged

mattip closed this Jul 27, 2018

Uh oh!

ENH: Allow size=0 in numpy.random.choice, regardless of array #8717

ENH: Allow size=0 in numpy.random.choice, regardless of array #8717

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!