8000 ENH: Allow size=0 in numpy.random.choice, regardless of array by werenike · Pull Request #8717 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Allow size=0 in numpy.random.choice, regardless of array #8717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

ENH: Allow size=0 in numpy.random.choice, regardless of array #8717

wants to merge 4 commits into from

Conversation

werenike
Copy link
Contributor

Fixes #8311. Includes tests.

@@ -1094,6 +1094,8 @@ cdef class RandomState:

# Format and Verify input
a = np.array(a, copy=False)
if np.prod(size) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't testing the right thing at all. The code already worked when size had zero-entries, and this breaks np.random.choice(['a', 'b'], size=(3, 0, 4))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test:

assert_equal(np.random.choice(['a', 'b'], size=(3, 0, 4)).shape, (3, 0, 4))

s = (0,)
a = np.array([1.5, 2.5, 3.5])
p = [0.1, 0.4, 0.5]
assert_equal(np.random.choice([], s).shape, s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only test that fails on master, all the others already pass

@eric-wieser
Copy link
Member
eric-wieser commented Mar 2, 2017

Also while we're here, it would be nice if np.random.randint(x, x, size=(1,0,2)) was allowed as well. I think if you fixed that, the only change you'd need to make to choice is removing the error.

@werenike
Copy link
Contributor Author
werenike commented Mar 2, 2017

Hi eric, thanks for your comments. I didn't quite think it through. I wanted np.sum instead of prod and I'll remove the unnecessary tests. Should the randint idea be included here or in a different PR? Seems not quite related to allowing size=0?

@eric-wieser
Copy link
Member
eric-wieser commented Mar 2, 2017

@MareinK: prod is correct, but the size of the resulting array is wrong.

I don't think you should be special casing zeros size here. Remove the bit that throws the error, and then fix the bit that fails next. Fixing the randint thing will make random.choice(0, (0, 2, 3)) work, so is part of the same fix IMO.

@werenike
Copy link
Contributor Author
werenike commented Mar 2, 2017

Aaah I see what you mean, I will be working on it.

@werenike werenike closed this Mar 5, 2017
@werenike
Copy link
Contributor Author
werenike commented Mar 5, 2017

I made a new attempt. I'm not sure about the changes to randint_helpers.pxi.in. I had to create a new case for size==0 and move the definition of rng inside the relevant cases because it throws an error under conditions that are valid when size==0. I was thinking about some other alternatives:

Current solution:

    if size is None:
        rng = <npy_{{npy_udt}}>(high - low)
        rk_random_{{npy_udt}}(off, rng, 1, &buf, state)
        return np.{{np_dt}}(<npy_{{npy_dt}}>buf)
    elif np.prod(size) == 0:
        return <ndarray>np.empty(size, np.{{np_dt}})
    else:
        array = <ndarray>np.empty(size, np.{{np_dt}})
        cnt = PyArray_SIZE(array)
        array_data = <npy_{{npy_udt}} *>PyArray_DATA(array)
        rng = <npy_{{npy_udt}}>(high - low)
        with nogil:
            rk_random_{{npy_udt}}(off, rng, cnt, array_data, state)
        return array

First alternative: merge the case for size==0 with the existing case where array is created: just don't fill the array.

    if size is None:
        rng = <npy_{{npy_udt}}>(high - low)
        rk_random_{{npy_udt}}(off, rng, 1, &buf, state)
        return np.{{np_dt}}(<npy_{{npy_dt}}>buf)
    else:
        array = <ndarray>np.empty(size, np.{{np_dt}})
        if np.prod(size) != 0:
            cnt = PyArray_SIZE(array)
            array_data = <npy_{{npy_udt}} *>PyArray_DATA(array)
            rng = <npy_{{npy_udt}}>(high - low)
            with nogil:
                rk_random_{{npy_udt}}(off, rng, cnt, array_data, state)
        return array

Second alternative: keep the definition for rng outside the cases but make it conditional.

    if np.prod(size) != 0:
        rng = <npy_{{npy_udt}}>(high - low)
    if size is None:
        rk_random_{{npy_udt}}(off, rng, 1, &buf, state)
        return np.{{np_dt}}(<npy_{{npy_dt}}>buf)
    elif np.prod(size) == 0:
        return <ndarray>np.empty(size, np.{{np_dt}})
    else:
        array = <ndarray>np.empty(size, np.{{np_dt}})
        cnt = PyArray_SIZE(array)
        array_data = <npy_{{npy_udt}} *>PyArray_DATA(array)
        with nogil:
            rk_random_{{npy_udt}}(off, rng, cnt, array_data, state)
        return array

A third possibility combines these two options.

Do any of these seem preferable? I'm looking forward to any feedback!

@werenike werenike reopened this Mar 5, 2017
@werenike
Copy link
Contributor Author
werenike commented Mar 5, 2017

AppVeyor failed with

There are newer queued builds for this pull request, failing early.

Not sure what caused this?

@charris
Copy link
Member
charris commented Mar 5, 2017

It has just finished #8669, not sure what is going on.

@charris
Copy link
Member
charris commented Mar 5, 2017

Appveyor just tested the merge of #8669, with the same changeset as this. Not sure what is going on there. Anyway, I restarted the tests.

@eric-wieser
Copy link
Member

Can you just make rk_random_{{npy_udt}} work when cnt==0? Also, needs tests for the new randint behaviour too

@eric-wieser
Copy link
Member

And why can't you compute the subtraction outside the if statement? Surely doing a subtraction alone won't cause an error.

@werenike
Copy link
Contributor Author
werenike commented Mar 5, 2017

The subtraction (high - low) can be computed outside of the if statement, but it is then cast to <npy_{{npy_udt}}>, which is unsigned. So this operation throws an error when low > high. Namely, OverflowError: can't convert negative value to {{npy_udt}}. But low > high is not an invalid input when prod(size)==0 (and it occurs with e.g. randint(0,0,0), as randfunc is called with low, high-1 or 0,-1). So that's why I made it so that the operation isn't performed in that case. Is there a better solution?

As for rk_random_{{npy_udt}}: it already works when cnt==0, but calling it requires the rng variable, so I made sure rk_random was not called when rng was not assigned. However the call does seem to work even if rng has not been assigned, but I'm not sure if that is well-defined behaviour? That is, the following passes the current tests:

    cdef npy_{{npy_udt}} off, rng, buf
    cdef npy_{{npy_udt}} *out
    cdef ndarray array "arrayObject"
    cdef npy_intp cnt
    cdef rk_state *state = <rk_state *>PyCapsule_GetPointer(rngstate, NULL)

    off = <npy_{{npy_udt}}>(<npy_{{npy_dt}}>low)
    if low <= high:
        rng = <npy_{{npy_udt}}>(high - low)
    
    if size is None:
        rk_random_{{npy_udt}}(off, rng, 1, &buf, state)
        return np.{{np_dt}}(<npy_{{npy_dt}}>buf)
    else:
        array = <ndarray>np.empty(size, np.{{np_dt}})
        cnt = PyArray_SIZE(array)
        array_data = <npy_{{npy_udt}} *>PyArray_DATA(array)
        with nogil:
            rk_random_{{npy_udt}}(off, rng, cnt, array_data, state)
        return array

I'll add tests for randint too.

@seberg
Copy link
Member
seberg commented Mar 5, 2017

low > high would seem always an invalid input? Is there a real usecase for special casing this, because otherwise the error seems right to me.

@eric-wieser
Copy link
Member

Yeah, to me the only important case is low==high, which just works here.

@seberg
Copy link
Member
seberg commented Mar 5, 2017

Hmmm, true, you can take nothing out of an empty set I guess. A bit funny, but probably right.

@eric-wieser
Copy link
Member

I mean, that's the entire point of this PR!

Although Wikipedia argues that [a,b) is a valid but empty set even if b < a. So maybe the current patch is correct in allowing that.

@werenike
Copy link
Contributor Author
werenike commented Mar 6, 2017

Getting low==high to work in randint is enough, but it's not enough in _rand_{{npy_dt}}. That's because in randint(low,high,...) there is the call randfunc(low,high-1,...) (where randfunc=_rand_{{npy_dt}}). So for example the call randint(0,0,0) (which is also called to compute choice([],0)) calls randfunc(0,-1,...). So then in _rand_{{npy_dt}}, low>=high must work. Or am I mistaken?

@eric-wieser
Copy link
Member

Lots of good points there. I'm generally pretty happy with this patch as it is, once you add the other tests.

A test of randint(0, -10, 0) would be nice to, to cover the inverted-interval case. I think that should be allowed to pass and return an empty array as well..

@werenike
Copy link
Contributor Author
werenike commented Mar 7, 2017

Tests added.

@eric-wieser
Copy link
Member
eric-wieser commented Mar 7, 2017

Tests look good to me.

The subtraction (high - low) can be computed outside of the if statement, but it is then cast to <npy_{{npy_udt}}>, which is unsigned. So this operation throws an error when low > high. Namely, OverflowError: can't convert negative value to {{npy_udt}}

Still confused about this. If this is true, then how does this line work:

off = <npy_{{npy_udt}}>(<npy_{{npy_dt}}>low)

@werenike
Copy link
Contributor Author
werenike commented Mar 7, 2017

Do you mean, why does this line not throw the same error when low<0? It seems to be because of the additional cast to npy_{{npy_dt}}. Indeed, adding this cast to the high-low subtraction resolves the 'negative value' error, but it introduces an error in one test and causes another to fail.

Although, I don't really understand how the low parameter even works at all, as the cast to an unsigned type would suggest that randint(-10,20) has the same effect as randint(10,20), right? But of course it doesn't. I thought maybe the sign information would be used somewhere else, but low and high aren't used anywhere else, except in high-low, and the sign information is lost there, too. The only things passed to rk_random_{{npy_udt}} are off and rng as far as the distribution is concerned, and those are both unsigned, so how can it know about negative offsets? This is probably not relevant to the PR though. Just confused.

@eric-wieser
Copy link
Member
eric-wieser commented Mar 7, 2017

as the cast to an unsigned type would suggest that randint(-10,20) has the same effect as randint(10,20), right?

Nope, (uint8_t) -10 gives -10 & 0xff = 246. Although that might be implementation defined

@werenike
Copy link
Contributor Author
werenike commented Mar 7, 2017

Aha, and then random numbers are generated based on that, and then interpreted as signed again... I suppose.

How about the issue with off? Is the extra conversion to npy_{{npy_dt}} something to pursue for rng?

@eric-wieser
Copy link
Member
eric-wieser commented Mar 7, 2017

@MareinK: I've added a commit to fix that - there's actually a bug in the subtraction high - low, which can cause undefined behaviour. Try not to force push over it without intending to!

include "numpy.pxd"
include "randint_helpers.pxi"
Copy link
Member
@eric-wieser eric-wieser Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be after include "numpy.pxd" to have access to the numpy types

@@ -23,7 +23,7 @@ def get_dispatch(dtypes):

{{for npy_dt, npy_udt, np_dt in get_dispatch(dtypes)}}

def _rand_{{npy_dt}}(low, high, size, rngstate):
def _rand_{{npy_dt}}(npy_{{npy_dt}} low, npy_{{npy_dt}} high, size, rngstate):
Copy link
Member

Choose a reason for hiding this comment

The reason will be d F438 isplayed to describe this comment to others. Learn more.

May as well enforce casting here

rng = <npy_{{npy_udt}}>(high - low)
off = <npy_{{npy_udt}}>(<npy_{{npy_dt}}>low)
off = <npy_{{npy_udt}}>(low)
rng = <npy_{{npy_udt}}>(high) - <npy_{{npy_udt}}>(low)
Copy link
Member
@eric-wieser eric-wieser Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high - low can produce a result that doesn't fit in npy_dt, and cause signed overflow (which is undefined in C - not sure about Cython). Unsigned overflow, however, is well-defined, and does the right thing.

@@ -1106,8 +1106,6 @@ cdef class RandomState:
raise ValueError("a must be 1-dimensional")
else:
pop_size = a.shape[0]
if pop_size is 0:
raise ValueError("a must be non-empty")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "a cannot be empty unless no samples are taken"

@werenike
Copy link
Contributor Author
werenike commented Mar 8, 2017

I made the proposed changes. I noticed that for each error that is raised, the np.prod(size)!=0 check is made now. Is this the right way to go about it? Or would one global check be better? Perhaps at least the outcome could be placed in a variable and reused for each check.

I also noticed that now, there are some interesting effects with multi dimensional arrays, such as:

np.random.choice([[[[]]]],(3,0,4)) = array([], shape=(3, 0, 4, 1, 1, 0), dtype=float64)

Not sure if that is desirable. An error can be raised instead by removing just a single prod!=0 check (without other effects). Although going from our work so far I think I would expect array([], shape=(3, 0, 4) to be the result.

raise ValueError("a must be greater than 0")
elif a.ndim != 1:
elif a.ndim != 1 and np.prod(size) != 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be here.

@@ -1100,14 +1100,14 @@ cdef class RandomState:
pop_size = operator.index(a.item())
except TypeError:
raise ValueError("a must be 1-dimensional or an integer")
if pop_size <= 0:
if pop_size <= 0 and np.prod(size) != 0:
raise ValueError("a must be greater than 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message probably wants fixing to be consistent with the others

@eric-wieser
Copy link
Member
eric-wieser commented Mar 8, 2017

In case my previous comment didn't convey it, np.random.choice([[[[]]]],(3,0,4)) should be illegal, because numpy has decided that choice should only work on 1d arrays.

Arguably you could extend it to work over the axisth dimension of ND arrays, but that's for another PR, and it's not clear what size would mean in that case.

@werenike
Copy link
Contributor Author
werenike commented Mar 8, 2017

Maybe for another time then :) It's illegal again with my latest commit (ValueError: a must be 1-dimensional).

@eric-wieser
Copy link
Member
eric-wieser commented Mar 8, 2017

@seberg, I think I've had too much of a hand in this to give an impartial review - can you take a quick look?

@eric-wieser eric-wieser requested a review from seberg March 8, 2017 16:09
@homu
Copy link
Contributor
homu commented Mar 26, 2017

☔ The latest upstream changes (presumably #8649) made this pull request unmergeable. Please resolve the merge conflicts.

Even when no elements needed to be drawn, ``np.random.randint`` and
``np.random.choice`` raised an error when the arguments described an empty
distribution. This has been fixed so that e.g.
``np.random.choice([],0) == np.array([],dtype=float64)``.
Bundled version of LAPACK is now 3.2.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline needed here

@homu
Copy link
Contributor
homu commented Apr 30, 2017

☔ The latest upstream changes (presumably #8885) made this pull request unmergeable. Please resolve the merge conflicts.

@charris
Copy link
Member
charris commented Aug 18, 2017

#9576 affects randint so may be related.

@charris
Copy link
Member
charris commented Aug 18, 2017

See also #7810, which adds multidimensional array support to random.choice.

bashtage added a commit to bashtage/numpy that referenced this pull request Aug 31, 2017
Allow randint to handle zero size arrays to allow choice to function
on empty inputs

xref numpy#8717
@mattip
Copy link
Member
mattip commented Jun 13, 2018

This was approved but now needs a rebase/merge to resolve conflicts

@werenike
Copy link
Contributor Author

I'd like to continue work on this but I've since deleted the fork that this request was based on. Is the best solution to create a new fork and a new pull request? Perhaps as described here? Thanks.

@charris
Copy link
Member
charris commented Jun 15, 2018

@MareinK The PR is still available. I can put it up as a new PR if you don't have the code any longer.

@werenike
Copy link
Contributor Author

Indeed I don't have the code available anymore. As I said I can reconstruct it, but I don't know what is the best way to go about it.

@mattip
Copy link
Member
mattip commented Jul 27, 2018

Replaced by #11383

@mattip mattip closed this Jul 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0