8000 ENH: random: Add the method `permuted` to Generator. by WarrenWeckesser · Pull Request #15121 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: random: Add the method permuted to Generator. #15121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Sep 2, 2020

Conversation

WarrenWeckesser
Copy link
Member
@WarrenWeckesser WarrenWeckesser commented Dec 17, 2019

Note: updated on 15-July-2020.

In 2014, I created a github issue [1]_ and started a mailing list
discussion [2]_ about a limitation of the functions shuffle and
permutation in numpy.random. The problem is that those functions
treat the input as 1-d sequence, and only apply the shuffle or
permutation to that 1-d input. There is no function to, say, shuffle
each row of a 2-d array, with the permutation of each row being
independent of the others.

At the time, most folks were in favor of adding new functions, although
there was still some debate about the names. This PR is an implementation
of the Generator method permuted with an out parameter as proposed
by @eric-wieser below, and as favored by @seberg in gh-5173. The full
signature is

def permuted(self, object x, *, axis=None, out=None):

When axis=None, the operation applies to the flattened array.
(The return value of permuted in that case still has the
same shape as x.) Otherwise, random permutations are applied
independently along the given axis.

Currently permuted is the only new function that is implemented.
For a consistent API, we would might eventually deprecate shuffle
and permutation (as proposed in gh=5173), and replace them with

def shuffled(self, object x, *, axis=None, out=None):

This would handle axis like shuffle and permutation currently
handle it. I can add that to this PR if we want to make that change now.

Closes gh-5173.

[1] #5173
[2] https://mail.python.org/pipermail/numpy-discussion/2014-October/071340.html

@WarrenWeckesser
Copy link
Member Author
WarrenWeckesser commented Dec 17, 2019

I guess it needs some unit tests, too. 😧

@WarrenWeckesser WarrenWeckesser changed the title ENH: random: : Add randomly_permute and randomly_permuted methods to Generator. WIP/ENH: random: : Add randomly_permute and randomly_permuted methods to Generator. Dec 17, 2019
@WarrenWeckesser
Copy link
Member Author

I added the WIP prefix to the title. I definitely need feedback on both the API and the implementation. The implementation of randomly_permute seems to be a pretty straightforward application of the iterator created with PyArray_IterAllButAxis. According to the documentation of the C API, that function has been superceded by NpyIter, but I haven't yet figured out how to do the equivalent iteration with NpyIter. Pointers to concrete examples would be appreciated!

Be sure to peruse the old mailing list thread to see the motivation for the API.

@eric-wieser
Copy link
Member
eric-wieser commented Dec 17, 2019

but no, I won't fight to the death for these names

Perhaps one of:

  • random_permutation(x, *, axis=.., out=None)
  • permutation_along_axis(x, *, axis=.., out=None)

Then out=x can mean "in-place"

@WarrenWeckesser
Copy link
Member Author

but no, I won't fight to the death for these names

Perhaps one of:

* `random_permutation(x, *, axis=.., out=None)`

* `permutation_along_axis(x, *, axis=.., out=None)`

Another name that I considered was shuffle_along_axis. None of the names here or in the mailing list thread are obvious first choices for me. If the default remains axis=None, a name of the form <something>_along_axis doesn't feel quite right, because the default doesn't actually shuffle along an axis.

+1 for making axis (and out, if implemented) keyword only.

Then out=x can mean "in-place"

Creating just one new function and using out to provide the in-place capability could work. That API feels like an outlier, though. That is, we don't use that for the existing functions shuffle or permutation, nor for numpy.sort, which theoretically could have almost the same signature as a shuffle function, since both sorting and shuffling create permutations of the input. What do other folks think of this idea?

@matthew-brett
Copy link
Contributor

I'm really sorry if I missed it, but what about adding an iaxis argument to shuffle and permute?

@WarrenWeckesser
Copy link
Member Author

@matthew-brett, is the idea that one gives iaxis=n to shuffle independently along the axis n? How does that parameter interact with the existing axis parameter? Would one ever meaningfully use both axis and iaxis?

I should also mention another API alternative from the email thread.

I originally proposed adding the boolean parameter independent to change the meaning of the axis argument of shuffle, instead of creating new functions. (In the email thread, I also started out with a misguided idea about how axis=None should be handled in that case. My foolishness is resolved after a few emails.) On the idea of adding the parameter independent to shuffle, Robert Kern commented:

It seems to me a perfectly good reason to have two methods instead of
one. I can't imagine when I wouldn't be using a literal True or False
for this, so it really should be two different methods.

@matthew-brett
Copy link
Contributor

Yes, my idea, to the extent I had thought it through, was to have iaxis be a keyword-only argument that was mutually exclusive of axis.

I did see your indedepent=True idea, but I didn't read Robert's mail. He is probably right, that once you realize what permute does, you'd want to specify independent=False, if you really wanted that beheviorr, but that seems to me an advantage. If we were starting from scratch, and there was another way of naming the functions, it would make more sense to have two functions, but with the names we have, and their curent behavior, the presence of the independent argument seems like a good warning about potential surprises.

@bashtage
Copy link
Contributor
bashtage commented Dec 20, 2019

Namespace pollution with these seems pretty bad. I think these only make sense if there is a move to remove shuffle/permutation. If these can't or shouldn't but removed, then I think it would be better to fix this with a cleverly chosen keyword argument. Using randomly_permute also seems to hint at an idea that Generator.permutation is less than random.

I also think having both is not a good idea. Why not just randomly_permute(...,copy=True). The inplace can always return the same object when copy=False

y = randomly_permute(x)
y is x # True
y = randomly_permute(x,copy=True)
y is x # False

I also think ca. 2014 the compatibility guarantee was so strong that it almost dictated to have a new function. I think in Generator there should be more room for maneuvering.

Of the names you have mentioned, I would prefer permute (remove the random) so that there would be permute and permutation. Ideally permutation could be deprecated if permute had all of its features.

Maybe I'm missing something, but how does this differ from the version in master permutation(x, axis=0) (at least for a permutation)?

Edit: I get it now. It is sort of like map(permutation, swapaxes(x,...)) . Maybe permutation_map. Or use the keyword map (or is the an apply) in permutation.

Copy link
Contributor
@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things that could be improved.

@bashtage
Copy link
Contributor

transposition? I'm not sure if this is describing this type of change or is actually what permutation is doing.

@matthew-brett
Copy link < 8000 div aria-live="polite" aria-atomic="true" class="sr-only" data-clipboard-copy-feedback>
Contributor

I'd just like to humbly beg for the permuted and shuffled versions of these functions.

I know these are trivial extensions, but the same is true of the Python built-in "sorted", and I'm sure y'all have the same experience that I do, of finding it extremely useful, so that I almost invariably use sorted instead of the sort method of a list.

I would also find it very useful for teaching because:

  • I save a line or two of uninteresting code
  • I don't have to address the problem of mutable objects or views until a lot later.
  • I don't have to add a keyword like copy=True that requires me to explain mutable / views.

@bashtage
Copy link
Contributor

Another thought about a keyword argument. This feels like how these differ from permutation. The existing seems like an outer permutation since it permutes entire slices. This permutation seems like an inner since it permutes the elements within a slice (or the slice's .flat).

This suggests something like how="inner" or how="outer" or (less clear but simpler), inner=True as keywords.

@WarrenWeckesser
Copy link
Member Author

@bashtage, thanks for the review!

I probably won't get back to this until after the holidays.

@WarrenWeckesser
Copy link
Member Author

@bashtage, @matthew-brett and @eric-wieser, thanks again for the review and suggestions last December. This dropped down a few levels on my "to do" list, but @seberg commented over in gh-5173, so it is time to bump this back up the list. I'd like to continue the discussion of the API over in gh-5173 before getting back to the PR. I just added a comment in that issue with my attempt to summarize the API discussions so far. Let's use that issue to finish up the API design. This PR might not survive the discussion. 😮

@mattip
Copy link
Member
mattip commented Jul 3, 2020

I am confused about using "permute" and "shuffle" in the docstrings: do they mean the same or different things? The docstring for permutation and docstring for shufle do not clear up my confusion. It does seem that shuffle operates in-place while permutation returns a copy or a view, not clear. With that, it seems the signature at the top of the PR is wrong:

randomly_permuted(x, axis=None)  # Returns a permuted copy.

should have an out argument, like permutation. The docstrings should have some compare and contrast uses-cases: when would I want to use each of these 4 functions.

@WarrenWeckesser WarrenWeckesser marked this pull request as draft July 3, 2020 12:25
@WarrenWeckesser
Copy link
Member Author 8000

I am confused about using "permute" and "shuffle" in the docstrings: do they mean the same or different things?

"randomly permute" and "shuffle" mean the same thing: reorder the elements randomly. When typeset as code, shuffle refers to the existing Generator method.

It does seem that shuffle operates in-place while permutation returns a copy or a view, not clear.

permutation returns a shuffled copy. (permutation has an extra feature:
if it is given in integer n, it returns a shuffled arange(n).)

randomly_permuted(x, axis=None) # Returns a permuted copy. should have an out argument, like permutation.

permutation doesn't have an out argument--are you saying both should have one?


In my latest update to this PR, I added a new method, permuted, that has an out parameter. This function provides an alternative to the pair randomly_permute/randomly_permuted. (Just like randomly_permute[d], consider the name a place-holder--the primary question to be decided is the structure of the API.) I wanted to be able to experiment with both APIs before deciding which we should use. We have to decide which API is preferable before merging the PR. Let's discuss the API over in #5173

@mattip
Copy link
Member
mattip commented Jul 3, 2020

Sorry, my bad. permutation does not have an out arg.

So if I am a first-time user, how do I know which to use: permutation, shuffle, randomly_permute or randomly_permuted? In this PR there should be some higher level documentation of when to use what and how to distinguish between these similar-but-not-identical APIs.

@bashtage
Copy link
Contributor
bashtage commented Jul 3, 2020

As for the api, I find permuted with out to be very confusing. and inplace flag feels clearer, despite its singleton usage. I also think having the word random appear in the same of any Generator method except random() is not a good design. Random is implied.

I still think add permute, permuted (no out) and deprecate permutation is the long-term best solution.

@WarrenWeckesser
Copy link
Member Author

@bashtage, let's discuss the API over in #5173

@WarrenWeckesser WarrenWeckesser changed the title WIP/ENH: random: : Add randomly_permute and randomly_permuted methods to Generator. ENH: random: Add the method permuted to Generator. Jul 15, 2020
shuffle of the columns.

In-place vs. copy
~~~~~~~~~~~~~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the moved up, the axis difference is secondary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a table summarizing this lengthy description or some other kind of TL;DR summary:

method copy/inplace axis handling
shuffle inplace as-if 1d
permutation copy as-if 1d
permute either copy or inplace axis independent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also perhaps worth noting that shuffle works on non-numpy sequences.

@WarrenWeckesser
Copy link
Member Author

I'm updating the PR to ensure that the GIL is held when shuffling object arrays. This affects the existing shuffle method, so I'll update that too.

shuffle has this bit of code:

            with self.lock, nogil:
                # We trick gcc into providing a specialized implementation for
                # the most common case, yielding a ~33% performance improvement.
                # Note that apparently, only one branch can ever be specialized.
                if itemsize == sizeof(np.npy_intp):
                    _shuffle_raw(&self._bitgen, n, 1, sizeof(np.npy_intp),
                                 stride, x_ptr, buf_ptr)
                else:
                    _shuffle_raw(&self._bitgen, n, 1, itemsize, stride, x_ptr,
                                 buf_ptr)

Can anyone explain that "trick"? The calls to _shuffle_raw appear to be exactly the same, except the first one is passed a size that the compiler might know about. Is the first one supposed to be _shuffle_int (under the assumption that sizeof(npy_intp) == sizeof(int64))?

@eric-wieser
Copy link
Member

except the first one is passed a size that the compiler might know about

That's exactly the point. This strategy is used repeatedly by ufunc loops too. You can repeat identical code multiple times, and have the compiler optimize it a different way each time based on the if statement that surrounds it.

Doing it this way is safer and faster than trying to push the optimization through yourself.

@WarrenWeckesser
Copy link
Member Author

@eric-wieser, thanks. In _shuffle_raw, itemsize is only used as the third argument of memcpy. That still allows the compiler to optimize based on knowing that itemsize is sizeof(npy_intp)?

@seberg
Copy link
Member
seberg commented Aug 20, 2020

@WarrenWeckesser I just wondered if it was possible, and I think you can do with nogil(if_acceptable): and not change any other code?

@WarrenWeckesser
Copy link
Member Author

@seberg, thanks, I wasn't aware of that option. I don't think it helps here, because according to the Cython docs, the argument to nogil must be a constant known at compile time.

@WarrenWeckesser
Copy link
Member Author

That still allows the compiler to optimize based on knowing that itemsize is sizeof(npy_intp)?

From Sebastian and from some further reading: yes, gcc can, in fact, replace a memcpy call with inline code.

@WarrenWeckesser
Copy link
Member Author

@bashtage wrote

I do wonder if a fast past for shuffling a C contiguous array using the first axes should be supported now. I think this case would only be around 6 LOC using shuffle_int.

In the current version of the PR, if axis is None and the input is contiguous (either C or F), permuted will pass the array to shuffle as a 1-d array, and shuffle will use its fast path. Does this address your concern?

@WarrenWeckesser
Copy link
Member Author

In the latest version, I think I've addressed all the comments.

To cut down on the amount of repeated code, I made a thin inline wrapper of _shuffle_raw that does the "gcc trick" that should allow gcc to optimize the use of memcpy. Does that look like a reasonable approach?

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I am happy, the only real change here is the except -1 to make sure that errors during WriteBackIfCopy (which I guess are possible) are floated correctly. I wonder if our __init__.pyd really needs those excepts in many places, but that is a different thing.

I am OK with this approach, its not super pretty but works. I believe that this was mentioned a few times on the mailing list before, so my plan will be to merge soon and then we post a heads-up there just in case someone is unhappy with the new API (now we have nice docs to point to as well).

@@ -28,6 +29,13 @@ from ._common cimport (POISSON_LAM_MAX, CONS_POSITIVE, CONS_NONE,
validate_output_shape
)

cdef extern from "numpy/arrayobject.h":
int PyArray_ResolveWritebackIfCopy(np.ndarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int PyArray_ResolveWritebackIfCopy(np.ndarray)
int PyArray_ResolveWritebackIfCopy(np.ndarray) except -1

self.shuffle(out)
return out

ax = normalize_axis_index(axis, np.ndim(out))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ax = normalize_axis_index(axis, np.ndim(out))
ax = normalize_axis_index(axis, out.ndim)

@bashtage
Copy link
Contributor
bashtage commented Sep 1, 2020

LGTM. Clearly lots of hard work and thinking about the edge cases.

@mattip mattip merged commit f1d0378 into numpy:master Sep 2, 2020
@mattip
Copy link
Member
mattip commented Sep 2, 2020

Thanks @WarrenWeckesser

@charris charris mentioned this pull request Oct 10, 2020
20 tasks
charris added a commit to charris/numpy that referenced this pull request Nov 22, 2020
@WarrenWeckesser WarrenWeckesser deleted the shuffleupagus branch February 18, 2021 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Alternative to random.shuffle, with an axis argument.
7 participants
0