8000 RandomState.shuffle has a terrible overhead for NumPy arrays · Issue #5514 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

RandomState.shuffle has a terrible overhead for NumPy arrays #5514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sturlamolden opened this issue Jan 28, 2015 · 22 comments
Closed

RandomState.shuffle has a terrible overhead for NumPy arrays #5514

sturlamolden opened this issue Jan 28, 2015 · 22 comments

Comments

@sturlamolden
Copy link
Contributor

The shuffle function for numpy.random.RandomState is implemented like this:

        cdef npy_intp i, j

        i = len(x) - 1

        # Logic adapted from random.shuffle()
        if isinstance(x, np.ndarray) and \
           (x.ndim > 1 or x.dtype.fields is not None):
            # For a multi-dimensional ndarray, indexing returns a view onto
            # each row. So we can't just use ordinary assignment to swap the
            # rows; we need a bounce buffer.
            buf = np.empty_like(x[0])
            with self.lock:
                while i > 0:
                    j = rk_interval(i, self.internal_state)
                    buf[...] = x[j]
                    x[j] = x[i]
                    x[i] = buf
                    i = i - 1
        else:
            # For single-dimensional arrays, lists, and any other Python
            # sequence types, indexing returns a real object that's
            # independent of the array contents, so we can just swap directly.
            with self.lock:
                while i > 0:
                    j = rk_interval(i, self.internal_state)
                    x[i], x[j] = x[j], x[i]
                    i = i - 1

This implementation is close to optimal for Python lists, as it would just swap PyObject* pointers. We could probably do a little better by declaring x list, but not much.

However for NumPy arrays this implementation is awful.

First, the indexing is implemented as Python function calls, as Cython does not know that x is an ndarray. Along with it comes all the overhead like bounds checking, etc.

Second, the swapping statement x[i], x[j] = x[j], x[i] continously creates and destroys dtype scalars (Python objects). In the multidimensional case it creates temporary ndarrays instead.

All in all, this is as bad as it gets.

In the single dimensional case we should have the shuffling loop

while i > 0:
    j = rk_interval(i, self.internal_state)
    x[i], x[j] = x[j], x[i]
    i = i - 1

as a C function, with specialisations depending on the dtype (e.g. a switch statement on the typenum). Another possibility to avoid C is to use "fusedtypes" in Cython to code this generically.

In the multidimensional case we could shuffle an array of void* pointers to the subarrays, and use this to fill in the output array.

This will kill enormous amount of Python overhead in Monte Carlo simulations where we need to shuffle, e.g. in permutation tests.

anntzer added a commit to anntzer/numpy that referenced this issue < 8000 a href="#ref-commit-7377e44" class="Link--secondary"> Jan 3, 2016
This patch modifies random.shuffle so that (when working on a ndarray)
an array of indices is shuffled and then elements are take()n from that
array in that order.  This allows the inner loop to be statically typed
(it turns out this is not so easy to write a generic shuffling code
using Cython fused types) and thus much faster (~6x for me), at the
expense of a threefold increase in memory use (I guess take() needs to
create a copy, and an additional array of indices is created.).

See numpy#5514.
anntzer added a commit to anntzer/numpy that referenced this issue Jan 17, 2016
Only for 1d-ndarrays exactly, as subtypes (e.g. masked arrays) may not
allow direct shuffle of the underlying buffer (in fact, the old
implementation destroyed the underlying values of masked arrays while
shuffling).

Also handles struct-containing-object 1d ndarrays properly.

See numpy#6776 for an earlier, less general (but even faster: ~6x)
improvement attempt, numpy#5514 for the original issue.
jaimefrio pushed a commit to jaimefrio/numpy that referenced this issue Mar 22, 2016
Only for 1d-ndarrays exactly, as subtypes (e.g. masked arrays) may not
allow direct shuffle of the underlying buffer (in fact, the old
implementation destroyed the underlying values of masked arrays while
shuffling).

Also handles struct-containing-object 1d ndarrays properly.

See numpy#6776 for an earlier, less general (but even faster: ~6x)
improvement attempt, numpy#5514 for the original issue.
@charris
Copy link
Member
charris commented Jun 10, 2016

Can this be closed? The 1-D case has been fixed, for the multidimensional case I would suggest either shuffling the indexes and making a copy using take or a special indexing class that uses shuffled indexes. The latter won't work for slices, however.

@charris
Copy link
Member
charris commented Jun 10, 2016

I suppose it would be faster in the multiple dimension case to make a shuffled copy, then copy back. The function should be reentrant as long as the index shuffling does not take place under the lock.

@charris
Copy link
Member
charris commented Jun 10, 2016

I suppose one could also shuffle a 1-D object array of views in the multidimensional case, but the result would not act like a true multidimensional array.

@anntzer
Copy link
Contributor
anntzer commented Jun 10, 2016

The nd implementation uses a bounce buffer just big enough to hold one "row", which should be close to optimal.

@sturlamolden
Copy link
Contributor Author
sturlamolden commented Jun 10, 2016

There are still som minor issues with the single-dimensional case. The loop on line 5100 should be over range(n-1,0,-1) to avoid the Python function reversed (unless Cython has become smart enough to optimize it away). We should also release the GIL around the loop. But apart from that it looks fixed.

The multidimensional case I am not sure about. An ndarray is generally not the appropriate data structure for shuffling more than one dimension. I would prefer to use a ragged array in my own code. Another option would be to shuffle a 1D array of indices and then use it for fancy indexing. I don't think it is worth the effort to try to fix something that is inherently unfixable because the data structure is unsuitable for the task.

@charris
Copy link
Member
charris commented Jun 10, 2016

I think you need to keep the GIL to maintain repeatability. Does cython deal with range(n-1,0,-1) ? Be interesting to see if there is noticeable speedup with that change.

@anntzer
Copy link
Contributor
anntzer commented Jun 10, 2016

Cython does optimize reversed away.
Using a shuffled index array was discussed in #6776 and shot down as it would incur a huge memory cost.

@sturlamolden
Copy link
Contributor Author
sturlamolden commented Jun 10, 2016

We don't need to keep the GIL because we have acquired the lock on the RandomState object. Line 5073.

@sturlamolden
Copy link
Contributor Author

Yes Cython does deal with range(n-1,0,-1) but if it can optimize away reversed we don't need it.

@seberg
Copy link
Member
seberg commented Jun 10, 2016

Note that you need to be careful with releasing the GIL. Yes, you can do it, but only if the underlying array has no object references.

@sturlamolden
Copy link
Contributor Author

We are just calling rk_interval and memcpy, none of which needs the GIL.

@anntzer
Copy link
Contributor
anntzer commented Jun 10, 2016

You should also set the array flags to non-writable before entering the loop.

@sturlamolden
Copy link
Contributor Author

Even in the case of object references we do not need it because we are not changing refcounts.

@sturlamolden
Copy link
Contributor Author

Setting the non-writable flag temporarily would be safer, but an idiot could just undo the flag.

@anntzer
Copy link
Contributor
anntzer commented Jun 10, 2016

It doesn't mean we shouldn't at least try to avoid data corruption by reasonably behaved multithreaded programs...

@sturlamolden
Copy link
Contributor Author

I agree that setting non-writable is a good idea.

@seberg
Copy link
Member
seberg commented Jun 10, 2016

I doubt it helps too much, but maybe a bit.... Basically you have to be careful in any case, since you will get undefined behaviour even if you read. But if you read from an object array, you might well blow up your reference counts and crash python. Frankly, just don't think it is worth the trouble to think about releasing the GIL for object arrays.

@sturlamolden
Copy link
Contributor Author

Ok, so just let us special case them. It is just an extra line of code.

@seberg
Copy link
Member
seberg commented Jun 10, 2016

That was my thought, I hoped you would open a PR since you seem to have an exact plan, and point that out ;).

@sturlamolden
Copy link
Contributor Author

Will do, once I get home. No coding from my cell phone 😉

@bashtage
Copy link
Contributor

Has this been fixed in master? There is a special path now.

@seberg
Copy link
Member
seberg commented Apr 12, 2019

Yeah, I will close it. One could probably try to do similar optimizations for higher dimensional arrays, but there are a lot of traps here and in most cases it is probably not much useful. Doesn't mean we can't do it, but no need to keep an issue around.

@seberg seberg closed this as completed Apr 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0