8000 ENH: tracking issue for merging randomgen into numpy · Issue #13164 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: tracking issue for merging randomgen into numpy #13164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
15 of 16 tasks
mattip opened this issue Mar 20, 2019 · 32 comments
Closed
15 of 16 tasks

ENH: tracking issue for merging randomgen into numpy #13164

mattip opened this issue Mar 20, 2019 · 32 comments

Comments

@mattip
Copy link
Member
mattip commented Mar 20, 2019

Merge bashtage/randomgen into numpy, as part of NEP 19.

  • get the cythonize to work properly. Currently we call python -mcython infile outfile with no ability to add extra command line args
  • get the code to compile and import
  • get the tests to run and pass
  • replace mtrand with randomgen
  • work the documentation cleanly into our documentation
  • fix any TODOs
    • decrease module depth by moving numpy.random.randomgen.* into numpy.random
    • rewrite parts of documentation as randomgen is not outside numpy.random
  • benchmark
  • check that the NEP and the implementation are consistent

Post merge checklist

  • work through the other issues and PRs related to random
  • make documentation flow more smoothly with indices, heirarchies
    • integrate examples to Numba, Cython, CFFI, ctypes more tightly
    • refactor out the common BRNG attribute/method documentation
  • discuss deprecating np.random.seed and access to functions via np.random.xxx
  • fix vendored code for NumPy style (two-space tabbing and more)

Edit: add more tasks (replace mtrand, fix TODOs)
Edit: add NEP check
Edit: add post-merge tasks

@mattip
Copy link
Member Author
mattip commented Mar 20, 2019

xref #13163

< 8000 div class="js-timeline-item js-timeline-progressive-focus-container" data-gid="MDEyOklzc3VlQ29tbWVudDQ3NTkzNTcwMQ==">
@mattip
Copy link
Member Author
mattip commented Mar 24, 2019

The first three tasks are completed.

@mattip
Copy link
Member Author
mattip commented Apr 10, 2019

marked more tasks as completed as #13163 has progressed. Added two more tasks to refactor the namespaces and more tightly integrate documentation

@bashtage
Copy link
Contributor

Should remove random_integers. This has been deprecated since 1.11.

@mattip
Copy link < 8000 div aria-live="polite" aria-atomic="true" class="sr-only" data-clipboard-copy-feedback>
Member Author
mattip commented Apr 10, 2019

Should remove random_integers. This has been deprecated since 1.11.

Edit: reverted this change. We need to discuss this more widely, there still seem to be some use cases for random_integers when using a np.int-type max value, then high +1 will overflow

@bashtage
Copy link
Contributor
bashtage commented Apr 14, 2019

A summary of issues/PRs that I think are fixed in randomgen:

@bashtage
Copy link
Contributor

@mattip This isn't quite right. You can generate every possible value using randint for a dtype, including one about the max value. random_integers is redundant.

@bashtage
Copy link
Contributor

For example

import numpy as np
from randomgen import RandomGenerator, MT19937
rg = RandomGenerator(MT19937(0))
rg.randint(0, 2**64, dtype=np.uint64)
rg.randint(0, 2**32, dtype=np.int32)

@charris
Cop 8000 y link
Member
charris commented Apr 15, 2019

I think the problem comes when the range is a NumPy integer type, which is unavoidable for broadcasting if we go that way. Python ints are fine. In the old generator the c-level routine worked with closed intervals, so all that was needed was 2**64 - 1, which is a safe 64 bit value. That said, I don't know how randomgen does things.

@eric-wieser
Copy link
Member

The problem @charris mentions has come up on github at least once before. Some options I think I remember seeing:

  • randint(0, 2**64-1, closed=True), randint(0, 2**64-1, bounds='closed') or similar spellings to use a closed interval instead of a half-open interval
  • randint(0, 0, allow_empty=False, dtype=np.uint64) or similar spellings to treat the upper bound as if it had overflowed (eg, interpret x, x as the interval [x, x+2**64) instead of the empty interval). In theory allow_empty could vectorize, allowing the user to pass in an out-of-band bit for each item

@bashtage
Copy link
Contributor

The underlying C generation code is the same and works on closed intervals so that there are no issues with types in C. The value checking and downcasting are handled using object dtype for u/int64, and the next largest type for all other types.

Broadcasting also works in extreme cases,

import random.generator as g

g.randint(np.zeros((3,1)),np.array([2**64]*3),dtype='uint64')
Out[6]: 
array([[10573188468909976714,   520699726749771223, 16676690052973521388],
       [ 8784019325716569234, 17221063269744199451, 14319133892835349315],
       [ 4530430698426414801,  4248163222813391406,  8521872816997754320]],
      dtype=uint64)

@eric-wieser
Copy link
Member

The danger comes with code like

arr = np.array(..., dtype=np.int64)  # including int64 max
range_min = arr.min()
range_max = arr.max()
rand_arr = randint(range_min, range_max+1)

How does this behave?

@bashtage
Copy link
Contributor

ValueError: low >= high since

range_min, range_max+1
C:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: overflow encountered in longlong_scalars
  """Entry point for launching an IPython kernel.
Out[9]: (-9223372036854775808, -9223372036854775808)

On the other hand

rand_arr = g.randint(range_min, int(range_max)+1)

Now if we want to get rid of randint and only have a bounded interval generator, then that would allow the code to be simplified.

@bashtage
Copy link
Contributor

I still think random_integers should be removed, at least for now. It would always be brought back later with a more reasonable set of options and choices. There are two reason why I think that removal is the right choice:

  1. It is completely redundant. It is just a specific, wrapped call to randint with dtype='l'. It doesn't even help in the one case where it theoretically could, since it doesn't case to a Python int internally, so that adding 1 to a np.int64 wraps here, and so you get a high<low error.
In [6]: np.random.random_integers(0,np.int64(2**63-1))

/home/kevin/miniconda3/bin/ipython:1: DeprecationWarning: This function is deprecated. Please call randint(0, 9223372036854775807 + 1) instead
  #!/home/kevin/miniconda3/bin/python
/home/kevin/miniconda3/bin/ipython:1: RuntimeWarning: overflow encountered in long_scalars
  #!/home/kevin/miniconda3/bin/python
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-a7b97ffba7f7> in <module>
----> 1 np.random.random_integers(0,np.int64(2**63-1))

mtrand.pyx in mtrand.RandomState.random_integers()

mtrand.pyx in mtrand.RandomState.randint()

ValueError: Range cannot be empty (low >= high) unless no samples are taken
  1. It's type is platform dependent.
  2. The error messages returns are wrong for random_integers.

@bashtage
Copy link
Contributor

The closed=True seems like the easiest fix. Would require virtually no changes except for the u/int64 paths.

@bashtage
Copy link
Contributor
bashtage commented Apr 15, 2019

Should versionadded tags be removed since this module is fresh?

I mean only in RandomGenerator, not in RandomState

@bashtage
Copy link
Contributor

mattip#15 contains a patch that adds the closed kwarg. It isn't going to be merged so that the merits of this approach can be discussed after #13163 is merged.

@mattip
Copy link
Member Author
mattip commented May 14, 2019

randint(..., closed=False) in mattip#15 was merged, so the new randomgen will not have random_integers.

@mattip
Copy link
Member Author
mattip commented May 29, 2019

I went quickly through the issues and PRs mentioned above and closed the ones I could do without too much thinking. I will make another pass through soonish. There are still a few open, as well as all those labelled with numpy.random

@bashtage
Copy link
Contributor

Do you think some of these need additional tests?

@bashtage
Copy link
Contributor
bashtage commented Jun 1, 2019

I think #7861 can be closed now since one can use the new random API in low-level Cython code.

@mattip
Copy link
Member Author
mattip commented Jun 6, 2019
6D40

Summary for the present status:
The remaining items in the top comment's checklist are enhancements that can be done moving forward, for instance documentation improvements and creating a base class for the BitGenerators.

Looking at the issues and PRs marked with numpy.random, the following seem to be possible blockers for final acceptance of the randomgen branch:

A number of other issues propose enhancements and clarifications on best practices, like #6132, or #9650 that can be done on an ongoing basis.

@charris
Copy link
Member
charris commented Jun 6, 2019

I think zipf (and related) needs a range keyword for when a is near 1. The idea is to rescale the sampling so that most draws are in a valid, representable range rather than inf. Doing so speeds things up enormously. Might need a new function name (szipf?, zipfm1?).

@charris
Copy link
Member
charris commented Jun 6, 2019

To clarify a bit, the sample scaling is one problem, the other is that with values of a near 1, one wants the small valued tail, and the range sets the limit on the tail. Previously the tail limit was set by the size of long integers.

@rkern
Copy link
Member
rkern commented Jun 12, 2019

A couple more things:

  • There is now a better algorithm, thanks to @imneme, for deriving starting states from seeds than the ones we have currently. I'd like to see us use it consistently. It ensures that each bit in the provided seed data has a chance of affecting each bit in the requested output, which means that even sequential seeds, changing only in the LSBs of a very long sequence of bits, will give a good spread in the state space. Which is just a nice property on its own, but it also:

  • Allows for reproducible PRNG spawning without needing to plan ahead, like .jumped() requires. The guarantees are "merely" probabilistic, but given the strength of the avalanche properties of the seed algorithm and the state sizes of the PRNGs under consideration, I am willing to stake my reputation on recommending this. I've tested PCG32 (the one most likely to fail on size-of-state grounds and what little architectural similarity to the hashing algorithm it has), PCG64, and Xoroshiro256 out to a few TiB each on PractRand with 4096 streams derived by this spawning algorithm (DSFMT fails earlier, but DSFMT just fails with 1 stream). I consider the likelihood of user error with .jumped() much higher than the chance of a collision in the streams. We have a good chance to advance best practices here for a modern set of dynamic parallel programming tools.

  • I'd like to remove the settable-increments from PCG32 and PCG64 and just treat them as things that get seeded. Users won't pick good ones. We could get around it by hashing the user-provided increment, but I think seeding it is better (and using either .spawn() or .jumped() to get new streams). This would assist in aligning the feature sets of all of the PRNGs, making them more easily replaced with each other. That alleviates some of the worry of adding PCG64 as the default (insofar as having a default with a more full-featured API might cause people to rely on that API).

@rkern
Copy link
Member
rkern commented Jun 12, 2019

Oops, a slightly old version of that got copy-pasted. I also tried Philox and ThreeFry successfully as well, but I was really unconcerned about them.

This is how I envisioned SeedSequence to be used, to make it explicit:

The BitGenerator constructors can all take an explicit ISeedSequence instance (which could be SeedSequence or anything that implements that interface) or any one of the inputs that SeedSequence takes (which will be converted to SeedSequence) OR a properly-formatted state dict for bitgen.state['state'] if someone wants complete control over the state. This would make the constructors more like each other instead of per-BitGenerator ad hoc arguments like ThreeFry's and Philox's key= argument and PCG32's and PCG64's inc= argument.

# Each of these would be valid ways to instantiate a Philox BitGenerator.
np.random.Philox()  # -> Philox(SeedSequence())
np.random.Philox(None)  # -> Philox(SeedSequence(None))
np.random.Philox(12345)  # -> Philox(SeedSequence(12345)
np.random.Philox([314159, 271828])  # -> Philox(SeedSequence([314159, 271828]))
np.random.Philox(SeedSequence(12345))
np.random.Philox(MyOwnSeedSequenceImplementation(12345))
np.random.Philox({'counter': np.array([0x01, 0, 0, 0], dtype=np.uint64),
                  'key': np.array([ 7639784375349706623, 18058138874403901375], dtype=np.uint64)})

Why the ISeedSequence ABC instead of just the one implementation? At the very least, we need an implementation of the old MT19937 seeding algorithm to reproduce RandomState's pre-1.17 behavior. So RandomState.__init__(bit_generator=None) would have something like:

elif not hasattr(bit_generator, 'capsule'):
    bit_generator = _MT19937(LegacyMTSeedSequence(bit_generator))`

Other generator authors also provide their own recommended ways of expanding integer seeds out to internal states. I'm not concerned about replacing those with SeedSequence for general use (I think this algorithm is at least as good as any of them), but people may want to implement those for comparison purposes with other systems.

We discussed the spawning in the most recent Development Meeting. I think the general consensus was not to expose the API at the level of the Generator, yet. We should probably get a little bit of experience with this out in the wild first before including it in the most prominent API. For 1.17, people experimenting with this facility would instantiate their root SeedSequence, explicitly pass its spawn out to the processes that need it, and use np.random.default_gen(seed_seq) to get the Generator for each. It's possible that we will need to have BitGenerators own and expose the SeedSequence that they create or are passed in order to enable this future API. We won't be able to add it later because third-party implementations won't do that. On the other hand, it's possible that we can implement the handling of SeedSequence in BitGenerator.__cinit__() or BitGenerator.__init__() and make sure all implementations call those. I think that would let third-party implementations not have to handle it explicitly and let us add the ownership at a later time. There might be some Cython details that preclude that; I'm not sure.

I'm also happy to drop the program_entropy= argument for this release. It's something we can add in later when I am able to formulate a better elevator pitch for the concept.

@mattip
Copy link
Member Author
mattip commented Jun 13, 2019

In addition to changing seed and jumped, we need to decide about:

  • which BitGenerators to include in the upcoming numpy release. The general feeling in the last community meeting is to got for a minimum rather than a maximum. That means include two: legacy MT19937 and the Generator default.
  • which BitGenerator should be the Generator default. In spite of the performance problems, it was felt PCG64 is the best option, even though it is slow on some platforms.

@bashtage: any thoughts? Especially around the idea to continue to maintain a repo of alternative BitGenerators that would not be directly part of this repo?

@seberg
Copy link
Member
seberg commented Jun 13, 2019

As much as I like the SeedSequence idea, I am a bit unsure whether we need to expose it. I understand the argument that it hides spawn from the Generator or BitGenerator at the moment. But in the long term, there seems little usage for SeedSequence directly? Except maybe that if we carry around a SeedSequence object implicitly that could mean carrying around a largish array if someone uses a large array for seeding.

I suppose the argument of being able to replace the actual seeding procedure is valid, but I am unsure if it is an actual likely use case.

Or is there an API reason for wanting to use a SeedSequence for anything more then spawn (I could imaging spawning a different BitGenerator from an existing one, thus wanting a/its SeedSequence as a start).

On the other hand, if we use it internally anyway, we could just deprecate it at some point and keep it around indefinitely without much harm....

@rkern
Copy link
Member
rkern commented Jun 13, 2019

To me, the costs entailed by exposing it are not much. Since the BitGenerators do provide stream compatibility guarantees, the bulk of the usual indefinite maintenance cost is already priced in; we have to maintain the algorithm indefinitely. It's just a matter of the precise public API, which can be managed with deprecations, if we need to.

Since exposing SeedSequence as part of the documented API lets us delay promoting a Generator.spawn() API until we have built up some confidence in the concept, I think that cost is worthwhile.

I think the ISeedSequence interface also fits in with the basic concept of this redesign, like the Generator/BitGenerator split. We're giving advanced developers the tools to plug in different implementations and experiment with new options. For example, someone wanting to use the counter-mode crypto PRNGs with strictly incrementing keys to get independent streams. At the same time, no one has to touch any of that, or even think about it; they will just use np.random.default_gen(seed) most of the time.

@mattip
Copy link
Member Author
mattip commented Aug 18, 2019

Restating the remaining open random issues from this comment

A number of other issues propose enhancements and clarifications on best practices, like #6132, or #9650 that can be done on an ongoing basis.

@rkern, @bashtage any opinions?

@bashtage
Copy link
Contributor

I am:

I think #6132 is . #9650 is solved using SeedSequence although it isn't automatic I think.

@mattip
Copy link
Member Author
mattip commented Dec 8, 2019

I would like to close this issue, since it has served its purpose while we were intensively working on random. Not all the issues referenced are closed, but I don't think keeping this open is raising the chance of closing those. Please reopen with more details if you see a purpose in keeping this open.

@mattip mattip closed this as completed Dec 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0