8000 ENH: use SeedSequence instead of seed() by mattip · Pull Request #13780 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: use SeedSequence instead of seed() #13780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 26, 2019
Merged

Conversation

mattip
Copy link
Member
@mattip mattip commented Jun 14, 2019

Modify BitGenerators to use a SeedSequence class for seeded entropy, the class also provides a spawned() API to generate a number of independent entropy states for creating independent random streams.

Based on this gist by @rkern, the idea is

  • create a common base class for all BItGenerators
  • BitGenerator(seed) accepts an int, a sequence of ints, or a SeedSequence.
  • There is no longer a seed method to modify an existing BitGenerator, since creation of a new one is cheap. To seed a BitGenerator, simply create a new one with the desired seed value. This may change based on user acceptance of the idea.

The first commit only changes PCG64, as a proof of concept, since the change is rather large, and does not expose the SeedSequence's spawn method to the BitGenerator.

@mattip
Copy link
Member Author
mattip commented Jun 14, 2019

xref #13164, #13685, #13635, #13675

@mattip
Copy link
Member Author
mattip commented Jun 14, 2019

Hmm, python3.5 does not have a secrets module. Some other source of entropy is needed.

@bashtage
Copy link
Contributor

Hmm, python3.5 does not have a secrets module. Some other source of entropy is needed.

You can use random_entropy from numpy.random.entropy.

Copy link
Contributor
@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the multitude of comments.

I think BitGenerator is effectively to defining the interface and so needs to be adaptable to different generators.

else:
raise TypeError('{}(seed) expects int or SeedSequence for seed '
'not {}'.format(type(self).__name__, seed))
self._seed = seed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the seed get set into the bit generators state? In the old model, you would not call self.seed(self._seed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could use a private class self._set_seed(seed) that is requried by children that handles specific seeding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand. I was thinking that the the BitGenerator class would call self._set_seed(self._seed) (which is provided by PCG64._set_seed(self, value)) at the end of its __init__. Might not be worth the hassle since then PCG64.__init__ would have to ensure that everything existed before the super call (which is fine for PCG64, but not for a generator that needs aligned memory).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BitGenerators call SeedSequence.generate_state, then use that value to set their seed. in particular PCG64 calls

val = self._seed_seq.generate_state(4, np.uint64); 
pcg64_set_seed(&self.rng_state,
                       <uint64_t *>np.PyArray_DATA(val),
                       <uint64_t *>np.PyArray_DATA(val) + 2)

@bashtage
Copy link
Contributor
  • There is no longer a seed method to modify an existing BitGenerator, since creation of a new one is cheap. To seed a BitGenerator, simply create a new one with the desired seed value. This may change based on user acceptance of the idea.

IMO this is not a good idea. Think about testing. Why create many generators when you really want to use the same one with the same state? The cost of creating a generator may also depend on the size of the state. For example, MT19937 allocates 4*624 + 4 bytes of state.

@rkern
Copy link
Member
rkern commented Jun 14, 2019

Think about testing. Why create many generators when you really want to use the same one with the same state?

It's actually pretty hard in most test frameworks to make just one object for the test suite and reseed it each test. You have to go out of your way to do it. It's much easier just to make it fresh each time in the setUp(). Tests are not time-critical, in any case, and even "large" PRNGs like MT are only large relative to other PRNGs. They are not large compared to many of the other objects in the tests.

@charris charris added this to the 1.17.0 release milestone Jun 14, 2019
@charris
Copy link
Member
charris commented Jun 14, 2019

Doc failure looks legitimate.

@bashtage
Copy link
Contributor

It's actually pretty hard in most test frameworks to make just one object for the test suite and reseed it each test. You have to go out of your way to do it. It's much easier just to make it fresh each time in the setUp(). Tests are not time-critical, in any case, and even "large" PRNGs like MT are only large relative to other PRNGs. They are not large compared to many of the other objects in the tests.

What I don't understand is that is the upside to moving seed from being public to private? There is a downside (creating 2 or 3 extra objects, depending on how the seed is implemented).

IME in pytest is easy to create a global automatically reset generator in conftest using a fixture, e.g.,

SEED = 12343928
GENERATOR = np.random.Generator(SEED)
@pytest.fixture()
def reset_generator():
    """
    Fixture that provides a Generator using the same initial state
    """
    GENERATOR.bit_generator.seed(SEED)
    yield GENERATOR

and then

def test_something(reset_generator):
    reset_generator.integers(...)

Of course one can do the same thing by grabbing and then resetting the initial state, so maybe this isn't much of a concern in this context.

The proof is in the pudding I suppose:

In [20]: %timeit rg.Generator(rg.PCG64())
27 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit g.seed(12334)
5.49 µs ± 75.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [22]: state = g.state
In [23]: %timeit g.state = state
2.39 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So directly setting the state is fastest, which makes sense since it has the fewest safeties. This said, none are too slow in any important way, and even 10K calls is < 1s.

@mattip
Copy link
Member Author
mattip commented Jun 15, 2019

I am aiming for the minimal interface for the first release. We can always add seed back in if the performance hit hurts users.

@rkern
Copy link
Member
rkern commented Jun 16, 2019

We've seen such great misuse of .seed() in the old design as it's presence suggests to people that "this is the way to provide the seed for the RNG", that micro-optimizations aren't enough, in my mind, to keep it.

@seberg
Copy link
Member
seberg commented Jun 16, 2019

I have to agree with being minimal. If there is any other reason except slight optimization, sure. But I doubt few tests even notice that slowdown.

@mattip
Copy link
Member Author
mattip commented Jun 16, 2019

Does the design pattern for SeedSequence seem reasonable (create it in the base BitGenerator.__init__, use it in the child PCG64.__init__)? If so, I will convert the other BitGenerators to use this pattern

@mattip mattip changed the title WIP, ENH: use SeedSequence instead of seed(), jumped() WIP, ENH: use SeedSequence instead of seed() Jun 16, 2019
@mattip
Copy link
Member Author
mattip commented Jun 19, 2019

I have moved dSFMT, MT19937 and PCG32 to use seeding via SeedSequence (MT19937 will only use the new SeedSequence if a SeedSequence is provided for legacy reasons). I also updated the documentation.

I don't think it makes sense to use SeedSequence in the counter-based ThreeFry and Philox since they already mix some entropy into their initial state.

@rkern
Copy link
Member
rkern commented Jun 19, 2019
628C

The use of SeedSequence to convert non-ISeedSequence inputs should be universal in order to enable the later addition of Generator.spawn(). We won't be able to add that later if it's not the default now. It also enables all BitGenerators to accept any of the various forms of entropy that SeedSequence accepts, so it makes the APIs more uniform.

The only thing that needs the old MT19937 seeding is RandomState(seed), not MT19937(seed), so that can be an implementation detail of RandomState.

seed = random_entropy(1, 'fallback')
mt19937_seed(&self.rng_state, seed[0])
elif isinstance(seed, SeedSequence):
val = self._seed_seq.generate_state(1, np.uint32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get at least 128 bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is more better? Should I use RK_STATE_LEN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd bypass mt19937_init_by_array() entirely by pulling RK_STATE_LEN uint32 words, then make sure the first word isn't 0 like in the usual routine. Then that's the state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will let us replace SeedSequence with a LegacyMTSeedSequence at the Python level that replicates mt19937_init_by_array(). That would be a good way to implement RandomState(seed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If following this path, should probably disallow BitGenerators from being input into RandomState, so that RandomState(MT19937(seed)) is not allowed since it will be different from RandomState(seed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to refactor the code from mt19937_init_by_array into a LegacyMTSeedSequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, @classmethod it is, then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked. Now RandomState.seed calls MT19337.legacy_seeded. If RandomState has a different BitGenerator, the call will fail. Also reworked the tests to create new Generators for each test rather than reseed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, NEP 19 says (well, I said) that in such cases we should pass the provided seed through to the BitGenerator. Though at the time I was probably still thinking of BitGenerator.seed() being an API that people might use, however disrecommended it was. I think I prefer the semantics you have now, though. If people are calling np.random.seed(), they probably wanted to reproduce the old bitstream, so making this error out when someone has threaded through another BitGenerator should raise the discrepancy up to them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If needed we can propose a change to the NEP since we now have a better picture of how this all should work

Copy link
Contributor
@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly formatting issues.

seed = random_entropy(1, 'fallback')
mt19937_seed(&self.rng_state, seed[0])
elif isinstance(seed, SeedSequence):
val = self._seed_seq.generate_state(1, np.uint32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If following this path, should probably disallow BitGenerators from being input into RandomState, so that RandomState(MT19937(seed)) is not allowed since it will be different from RandomState(seed)

@bashtage
Copy link
Contributor

The use of SeedSequence to convert non-ISeedSequence inputs should be universal in order to enable the later addition of Generator.spawn(). We won't be able to add that later if it's not the default now. It also enables all BitGenerators to accept any of the various forms of entropy that SeedSequence accepts, so it makes the APIs more uniform.

I've been wondering about the API design for this. Is it necessary for a BitGenerator to expose this up to a Generator? In particular, I could imagine another approach that has a signature like

pgen = ParallelGenerator(seed: int, generator:Generator-like, bit_gen:BitGenerator, seed_size: int = 128)

that would be able to enable sequenced creation of any Generator-like object from any BitGenerator-like object as long as the BitGenerator can accept 32-bit unsigned integers. This avoids tight coupling of SeedSequence at the bottom and simplifies writing Generator-like classes that could contain user-provided distributions.

This is orthogonal to the use of SeedSequence internally to the BitGenerators I think.

@rkern
Copy link
Member
rkern commented Jun 19, 2019

Can you show a usage example? I don't have a clear idea from that constructor what you are proposing.

@mattip
Copy link
Member Author
mattip commented Jun 25, 2019

Let's keep this instancemethod private, either with an underscore ...

Adopting

@mattip
Copy link
Member Author
mattip commented Jun 25, 2019

This doesn't work. If this is the method that RandomState.seed(None) ...

What is "This"?

@rkern
Copy link
Member
rkern commented Jun 25, 2019

You should be testing that legacy_seeding creates an equivalent state.

All the tests in test_random have to use legacy_seeding since they use the old API, I did not change those test results and the tests pass

Well now you have the test's name test_seed_equivalency() opposite to the test's contents. The test was written to exercise certain semantics that now appear under a different API. Test that new API. Or delete the test; what you are testing now has no use.

@bashtage
Copy link
Contributor

All the tests in test_random have to use legacy_seeding since they use the old API, I did not change those test results and the tests pass

IIRC all of the tests use a scalar seed. Might be worth triple checking that there is at least one that uses the vector seed path, and adding one with a random-looking seed.

@mattip
Copy link
Member Author
mattip commented Jun 25, 2019

TestSeed seems to have the cases you mentioned.

@rkern
Copy link
Member
rkern commented Jun 25, 2019

This doesn't work. If this is the method that RandomState.seed(None) ...

What is "This"?

Are you able to see the code context of my comment? I made it on the commit, so maybe it's not being displayed to you.

val = self._seed_seq.generate_state(1, np.uint32)

  1. SeedSequence.generate_state() is idempotent. For a given requested length, it returns the same thing every time. That means that 2 calls to .legacy_seeding(None) resets the state to the same thing. This is not the required semantics, which is to get fresh entropy from the OS on each call.

  2. 32 bits is too small of a bottleneck. We've been over this before, and it's as true in this context as the previous one.

@mattip
Copy link
Member Author
mattip commented Jun 25, 2019

Since it seems Philox is one of the BitGenerators that will be shipped, I refactored it to use SeedSequence

Copy link
Member
@rkern rkern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are just a few items left that could be addressed in a subsequent PR. Getting this in will help unblock some of the other PRs.

Seeds can be passed to any of the BitGenerators. The provided value is mixed
via `~.SeedSequence` to spread a possible sequence of seeds across a wider
range of initialization states for the BitGenerator (except for `~.MT19937`
which for legacy reasons will use the older initialization methods). Here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now untrue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating

`RandomState` object. Random number generation is separated into two
components, a bit generator and a random generator.
`RandomState` object. Random number generation is separated into three
components, a seed sequencer, a bit generator and a random generator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
components, a seed sequencer, a bit generator and a random generator.
components, a seed sequence, a bit generator and a random generator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopting

@@ -153,6 +170,11 @@ The included BitGenerators are:
* MT19937 - The standard Python BitGenerator. Produces identical results to
Python using the same seed/state. Adds a `~mt19937.MT19937.jumped` function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only true with _legacy_seeding()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing

sequence) of unsigned 32-bit integers, a `SeedSequence`, or ``None``
(the default). If `seed` is ``None``, a 32-bit unsigned integer is
read from ``/dev/urandom`` (or the Windows equivalent) if available. If
unavailable, a 32-bit hash of the time and process ID is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text is out of date. It should be rationalized with the other SeedSequence-using BitGenerators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing

self.rng_state.ctr.v[i] = counter[i]

self._reset_state_variables()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctypes and cffi properties can also be deleted, as they are provided by BitGenerator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing

@@ -132,7 +131,7 @@ def test_permutation_subclass(self):
class N(np.ndarray):
pass

mt19937.bit_generator.seed(1)
mt19937 = RandomState(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test using RandomState while all of the others using Generator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing last vestiges of RandomState from this file.

@rkern
Copy link
Member
rkern commented Jun 25, 2019

Needs a rebase, of course.

@mattip
Copy link
Member Author
mattip commented Jun 25, 2019

I will rebase once #13793 is merged

@mattip
Copy link
Member Author
mattip commented Jun 25, 2019

There are 31 commits here, should I rebase and squash them down to 1 or leave it be?

@charris
Copy link
Member
charris commented Jun 25, 2019

Might as well squash them down, it will make the rebase easier.

@charris
Copy link
Member
charris commented Jun 25, 2019

Should be good to rebase now.

@charris
Copy link
Member
charris commented Jun 25, 2019

CircleCI has found a couple of nits.

def __init__(self, seed=None, counter=None, key=None):
def __init__(self, seed_seq=None, counter=None, key=None):
if seed_seq is not None and key is not None:
raise ValueError('seed and key cannot be both used')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be 'seed_seq and key`

n_children_spawned : {int}, optional
The number of children already spawned. Only pass this if
reconstructing a `SeedSequence` from a serialized form.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need an Attributes section for program_entropy and pool? If these are private should they be _ed?

@charris
Copy link
Member
charris commented Jun 25, 2019

Fixed the merge problem, let's see if things go.

@bashtage Lets get this in and then do the fixups. I think it is getting late out your way and Matti will be traveling tomorrow.

@charris
Copy link
Member
charris commented Jun 25, 2019

What should be the fix here?

   def test_pickle(self):
        import pickle
    
        bit_generator = self.bit_generator(*self.data1['seed'])
        state = bit_generator.state
        bitgen_pkl = pickle.dumps(bit_generator)
        reloaded = pickle.loads(bitgen_pkl)
        reloaded_state = reloaded.state
        assert_array_equal(Generator(bit_generator).standard_normal(1000),
                           Generator(reloaded).standard_normal(1000))
        assert bit_generator is not reloaded
        assert_state_equal(reloaded_state, state)
    
>       ss = SeedSequence(100)
E       NameError: name 'SeedSequence' is not defined

@rkern
Copy link
Member
rkern commented Jun 25, 2019

Just import it. With all of the squashing, it's hard to tell what happened.

@charris
Copy link
Member
charris commented Jun 26, 2019

I'm putting this in, Shippable is well behind at this point.

@bashtage @rkern Please make further fixups as PRs on master.

Thanks to all the reviewers, and thanks to Matti for hanging in there :)

@charris charris merged commit 4668b46 into numpy:master Jun 26, 2019
@rkern
Copy link
Member
rkern commented Jun 26, 2019

Thank you all for your patience and endurance!

@mattip mattip deleted the seedsequence branch October 13, 2020 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0