8000 WIP: Small Fast Chaotic PRNGs by rkern · Pull Request #41 · mattip/numpy · GitHub
[go: up one dir, main page]

Skip to content

WIP: Small Fast Chaotic PRNGs #41

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from
Closed

Conversation

rkern
Copy link
@rkern rkern commented Jun 22, 2019

Here are two more PRNGs to enter the fray. They are each at least as fast as Xoshiro256, at least on my 64-bit Linux machine. I expect they'll be about the same on 32-bit machines, but I have not tested that.

❯ python benchmark.py
--------------------------------------------------------------------------------

Time to produce 1,000,000 Uniforms
************************************************************
JSF64         2.38 ms
MT19937       6.88 ms
PCG32         3.58 ms
PCG64         4.10 ms
Philox        6.15 ms
SFC64         2.44 ms
ThreeFry      8.53 ms
Xoshiro256    2.80 ms
numpy         6.88 ms
dtype: object

Uniforms per second
************************************************************
JSF64         419.99 million
MT19937       145.28 million
PCG32         279.56 million
PCG64         243.88 million
Philox        162.65 million
SFC64         410.45 million
ThreeFry      117.18 million
Xoshiro256    356.89 million
numpy         145.28 million
dtype: object

Speed-up relative to NumPy
************************************************************
JSF64         189.1%
MT19937         0.0%
PCG32          92.4%
PCG64          67.9%
Philox         12.0%
SFC64         182.5%
ThreeFry      -19.3%
Xoshiro256    145.7%
dtype: object
--------------------------------------------------------------------------------

They are each 256-bit PRNGs architected fairly similarly, based on random invertible mappings. They each have a small number of separate cycles within their state space, but it is exceedingly unlikely that your seeding won't fall on the big ones. You can expect a period of about 2**255. SFC64 incorporates a 64-bit counter in its state so you have the extra safety that, should you happen to be enormously unlucky, you always have a minimum of 2**64 states in your cycle.

Both are very well-tested. JSF64 has been analyzed for a long time. SFC64 seems to be inspired by that work; it was written by the author of PractRand so it too has been pretty thoroughly tested.

Their one downside is that they don't have an efficient jumpahead operation. @imneme (whose implementation I referenced) seems to think that it might be doable for SFC64, but I don't know where to start with that.

@bashtage
Copy link

Some Win64 timings

Time to produce 1,000,000 Uniforms
************************************************************
JSF64          5.45 ms
MT19937       10.48 ms
PCG64          6.23 ms
Philox        27.40 ms
SFC64          5.20 ms
ThreeFry      20.50 ms
Xoshiro256     6.89 ms

and Win32

Time to produce 1,000,000 Uniforms
************************************************************
JSF64          13.88 ms
MT19937        13.98 ms
PCG64          46.19 ms
Philox         60.32 ms
SFC64          13.19 ms
ThreeFry       41.87 ms
Xoshiro256     15.91 ms

@mattip
Copy link
Owner
mattip commented Jun 22, 2019

I would prefer to add only one of them to numpy. We could add both to the collection of alternative BitGenerators in numpy/bitgenerators

@rkern
Copy link
Author
rkern commented Jun 22, 2019

I would prefer to add only one of them to numpy.

Certainly, hence the WIP. I wanted both up for comparison purposes. I'd pick SFC64 of the two.

@imneme
Copy link
imneme commented Jun 23, 2019

BTW, I suspect the comment about jump-ahead is a copy-and-paste-o. The same comment is at the bottom of some other PRNGs where it is actually true.

FWIW, I reported to Chris that PractRand claimed to find a problem with SFC64 (but not the smaller variant). In version 0.94 of PractRand Chris changed the constants for SFC, which may address/avoid the issue.

@rkern
Copy link
Author
rkern commented Jun 23, 2019

Hmm, no, there's no difference in the sfc64 implementation between 0.93 and 0.94. The last update to the constants happened in 0.93.

@imneme
Copy link
imneme commented Jun 23, 2019

Oh, sorry, I misread the release notes. You're right, the constants were improved in 0.93. Which is unfortunate because it means that the version that failed pat5 is still the standard one. Someone should rerun the test just to be sure, but it'll take 12 days 13.5 hours on my (somewhat slow but 48-core) machine. You can rerun it with:

./RNG_test sfc64 -te 1 -tlmin 20 -tlmax 50 -seed 0xdf7e5fd0802274af

In the interests of making sure the result reproduces, I've set it rerunning on a faster machine. It should be done in about a week.

@rkern
Copy link
Author
rkern commented Jun 23, 2019

I've patched PractRand to let me just run the Pat5 tests. Let's see if that provokes the failure any faster in wall-clock time.

@imneme
Copy link
imneme commented Jun 23, 2019

FWIW, here's the full log file from when I tested it. The whole thing is pretty weird. It's super unhappy about it, but then its outrage fades as more data comes in.

sfc64-test.log

@rkern
Copy link
Author
rkern commented Jun 23, 2019

Interesting. In my Pat5-only run, I got to 64 TiB without a failure:

❯ ./RNG_test sfc64 -te 100 -tlmin 30 -tlmax 50 -seed 0xdf7e5fd0802274af -multithreaded
...
rng=sfc64, seed=0xdf7e5fd0802274af
length= 64 terabytes (2^46 bytes), time= 46871 seconds
  no anomalies in 72 test result(s)

And then promptly mis-pressed Ctrl-C to copy that here, so I have to start again.

My other run with a different seed is still running at 64 TiB as well and is still going.

@imneme
Copy link
imneme commented Jun 23, 2019

It's all a bit of a PITA. When I tried yesterday to reproduce it with 0.94, I saw early differences from my run with some of the other tests. Rerunning with 0.93, I see the same output for these other tests which gives me more confidence. I looked briefly at backporting your pat5-only option to 0.93, but the code is structured rather differently, and as I contemplated doing so I wasn't sure what it would mean anyway if it did or didn't work with 0.93.

Alas, running the code on my iMac at home (a Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz) was supposed to be faster than my main test machine a (Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz), but as the test wears on that is proving to not be the case.

(It's possible that the whole thing was a glitch related to a race condition in the -multithreaded option; but I can't say I'll be happy if that's the case, because you really want to feel you can trust that a test failure that shows up after running a test for a couple of weeks [or more than half a year in some cases] actually means what it says.)

@rkern
Copy link
Author
rkern commented Jun 23, 2019

Interesting. I'll do the backporting. If I can provoke the same failure, then I think that'll be confirmation enough. If not, then a full run is still probably warranted.

I've pushed an implementation of David Blackman's GJrand as well. It has many of the same desirable properties as SFC64, like each of the 2**128 unique seed states being at least 2**64 steps away from any other. It's moderately slower than JSF64/SFC64/Xoshiro256, about halfway between PCG64 and that lot on my 64-bit system.

@imneme
Copy link
imneme commented Jun 24, 2019

One issue with all of these chaotic PRNGs is that their authors don't allow you to seed the full state of the generator, just part of it. For JSF, it's merely because the author did some testing on the limited range of allowed seedings (beyond the probabilistic guarantee that would have existed); for the counter-based variants, it's because the guarantees that the counter provides disappear if you let the counter have an arbitrary starting value.

Having less random seed data than the state size of the PRNG will typically introduce a kind of bias. This bias is only easily detectable if the space of allowed speeds is very small (e.g., 32 bits), otherwise we're in too big to fail territory.

To explore this, I just wrote seeding_bias.cpp, which is a modest tweak on code I wrote to show bias initializing the Mersenne Twister with simple seeds.

Here's what happens if we run it on jsf32

unix% g++ -Wall -O3 -std=c++14 seeding_bias.cpp -o jsf32_bias -DRNG_TYPE=jsf32 -DINCLUDE_FILE='"jsf.hpp"'

unix% ./jsf32_bias 1
Bias test of jsf32:
- Will enumerate through all 4294967296 seeds, reading 1 outputs from each one.
- Enumerating: ................................................................

Never occurring: 1, 3, 4, 8, 11, 12, 13, 17, 19, 20, 22, 23, 27, 28, 29, 31, 32, 40, 44, 48, ...
Most overrepresented: 264362982 -- repeated 12 times

unix% ./jsf32bias 4
Bias test of jsf32:
- Will enumerate through all RNG seeds, reading 4 outputs from each one.
- Enumerating: ................................................................

Never occurring: 130, 135, 387, 613, 685, 757, 766, 875, 890, 935, 983, 988, 1167, 1175, 1199, 1251, 1258, 1312, 1350, 1353, ...
Most overrepresented: 790806589 -- repeated 25 times

unix% ./jsf32bias 16     
Bias test of jsf32:
- Will enumerate through all 4294967296 seeds, reading 1 outputs from each one.
- Will enumerate through all RNG seeds, reading 16 outputs from each one.
- Enumerating: ................................................................

Never occurring: 3835856, 7814077, 7839806, 19677170, 22684530, 37104417, 37886080, 44763576, 44915593, 50908421, 51401130, 62306837, 83121896, 86907698, 101248875, 106001438, 106066804, 106908924, 109382321, 119572514, ...
Most overrepresented: 1674563904 -- repeated 47 times

This also happens for gjrand and sfc:

unix% g++ -Wall -O3 -std=c++14 seeding_bias.cpp -o gjrand_bias -DRNG_TYPE=gjrand32 -DINCLUDE_FILE='"gjrand.hpp"'

unix% g++-8 -Wall -O3 -std=c++14 seeding_bias.cpp -o sfc_bias -DRNG_TYPE=sfc32 -DINCLUDE_FILE='"sfc.hpp"'

unix% ./sfc_bias 1
Bias test of sfc32:
- Will enumerate through all 4294967296 seeds, reading 1 outputs from each one.
- Enumerating: ................................................................

Never occurring: 1, 4, 5, 8, 9, 10, 13, 18, 21, 23, 24, 28, 29, 30, 37, 38, 40, 41, 43, 45, ...
Most overrepresented: 1585543738 -- repeated 13 times

unix% ./gjrand_bias 1
Bias test of gjrand32:
- Will enumerate through all 4294967296 seeds, reading 1 outputs from each one.
- Enumerating: ................................................................

Never occurring: 6, 8, 12, 17, 18, 22, 23, 26, 27, 31, 34, 39, 40, 41, 43, 44, 45, 46, 47, 49, ...
Most overrepresented: 210398684 -- repeated 12 times

(These two PRNGs also allow three-word and two-word seeding respectively, which is too large to test, but would also technically have bias as well for the same reasons.)

Just to show that it doesn't have to be this way (i.e., if the amount of seed data matches the internal state size):

unix% g++ -Wall -O3 -std=c++14 seeding_bias.cpp -o xorshift32_bias -DRNG_TYPE=xorshift32plain32a -DINCLUDE_FILE='"xorshift.hpp"'

unix% g++ -Wall -O3 -std=c++14 seeding_bias.cpp -o xorshift32star16_bias -DRNG_TYPE=xorshift32star16a -DSEED_TYPE=uint32_t -DINCLUDE_FILE='"xorshift.hpp"'

unix% g++ -Wall -O3 -std=c++14 seeding_bias.cpp -o pcg32_once_insecure_bias -DRNG_TYPE='pcg32_once_insecure' -DSEED_TYPE=uint32_t -DINCLUDE_FILE='"pcg_random.hpp"'

unix% g++ -Wall -O3 -std=c++14 seeding_bias.cpp -o pcg16_bias -DRNG_TYPE='pcg_engines::setseq_xsh_rr_32_16' -DSEED_TYPE=uint32_t -DINCLUDE_FILE='"pcg_random.hpp"'

unix% ./xorshift32_bias      
Bias test of xorshift32plain32a:
- Will enumerate through all 4294967296 seeds, reading 1 outputs from each one.
- Enumerating: ................................................................

N
8000
ever occurring: (none)
Most overrepresented: (nothing overrepresented)

unix% ./xorshift32star16_bias
Bias test of xorshift32star16a:
- Will enumerate through all 4294967296 seeds, reading 1 outputs from each one.
- Enumerating: ................................................................

Never occurring: (none)
Most overrepresented: (nothing overrepresented)

unix% ./pcg32_once_insecure_bias 
Bias test of pcg32_once_insecure:
- Will enumerate through all 4294967296 seeds, reading 1 outputs from each one.
- Enumerating: ................................................................

Never occurring: (none)
Most overrepresented: (nothing overrepresented)

unix% ./pcg16_bias 
Bias test of pcg_engines::setseq_xsh_rr_32_16:
- Will enumerate through all 4294967296 seeds, reading 1 outputs from each one.
- Enumerating: ................................................................

Never occurring: (none)
Most overrepresented: (nothing overrepresented)

FWIW, for both pcg16 and pcg32_once_insecure, in one sense we're actually only providing half the state, because we haven't specified the stream (but could). This is a (rare) positive that comes from the fact that streams control a parameter of the generator rather than changing state.

@rkern
Copy link
Author
rkern commented Jun 24, 2019

@imneme
Copy link
imneme commented Jun 24, 2019

Thanks. I had to also patch show_checkpoint to have return type void rather than double to avoid undefined (and problematic) behavior. Running now.

(Clang also warned about various other problems with the overall source.)

@rkern
Copy link
Author
rkern commented Jun 24, 2019

Yeah, I applied a patch for that as well.

@mattip
Copy link
Owner
mattip commented Jun 24, 2019

Would it make sense to add GJrand, JSF64 and SFC64 to numpy/bitgenerators and then change this PR to only add SFC64 to numpy.random?

I understand then we would remove PCG64 and PCG32 from numpy.random as well.

@bashtage
Copy link

I don't think PCG64 is going away, is it?

@mattip
Copy link
Owner
mattip commented Jun 24, 2019

As I understood it, the goal is to release 1.17.0 with two BitGenerators: MT19937 and one other. All the rest would live in numpy/bitgenerators.

@bashtage
Copy link

I might have missed that thread/comment.

@imneme
Copy link
imneme commented Jun 24, 2019

With the pat5 only test with PractRand 0.93, I see:

rng=sfc64, seed=0xdf7e5fd0802274af
length= 64 terabytes (2^46 bytes), time= 29910 seconds
  no anomalies in 18 test result(s)

That leads me to believe it was some kind of strange one-time glitch in the tester and sfc64 gets a clean bill of health. It's good for sfc, but bad in other ways because it means we can't be as sure about drawing inferences from seeing PractRand fail something.

@rkern
Copy link
Author
rkern commented Jun 24, 2019

As I understood it, the goal is to release 1.17.0 with two BitGenerators: MT19937 and one other. All the rest would live in numpy/bitgenerators.

No consensus has been reached on that point. I proposed it, but was convinced that including Philox as a representative of the counter-based cryptoish PRNG clade was a good idea. One often uses those in a different way than other PRNGs thanks to their guarantees, so it would assist porting code that uses one of PRNGs (setting aside the actual numbers that come out). I still like PCG64 as the default as a good all-rounder; the same argument for a counter-based PRNG applies to jumpable PRNGs. But I think there's room for one more "speed-at-all-costs-except-quality", even on Win32, that SFC64 would provide.

My current proposal is PCG64 as the default, MT19937 for legacy, Philox and SFC64 for minimal amounts of diversity.

@mattip
Copy link
Owner
mattip commented Jun 24, 2019

I added GJrand, JSF64, SFC64 to numpy/bitgenerators. Could you redo this PR against numpy/numpy to add SFC64, then we can move toward PCG64, MT19937, Philox, and SFC64 only.

@rkern
Copy link
Author
rkern commented Jun 24, 2019

That leads me to believe it was some kind of strange one-time glitch in the tester and sfc64 gets a clean bill of health. It's good for sfc, but bad in other ways because it means we can't be as sure about drawing inferences from seeing PractRand fail something.

I concur.

Possibly-insulting question: on the original machine which you ran the tests, did you have ECC memory? I always enjoy cosmic-ray-corrupted-my-process stories. But a bug in PractRand's multithreading seems more likely.

@mattip
Copy link
Owner
mattip commented Jun 24, 2019

I updated the README to describe the bitgenerators, I hope I was accurate.

@imneme
Copy link
imneme commented Jun 24, 2019

Possibly-insulting question: on the original machine which you ran the tests, did you have ECC memory? I always enjoy cosmic-ray-corrupted-my-process stories. But a bug in PractRand's multithreading seems more likely.

FWIW, my main test machine is a Silicon Mechanics Rackform R422.v5 [based on the SuperMicro X10QBL board], with four 12-core Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz and 32 Samsung M393B2G70QH0-YK0 16GB DDR3 1600 ECC Registered DIMMs, giving me 96 hyperthreaded cores and 512 GB of goodness.

(It's a really nice machine!)

@rkern
Copy link
Author
rkern commented Jun 26, 2019

Replaced by numpy#13838

@rkern rkern closed this Jun 26, 2019
mattip pushed a commit that referenced this pull request Nov 15, 2023
…_assign_subscript

Merge in numpy-hpy from ss/array_transpose to labs-hpy-port

* commit '801921dd9786794a410e5d193c019930419da168':
  Port more of array_assign_subscript
  Slices support in array_assign_subscript
  Revert "Comment out initialization of the legacy global PyObject variables and weakreflist"
  Port more of array_assign_subscript
  Port more of array_assign_subscript
  Partial port of np.transpose
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0