-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Decide on new PRNG BitGenerator default #13635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What does the Intel windows compiler do for 128 bit integers? How much slower is PCG64 compiled with MSVC compared to MT1993 on windows? I suspect that the jump ahead feature will be widely used, so it might be good to have it by default. |
Not entirely sure; I don't know if there are ABI implications that ICC would care to be constrained by. If we just want to get any idea of the generated assembly that we could use, then this is a handy resource: https://godbolt.org/z/kBntXH
Do you mean settable streams, rather? That's a good point, but I wonder if it might not cut the other way. If our choice of default actually matters much, then maybe if we pick one of these more fully-featured PRNGs, people will use those features more extensively in library code without documenting that they require those "advanced" features, because, after all, they are available "standard". But then if another user tries to use that library with a less-featured On the gripping hand, that eventuality would make the |
In benchmarks posted by @bashtage in #13163, PCG64 is nearly half the speed of MT19937, which is pretty disappointing performance out of MSVC and friends. It's compared to 23% faster on Linux.
Other compilers like Clang, GCC, and the Intel compiler implement 128-bit integers on 64-bit systems the same way that they implemented 64-bit integers on 32-bit systems. All the same techniques with no new ideas needed. Microsoft didn't bother to do that for MSVC so there are no 128-bit integers directly supported by the compiler. As a result, for MSVC the existing implementation of PCG64 in #13163 hand-implements 128-bit math by calling-out to Microsoft intrinsics like _umul128 in x86_64 (and it could presumably also use equivalent and more portable Intel intrinsics like _mulx_u64 instead), thereby coding what GCC, Clang and the Intel compiler would do by themselves. The biggest issue is likely that Microsoft's compiler doesn't optimize these intrinsics very well (hopefully they're at least inlined?). It's possible that hand-coded assembler might go faster but the proper fix would be for the compiler not do be so diabolically poor.
I'm glad you like jump ahead, but I'm curious why you think it'd be widely used. (Personally, I really like |
Probably unfamiliarity with current terminology. What I mean is easily obtained independent streams that can be used to run simulations in parallel. I don't know how many simulation problems can be parallelized, but I suspect it is a lot and given the number of cores people get on a chip these days, that could easily make up for a speed disadvantage.
So that will hurt our wheels, OTOH, many folks on Windows get their packages from Anaconda or Enthought, both of which use Intel, and folks who really care about performance are probably on Linux, Mac, or maybe AIX. EDIT: And perhaps if Microsoft is concerned, they 8000 could offer a bounty for fixing the problem. |
FWIW, here's the assembly that |
Very cool, @rkern. Any chance you can do the same to see what MSVC is doing with the hand-written 128-bit code? |
It's, uh, not pretty. Oops, forget to add |
It's not quite that bad. You didn't have optimization on, so it didn't inline anything. I've added Still kind-of a train wreck though. But I think it's fair to say that with a bit of attention, the PCG64 code could be tweaked to compile on MSVC into something that isn't utterly embarrassing for everyone. |
In order to allow everything else to be merged, why not pick "no default" for now? That leaves us free to make the decision on the default later (even after one or more release s) without breaking compatibility. |
Most of our users are not random number experts, we should be providing defaults for them. Beyond the prosaic "now they need to type more code", what happens when we change something? In the case where the BitGenerator is hard coded, (because we did not provide a default), every non-sophisticated user will now have to refactor their code, and hopefully understand the nuances of their choice (note we cannot even agree amongst ourselves what is best). However if we provide a default, we might noisily break their tests because the new default or new version is not bit-stream compatible. Between the assumption that the bit-stream will always be constant versus the assumption that the NumPy developers know what they are doing and the default values should be best-of-brand, I would err on the side of the second assumption, even if it breaks the first. Edit: clarify which developers should know what they are doing |
Well, we'll certainly be documenting recommendations, at the very least, regardless of whether or not we have a default or what the default is.
What "something" are you thinking about? I can't follow your argument. |
@mattip is referring to changing the default bit generator. This woudl make users who have adopted it mad, and coudl require some code change. For example, if you used
and the underlying bit generator was changed, then this would be wrong. If you did the more sane thing and used
then you would not see it. IMO, When considering the choice of default, we should think it fixed until a fatal flaw is found in the underlying bit generator or Intel adds a QUANTUM_NI to its chips, which produces many OOM improvement in random performance. |
I realize I'm a bit of an outsider here, but I don't think it's reasonable to expect that which PRNG is the default choice is fixed forever and never changes. (In C++, for example, Rather, there needs to be a mechanism to reproduce prior results. Thus once a particular implementation exists, it is very uncool to change it (e.g., the MT19937 is MT19937, you can't tweak it to give different output). [And it's also uncool to remove an implementation that already exists.] When the default changes, people who want to keep reproducing old results will need to ask for the previous default by name. (You could make that by providing a mechanism to select the default corresponding to a prior release.) That said, even if you're allowed to swap out the default generator for something else, it really needs to be strictly better — any features present in the default generator represent a commitment to support that feature in the future. If your default generator has efficient In summary, there are ways to make sure uses can have reproducible results without trying to lock yourselves into a contract where the default is unchanged forever. It'll also reduce the stakes for the choice you make. (FWIW, this is what I did in PCG. The default PCG 32-bit PRNG is currently the XSH-RR variant [accessed as |
It is not really forever (this entire project is 90% driven by a real, genuine, and honored forever promise made about 14 years ago), but as you say, switching needs (a) a compelling reason to change and (b) would take at least a few years give the depreciation cycle. It is much better to try hard today to get it as close to right as possible. One thing that isn't banned, of course, is improving PRNG code after release as logn as it produces the same values. For example, if we went with a PRNG that used uint128, we could let MS add uint128 support (fat chance) or add assembly for Win64 in a future version. |
Right, that seems to be arguing, with @eric-wieser, for the "No default" option, which I can't square with the initial statement "Most of our users are not random number experts, we should be providing defaults for them." |
Between no default and a friendly, fully assuming default, I would always choose the latter: Now:
my preference:
Now I don't think this is going to get in, but I think one way to prolong the use of RandomState is to make this only available to users who feel they are expert enough to choose a bit generator. |
Yes, we have that. Users can grab The central question about the default is "What does the code |
@rkern, If you only care about the no arguments case where the seed is autogenerated from available entropy, then it really doesn't matter much what the underlying generator is — it could change on an hourly basis since the results would never be reproducible (different runs would get different seeds). In contrast, @bashtage seems to care about a default generator that's provided with a seed. |
You can reseed the g = Generator()
g.bit_generator.seed(seed) That's somewhat tedious, which is why I suggested at top that maybe most people would usually opt for So I guess we also have a broader question in front of us: "What are all the ways that we want users to instantiate one of these? And if those ways have default settings, what should those defaults be?" We have some open design space and @bashtage's suggestion for @bashtage How much do you think documentation would help? That is, if we said at the top " I might be more convinced to have a |
What about actually making a I agree with @bashtage that it would much more friendly to directly support integer seeds as arguments to In documentation for |
(Just saw this edit to an earlier comment)
For MSVC, it's not |
Actually, even better would be to include an explicit Presumably |
Hmmm. That's appealing. It feels like it's overcomplicating things and adding another loop to this Gordian knot (and that there should be a more Alexander-esque stroke available to us), but that's really the only bad thing I have to say about it. It does make the remaining decisions easier: we can focus on statistical quality and performance.
I'm less a fan of this. Unlike the |
Consider the following from the perspective of a typical NumPy user:
I think it's safe to assume that most of our users know very little about the relative merits of RNG algorithms. But even without reading any docs, they can safely guess that In contrast, names like |
In your use case, no one is selecting the And in any case, that level of cross-release reproducibility is something that we've disclaimed in NEP 19. The arguments against versioning the distributions apply just as well here. |
A few thoughts on the default:
|
Sorry to say - but 32-bit still matters on Windows - see pypa/manylinux#118 (comment) |
Yes, when he explains the basics it's considered okay! |
@lemire: There's a humor piece I like that I think has a lot of truth to it called How To Criticize Computer Scientists. The underlying idea behind the piece is that theorists favor sophistication and experimentalists favor simplicity. So if your audience is one of experimentalists, they'll be delighted by simplicity, but if your audience is one of theorists, not so much. |
The default |
Very much inspired by this thread, I have some news to report… BackgroundBy many measures BUT, the pathologies that came up in this thread didn't sit well with me. Sure, I could say “well, don't hold it that way”, but the whole point of a general purpose PRNG is that it ought to robust. I wanted to do better... So, about 25 days ago I began thinking about designing a new member of the PCG family… GoalMy goal was to design a new PCG family member that could be a drop in replacement for the current
As always there is a trade-off as we try to get the best quality we can as quickly as we can. If we didn't care at all about speed, we could have more steps in the output function to produce more heavily scrambled output, but the point of PCG was that the underlying LCG was “almost good enough” and so we didn't need to go to quite as much effort as we would with something like a counter incrementing by 1. SpoilerI'm pleased to report success! About 25 days ago when I was first thinking about this I was actually on vacation. When I got back about ten days ago, I tried the ideas I had and was pleased to find that they worked well. The subsequent time has mostly been spent on various kinds of testing. Yesterday was satisfied enough that I pushed the code into the C++ version of PCG. Tests at small sizes indicate that it is much better than XSL RR, and competitive with RXS M, but it actually shines at larger sizes. It meets all the other goals as well. DetailsFWIW, the new output function is (for the 64-bit output case): uint64_t output(__uint128_t internal)
{
uint64_t hi = internal >> 64;
uint64_t lo = internal;
lo |= 1;
hi ^= hi >> 32;
hi *= 0xda942042e4dd58b5ULL;
hi ^= hi >> 48;
hi *= lo;
return hi;
} This output function is inspired by xorshift-multiply, by which is widely used. The choice of multipliers is (a) to keep the number of magic constants down, and (b) to prevent the permutation being undone (if you don't have access to low-order bits), and also provide the whole “randomized-by-itself” quality that PCG output functions typically have. Other changesIt's also the case that TestingI've tested it thoroughly (PractRand and TestU01) and I'm happy with it. Tests included scenarios outlined in this thread (e.g., taking a gang generators either on sequential steams or advanced by 2^64 and interleaving their output — I tested a gang of four and a gang of 8192 out to 8 TB with no issues, as well as a stream and its opposite-land counterpart). SpeedI could go on at length about speed tests and benchmarks. There are all sorts of factors that influence whether one PRNG runs faster than another in a given benchmark, but overall, this variant seems to often be a little faster, sometimes a lot faster, and occasionally a little slower. Factors like the compiler and the application have a much greater impact on benchmark variability. AvailabilityUsers of the C++ header can access this new family member now as Formal AnnouncementsOverall, I'm very happy with this new family member and at some point later in the summer, there will be blog posts with more detail, likely referencing this thread. Your Choices...Of course, you have to work out what to do with this. Regardless of whether you'd use it or not, I'd actually be pretty curious to see whether it does better or worse in your speed benchmarks. |
@imneme Does it get around the need to have a fast full 64-bit multiplier? (Which is super fast on x64 but a tad slower on some weaker architectures.) |
@lemire: 128-bit × 64-bit multiply, which internally will be done with two multiply instructions on x86 (a 64-bit × 64-bit → 128-bit result on the low-order bits, and 64-bit × 64-bit on the high order bits. Both multiplies can take place in parallel, and then the two results need to be added.) It's still potentially better than 128-bit × 128-bit. Although how much better depends on how well instruction scheduling goes at that moment. You're right that on ARM, 64-bit × 64-bit → 128-bit result is actually two instructions. (It's totally possible, of course, to gang together two 64-bit LCGs and mix them. The space of all the PRNGs that could exist and would work well is pretty huge.) |
A quick-and-dirty implementation in our framework suggests a mild performance improvement, at least on 64-bit Linux:
I think it's outputting from the pre-iterated state that gives it the bump, at least in this context. I left that out at first and got essentially the same performance as I guess the real question for you, @imneme, is how annoyed will you be with the name |
It's your choice to make! I have no problem with your shipping the 1.0 version of PCG64. Plenty of other folks have used that variant. I do think the DXSM variant has the advantage of avoiding the edge-case-usage issues that came up in this thread (after all, that's pretty much why it exists), but on the other hand, it has the disadvantage that it's late to the party. It might even seem quite reckless to ship a PRNG that is less than a month old out to users (even though it is based on the same ideas as the more time-tested PCG variants). (That said, if it were my choice, notwithstanding possible accusations of recklessness, I'd probably ship the new one; I think the delay to get it up and running in Numpy is pretty minimal. And the risk is very low — it's already throughly tested with BigCrush, and tested with PractRand out to 16 TB (including [Glad the performace did get a tad better. Five years ago, the using pre-iterated state was a pessimization for 128-bit sizes with 128-bit multipliers. But that was then, on the machines I was testing on, with the benchmarks I was using.] |
I meant more about using the name |
@rkern If it is just a naming issue, then PCG64DXSM and PCG64 sets them apart nicely, no? |
For numpy, certainly. I am just wondering if @imneme would prefer that we not name our 1.0 implementation |
In Section 6.3 of the PCG paper, it says:
And the C and C++ libraries are structured that way. The libraries provide
In this way, the aliases can be updated to point to newer family members, but users who want to exactly reproduce older results will still be able to by using the low-level interface to select the family member that was previously reachable by a convenient high-level alias. If you're going to ship a PRNG called |
The default generator is implemented is implemented as One minor nit: what about calling this |
I had the same thought, but then, |
👍 I like |
👍 for |
#13163 will be bringing in the long-awaited replacement of numpy's PRNG infrastructure. In the interest of keeping that PR manageable, we will merge it to master before all of the decisions are finalized, like which
BitGenerator
will be nominated as the default.We must make a decision before the first release with the new infrastructure. Once released, we will be stuck with our choice for a while, so we should be sure that we are comfortable with our decision.
On the other hand, the choice of the default does not have that many consequences. We are not talking about the default
BitGenerator
underlying thenumpy.random.*
convenience functions. Per NEP 19, these remain aliases to the legacyRandomState
, whoseBitGenerator
remainsMT19937
. The only place where the default comes in is whenGenerator()
is instantiated without arguments; i.e. when a user requests aGenerator
with an arbitrary state, presumably to then call the.seed()
method on it. This might probably be pretty rare, as it would be about as easy to just explicitly instantiate it with the seededBitGenerator
that they actually want. A legitimate choice here might actually be to nominate no default and always require the user to specify aBitGenerator
.Nonetheless, we will have recommendations as to which
BitGenerator
people should use most of the time, and while we can change recommendations fairly freely, whichever one has pride of place will probably get written about most in books, blogs, tutorials, and such.IMO, there are a few main options (with my commentary, please feel free to disagree; I have not attempted to port over all the relevant comments from #13163):
No default
Always require
Generator(ChosenBitGenerator(maybe_seed))
. This is a little unfriendly, but as it's a pretty convenient way to get the generator properly initialized for reproducibility, people may end up doing this anyways, even if we do have a default.MT19937
This would be a good conservative choice. It is certainly no worse than the status quo. As the Mersenne Twister is still widely regarded as "the standard" choice, it might help academic users who need their papers to be reviewed by people who might question "non-standard" choices, regardless of the specific qualities of the PRNG. "No one ever got fired for hiring IBM." The main downsides of
MT19937
are mostly that it is slower than some of the available alternatives, due to its very large state, and that it fails some statistical quality tests. In choosing another PRNG, we have an opportunity (but not an obligation, IMO) to be opinionated here and try to move "the standard", if we wish.PCG64
This is likely the one that I'll be using most often, personally. The main downside is that it uses 128-bit integer arithmetic, which is emulated in C if the compiler does not provide such an integer type. The two main platforms for which this is the case are 32-bit CPUs and 64-bit MSVC, which just does not support 128-bit integers even when the CPU does. Personally, I do not suggest letting the performance increasingly-rare 32-bit CPUs dictate our choices. But the MSVC performance is important, though, since our Windows builds do need that compiler and not other Windows compilers. It can probably be addressed with some assembly/compiler intrinsics, but someone would have to write them. The fact that it's only MSVC that we have to do this for makes this somewhat more palatable than other times when we are confronted with assembly.
Xoshiro256
Another modern choice for a small, fast PRNG. It does have a few known statistical quirks, but they are unlikely to be a major factor for most uses. Those quirks make me shy away from it, but that's my personal choice for the code I'll be writing.
The text was updated successfully, but these errors were encountered: