8000 Incorrect use of pseudorandom numbers · Issue #114 · UBC-DSCI/introduction-to-datascience-python · GitHub
[go: up one dir, main page]

Skip to content

Incorrect use of pseudorandom numbers #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
trevorcampbell opened this issue Jan 8, 2023 · 9 comments
Closed
8000

Incorrect use of pseudorandom numbers #114

trevorcampbell opened this issue Jan 8, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@trevorcampbell
Copy link
Contributor
trevorcampbell commented Jan 8, 2023

I've noticed that a lot of the code in Ch 6 is constantly setting random_state = 1. I may be wrong, but I'm fairly sure this is equivalent to fixing the random state of the PRNG at each function call where that argument is passed.

That's bad usage of prngs. You generally do not want to seed a PRNG many times -- just once -- because they only function as intended if they're run as a single long chain.

So: we should look through the book carefully once we have done the first round of polish to check for this bug.

@trevorcampbell trevorcampbell added the bug Something isn't working label Jan 8, 2023
@joelostblom
Copy link
Contributor

Adding my thoughts here from #70 and expanding a bit. I am not familiar with the internals for how the PRNGSs work, but I think both pandas and sklearn recommend the use of random_state repeatedly over setting seeds globally, so there might be something to account for what you mention in those packages?

One issue I see with setting seeds globally is that the number of computation performed impact the result. So if I have insert a function that uses randomness somewhere in the middle of my analysis, it will change the results of everything below, which makes it much less reproducible and harder to keep track of what's going on.

For example, consider that I have two important computations to carry out that both uses randomness:

image

All is fine as long as I keep the notebook cells in this exact order and don't add anything new. But what if I realize that I wanted to perform some other computation before executing the second cell? Maybe something that is just semi related and that I don't even realize uses randomness. The results from the second computation will now change "mysteriously" for me:

image

To me this seems like a quite big caveat that many students might run into and I would personally not want to rely on a data analysis where the number of cells will impact the output of my model, I think it is better in that case to use random_state to set it intentionally for each call that uses randomness. What do you think?

@trevorcampbell
Copy link
Contributor Author
trevorcampbell commented Jan 13, 2023

I am not familiar with the internals for how the PRNGSs work, but I think both pandas and sklearn recommend the use of random_state repeatedly over setting seeds globally

I am reasonably well-versed in recent research on PRNGs. Surprisingly, numpy (which I guess is what pandas/sklearn use under the hood) seems to be using the Mersenne Twister algorithm. MT used to be the standard, but it's really quite poor compared with modern alternatives. Either way, basically every PRNG operates on the same principle: the seed value is used to pick an initial state vector x_0 (in a binary field), and then each time the RNG is called, it transitions the state x_{t+1} = F(x_t) with a linear transformation and then scrambles the output y_{t+1} = G(x_{t+1}) with some nonlinearity. Different PRNGs use different F and G functions.

I would be really surprised if pandas/scikit-learn advocated doing random_state = 10 or whatever everywhere. In this case you're just repeatedly setting x_0 and retrieving x_1 as your "random" number. This would be totally unacceptable in any computational stat code. One notion of "quality" of a PRNG is its cycle length -- how long it takes to get a repeated random number. Most are measured in super large numbers -- even the Mersenne Twister, which is not considered to be very good, has a period of like 2^19937. If you do random_state = number everywhere, your cycle length is 2.

On the other hand, if they advocate creating a random state object rng = RandomState() and propagating that everywhere with random_state = rng, that's fine. rng maintains the current x_t and allows the state to transition properly.

One issue I see with setting seeds globally is that the number of computation performed impact the result.

But what if I realize that I wanted to perform some other computation before executing the second cell? Maybe something that is just semi related and that I don't even realize uses randomness.

I agree, but for the purposes of DSCI100 these cases will not come up much. I think students are waaaaay more likely to accidentally forget a random_state = ... argument than they are to get confused when they rearrange code.

We can also include an example in the text on what can happen if you are using the default prng and insert some other function from another library in your code.

@joelostblom
Copy link
Contributor

Great, thanks for explaining more about PRNGs! I see that the cycle length is important and I wonder how close to those large numbers we would get when using PRNGs in our analysis code. I looked into the docs abit and this is what I found:

Pandas docs talk very little about randomness, but where they do mention it, they seem to indicate that setting random_state to an integer between calls is fine.

Scikit-learn, on the other hand, has an entire help section dedicated to how to deal with randomness. In summary their recommendations are to always use the random_state parameter instead of the global seed, and setting it to an instance of np.random.RandomState for estimators and an integer for splitters. Setting it to an instance will have a similar effect as to setting the numpy random seed globally in the sense that the order of execution now matters. However, the benefits they describe seem to be more related to advancing the state of the PRNG for different CV folds when calling an estimator rather than between estimators, so I asked a follow up question here.

@joelostblom
Copy link
Contributor

MT used to be the standard, but it's really quite poor compared with modern alternatives.

Btw, this is also one of the reasons that numpy docs recommend moving away from np.random.seed and use the new Generator class instead, which uses the PCG64 algorithm by defauilt (which according to what I read in the docs is better). Is this one you are familiar with and think is better?

@trevorcampbell
Copy link
Contributor Author

I see that the cycle length is important and I wonder how close to those large numbers we would get when using PRNGs in our analysis code

You'd never reach anywhere near the period of a PRNG in practice, even in much more sophisticated code than any data analysis would have. But that's not really the point -- cycle length is more of a necessary condition for performance. A long cycle length does not imply good performance (e.g., I could just count up to intmax starting from 0 -- that's obviously a really bad PRNG, but the cycle length is maximized for an integer state); but a short cycle length does imply bad performance. With a cycle length of 2 (which is essentially what you get by re-seeding constantly) you aren't using pseudorandomness in any reasonable sense.

Pandas docs talk very little about randomness, but where they do mention it, they seem to indicate that setting random_state to an integer between calls is fine.

That page doesn't seem to say much about setting the random state -- it's just used incidentally in one place. Do they make a clear recommendation somewhere?

Setting it to an instance will have a similar effect as to setting the numpy random seed globally in the sense that the order of execution now matters

I don't think there's any reason to try to avoid making the order of execution matter. In any code you write, order of execution will matter, random or not.

use the new Generator class instead, which uses the PCG64 algorithm by defauilt (which according to what I read in the docs is better). Is this one you are familiar with and think is better?

PCG is one of the current top performers. It uses a rather simple F function (I think they just use a linear congruential generator if I recall) and the nonlinear G does most of the heavy lifting. It performs well empirically on standard benchmarks.

Two other families that are worth knowing:

  • the xorshift family (e.g. xoshiro128++) are the other way around -- the linear F function does all the heavy lifting, with a relatively simple scrambler G. These PRNGs let you do a lot more theoretical analysis (since analyzing linear state machines is "easy"), so this family is more well-understood theoretically. Performs well empirically. I don't think there's a clear winner between pcg and xorshift family, they are both fine.
  • partial cryptographic hash-based PRNGs like Philox. The idea there is that F is trivial (it's just a counter -- 1,2,3...) but then G is basically a few rounds of a crypto hash. The big win here is you can get random numbers in parallel, since you can very easily jump the internal state forward (state T is just T). But note that you can jump other PRNGs too, just with a bit more work.

@trevorcampbell
Copy link
Contributor Author

(I should also say that MT19937 is totally fine for anything we do in DSCI100, and frankly almost everything people do in practice. So I don't see that as being a drawback of np.random.seed here)

@joelostblom
Copy link
Contributor

Thanks! Replies below, and just to be clear I don't think we need to prioritize this for the first semester although I think it is important to revisit for the next iteration.

With a cycle length of 2 (which is essentially what you get by re-seeding constantly) you aren't using pseudorandomness in any reasonable sense.

Ah that makes sense, thanks for expanding on that explanation.

Do they make a clear recommendation somewhere?

Nope, they don't use it much, but where they do I see integers.

I don't think there's any reason to try to avoid making the order of execution matter. In any code you write, order of execution will matter, random or not.

I find execution order more important here because it could be more subtle that just using the random instance additional times will change things downstreams. It it not as clear as that execution order matters if we are reassigning or manipulating some object directly (in my opinion).

There is also the fact that sklearn has this line in the description of random_state in the docstrings "Pass an int for reproducible output across multiple function calls", and that it is the most common use in their documentation, as well as in general when I see it taught (including how we teach it in MDS) and used in the wild (none of those saying it is necessary the right choice, but worth discussing more I think)

Two other families that are worth knowing:

Interesting! Thank you!

(I should also say that MT19937 is totally fine for anything we do in DSCI100, and frankly almost everything people do in practice. So I don't see that as being a drawback of np.random.seed here)

Yup I agree with that.

@trevorcampbell
Copy link
Contributor Author

It it not as clear as that execution order matters if we are reassigning or manipulating some object directly (in my opinion).

I don't really find this argument compelling. If you're coding, you should care about the order that you do things. Sure there are some commutative operations, I guess, but they're a lot less common than the alternative.

sklearn has this line in the description of random_state in the docstrings "Pass an int for reproducible output across multiple function calls", and that it is the most common use in their documentation, as well as in general when I see it taught (including how we teach it in MDS) and used in the wild (none of those saying it is necessary the right choice, but worth discussing more I think)

I mean...just because sklearn docs say to do something, that doesn't mean it's the right thing to do. My honest guess is that the docs suggest this because they don't want to have to write rng = RandomState(seed=[integer]) everywhere.

I decided to dig into a few sklearn functions, and indeed setting random_state = [integer] is equivalent to random_state = np.random.RandomState(seed = [integer]). I figured they may be doing something clever under the hood to obtain reasonable-quality pseudorandomness from fixed integer inputs, but no, there isn't anything more fancy going on under the hood. It's just seeding a prng and using it.

So I really don't see a reason that random_state = [integer] is suggested, aside from laziness. It's strictly better and only one extra line of code to initially seed a random state rng = RandomState(seed = [integer]) and use it everywhere it's needed via random_state = rng. It's also better to just set np.random.seed([integer]) at the start of code. The arguments against np.random.seed are just 1) Mersenne Twister isn't state of the art, and 2) other code might mess with the global prng state. But if you're using random_state = [integer] everywhere in your code, you aren't serious enough about using reasonable quality randomness to care about the minor drawbacks of MT, and you probably aren't doing anything sophisticated where other code changing the rng state matters (multithreaded/multiprocess/asynch).

(including how we teach it in MDS)

I would strongly suggest changing everywhere to rng = RandomState(seed = [integer]) once at the beginning and random_state = rng everywhere else. If you're teaching random_state = [integer], you're teaching bad habits and sweeping potential issues under the rug. Another drawback is that if you want to check how sensitive your analysis is, which is important for any serious data analysis, it's annoying to go through and change all your manually-chosen random states. Much easier to just change one initial seed.

I prefer np.random.seed for DSCI100 because it's much simpler and less error-prone for the things they do in that class, and they aren't anywhere near the complexity of code where the drawbacks matter.

@joelostblom
Copy link
Contributor

Thanks for this detailed reply @trevorcampbell and sorry for my late reply. I wanted to dig into the scipy functions myself as well and learn more about that, but it seems like I am never getting there. I agree that just because the docs is saying something, it doesn't mean it is the right thing to do, but I thought they would have a good reason for saying that which maybe wasn't obvious at first look, but based on your investigation it seems like they maybe don't. I want to probe the maintainers rationale a bit more in that thread I started when I get time.

I forwarded our discussion to the instructors in MDS who are in charge of the more ML heavy courses, and it sounds like a more careful discussion of randomness could be both beneficial and interesting for our students.

Thinking about it more, I can also see an argument for not setting the random state, or at least trying a range of numbers, to make sure there are no spurious things happening and that the results are somewhat robust between runs with varying seeds.

I will close this for now since there seems like there is no action required at this point and we can re-open if that changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
0