-
Notifications
You must be signed in to change notification settings - Fork 15
Incorrect use of pseudorandom numbers #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Adding my thoughts here from #70 and expanding a bit. I am not familiar with the internals for how the PRNGSs work, but I think both pandas and sklearn recommend the use of random_state repeatedly over setting seeds globally, so there might be something to account for what you mention in those packages? One issue I see with setting seeds globally is that the number of computation performed impact the result. So if I have insert a function that uses randomness somewhere in the middle of my analysis, it will change the results of everything below, which makes it much less reproducible and harder to keep track of what's going on. For example, consider that I have two important computations to carry out that both uses randomness: All is fine as long as I keep the notebook cells in this exact order and don't add anything new. But what if I realize that I wanted to perform some other computation before executing the second cell? Maybe something that is just semi related and that I don't even realize uses randomness. The results from the second computation will now change "mysteriously" for me: To me this seems like a quite big caveat that many students might run into and I would personally not want to rely on a data analysis where the number of cells will impact the output of my model, I think it is better in that case to use |
I am reasonably well-versed in recent research on PRNGs. Surprisingly, I would be really surprised if pandas/scikit-learn advocated doing On the other hand, if they advocate creating a random state object
I agree, but for the purposes of DSCI100 these cases will not come up much. I think students are waaaaay more likely to accidentally forget a We can also include an example in the text on what can happen if you are using the default prng and insert some other function from another library in your code. |
Great, thanks for explaining more about PRNGs! I see that the cycle length is important and I wonder how close to those large numbers we would get when using PRNGs in our analysis code. I looked into the docs abit and this is what I found: Pandas docs talk very little about randomness, but where they do mention it, they seem to indicate that setting Scikit-learn, on the other hand, has an entire help section dedicated to how to deal with randomness. In summary their recommendations are to always use the |
Btw, this is also one of the reasons that numpy docs recommend moving away from |
You'd never reach anywhere near the period of a PRNG in practice, even in much more sophisticated code than any data analysis would have. But that's not really the point -- cycle length is more of a necessary condition for performance. A long cycle length does not imply good performance (e.g., I could just count up to intmax starting from 0 -- that's obviously a really bad PRNG, but the cycle length is maximized for an integer state); but a short cycle length does imply bad performance. With a cycle length of 2 (which is essentially what you get by re-seeding constantly) you aren't using pseudorandomness in any reasonable sense.
That page doesn't seem to say much about setting the random state -- it's just used incidentally in one place. Do they make a clear recommendation somewhere?
I don't think there's any reason to try to avoid making the order of execution matter. In any code you write, order of execution will matter, random or not.
PCG is one of the current top performers. It uses a rather simple Two other families that are worth knowing:
|
(I should also say that MT19937 is totally fine for anything we do in DSCI100, and frankly almost everything people do in practice. So I don't see that as being a drawback of |
Thanks! Replies below, and just to be clear I don't think we need to prioritize this for the first semester although I think it is important to revisit for the next iteration.
Ah that makes sense, thanks for expanding on that explanation.
Nope, they don't use it much, but where they do I see integers.
I find execution order more important here because it could be more subtle that just using the random instance additional times will change things downstreams. It it not as clear as that execution order matters if we are reassigning or manipulating some object directly (in my opinion). There is also the fact that sklearn has this line in the description of
Interesting! Thank you!
Yup I agree with that. |
I don't really find this argument compelling. If you're coding, you should care about the order that you do things. Sure there are some commutative operations, I guess, but they're a lot less common than the alternative.
I mean...just because sklearn docs say to do something, that doesn't mean it's the right thing to do. My honest guess is that the docs suggest this because they don't want to have to write I decided to dig into a few sklearn functions, and indeed setting So I really don't see a reason that
I would strongly suggest changing everywhere to I prefer |
Thanks for this detailed reply @trevorcampbell and sorry for my late reply. I wanted to dig into the scipy functions myself as well and learn more about that, but it seems like I am never getting there. I agree that just because the docs is saying something, it doesn't mean it is the right thing to do, but I thought they would have a good reason for saying that which maybe wasn't obvious at first look, but based on your investigation it seems like they maybe don't. I want to probe the maintainers rationale a bit more in that thread I started when I get time. I forwarded our discussion to the instructors in MDS who are in charge of the more ML heavy courses, and it sounds like a more careful discussion of randomness could be both beneficial and interesting for our students. Thinking about it more, I can also see an argument for not setting the random state, or at least trying a range of numbers, to make sure there are no spurious things happening and that the results are somewhat robust between runs with varying seeds. I will close this for now since there seems like there is no action required at this point and we can re-open if that changes. |
I've noticed that a lot of the code in Ch 6 is constantly setting
random_state = 1
. I may be wrong, but I'm fairly sure this is equivalent to fixing the random state of the PRNG at each function call where that argument is passed.That's bad usage of prngs. You generally do not want to seed a PRNG many times -- just once -- because they only function as intended if they're run as a single long chain.
So: we should look through the book carefully once we have done the first round of polish to check for this bug.
The text was updated successfully, but these errors were encountered: