-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Series/DataFrame sample method with/without replacement #2419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our < 8000 a class="Link--inTextBlock" href="https://docs.github.com/terms" target="_blank">terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Something like |
Or even just |
This doesn't need to get done for 0.10 |
I would like to propose that we should copy the API from dplyr for this method: namely, we should have two methods, CC @hayd |
Steal all the dplyr! To keep the number of new methods low, would you favor a single method
And we can have a |
@TomAugspurger Hmm. I've used |
Good enough for me. On Wed, Jan 21, 2015 at 1:20 PM, Stephan Hoyer notifications@github.com
|
+1 |
I'd be happy to take a look at this in a about a week (after a presentation). How would people feel about an implementation built around a numpy sampling of the index, followed by a .loc[] call, similar (though with the suggested
|
That sounds fine. You'll also want to accept a The only wrinkle is how to handle duplicates in the index. If you use |
Sounds great -- I'll get to it next week! |
@nickeubank glad you're excited about this! It would be great if you could get this finished :). Here are the rough versions (mostly untested) that I wrote a few weeks ago: def sample_n(df, n, replace=False, weight=None, seed=None):
"""Sample n rows from a DataFrame at random
"""
rs = np.random.RandomState(seed)
locs = rs.choice(df.shape[0], size=n, replace=replace, p=weight)
return df.take(locs, axis=0)
def sample_frac(df, frac, replace=False, weight=None, seed=None):
"""Sample some fraction of a DataFrame at random
"""
n = int(round(frac * df.shape[0]))
return sample_n(df, n, replace=replace, weight=weight, seed=seed) I think these get a couple of things right:
What this needs:
Also, it would be really nice for these methods to work with grouped operations, so you could write something like |
@shoyer Great! looks like this is in great shape. I'll start by building some tests and look into a weight implementation and get back to you, then we can pivot to the groupby once that's done. Do you have an existing fork I should work on? |
@nickeubank Nope, feel free to start from scratch. I needed |
Quick poll: I'm inclined to call the function "rand()" and accept both "size" and "size_type = {number, frac}" to accommodate both request for an exact number of rows and a fraction of rows. My personal interest in this is mostly for being able to quickly query a random set of rows to examine my data frame, so having "df.rand()" return 5 random rows in a manner analogous to "df.head()" feels more appealing than longer function names like sample_n() or sample_frac(). But I'm open to input -- would people prefer sample_n() and sample_frac()? or is rand() seem ok? |
I am not a fan of For me, adding a few characters to the length of the function is not such a big concern, because I'm almost always using auto-complete in IPython, anyways. I'm afraid I'm also not a fan of returning 5 random rows as the default. That feels like a very arbitrary number to me -- and again, something that would be hard to guess. |
I'm also in favor of |
@nickeubank be sure to also check #7274, a closed PR trying to implement this for some inspiration (comments, tests) I also like |
OK, sounds like a concensus in favor of like @jorisvandenbossche, I'm inclined to one method with a @shoyer Regarding the default return of five rows, it's a little arbitrary, but is analogous to what |
Like I said before, my main issue with plain |
Ah, I see -- you were thinking that if a size value is between 0 and 1, the function infers the user wants a share of rows; if size is an integer greater than 1, the function assumes they want N rows? I was just going to make it a function option. That gets rid of the corner case. Basically:
|
If we would make it one But also ok to make two functions of it |
Also, I would use |
|
and actually |
Ha! Do you think this is the exact conversation that the dplyr developers had? Sounds like there's a pretty good consensus around 2 functions -- i'll code that up! |
Actually, I think @jorisvandenbossche and I are now voting for one function, two arguments :). |
Oh! Misread post on length. :) OK, so something like the following, with an error thrown if both n and frac values are provided:
|
Yes, that looks very close. One thing to note is that you'll need to make Also, |
On first point: Great. On weights: I was coding this into "code/generic.py" so it would also work with Series, and in a series the string wouldn't mean anything. With that in mind, I thought I'd just ask for a Series in the Or do you think we need an |
Nevermind -- ill just add "if dataframe" clause. :) |
Little late to the party here, but I am -1 on passing in a string to weights to mean a column. Why not just accept a single thing--a Series--and it works with both series and frame without having to know what the type of self is. It's also more clear what the meaning is IMO. |
I agree this functionality is not essential, but we already use this sort of syntax as a shortcut (e.g., with |
Submitted as pull request #9666. Input welcome! |
closed by #9666 |
Should use a more intelligent algorithm than using
np.random.permutation
The text was updated successfully, but these errors were encountered: