8000 Cythonized GroupBy Fill by WillAyd · Pull Request #19673 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

Cythonized GroupBy Fill #19673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Feb 25, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
c58123c
Added test case for groupby fill methods
WillAyd Feb 12, 2018
2bc8023
Added code for group_fillna
WillAyd Feb 13, 2018
3cb25c0
Added ASV benchmarks
WillAyd Feb 13, 2018
7fecc11
Connected GroupBy method to Cython fillna
WillAyd Feb 13, 2018
3c2fb36
Fixed issue when filling Series after GroupBy
WillAyd Feb 13, 2018
a52b8c4
Added tests to mix group entries; fixed sort bug
WillAyd Feb 13, 2018
16c1823
Simplied groupby Cython calls for ffill/bfill
WillAyd Feb 13, 2018
bd3d5e0
Removed abandoned Cython implementation
WillAyd Feb 13, 2018
cae65af
Added upcast to int64 to prevent 32 bit failures
WillAyd Feb 13, 2018
0266514
Fixed issue with reconstructing grouped Series
WillAyd Feb 14, 2018
50dc690
Changed .view to .astype to avoid 32 bit segfaults
WillAyd Feb 15, 2018
9fa8e25
Added whatsnew
WillAyd Feb 15, 2018
5da06d8
Aligned group_fillna and group_shift signatures
WillAyd Feb 19 8000 , 2018
2fe91a4
Fixed failing test; list comp for _fill method
WillAyd Feb 19, 2018
825ba17
Updated whatsnew
WillAyd Feb 19, 2018
127c71c
PEP8 fixes
WillAyd Feb 19, 2018
3a23cd6
Py27 support with super call
WillAyd Feb 20, 2018
a363146
Fixed LINT issue
WillAyd Feb 20, 2018
fd513c8
Used kwargs to call Cython groupby funcs
WillAyd Feb 20, 2018
776d1b7
Docstring for _fill method
WillAyd Feb 20, 2018
33f0d06
Cleaned up kwargs passing to Cython layer
WillAyd Feb 21, 2018
662008a
Idiomatic update - replace join with concat
WillAyd Feb 22, 2018
27e24fa
Moved non-templated funcs to groupby.pyx
WillAyd Feb 23, 2018
6f72476
Code update - swap group_index.take with grouper
WillAyd Feb 23, 2018
eff6603
Rebase and update import
WillAyd Feb 24, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Aligned group_fillna and group_shift signatures
  • Loading branch information
WillAyd committed Feb 24, 2018
commit 5da06d868dc5cf044a84a32dcf74f0924d2b919d
20 changes: 11 additions & 9 deletions pandas/_libs/groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -906,7 +906,7 @@ def group_cumsum(numeric[:, :] out,

@cython.boundscheck(False)
@cython.wraparound(False)
def group_shift_indexer(int64_t[:] out, int64_t[:] labels,
def group_shift_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,
int ngroups, int periods):
cdef:
Py_ssize_t N, i, j, ii
Expand Down Expand Up @@ -957,21 +957,19 @@ def group_shift_indexer(int64_t[:] out, int64_t[:] labels,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized this is in groupby_helper.pxi.in, since this doesn't use templating (nor does group_shift_indexer). I think ok to omve to groupby.pxi. can you move?

@cython.wraparound(False)
@cython.boundscheck(False)
def group_fillna_indexer(ndarray[int64_t] out,
ndarray[uint8_t] mask,
ndarray[int64_t] labels,
object method,
def group_fillna_indexer(ndarray[int64_t] out, ndarray[int64_t] labels,
ndarray[uint8_t] mask, object direction,
int64_t limit):
"""Fills values forwards or backwards within a group
"""Indexes how to fill values forwards or backwards within a group

Parameters
----------
out : array of int64_t values which this method will write its results to
Missing values will be written to with a value of -1
mask : array of int64_t values where a 1 indicates a missing value
labels : array containing unique label for each group, with its ordering
matching up to the corresponding record in `values`
method : {'ffill', 'bfill'}
mask : array of int64_t values where a 1 indicates a missing value
direction : {'ffill', 'bfill'}
Direction for fill to be applied (forwards or backwards, respectively)
limit : Consecutive values to fill before stopping, or -1 for no limit

Expand All @@ -987,8 +985,11 @@ def group_fillna_indexer(ndarray[int64_t] out,

N = len(out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe want an assert that N == len(labels) == len(mask)


# Make sure all arrays are the same size
assert N == len(labels) == len(mask)

sorted_labels = np.argsort(labels).astype(np.int64, copy=False)
if method == 'bfill':
if direction == 'bfill':
sorted_labels = sorted_labels[::-1]

with nogil:
Expand All @@ -1004,6 +1005,7 @@ def group_fillna_indexer(ndarray[int64_t] out,
curr_fill_idx = idx

out[idx] = curr_fill_idx

# If we move to the next group, reset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here

# the fill_idx and counter
if i == N - 1 or labels[idx] != labels[sorted_labels[i+1]]:
Expand Down
92 changes: 64 additions & 28 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import types
from functools import wraps
from functools import wraps, partial
import numpy as np
import datetime
import collections
Expand Down Expand Up @@ -1457,25 +1457,14 @@ def expanding(self, *args, **kwargs):
from pandas.core.window import ExpandingGroupby
return ExpandingGroupby(self, *args, **kwargs)

def _fill(self, how, limit=None):
labels, _, _ = self.grouper.group_info

def _fill(self, direction, limit=None):
# Need int value for Cython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

if limit is None:
limit = -1
output = {}
if type(self) is DataFrameGroupBy:
for grp in self.grouper.groupings:
ser = grp.group_index.take(grp.labels)
output[ser.name] = ser.values
for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)
mask = isnull(obj.values).view(np.uint8)
libgroupby.group_fillna_indexer(indexer, mask, labels, how,
limit)
output[name] = algorithms.take_nd(obj.values, indexer)

return self._wrap_transformed_output(output)
return self._get_cythonized_result('group_fillna_indexer',
self.grouper, needs_mask=True,
direction=direction, limit=limit)

@Substitution(name='groupby')
def pad(self, limit=None):
Expand Down Expand Up @@ -1863,6 +1852,52 @@ def cummax(self, axis=0, **kwargs):

return self._cython_transform('cummax', numeric_only=False)

def _get_cythonized_result(self, how, grouper, needs_mask=False,
needs_ngroups=False, **kwargs):
"""Get result for Cythonized functions

Parameters
----------
how : str, Cythonized function name to be called
grouper : Grouper object containing pertinent group info
needs_mask : bool, default False
Whether boolean mask needs to be part of the Cython call signature
needs_ngroups : bool, default False
Whether number of groups part of the Cython call signature
**kwargs : dict
Extra arguments required for the given function. This method
internally stores an OrderedDict that maps those keywords to
positional arguments before calling the Cython layer

Returns
-------
GroupBy object populated with appropriate result(s)
"""
exp_kwds = collections.OrderedDict([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be passed in directly (from the calling function), no? rather than hard coded inside this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe the caller passes in a partial the kwargs baked rather than the string version of the funtion?

_get_cythonized_result(
    partial(pd._libs.groupby.group_shift_indexer, nperiods=2),
    ...
)

Copy link
Member Author
@WillAyd WillAyd Feb 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll think more about that. The one complicating factor is that masked values cannot be passed in directly as they are calculated on a per slice basis. To avoid a convoluted set of calls I figured it would make sense to have all keywords and positional arguments resolved within one function, but there’s certainly other ways to go about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 comment is about the exp_kwds, I am still not clear why the caller cannot pass these

(('group_fillna_indexer'), ('direction', 'limit')),
(('group_shift_indexer'), ('nperiods',))])

labels, _, ngroups = grouper.group_info
output = collections.OrderedDict()
base_func = getattr(libgroupby, how)

for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to specify int64? (or platform int)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, int64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a copy/paste of existing code, so I didn't give it too much thought. That said, labels is already wrapped with a _ensure_int64 call as part of group_info within the BaseGrouper, so unless there's something in particular you know of I figure it would be easiest to inherit that type from labels and not override

func = partial(base_func, indexer, labels)
if needs_mask:
mask = isnull(obj.values).astype(np.uint8, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to the int64 case, you do want to use view here - numpy doesn't see bool and uint8 as the 'same' type, so will always copy, which view elides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up - still trying to wrap my head around some of the nuances there. Out of curiosity, how do you know this? Just from experience or is there a good reference?

I do see that astype documentation mentions dtype, order and subok requirements need to be satisfied to prevent copy, but I'm not clear on what those requirements are and if they are documented

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately all I can really point to is trial and error. Some "general" rules, at least in a pandas context

  1. copy=False will only elide a copy if the dtype is identical (may be contradictions to this, but at least not in general). Useful for conditional casts.
  2. arr.view(dtype) never copies. Viewing across types should only be used if if the itemsize of the two types are identical. Primarily useful for casting a type with some metadata down to a a more primitive type (bool to uint8, datetime64 to int64).

func = partial(func, mask)

if needs_ngroups:
func = partial(func, ngroups)

# Convert any keywords into positional arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can pass kwargs to cython functions, so not sure this is necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explicitly converted kwargs into positional arguments from the first comment back in #19481. If we are OK with kwargs in the Cython layer I can rewrite this, as it could simplify a few things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you can pass kwargs

func = partial(func, *(kwargs[x] for x in exp_kwds[how]))
func() # Call func to modify indexer values in place
output[name] = algorithms.take_nd(obj.values, indexer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need .values here (I think), or maybe need _values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - believe we need the ndarray not the Series.

Looking at internals.py I see external_values and internal_values (respectively for .values and ._values) both ultimately dispatch to the same property (self.values) - is the distinction that we are trying to make here forward looking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is fine


return self._wrap_transformed_output(output)

@Substitution(name='groupby')
@Appender(_doc_template)
def shift(self, periods=1, freq=None, axis=0):
Expand All @@ -1880,17 +1915,10 @@ def shift(self, periods=1, freq=None, axis=0):
if freq is not None or axis != 0:
return self.apply(lambda x: x.shift(periods, freq, axis))

labels, _, ngroups = self.grouper.group_info

# filled in by Cython
indexer = np.zeros_like(labels)
libgroupby.group_shift_indexer(indexer, labels, ngroups, periods)
return self._get_cythonized_result('group_shift_indexer',
self.grouper, needs_ngroups=True,
nperiods=periods)

output = {}
for name, obj in self._iterate_slices():
output[name] = algorithms.take_nd(obj.values, indexer)

return self._wrap_transformed_output(output)

@Substitution(name='groupby')
@Appender(_doc_template)
Expand Down Expand Up @@ -3597,7 +3625,6 @@ def describe(self, **kwargs):
def value_counts(self, normalize=False, sort=True, ascending=False,
bins=None, dropna=True):

from functools import partial
from pandas.core.reshape.tile import cut
from pandas.core.reshape.merge import _get_join_indexers

Expand Down Expand Up @@ -4605,9 +4632,18 @@ def _apply_to_column_groupbys(self, func):
in self._iterate_column_groupbys()),
keys=self._selected_obj.columns, axis=1)

def _fill(self, direction, limit=None):
"""Overriden method to concat grouped columns in output"""
res = super()._fill(direction, limit=limit)
output = collections.OrderedDict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be slightly simpler as a list-comprehension

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though maybe this should be an arg to _fill? (IOW the option to do it should be done in fill)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean as an arg to _get_cythonized_result (fill would always use, unless we wanted to implement a new feature)? I was thinking about that but given fill on a DataFrameGroupBy is the only operation I know of at the moment that includes the grouping in the body of the returned object I figured it made the most sense to just perform that action there

for grp in self.grouper.groupings:
ser = grp.group_index.take(grp.labels)
output[ser.name] = ser.values

return self._wrap_transformed_output(output).join(res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, rather than joining, could't the entries of res be added be added to output? should have the same shape so shouldn't need to go through the join machinery

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is the super call already wraps it's output into a DataFrame or Series (actually returned by _get_cythonized_result. If we wanted to, we could create some private methods that pass around the "unwrapped" output which private methods in this class and its parent can access before ever calling _wrap_transformed_output, but I would question if the juice is really worth the squeeze to do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do a concat no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat should work just as well. Is that out of performance reasons we prefer that to join?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just idiomatic


def count(self):
""" Compute count of group, excluding missing values """
from functools import partial
from pandas.core.dtypes.missing import _isna_ndarraylike as isna

data, _ = self._get_data_to_aggregate()
Expand Down
0