8000 Cythonized GroupBy Fill by WillAyd · Pull Request #19673 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

Cythonized GroupBy Fill #19673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Feb 25, 2018
Merged
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
c58123c
Added test case for groupby fill methods
WillAyd Feb 12, 2018
2bc8023
Added code for group_fillna
WillAyd Feb 13, 2018
3cb25c0
Added ASV benchmarks
WillAyd Feb 13, 2018
7fecc11
Connected GroupBy method to Cython fillna
WillAyd Feb 13, 2018
3c2fb36
Fixed issue when filling Series after GroupBy
WillAyd Feb 13, 2018
a52b8c4
Added tests to mix group entries; fixed sort bug
WillAyd Feb 13, 2018
16c1823
Simplied groupby Cython calls for ffill/bfill
WillAyd Feb 13, 2018
bd3d5e0
Removed abandoned Cython implementation
WillAyd Feb 13, 2018
cae65af
Added upcast to int64 to prevent 32 bit failures
WillAyd Feb 13, 2018
0266514
Fixed issue with reconstructing grouped Series
WillAyd Feb 14, 2018
50dc690
Changed .view to .astype to avoid 32 bit segfaults
WillAyd Feb 15, 2018
9fa8e25
Added whatsnew
WillAyd Feb 15, 2018
5da06d8
Aligned group_fillna and group_shift signatures
WillAyd Feb 19, 2018
2fe91a4
Fixed failing test; list comp for _fill method
WillAyd Feb 19, 2018
825ba17
Updated whatsnew
WillAyd Feb 19, 2018
127c71c
PEP8 fixes
WillAyd Feb 19, 2018
3a23cd6
Py27 support with super call
WillAyd Feb 20, 2018
a363146
Fixed LINT issue
WillAyd Feb 20, 2018
fd513c8
Used kwargs to call Cython groupby funcs
WillAyd Feb 20, 2018
776d1b7
Docstring for _fill method
WillAyd Feb 20, 2018
33f0d06
Cleaned up kwargs passing to Cython layer
WillAyd Feb 21, 2018
662008a
Idiomatic update - replace join with concat
WillAyd Feb 22, 2018
27e24fa
Moved non-templated funcs to groupby.pyx
WillAyd Feb 23, 2018
6f72476
Code update - swap group_index.take with grouper
WillAyd Feb 23, 2018
eff6603
Rebase and update import
WillAyd Feb 24, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fixed failing test; list comp for _fill method
8000
  • Loading branch information
WillAyd committed Feb 24, 2018
commit 2fe91a4ac6309becb958e97c6b94e61e1dd2c9e2
13 changes: 6 additions & 7 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1882,10 +1882,10 @@ def _get_cythonized_result(self, how, grouper, needs_mask=False,
base_func = getattr(libgroupby, how)

for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)
indexer = np.zeros_like(labels, dtype=np.int64)
func = partial(base_func, indexer, labels)
if needs_mask:
mask = isnull(obj.values).astype(np.uint8, copy=False)
mask = isnull(obj.values).view(np.uint8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont need to use .values here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing .values for this returns a Series object so I do think we need this to get the ndarray before passing to Cython

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right this is ok

func = partial(func, mask)

if needs_ngroups:
Expand Down Expand Up @@ -4633,12 +4633,11 @@ def _apply_to_column_groupbys(self, func):
keys=self._selected_obj.columns, axis=1)

def _fill(self, direction, limit=None):
"""Overriden method to concat grouped columns in output"""
"""Overriden method to join grouped columns in output"""
res = super()._fill(direction, limit=limit)
output = collections.OrderedDict()
for grp in self.grouper.groupings:
ser = grp.group_index.take(grp.labels)
output[ser.name] = ser.values
output = collections.OrderedDict(
(grp.name, grp.group_index.take(grp.labels)) for grp in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe missing something obvious, but isn't the take unnecessary - couldn't you use the original grouping columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly could be a better way to do it - first time working this far into the API so if there's any easier way you know of that I missed let me know.

FWIW I think that instead of the take we could replace this with grp.grouper and get the same values - is that what you are referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay - was working from my memory and I honestly get lost on all the levels of groupby, groupers, and groupings. What you want here is grp.grouper - it has the original values without the need to take again.

In [40]: df = pd.DataFrame({'k': [1, 1, 2, 2, 1, 1], 'v': [1., np.nan, 2., np.nan, np.nan, 3.]})

In [41]: gb = df.groupby('k')

In [42]: [grp.grouper for grp in gb.grouper.groupings]
Out[42]: [array([1, 1, 2, 2, 1, 1], dtype=int64)]

self.grouper.groupings)

return self._wrap_transformed_output(output).join(res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, rather than joining, could't the entries of res be added be added to output? should have the same shape so shouldn't need to go through the join machinery

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is the super call already wraps it's output into a DataFrame or Series (actually returned by _get_cythonized_result. If we wanted to, we could create some private methods that pass around the "unwrapped" output which private methods in this class and its parent can access before ever calling _wrap_transformed_output, but I would question if the juice is really worth the squeeze to do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do a concat no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat should work just as well. Is that out of performance reasons we prefer that to join?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just idiomatic


Expand Down
0