8000 BUG: groupby.agg/transform casts UDF results by rhshadrach · Pull Request #40790 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

BUG: groupby.agg/transform casts UDF results #40790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 3, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
477d813
BUG: groupby.agg/transform downcasts UDF results
rhshadrach Apr 2, 2021
d932c93
Merge branch 'master' of https://github.com/pandas-dev/pandas into do…
rhshadrach Apr 10, 2021
f2069a7
Reverted behavior change when input and output are the same kind
rhshadrach Apr 10, 2021
35c789f
Patch via maybe_convert_objects
rhshadrach Apr 10, 2021
93fa089
Merge branch 'master' of https://github.com/pandas-dev/pandas into do…
rhshadrach Apr 22, 2021
1cb216e
fixups
rhshadrach Apr 22, 2021
0cafcee
whatsnew
rhshadrach Apr 22, 2021
785ac9d
dtype test fixes
rhshadrach Apr 23, 2021
737a366
Merge branch 'master' of https://github.com/pandas-dev/pandas into do…
rhshadrach Apr 23, 2021
0b00aa7
Merge branch 'master' of https://github.com/pandas-dev/pandas into do…
rhshadrach Apr 24, 2021
de0f7b5
fixup
rhshadrach Apr 24, 2021
e95bb49
Merge branch 'dont_cast_udfs' of https://github.com/rhshadrach/pandas…
rhshadrach Apr 24, 2021
4ef6794
Fixup
rhshadrach Apr 24, 2021
4f97288
Add GH issue to TODOs
rhshadrach Apr 24, 2021 8000
ad7d990
Added docs to user guide, agg docstring
rhshadrach Apr 25, 2021
11529e3
Updated docs
rhshadrach Apr 25, 2021
0ca49f6
Merge branch 'dont_cast_udfs' of https://github.com/rhshadrach/pandas…
rhshadrach Apr 25, 2021
a0a2640
Fixup
rhshadrach Apr 27, 2021
eb1943a
Fixup
rhshadrach Apr 27, 2021
180bc23
docsting fixup
rhshadrach Apr 29, 2021
47d97ae
Merge branch 'master' of https://github.com/pandas-dev/pandas into do…
rhshadrach Apr 29, 2021
4a0978e
Add versionchanged
rhshadrach May 1, 2021
2b38e5c
Merge branch 'master' of https://github.com/pandas-dev/pandas into do…
rhshadrach May 1, 2021
6b80c10
Added versionchanged
rhshadrach May 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Reverted behavior change when input and output are the same kind
  • Loading branch information
rhshadrach committed Apr 10, 2021
commit f2069a7b214ea0ac15b75ea854d09af825cac47d
7 changes: 6 additions & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,7 @@ def maybe_downcast_to_dtype(result: ArrayLike, dtype: str | np.dtype) -> ArrayLi


def maybe_downcast_numeric(
result: ArrayLike, dtype: DtypeObj, do_round: bool = False
result: ArrayLike, dtype: DtypeObj, do_round: bool = False, same_kind: bool = False
) -> ArrayLike:
"""
Subset of maybe_downcast_to_dtype restricted to numeric dtypes.
Expand All @@ -314,6 +314,9 @@ def maybe_downcast_numeric(
result : ndarray or ExtensionArray
dtype : np.dtype or ExtensionDtype
do_round : bool
same_kind: bool
Whether to only possibly downcast when result.dtype is the same kind
as dtype.

Returns
-------
Expand All @@ -332,6 +335,8 @@ def trans(x):
# don't allow upcasts here (except if empty)
if result.dtype.itemsize <= dtype.itemsize and result.size:
return result
elif same_kind:
return result

if is_bool_dtype(dtype) or is_integer_dtype(dtype):

Expand Down
8 changes: 8 additions & 0 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
is_dict_like,
is_integer_dtype,
is_interval_dtype,
is_numeric_dtype,
is_scalar,
needs_i8_conversion,
)
Expand Down Expand Up @@ -578,6 +579,13 @@ def _transform_general(self, func, *args, **kwargs):
result = self._set_result_index_ordered(concatenated)
else:
result = self.obj._constructor(dtype=np.float64)
# we will only try to coerce the result type if
# we have a numeric dtype, as these are *always* user-defined funcs
# the cython take a different path (and casting)
if is_numeric_dtype(result.dtype):
result = maybe_downcast_numeric(
result, self._selected_obj.dtype, same_kind=True
)

result.name = self._selected_obj.name
return result
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1226,6 +1226,9 @@ def _python_agg_general(self, func, *args, **kwargs):
assert result is not None
key = base.OutputKey(label=name, position=idx)

if is_numeric_dtype(obj.dtype):
result = maybe_downcast_numeric(result, obj.dtype, same_kind=True)

if self.grouper._filter_empty_groups:
mask = counts.ravel() > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach can i get your help in this nearby piece of code? in all existing tests, when we get here, we have mask.all(). trying to come up with a case where this doesnt hold (or prove that it must always hold). any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is now removed - guessing mask.all() always held.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep.

next thing to ask your help with : IIRC you've done a lot of work in core.apply, which DataFrameGroupBy.aggregate uses. id like to make SeriesGroupBy.aggregate and DataFrameGroupBy.aggregate share more code (or at least be more obviously-similar). can i get your thoughts on how to achieve this (and whether its the effort)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we make aggregate always aggregate (PoC implemented in #40275), we can greatly simplify these methods. However, in order to do that we need to separate the apply/agg paths: currently apply uses agg for list/dicts and agg also uses apply for UDFs. I make a couple of attempts to do this but kept running into issues with changing behaviors without having a clear way to deprecate. This was the motivation for #41112. I plan to start working on that, assuming that's a good approach, in 1.3.


Expand Down
6 changes: 5 additions & 1 deletion pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,11 @@ def f(x):
return float(len(x))

agged = grouped.agg(f)
expected = Series([4.0, 2.0], index=["bar", "foo"])

# precision will only be preserved when the input dtype is the same kind as output
expected = Series(
[4.0, 2.0], index=["bar", "foo"], dtype=dtype if dtype == "float32" else None
)
tm.assert_series_equal(agged, expected)


Expand Down
3 changes: 0 additions & 3 deletions pandas/tests/resample/test_datetime_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -1204,9 +1204,6 @@ def test_resample_median_bug_1688():

result = df.resample("T").apply(lambda x: x.mean())
exp = df.asfreq("T")
if dtype == "float32":
# TODO: fastpath for apply comes back at float64
exp = exp.astype("float64")
tm.assert_frame_equal(result, exp)

result = df.resample("T").median()
Expand Down
0