8000 ENH/BUG: Use Kleene logic for groupby any/all by mzeitlin11 · Pull Request #40819 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

ENH/BUG: Use Kleene logic for groupby any/all #40819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Apr 13, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
088ca14
WIP
mzeitlin11 Apr 6, 2021
2554921
TSTS: Consolidate groupby any, all
mzeitlin11 Apr 6, 2021
9a8f9c9
Fixture fixup
mzeitlin11 Apr 6, 2021
5ca9c4b
Unmove test
mzeitlin11 Apr 6, 2021
68fd995
Merge remote-tracking branch 'origin/master' into enh/any_all_kleene
mzeitlin11 Apr 6, 2021
6530491
Merge branch 'tst/any_all' into enh/any_all_kleene
mzeitlin11 Apr 6, 2021
26146c2
Add initial tests
mzeitlin11 Apr 6, 2021
20f475d
Add whatsnew, bench
mzeitlin11 Apr 6, 2021
924b38e
Clean up edge case
mzeitlin11 Apr 6, 2021
423f43f
Fix typo
mzeitlin11 Apr 6, 2021
b1408ac
Avoid copy if possible
mzeitlin11 Apr 6, 2021
47ef037
Fix old level test
mzeitlin11 Apr 7, 2021
4415060
Precommit fixup
mzeitlin11 Apr 7, 2021
bb04c1c
Clean up print
mzeitlin11 Apr 7, 2021
9c90886
Merge remote-tracking branch 'origin/master' into enh/any_all_kleene
mzeitlin11 Apr 7, 2021
ef3fbe2
precommit fixup
mzeitlin11 Apr 7, 2021
f4c8a8a
Merge remote-tracking branch 'origin/master' into enh/any_all_kleene
mzeitlin11 Apr 8, 2021
1c3cb7d
Split out test
mzeitlin11 Apr 8, 2021
7cbf85b
Split whatsnew
mzeitlin11 Apr 8, 2021
809b8a4
whatsnew typo
mzeitlin11 Apr 8, 2021
58fd33a
Modify dispatch, add mixed test
mzeitlin11 Apr 8, 2021
80a65bb
Fix post proc check
mzeitlin11 Apr 8, 2021
c9b9d5f
Address review comments
mzeitlin11 Apr 8, 2021
7514568
Merge remote-tracking branch 'origin/master' into enh/any_all_kleene
mzeitlin11 Apr 9, 2021
740ad7b
Merge remote-tracking branch 'origin/master' into enh/any_all_kleene
mzeitlin11 Apr 10, 2021
a116bed
Name arguments better
mzeitlin11 Apr 10, 2021
8a428d4
Use -1 as mask signal
mzeitlin11 Apr 10, 2021
8e3c5be
Consistent typing
mzeitlin11 Apr 10, 2021
b627618
Don't use inspect
mzeitlin11 Apr 10, 2021
23b3b64
precommit fixup
mzeitlin11 Apr 10, 2021
a30496c
Clean up docstring
mzeitlin11 Apr 10, 2021
3051a99
Merge remote-tracking branch 'origin/master' into enh/any_all_kleene
mzeitlin11 Apr 12, 2021
4cd2833
Update doc/source/whatsnew/v1.3.0.rst
mzeitlin11 Apr 13, 2021
98cd401
Update doc/source/whatsnew/v1.3.0.rst
mzeitlin11 Apr 13, 2021
a92c637
Update doc/source/whatsnew/v1.3.0.rst
mzeitlin11 Apr 13, 2021
7c5c8e6
Update pandas/tests/groupby/test_any_all.py
mzeitlin11 Apr 13, 2021
c66d1fd
Update pandas/tests/groupby/test_any_all.py
mzeitlin11 Apr 13, 2021
0950234
Merge branch 'master' into enh/any_all_kleene
mzeitlin11 Apr 13, 2021
c81c1a5
Simplify teasts
mzeitlin11 Apr 13, 2021
d2b8ad0
Merge branch 'enh/any_all_kleene' of github.com:/mzeitlin11/pandas in…
mzeitlin11 Apr 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Address review comments
  • Loading branch information
mzeitlin11 committed Apr 8, 2021
commit c9b9d5f25820523898970be1a746f752b8e0dc4b
15 changes: 9 additions & 6 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,11 +77,11 @@ class providing the base-class of operations.
from pandas.core import nanops
import pandas.core.algorithms as algorithms
from pandas.core.arrays import (
BaseMaskedArray,
BooleanArray,
Categorical,
ExtensionArray,
)
from pandas.core.arrays.boolean import BooleanArray
from pandas.core.arrays.masked import BaseMaskedArray
from pandas.core.base import (
DataError,
PandasObject,
Expand Down Expand Up @@ -1444,9 +1444,9 @@ def result_to_bool(
needs_mask=True,
pre_processing=objs_to_bool,
post_processing=result_to_bool,
use_nullable_input_arg=True,
val_test=val_test,
skipna=skipna,
masked=False,
)

@final
Expand Down Expand Up @@ -2630,6 +2630,7 @@ def _get_cythonized_result(
result_is_index: bool = False,
pre_processing=None,
post_processing=None,
use_nullable_input_arg: bool = False,
**kwargs,
):
"""
Expand Down Expand Up @@ -2677,6 +2678,9 @@ def _get_cythonized_result(
second argument, i.e. the signature should be
(ndarray, Type). Optionally, a third argument can be "masked", to
allow for processing specific to nullable values
use_nullable_input_arg : bool, default False
If True, pass an argument "masked" to the cython function specifying
whether or not the input should be treated as masked
**kwargs : dict
Extra arguments to be passed back to Cython funcs

Expand Down Expand Up @@ -2705,7 +2709,6 @@ def _get_cythonized_result(
post_processing_accepts_masked = post_processing is not None and (
"masked" in inspect.signature(post_processing).parameters
)
kwargs_accepts_masked = "masked" in kwargs

error_msg = ""
for idx, obj in enumerate(self._iterate_slices()):
Expand Down Expand Up @@ -2757,8 +2760,8 @@ def _get_cythonized_result(
if needs_ngroups:
func = partial(func, ngroups)

if kwargs_accepts_masked:
kwargs["masked"] = is_nullable
if use_nullable_input_arg:
func = partial(func, masked=is_nullable)

func(**kwargs) # Call func to modify indexer values in place

Expand Down
15 changes: 11 additions & 4 deletions pandas/tests/groupby/test_any_all.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,9 @@ def test_bool_aggs_dup_column_labels(bool_agg_func):
@pytest.mark.parametrize("bool_agg_func", ["any", "all"])
@pytest.mark.parametrize("skipna", [True, False])
@pytest.mark.parametrize(
# expected indexed as [skipna][bool_agg_func == "all"]
"data,expected",
# expected_data indexed as [[skipna=False/any, skipna=False/all],
# [skipna=True/any, skipna=True/all]]
"data,expected_data",
[
([False, False, False], [[False, False], [False, False]]),
([True, True, True], [[True, True], [True, True]]),
Expand All @@ -85,16 +86,22 @@ def test_bool_aggs_dup_column_labels(bool_agg_func):
([True, pd.NA, False], [[True, False], [True, False]]),
],
)
def test_masked_kleene_logic(bool_agg_func, data, expected, skipna):
def test_masked_kleene_logic(bool_agg_func, data, expected_data, skipna):
# GH#37506
df = DataFrame(data, dtype="boolean")
expected = DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this expected is really hard to parse can you do in multiple steps / make simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing your comment above makes this simpler, let me know if you think any piece is still confusing

[expected[skipna][bool_agg_func == "all"]], dtype="boolean", index=[1]
[expected_data[skipna][bool_agg_func == "all"]], dtype="boolean", index=[1]
)

result = df.groupby([1, 1, 1]).agg(bool_agg_func, skipna=skipna)
tm.assert_frame_equal(result, expected)

# The expected result we compared to should match aggregating on the whole
# series
result = getattr(df[0], bool_agg_func)(skipna=skipna)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is really hard to parse.
what exactly are you checking? can you make this much simpler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this is essentially doing is series.any(skipna=skipna), but then with getattr because it's generic for any/all. Do you have a concrete suggestion to change? (or to clarify the comment?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a complicated expexcted then a really complicated assert. This needs to be greatly simplified. It is impossible to grok with all of this logic. my suggesetion is to break this up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be very concrete, you mean breaking it up into 2 lines, like

series = df[0]
getattr(series, method)(skipna=skipna)

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have tried to simplify by moving this validation of "expected" into a separate test in test_reductions.py (replacing the existing test there as this tests a superset of the logic).

expected = expected_data[skipna][bool_agg_func == "all"]
assert (result is pd.NA and expected is pd.NA) or result == expected


@pytest.mark.parametrize(
"dtype1,dtype2,exp_col1,exp_col2",
Expand Down
0