10000 ENH: Add dropna in groupby to allow NaN in keys by charlesdong1991 · Pull Request #30584 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

ENH: Add dropna in groupby to allow NaN in keys #30584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 59 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
7e461a1
remove \n from docstring
charlesdong1991 Dec 3, 2018
1314059
fix conflicts
charlesdong1991 Jan 19, 2019
8bcb313
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jul 30, 2019
13b03a8
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Dec 31, 2019
98f6127
fix issue 3729
charlesdong1991 Dec 31, 2019
d5fd74c
fix conflicts
charlesdong1991 Dec 31, 2019
eb717ec
not check type
charlesdong1991 Dec 31, 2019
de2ee5d
Add groupby test for Series
charlesdong1991 Dec 31, 2019
def05cc
Add whatsnew note
charlesdong1991 Dec 31, 2019
2888807
Code change based on JR review
charlesdong1991 Jan 1, 2020
b357659
fix conflicts
charlesdong1991 Jan 1, 2020
dc4fef1
add forgotten commits
charlesdong1991 Jan 1, 2020
25482ec
add forgotten commit
8000 charlesdong1991 Jan 1, 2020
015336d
Add dropna for series
charlesdong1991 Jan 1, 2020
ac2a79f
add doc example for Series
charlesdong1991 Jan 1, 2020
eb9a6f7
Add level example for series groupby
charlesdong1991 Jan 1, 2020
ffb70f8
Add doc example for frame groupby
charlesdong1991 Jan 1, 2020
b0e3cce
Code change based on JR reviews
charlesdong1991 Jan 2, 2020
a1d5510
add doc
charlesdong1991 Jan 2, 2020
11ef56a
move doc
charlesdong1991 Jan 2, 2020
b247a8b
NaN to NA
charlesdong1991 Jan 2, 2020
7cb027c
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Jan 2, 2020
d730c4a
Fix linting
charlesdong1991 Jan 2, 2020
42c4934
fix rst issue
charlesdong1991 Jan 2, 2020
2ba79b9
fix rst issue
charlesdong1991 Jan 2, 2020
8b79b6c
refactor based on WA review
charlesdong1991 Jan 3, 2020
a4fdf2d
merge master and resolve conflicts
charlesdong1991 Feb 10, 2020
4ac15e3
remove blank
charlesdong1991 Feb 10, 2020
4ebbad3
code change on reviews
charlesdong1991 Feb 10, 2020
f141b80
use pd.testing
charlesdong1991 Feb 10, 2020
23ad19b
linting
charlesdong1991 Feb 10, 2020
bafc4a5
fixup
charlesdong1991 Feb 10, 2020
c98bafe
fixup
charlesdong1991 Feb 10, 2020
86a5958
doc
charlesdong1991 Feb 10, 2020
6cf31d7
validation
charlesdong1991 Feb 10, 2020
2b77f37
xfail windows
charlesdong1991 Feb 10, 2020
451ec97
rebase and resolve conflict
charlesdong1991 Feb 19, 2020
1089b18
fixup based on WA review
charlesdong1991 Feb 22, 2020
63da563
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Feb 22, 2020
1b3f22a
fix conflicts
charlesdong1991 Apr 7, 2020
3f360a9
reduce tests
charlesdong1991 Apr 7, 2020
5cabe4b
fix pep8
charlesdong1991 Apr 7, 2020
76ffb9f
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 11, 2020
6c126c7
rebase and docs fixes
charlesdong1991 Apr 11, 2020
6d61d6a
fixup doc
charlesdong1991 Apr 11, 2020
3630e8b
remove inferred type
charlesdong1991 Apr 11, 2020
1cec7f1
better comment
charlesdong1991 Apr 11, 2020
1a1bb49
remove xfail
charlesdong1991 Apr 11, 2020
7ea2e79
use fixture
charlesdong1991 Apr 11, 2020
13b1e9a
coelse type for windows build
charlesdong1991 Apr 11, 2020
92a7eed
fixup
charlesdong1991 Apr 11, 2020
1315a9d
fixup
charlesdong1991 Apr 11, 2020
a7959d5
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 15, 2020
9fec9a8
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 27, 2020
ffbae76
Doc fixup
charlesdong1991 Apr 27, 2020
ef90d7c
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 27, 2020
e219748
rebase and resolve conflict
charlesdong1991 Apr 27, 2020
2940908
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 May 7, 2020
4ea6aa0
try merge master again
charlesdong1991 May 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Code change based on JR reviews
  • Loading branch information
charlesdong1991 committed Jan 2, 2020
commit b0e3cce4dc203ac1ccf6917e2169db21aa59035b
2 changes: 1 addition & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -635,7 +635,7 @@ def factorize(
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
)
if not dropna and (codes == na_sentinel).any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you assign codes == na_sentinel to a mask variable prior to this? Slightly preferable to re-evaluating this in the branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

uniques = np.append(uniques, [np.nan])
uniques = np.append(uniques, [None])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are typically int64 right? If so just curious if coercing to dtype is going to cause performance regressions

Copy link
Member Author
@charlesdong1991 charlesdong1991 Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think uniques should be typically str/category, codes are int.used None since i thought this is a safer choice, because _factorize_array uses None as default for na_value, i will rethink of this regression issue. thanks!

if you have any idea of better way of doing this, you are most welcome! 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likley changing the dtype; you need the na for this type, use pandas.core.dtypes.cast.na_value_for_dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not think so, first of all, if this changed using cast.na_value_for_dtype, then _reconstruct_data will fail since it cannot cast NaN to integer, and secondly this _reconstruct_data will cast the data to the correct type, so as you see from the output, all data are casted correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is leveraging the Int64 type an option here? This just seems a little strange in its current state

codes = np.where(codes == na_sentinel, len(uniques) - 1, codes)

uniques = _reconstruct_data(uniques, dtype, original)
Expand Down
2 changes: 1 addition & 1 deletion < 8000 a title="pandas/core/frame.py" class="Link--primary Truncate-text" href="#diff-421998af5fe0fbc54b35773ce9b6289cae3e8ae607f81783af04ebf1fbcaf077">pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5649,7 +5649,7 @@ def update(
Captive 210.0
Wild 185.0

We can also choose to include NaN in group keys or not by defining
We can also choose to include NaN in group keys or not by setting
`dropna` parameter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention the default


>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7347,9 +7347,9 @@ def clip(

.. versionadded:: 0.23.0
dropna : bool, default True
If True, and if group keys contain NaN values, NaN values together
If True, and if group keys contain NA values, NA values together
with row/column will be dropped.
If False, NaN values will also be treated as the key in groups
If False, NA values will also be treated as the key in groups

.. versionadded:: 1.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 1.0.0
.. versionadded:: 1.1.0


Expand Down
143 changes: 0 additions & 143 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2025,146 +2025,3 @@ def test_groupby_crash_on_nunique(axis):
expected = expected.T

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
"dropna, tuples, outputs",
[
(
True,
[["A", "B"], ["B", "A"]],
{"c": [13.0, 123.23], "d": [13.0, 123.0], "e": [13.0, 1.0]},
),
(
False,
[["A", "B"], ["A", np.nan], ["B", "A"]],
{
"c": [13.0, 12.3, 123.23],
"d": [13.0, 233.0, 123.0],
"e": [13.0, 12.0, 1.0],
},
),
],
)
def test_groupby_dropna_multi_index_dataframe(dropna, tuples, outputs):
# GH 3729
df_list = [
["A", "B", 12, 12, 12],
["A", None, 12.3, 233.0, 12],
["B", "A", 123.23, 123, 1],
["A", "B", 1, 1, 1.0],
]
df = pd.DataFrame(df_list, columns=["a", "b", "c", "d", "e"])
grouped = df.g F438 roupby(["a", "b"], dropna=dropna).sum()

mi = pd.MultiIndex.from_tuples(tuples, names=list("ab"))
expected = pd.DataFrame(outputs, index=mi)

tm.assert_frame_equal(grouped, expected, check_index_type=False)


@pytest.mark.parametrize(
"dropna, idx, outputs",
[
(True, ["A", "B"], {"b": [123.23, 13.0], "c": [123.0, 13.0], "d": [1.0, 13.0]}),
(
False,
["A", "B", np.nan],
{
"b": [123.23, 13.0, 12.3],
"c": [123.0, 13.0, 233.0],
"d": [1.0, 13.0, 12.0],
},
),
],
)
def test_groupby_dropna_normal_index_dataframe(dropna, idx, outputs):
# GH 3729
df_list = [
["B", 12, 12, 12],
[None, 12.3, 233.0, 12],
["A", 123.23, 123, 1],
["B", 1, 1, 1.0],
]
df = pd.DataFrame(df_list, columns=["a", "b", "c", "d"])
grouped = df.groupby("a", dropna=dropna).sum()

expected = pd.DataFrame(outputs, index=pd.Index(idx, dtype="object", name="a"))

tm.assert_frame_equal(grouped, expected, check_index_type=False)


@pytest.mark.parametrize(
"dropna, idx, expected",
[
(True, ["a", "a", "b", np.nan], pd.Series([3, 3], index=["a", "b"])),
(
False,
["a", "a", "b", np.nan],
pd.Series([3, 3, 3], index=["a", "b", np.nan]),
),
],
)
def test_groupby_dropna_series_level(dropna, idx, expected):
ser = pd.Series([1, 2, 3, 3], index=idx)

result = ser.groupby(level=0, dropna=dropna).sum()
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"dropna, expected",
[
(True, pd.Series([210.0, 350.0], index=["a", "b"], name="Max Speed")),
(
False,
pd.Series([210.0, 350.0, 20.0], index=["a", "b", np.nan], name="Max Speed"),
),
],
)
def test_groupby_dropna_series_by(dropna, expected):
ser = pd.Series(
[390.0, 350.0, 30.0, 20.0],
index=["Falcon", "Falcon", "Parrot", "Parrot"],
name="Max Speed",
)

result = ser.groupby(["a", "b", "a", np.nan], dropna=dropna).mean()
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"dropna, tuples, outputs",
[
(
True,
[["A", "B"], ["B", "A"]],
{"c": [13.0, 123.23], "d": [12.0, 123.0], "e": [1.0, 1.0]},
),
(
False,
[["A", "B"], ["A", np.nan], ["B", "A"]],
{
"c": [13.0, 12.3, 123.23],
"d": [12.0, 233.0, 123.0],
"e": [1.0, 12.0, 1.0],
},
),
],
)
def test_groupby_dropna_multi_index_dataframe_agg(dropna, tuples, outputs):
# GH 3729
df_list = [
["A", "B", 12, 12, 12],
["A", None, 12.3, 233.0, 12],
["B", "A", 123.23, 123, 1],
["A", "B", 1, 1, 1.0],
]
df = pd.DataFrame(df_list, columns=["a", "b", "c", "d", "e"])
agg_dict = {"c": sum, "d": max, "e": "min"}
grouped = df.groupby(["a", "b"], dropna=dropna).agg(agg_dict)

mi = pd.MultiIndex.from_tuples(tuples, names=list("ab"))
expected = pd.DataFrame(outputs, index=mi)

tm.assert_frame_equal(grouped, expected, check_index_type=False)
Loading
0