-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you have any way to simplify this example? FWIW you could just sum your boolean columns to get the same result |
This simplified example still shows the errorneous behaviour:
Whereas using
|
OK thanks for the last example. I suppose the difference is that agg is casting the result of your anonymous function back to boolean when the result is zeros / ones and apply is not. This also explains why none of the other columns were cast back to boolean (they had more than just 0s and 1s) and also why having N != 1 would not cast back to boolean either. There are a variety of ways to handle this from the end user perspective, whether it be through explicit typing or again using something like sum which would be more idiomatic and performant than what you are trying to do anyway. That said, investigation and PRs to align the casting rules for apply and agg are always welcome! |
xref #14873 is the root issue here. |
Uh oh!
There was an error while loading. Please reload this page.
Problem Description
I stumbled over some very specific off behaviour when trying to do a multi-index groupby over several boolean columns using
agg
with a dictionary to apply a function to each column that would count the number of True-values by using the boolean series as a boolean index on itself and getting the length of that.Example code:
The result I'm getting is:
As you can see, the second column isn't properly aggregated for some reason.
If I set
NN = 10
, I'm getting a proper result:The problem only happens when using a multi-index, only with the
C3
column, only whenNN=1
and only when using theagg
function (it works fine withapply
on the single column), so it can be reduced to this call:It seems like the specific combination of the multi-index and the boolean values causes some odd behaviour.
EDIT: I did not write the
C3
data specifically to cause the error, I just played with random combinations around until the error appeared, so I have no clue, why exactly that combination causes the problem.In my real application this happens to two of three columns in a dataframe with 570 rows, so it's not limited to very short dataframes like in the example.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: