-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
POC: consistent NaN treatment for pyarrow dtypes #61732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@mroeschke when convenient id like to get your thoughts before getting this working. it looks pretty feasible. |
# If user specifically asks to cast a numpy float array with NaNs | ||
# to pyarrow integer, we'll treat those NaNs as NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally be in favor of a harder break - this should raise like PyArrow does, and auser that want this behavior should fill NaN's first.
In [3]: pa.array(np.array([1.0, np.nan]), from_pandas=False, type=pa.int64())
ArrowInvalid: Float value nan was truncated converting to int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine with that. I just tried it out-- expecting it to break a ton of tests-- and im only seeing 147 failures (vs 89 without it), so not that bad.
Probably need to add the behavior of convert_dtypes
to the checklist above. ATM this branch doesn't change its behavior.
@@ -510,19 +525,32 @@ def _box_pa_array( | |||
value = to_timedelta(value, unit=pa_type.unit).as_unit(pa_type.unit) | |||
value = value.to_numpy() | |||
|
|||
mask = None | |||
if getattr(value, "dtype", None) is None or value.dtype.kind not in "mfM": | |||
# similar to isna(value) but exclude NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're moving towards making NaN
distinct from NA
(NaN not a missing value), maybe we should eventually make isna
have a nan_as_null: bool = True
argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that may be necessary, but im a little wary of it since it might mean needing to add that argument to everywhere with a skipna keyword
Generally +1 in this direction. Glad to see the changes to make this work are fairly minimal |
This is the third of several POCs stemming from the discussion in #61618 (see #61708, #61716). The main goal is to see how invasive it would be.
Specifically, this changes the behavior of pyarrow floating dtypes to treat NaN as distinct from NA in the constructors and
__setitem__
(xref #32265)Notes:
11389 failing tests locally. Most of these are in json, sql, or test_EA_types (which is about csv round-tripping).