8000 POC: consistent NaN treatment for pyarrow dtypes by jbrockmendel · Pull Request #61732 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

POC: consistent NaN treatment for pyarrow dtypes #61732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jbrockmendel
Copy link
Member
@jbrockmendel jbrockmendel commented Jun 28, 2025

This is the third of several POCs stemming from the discussion in #61618 (see #61708, #61716). The main goal is to see how invasive it would be.

Specifically, this changes the behavior of pyarrow floating dtypes to treat NaN as distinct from NA in the constructors and __setitem__ (xref #32265)

Notes:

  • This makes the decision to treat NaNs as close-enough to NA when a user explicitly asks for a pyarrow integer dtype. I think this is the right API, but won't check the box until there's a concensus.
  • I still have 113 89 failing tests locally. Most of these are in json, sql, or test_EA_types (which is about csv round-tripping).
  • Finding the mask to pass to pa.array needs optimization.
  • The kludge in NDFrame.where is ugly and fragile.
  • Need to double-check the new expected in the rank test. Maybe re-write the test with NA instead of NaN?
  • This doesn't currently change the behavior of convert_dtypes. Should it?

@jbrockmendel
Copy link
Member Author

@mroeschke when convenient id like to get your thoughts before getting this working. it looks pretty feasible.

Comment on lines +536 to +537
# If user specifically asks to cast a numpy float array with NaNs
# to pyarrow integer, we'll treat those NaNs as NA
Copy link
Member
@mroeschke mroeschke Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally be in favor of a harder break - this should raise like PyArrow does, and auser that want this behavior should fill NaN's first.

In [3]: pa.array(np.array([1.0, np.nan]), from_pandas=False, type=pa.int64())
ArrowInvalid: Float value nan was truncated converting to int64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with that. I just tried it out-- expecting it to break a ton of tests-- and im only seeing 147 failures (vs 89 without it), so not that bad.

Probably need to add the behavior of convert_dtypes to the checklist above. ATM this branch doesn't change its behavior.

@@ -510,19 +525,32 @@ def _box_pa_array(
value = to_timedelta(value, unit=pa_type.unit).as_unit(pa_type.unit)
value = value.to_numpy()

mask = None
if getattr(value, "dtype", None) is None or value.dtype.kind not in "mfM":
# similar to isna(value) but exclude NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're moving towards making NaN distinct from NA (NaN not a missing value), maybe we should eventually make isna have a nan_as_null: bool = True argument

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that may be necessary, but im a little wary of it since it might mean needing to add that argument to everywhere with a skipna keyword

@mroeschke
Copy link
Member

Generally +1 in this direction. Glad to see the changes to make this work are fairly minimal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0