POC: consistent NaN treatment for pyarrow dtypes #61732

jbrockmendel · 2025-06-28T17:23:26Z

This is the third of several POCs stemming from the discussion in #61618 (see #61708, #61716). The main goal is to see how invasive it would be.

Specifically, this changes the behavior of pyarrow floating dtypes to treat NaN as distinct from NA in the constructors and __setitem__ (xref #32265)

Notes:

This makes the decision to treat NaNs as close-enough to NA when a user explicitly asks for a pyarrow integer dtype. I think this is the right API, but won't check the box until there's a concensus.
I still have ~~113~~ 89 failing tests locally. Most of these are in json, sql, or test_EA_types (which is about csv round-tripping).
Finding the mask to pass to pa.array needs optimization.
The kludge in NDFrame.where is ugly and fragile.
Need to double-check the new expected in the rank test. Maybe re-write the test with NA instead of NaN?
This doesn't currently change the behavior of convert_dtypes. Should it?

jbrockmendel · 2025-06-30T15:19:46Z

@mroeschke when convenient id like to get your thoughts before getting this working. it looks pretty feasible.

mroeschke · 2025-06-30T16:59:25Z

pandas/core/arrays/arrow/array.py

+                # If user specifically asks to cast a numpy float array with NaNs
+                #  to pyarrow integer, we'll treat those NaNs as NA


I would personally be in favor of a harder break - this should raise like PyArrow does, and auser that want this behavior should fill NaN's first.

In [3]: pa.array(np.array([1.0, np.nan]), from_pandas=False, type=pa.int64()) ArrowInvalid: Float value nan was truncated converting to int64

I'd be fine with that. I just tried it out-- expecting it to break a ton of tests-- and im only seeing 147 failures (vs 89 without it), so not that bad.

Probably need to add the behavior of convert_dtypes to the checklist above. ATM this branch doesn't change its behavior.

mroeschke · 2025-06-30T17:02:16Z

pandas/core/arrays/arrow/array.py

@@ -510,19 +525,32 @@ def _box_pa_array(
                value = to_timedelta(value, unit=pa_type.unit).as_unit(pa_type.unit)
                value = value.to_numpy()

+            mask = None

            
+            if getattr(value, "dtype", None) is None or value.dtype.kind not in "mfM":
+                # similar to isna(value) but exclude NaN


If we're moving towards making NaN distinct from NA (NaN not a missing value), maybe we should eventually make isna have a nan_as_null: bool = True argument

that may be necessary, but im a little wary of it since it might mean needing to add that argument to everywhere with a skipna keyword

mroeschke · 2025-06-30T17:03:05Z

Generally +1 in this direction. Glad to see the changes to make this work are fairly minimal

jbrockmendel added 2 commits June 28, 2025 10:07

POC: consistent NaN treatment for pyarrow dtypes

3fad33f

comment

f1e8ba0

mroeschke reviewed Jun 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

POC: consistent NaN treatment for pyarrow dtypes #61732

POC: consistent NaN treatment for pyarrow dtypes #61732

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		# If user specifically asks to cast a numpy float array with NaNs
		# to pyarrow integer, we'll treat those NaNs as NA

Uh oh!

POC: consistent NaN treatment for pyarrow dtypes #61732

Are you sure you want to change the base?

POC: consistent NaN treatment for pyarrow dtypes #61732

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!