8000 ENH: general concat with ExtensionArrays through find_common_type by jorisvandenbossche · Pull Request #33607 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

ENH: general concat with ExtensionArrays through find_common_type #33607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 2, 2020

Conversation

jorisvandenbossche
Copy link
Member

Exploring what is being discussed in #22994. @TomAugspurger your idea seems to be working nicely! (it almost removes as much code than it adds (apart from tests/docs), ànd fixes the EA concat bugs ;))

Few notes compared to proposal in #22994:

  • Since we already have a find_common_type function, decided to use this as the "get_concat_dtype", since it seems this does what is needed for concat
  • Extending the EA interface with a ExensionDtype._get_common_type method that is used in pandas' find_common_type function to dispatch the logic to the extension type

What I already handled:

  • general protocol, using this for concat with axis=0 (eg concatting series) and using it as example for IntegerDtype
  • handling categoricals this way (which removes the concat_categorical helper). This turned up a few cases where we have value-dependent behaviour right now, which we can't easily preserve (mainly regarding NaNs and int dtype, see below)

Still need to handle sparse (those has failing tests now) and maybe datetime, and check other failures.

Comment on lines 111 to 121
if (
is_categorical_dtype(arr.dtype)
and isinstance(dtype, np.dtype)
and np.issubdtype(dtype, np.integer)
):
# problem case: categorical of int -> gives int as result dtype,
# but categorical can contain NAs -> fall back to object dtype
try:
return arr.astype(dtype, copy=False)
except ValueError:
return arr.astype(object, copy=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complication is to try to preserve some of the value-dependent behaviour of Categorical (in case of integer categories: missing values present or not?)

Eg when concatting integer categorical with integer series:

pd.concat([pd.Series([1, 2], dtype="category"), pd.Series([3, 4])]) 
-> results in int dtype
pd.concat([pd.Series([1, None], dtype="category"), pd.Series([3, 4])]) 
-> results in object dtype

Currently, when concatting, a Categorical with integer categories gets converted to int numpy array if no missing values are present, but object numpy array if missing values are present (to preserve the integers)

@jorisvandenbossche
Copy link
Member Author
jorisvandenbossche commented Apr 17, 2020

I needed to make one change to the tests related to categorical (another value-dependent behaviour). Assume those examples involving integer categorical and float arrays:

In [13]: pd.concat([pd.Series([1, None], dtype="category"), pd.Series([3.5, 4.5])])    
Out[13]: 
0      1
1    NaN
0    3.5
1    4.5
dtype: object

In [14]: pd.concat([pd.Series([1, 2], dtype="category"), pd.Series([3.5, 4.5])])    
Out[14]: 
0    1.0
1    2.0
0    3.5
1    4.5
dtype: float64

With a find_common_type, we will always find float. But the old behaviour is that the integer categorical becomes object if there are missing values present.
But I would say: if you are concatting with a float anyway, it doesn't make sense to try to provide the "integer categories" by keeping it object instead of casting to float.

So I would say that the new behaviour (always returning float in the above two examples) is better.

@@ -322,3 +323,33 @@ def _is_boolean(self) -> bool:
bool
"""
return False

def _get_common_type(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm can we keep the return type as ExtensionDtype? Do you envision cases where we'd like to return a plain NumPy dtype?

Oh... I suppose tz-naive DatetimeArray might break this, since it wants to return a NumPy dtype...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my first thought as well. But, right now, eg Categorical can end up with any kind of numpy dtype (depending on the dtype of its categories).

As long as not yet all dtypes have a EA version, I don't think it is feasible to require ExtensionDtype here

@jorisvandenbossche
Copy link
Member Author

I ran into one other hairy question:

  • Should _get_common_type be able to handle "not fuly initialized" dtype objects? Like a CategoricalDtype without categories? (which we have to handle "category" string).
    And if so, what is the expected behaviour?

Right now I added some special cases in Categorical._get_common_type to have the find_common_type tests passing, but not sure it is necessarily doing the correct thing in a general way.

@jorisvandenbossche jorisvandenbossche added API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 17, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Apr 17, 2020
@@ -95,6 +95,15 @@ def construct_array_type(cls) -> Type["IntegerArray"]:
"""
return IntegerArray

def _get_common_type(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be common_type or common_dtype? we've been loose about this distinction so far and i think it has caused amibiguity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care that much. I mainly used "type", because it is meant to be used in find_common_type.

(that find_common_type name is inspired on the numpy function, and that one actually handles both dtypes and scalar types, which I assume is the reason for the name. The pandas version, though, doesn't really make the distinction, so could have been named "find_common_dtype")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to "common_dtype" instead of "common_type". The internal function that uses this is still find_common_type, but that name from numpy is actually a misnomer here, since we are only dealing with dtypes, and not scalar types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for indulging me on this nitpick