-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExtensionArray.take default implementation #20814
Conversation
Implements a take interface that's compatible with NumPy and optionally pandas' NA semantics. ```python In [1]: import pandas as pd In [2]: from pandas.tests.extension.decimal.array import * In [3]: arr = DecimalArray(['1.1', '1.2', '1.3']) In [4]: arr.take([0, 1, -1]) Out[4]: DecimalArray(array(['1.1', '1.2', '1.3'], dtype=object)) In [5]: arr.take([0, 1, -1], fill_value=float('nan')) Out[5]: DecimalArray(array(['1.1', '1.2', Decimal('NaN')], dtype=object)) ``` Closes pandas-dev#20640
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on April 27, 2018 at 11:03 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked as WIP for now, because something is going strange with JSONArray. NumPy doesn't like us trying to shove dictionaries into scalar spots of an ndarray I think. Still debugging.
pandas/compat/__init__.py
Outdated
@@ -451,3 +451,5 @@ def is_platform_mac(): | |||
|
|||
def is_platform_32bit(): | |||
return struct.calcsize("P") * 8 < 64 | |||
|
|||
_default_fill_value = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably isn't the right place for this, but it has to be importable from pandas/arrays/base.py
and core/algorithms.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.core.dtypes.missing
or pandas.core.missing
pandas/core/algorithms.py
Outdated
@@ -1558,6 +1559,81 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, | |||
take_1d = take_nd | |||
|
|||
|
|||
def take_ea(arr, indexer, fill_value=_default_fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I threw this in core/algorithms.py
. Does it make more sense in core/arrays/base.py
, either as a standalone method or as a staticmethod on ExtensionArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move it. you need a set of methods that are developer exposed for EA authors that are useful in building the EA, but are not pandas internals. core/arrays/base.py with an core/arrays/api.py for exposing them would be appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ended up exposed in pd.api.extensions but it isn't clear to me why. are we sure its needed there?
pandas/core/algorithms.py
Outdated
@@ -1558,6 +1559,81 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, | |||
take_1d = take_nd | |||
|
|||
|
|||
def take_ea(arr, indexer, fill_value=_default_fill_value): | |||
"""Extension-array compatible take. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just copy-pased the docstring for now. Could do a shared doc if moved.
pandas/core/arrays/base.py
Outdated
Returns | ||
------- | ||
array-like | ||
Must satisify NumPy's `take` semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n.b.: this doesn't have to be an ndarray. Anything that satisfies NumPy's take
(and I suppose masking) semantics will work.
pandas/core/arrays/base.py
Outdated
from pandas.core.algorithms import take_ea | ||
from pandas.core.missing import isna | ||
|
||
if isna(fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this if
condition could be removed if we changed all the callers of pandas.core.algorithms.take*
to specify what kind they want (NumPy or pandas-style). But that's a bit of work right now.
@@ -134,12 +134,29 @@ def test_take(self, data, na_value, na_cmp): | |||
|
|||
def test_take_empty(self, data, na_value, na_cmp): | |||
empty = data[:0] | |||
result = empty.take([-1]) | |||
na_cmp(result[0], na_value) | |||
# result = empty.take([-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: should this raise now? Since this is doing NumPy style, as na_value
isn't passed to take
, then I think it should raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should then raise (as the indexer is out of bounds?). Numpy and pandas do allow empty takes on empty values:
In [93]: np.array([]).take([])
Out[93]: array([], dtype=float64)
In [94]: pd.Series([]).take([])
Out[94]: Series([], dtype: float64)
In [95]: np.array([]).take([0])
...
IndexError: cannot do a non-empty take from an empty axes.
But we should keep a test like this for the case of allow_fill=True
@@ -81,18 +81,32 @@ def test_merge(self, data, na_value): | |||
|
|||
|
|||
class TestGetitem(base.BaseGetitemTests): | |||
skip_take = pytest.mark.skip(reason="GH-20664.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a WIP. Need to fix categorical to warn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably doing that in a followup PR...
pandas/compat/__init__.py
Outdated
@@ -451,3 +451,5 @@ def is_platform_mac(): | |||
|
|||
def is_platform_32bit(): | |||
return struct.calcsize("P") * 8 < 64 | |||
|
|||
_default_fill_value = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.core.dtypes.missing
or pandas.core.missing
pandas/core/algorithms.py
Outdated
@@ -1558,6 +1559,81 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, | |||
take_1d = take_nd | |||
|
|||
|
|||
def take_ea(arr, indexer, fill_value=_default_fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move it. you need a set of methods that are developer exposed for EA authors that are useful in building the EA, but are not pandas internals. core/arrays/base.py with an core/arrays/api.py for exposing them would be appropriate
pandas/core/algorithms.py
Outdated
fill_value : any, optional | ||
Fill value to use for NA-indicies. This has a few behaviors. | ||
|
||
* fill_value is not specified : triggers NumPy's semantics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, re-reading this, I find this very confusing. Why are you overloading fill_value here and not making this a separate parameter?
IOW, (terrible name), but use_numpy_semanatics=False, fill_value=None
why not this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very confusing
Yeah, it's not just you 😆 I'll see how much work it is to implement your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, there are two things going on here:
np.nan
being swapped forExtensionArray.dtype.na_value
.fill_value
being overloaded to switch the meaning of negative indexers.
Which bit is confusing? If I'm able to change things so that fill_value
is simply passed through, would that be OK? Then someone specifying ExtensionArray.take(indexer, fill_value=thing)
would mean -1
is an NA marker. ExtensionArray.take(indexer)
would mean -1
is the last element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not overloading fill_value
is how it is currently, with both allow_fill
and fill_value
. If we find that clearer, I am fine with going back to that (although it would be nice to only need one keyword for this).
I have more an objection on the np.nan swapping. It would be nice if we could prevent this, but didn't look in detail enough to assess whether this would be possible or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, if we make sure to not pass through a default fill_value
, you would in most cases only need to specify allow_fill=True
to get pandas semantics, instead of needing to specify both allow_fill
and fill_value
. So maybe that is not that bad.
Currenlty, eg Series.reindex
calls under the hood (via Block.take_nd) take_1d
on the Block values, by passing through the Block.fill_value
as default fill_value
. Could we make sure to not pass anything through in case of ExtensionBlocks, or to let the ExtensionBlock.fill_value be the appropriate value? (to avoid the swapping of np.nan
)
Oh fun In [3]: np.array([{'a': 10}])
Out[3]: array([{'a': 10}], dtype=object)
In [4]: np.array([UserDict({'a': 10})])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-400145814467> in <module>()
----> 1 np.array([UserDict({'a': 10})])
/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/collections/__init__.py in __getitem__(self, key)
989 if hasattr(self.__class__, "__missing__"):
990 return self.__class__.__missing__(self, key)
--> 991 raise KeyError(key)
992 def __setitem__(self, key, item): self.data[key] = item
993 def __delitem__(self, key): del self.data[key]
KeyError: 0 (we use |
Quick general comment: why make a EA specific take method instead of just a general one that works for both ndarrays and EAs? (and for EA it would just dispatch to the EA). I am also not sure if it is worth to add the
I think it is not too much asked for an ExtensionArray author to implement this themselves. There is also a problem with what a public |
I think the way In [1]: from pandas.core.arrays.base import take_ea
In [2]: import numpy as np
In [3]: arr = np.array([1, 2, 3])
In [4]: take_ea(arr, [0, 1, -1])
Out[4]: array([1, 2, 3]) Suggestions for a better name?
That seems fine.
This can be private. |
The |
How I had it in my head (but which doesn't mean it is better!), was that such a |
pandas/core/algorithms.py
Outdated
return arr._from_sequence([fill_value] * len(indexer)) | ||
|
||
result = arr.take(indexer) | ||
result[mask] = fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to not use the take_1d
machinery here? (maybe there is, or maybe there is no reason to use it since the above is simpler and works fine, but was just wondering)
We should be able to have a clean implementation of The tricky part is fixing the callers like |
Pushed another WIP if anyone has some thoughts on the general approach. The extension tests all pass but I'm still breaking things elsewhere, and I think things can be simplified. The main idea is to have a sentinel |
Regarding |
That feeling is correct. Is that not what we want? I can re-add the allow_fill argument. |
I am going a bit back and forth on it, not sure ... Regarding what I said above about the dubious meaning of "na_value" (the physical value that is stored, or the public one), I think this is actually not really a problem. It should just be the public one, so the value that you get if you get an item of an EA that is missing (type of |
Trying to get my head around the current logic. Some pseudo code flow path in case of a Series reindex operation with EA values:
(but in case of a reindex/aligment operation, the final So what I find complex about this: a) as said above different meaning of |
I think if we use This should be able to be simplified greatly. Let's see... |
Just brainstorming on an alternative:
(I don't know if this scheme is anywhere near clear for somebody that did not write it ...?) This introduces again the second keyword, but eliminates double usecase of But of course this is only a single code path, I don't know how it interacts with all other usages of the take machinery. |
Actually, I don't think that's true. There would still be the ambiguity. Anyway, we can leave that up to the EA author I think... Your proposal looks sensible. I'll try to implement it. |
I don't think we can do that generically, since we don't know what a valid na_value is. I'll add one in decimal though. |
If you fill it with an existing scalar (like But it is actually a good question what should happen when you pass a value that is not accepted by the ExtensionArray. I suppose it is up to the EA to decide on that? (raise an error, or coerce to object) |
One additional point of feedback from geopandas: From running the extension tests in geopandas, it seems
snippet was not sufficient and I needed to add But I don't understand why it is not failing for DecimalArray, as you added an assert to the class constructor that all values passed are actually decimals (so also Decimal('NaN') instead of np.nan), which should catch this problem. |
Huh... I'll see what I can cook up. Maybe a dtype thing, since DecimalArray is fundamentally object dtype, but geopandas is integer? For cyberpandas, I can't quite use this This does raise the question, should |
OK, found the reason: I forgot to set the |
So that's the question I raised before (and therefore also questioning whether it should be public), but I have now come to a somewhat conclusion that I think this should be the user-facing scalar ipaddres.IPv4Address(0) and not the low-level (0, 0) (so in case of geopandas: None and not 0). Although it makes it not very useful for actual implementations, it's then more kind of an "informative attribute". |
That's my conclusion too. I will document this more fully in ExtensionArray.take, as
Yeah. In my case, all of the fields are the same dtype, so I'm able reshape & view as a unit64 ndarray, do the take, and the convert back (without copies). I'll make sure everything works, but all that reshaping is a bit complicated :) |
https://github.com/pandas-dev/pandas/pull/20814/files#diff-0267ba650641bdaf3930400c2478b084R203 I think all your comments were addressed. |
pandas/core/algorithms.py
Outdated
@@ -1504,7 +1506,7 @@ def take(arr, indexer, allow_fill=False, fill_value=None): | |||
>>> from pandas.api.extensions import take | |||
|
|||
With the default ``allow_fill=False``, negative numbers indicate | |||
slices from the right. | |||
positional indicies from the right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indicies -> indices
pandas/core/algorithms.py
Outdated
arr = np.asarray(arr) | ||
|
||
# Do we require int64 or intp here? | ||
indices = np.asarray(indices, dtype='int') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be intp, which is what take accepts for indexers
fill_value : any, default None | ||
Fill value to replace -1 values with. If applicable, this should | ||
10000 td> | use the sentinel missing value for this type. | |
indices : sequence of integers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share this doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want people to look at the source code of this (as we explicitly say the code of this base class is kind of the documentation), so I would certainly try to keep this one here (not sure if it would be easy to reuse it for the take
function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing docs between these two are a bit difficult because of circular import issues. The docstring would have to go in a third module, which isn't desirable.
mask = (indices >= n) | (indices < 0) | ||
if mask.any(): | ||
raise IndexError("indices are out-of-bounds") | ||
return indices | ||
|
||
|
||
def validate_indices(indices, n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed, as opposed to maybe_convert_indices? (maybe refactor that a bit?)
Apparently the random failure in test_merge was not fixed, at least not for all numpy arrays. https://circleci.com/gh/pandas-dev/pandas/13593#tests/containers/2 had a failures. Hopefully eb43fa4, which is ugly, fixes it for good. I ran test_merge 1000 times both before and after that commit and things pass, so who knows :/ |
pandas/core/dtypes/base.py
Outdated
# na_value is the default NA value to use for this type. This is used in | ||
# e.g. ExtensionArray.take. This should be the user-facing "boxed" version | ||
# of the NA value, not the physical NA vaalue for storage. | ||
na_value = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe even show an example (from JSON?)
mask = (indices >= n) | (indices < 0) | ||
if mask.any(): | ||
raise IndexError("indices are out-of-bounds") | ||
return indices | ||
|
||
|
||
def validate_indices(indices, n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so now we have 2 different code paths which do very similar type of validation, I would try to simplify Series._take
and NDFrame._take
to use these new helper functions, removing the need for maybe_convert_indices
entirely.
@jreback do you have any other comments? |
I guess this is fine. I am sure this will spawn many more bugs and would like this to live in master for a while. |
To what are you referring? |
@@ -5457,6 +5471,12 @@ def get_empty_dtype_and_na(join_units): | |||
if blk is None: | |||
return np.float64, np.nan | |||
|
|||
if is_uniform_reindex(join_units): | |||
# XXX: integrate property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger long shot any recollection what this comment is asking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the need for that is_uniform_reindex
check by including that case in the main if / elif
block (which may be done on master now?)
Implements a take interface that's compatible with NumPy and optionally pandas'
NA semantics.
Closes #20640
git diff upstream/master -u -- "*.py" | flake8 --diff