-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
ExtensionArray.take default implementation #20814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements a take interface that's compatible with NumPy and optionally pandas'
NA semantics.
```python
In [1]: import pandas as pd
In [2]: from pandas.tests.extension.decimal.array import *
In [3]: arr = DecimalArray(['1.1', '1.2', '1.3'])
In [4]: arr.take([0, 1, -1])
Out[4]: DecimalArray(array(['1.1', '1.2', '1.3'], dtype=object))
In [5]: arr.take([0, 1, -1], fill_value=float('nan'))
Out[5]: DecimalArray(array(['1.1', '1.2', Decimal('NaN')], dtype=object))
```
Closes pandas-dev#20640
|
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on April 27, 2018 at 11:03 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked as WIP for now, because something is going strange with JSONArray. NumPy doesn't like us trying to shove dictionaries into scalar spots of an ndarray I think. Still debugging.
pandas/compat/__init__.py
Outdated
| def is_platform_32bit(): | ||
| return struct.calcsize("P") * 8 < 64 | ||
|
|
||
| _default_fill_value = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably isn't the right place for this, but it has to be importable from pandas/arrays/base.py and core/algorithms.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.core.dtypes.missing or pandas.core.missing
pandas/core/algorithms.py
Outdated
| take_1d = take_nd | ||
|
|
||
|
|
||
| def take_ea(arr, indexer, fill_value=_default_fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I threw this in core/algorithms.py. Does it make more sense in core/arrays/base.py, either as a standalone method or as a staticmethod on ExtensionArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move it. you need a set of methods that are developer exposed for EA authors that are useful in building the EA, but are not pandas internals. core/arrays/base.py with an core/arrays/api.py for exposing them would be appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ended up exposed in pd.api.extensions but it isn't clear to me why. are we sure its needed there?
pandas/core/algorithms.py
Outdated
|
|
||
|
|
||
| def take_ea(arr, indexer, fill_value=_default_fill_value): | ||
| """Extension-array compatible take. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just copy-pased the docstring for now. Could do a shared doc if moved.
pandas/core/arrays/base.py
Outdated
| Returns | ||
| ------- | ||
| array-like | ||
| Must satisify NumPy's `take` semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n.b.: this doesn't have to be an ndarray. Anything that satisfies NumPy's take (and I suppose masking) semantics will work.
pandas/core/arrays/base.py
Outdated
| from pandas.core.algorithms import take_ea | ||
| from pandas.core.missing import isna | ||
|
|
||
| if isna(fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this if condition could be removed if we changed all the callers of pandas.core.algorithms.take* to specify what kind they want (NumPy or pandas-style). But that's a bit of work right now.
| empty = data[:0] | ||
| result = empty.take([-1]) | ||
| na_cmp(result[0], na_value) | ||
| # result = empty.take([-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: should this raise now? Since this is doing NumPy style, as na_value isn't passed to take, then I think it should raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should then raise (as the indexer is out of bounds?). Numpy and pandas do allow empty takes on empty values:
In [93]: np.array([]).take([])
Out[93]: array([], dtype=float64)
In [94]: pd.Series([]).take([])
Out[94]: Series([], dtype: float64)
In [95]: np.array([]).take([0])
...
IndexError: cannot do a non-empty take from an empty axes.
But we should keep a test like this for the case of allow_fill=True
|
|
||
|
|
||
| class TestGetitem(base.BaseGetitemTests): | ||
| skip_take = pytest.mark.skip(reason="GH-20664.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a WIP. Need to fix categorical to warn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably doing that in a followup PR...
pandas/compat/__init__.py
Outdated
| def is_platform_32bit(): | ||
| return struct.calcsize("P") * 8 < 64 | ||
|
|
||
| _default_fill_value = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.core.dtypes.missing or pandas.core.missing
pandas/core/algorithms.py
Outdated
| take_1d = take_nd | ||
|
|
||
|
|
||
| def take_ea(arr, indexer, fill_value=_default_fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move it. you need a set of methods that are developer exposed for EA authors that are useful in building the EA, but are not pandas internals. core/arrays/base.py with an core/arrays/api.py for exposing them would be appropriate
pandas/core/algorithms.py
Outdated
| fill_value : any, optional | ||
| Fill value to use for NA-indicies. This has a few behaviors. | ||
| * fill_value is not specified : triggers NumPy's semantics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, re-reading this, I find this very confusing. Why are you overloading fill_value here and not making this a separate parameter?
IOW, (terrible name), but use_numpy_semanatics=False, fill_value=None why not this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very confusing
Yeah, it's not just you 😆 I'll see how much work it is to implement your suggestion.
|
Oh fun In [3]: np.array([{'a': 10}])
Out[3]: array([{'a': 10}], dtype=object)
In [4]: np.array([UserDict({'a': 10})])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-400145814467> in <module>()
----> 1 np.array([UserDict({'a': 10})])
/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/collections/__init__.py in __getitem__(self, key)
989 if hasattr(self.__class__, "__missing__"):
990 return self.__class__.__missing__(self, key)
--> 991 raise KeyError(key)
992 def __setitem__(self, key, item): self.data[key] = item
993 def __delitem__(self, key): del self.data[key]
KeyError: 0(we use |
All reactions
Sorry, something went wrong.
|
Quick general comment: why make a EA specific take method instead of just a general one that works for both ndarrays and EAs? (and for EA it would just dispatch to the EA). I am also not sure if it is worth to add the I think it is not too much asked for an ExtensionArray author to implement this themselves. There is also a problem with what a public |
All reactions
Sorry, something went wrong.
I think the way In [1]: from pandas.core.arrays.base import take_ea
In [2]: import numpy as np
In [3]: arr = np.array([1, 2, 3])
In [4]: take_ea(arr, [0, 1, -1])
Out[4]: array([1, 2, 3])Suggestions for a better name?
That seems fine.
This can be private. |
All reactions
Sorry, something went wrong.
The |
All reactions
Sorry, something went wrong.
|
How I had it in my head (but which doesn't mean it is better!), was that such a |
All reactions
Sorry, something went wrong.
pandas/core/algorithms.py
Outdated
| return arr._from_sequence([fill_value] * len(indexer)) | ||
|
|
||
| result = arr.take(indexer) | ||
| result[mask] = fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to not use the take_1d machinery here? (maybe there is, or maybe there is no reason to use it since the above is simpler and works fine, but was just wondering)
Sorry, something went wrong.
All reactions
|
We should be able to have a clean implementation of The tricky part is fixing the callers like |
All reactions
Sorry, something went wrong.
|
Pushed another WIP if anyone has some thoughts on the general approach. The extension tests all pass but I'm still breaking things elsewhere, and I think things can be simplified. The main idea is to have a sentinel |
All reactions
Sorry, something went wrong.
|
Regarding |
All reactions
Sorry, something went wrong.
That feeling is correct. Is that not what we want? I can re-add the allow_fill argument. |
All reactions
Sorry, something went wrong.
I am going a bit back and forth on it, not sure ... Regarding what I said above about the dubious meaning of "na_value" (the physical value that is stored, or the public one), I think this is actually not really a problem. It should just be the public one, so the value that you get if you get an item of an EA that is missing (type of |
All reactions
Sorry, something went wrong.
|
Trying to get my head around the current logic. Some pseudo code flow path in case of a Series reindex operation with EA values: (but in case of a reindex/aligment operation, the final So what I find complex about this: a) as said above different meaning of |
All reactions
Sorry, something went wrong.
I think if we use This should be able to be simplified greatly. Let's see... |
All reactions
Sorry, something went wrong.
|
Just brainstorming on an alternative: (I don't know if this scheme is anywhere near clear for somebody that did not write it ...?) This introduces again the second keyword, but eliminates double usecase of But of course this is only a single code path, I don't know how it interacts with all other usages of the take machinery. |
All reactions
Sorry, something went wrong.
Actually, I don't think that's true. There would still be the ambiguity. Anyway, we can leave that up to the EA author I think... Your proposal looks sensible. I'll try to implement it. |
All reactions
Sorry, something went wrong.
I don't think we can do that generically, since we don't know what a valid na_value is. I'll add one in decimal though. |
All reactions
Sorry, something went wrong.
If you fill it with an existing scalar (like But it is actually a good question what should happen when you pass a value that is not accepted by the ExtensionArray. I suppose it is up to the EA to decide on that? (raise an error, or coerce to object) |
All reactions
Sorry, something went wrong.
|
One additional point of feedback from geopandas: From running the extension tests in geopandas, it seems snippet was not sufficient and I needed to add But I don't understand why it is not failing for DecimalArray, as you added an assert to the class constructor that all values passed are actually decimals (so also Decimal('NaN') instead of np.nan), which should catch this problem. |
All reactions
Sorry, something went wrong.
|
Huh... I'll see what I can cook up. Maybe a dtype thing, since DecimalArray is fundamentally object dtype, but geopandas is integer? For cyberpandas, I can't quite use this This does raise the question, should |
All reactions
Sorry, something went wrong.
|
OK, found the reason: I forgot to set the |
All reactions
Sorry, something went wrong.
So that's the question I raised before (and therefore also questioning whether it should be public), but I have now come to a somewhat conclusion that I think this should be the user-facing scalar ipaddres.IPv4Address(0) and not the low-level (0, 0) (so in case of geopandas: None and not 0). Although it makes it not very useful for actual implementations, it's then more kind of an "informative attribute". |
All reactions
Sorry, something went wrong.
That's my conclusion too. I will document this more fully in ExtensionArray.take, as
Yeah. In my case, all of the fields are the same dtype, so I'm able reshape & view as a unit64 ndarray, do the take, and the convert back (without copies). I'll make sure everything works, but all that reshaping is a bit complicated :) |
All reactions
Sorry, something went wrong.
https://github.com/pandas-dev/pandas/pull/20814/files#diff-0267ba650641bdaf3930400c2478b084R203 I think all your comments were addressed. |
All reactions
Sorry, something went wrong.
pandas/core/algorithms.py
Outdated
| With the default ``allow_fill=False``, negative numbers indicate | ||
| slices from the right. | ||
| positional indicies from the right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indicies -> indices
Sorry, something went wrong.
All reactions
pandas/core/algorithms.py
Outdated
| arr = np.asarray(arr) | ||
|
|
||
| # Do we require int64 or intp here? | ||
| indices = np.asarray(indices, dtype='int') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be intp, which is what take accepts for indexers
Sorry, something went wrong.
All reactions
| fill_value : any, default None | ||
| Fill value to replace -1 values with. If applicable, this should | ||
| use the sentinel missing value for this type. | ||
| indices : sequence of integers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share this doc?
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want people to look at the source code of this (as we explicitly say the code of this base class is kind of the documentation), so I would certainly try to keep this one here (not sure if it would be easy to reuse it for the take function)
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing docs between these two are a bit difficult because of circular import issues. The docstring would have to go in a third module, which isn't desirable.
Sorry, something went wrong.
All reactions
| return indices | ||
|
|
||
|
|
||
| def validate_indices(indices, n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed, as opposed to maybe_convert_indices? (maybe refactor that a bit?)
Sorry, something went wrong.
All reactions
|
Apparently the random failure in test_merge was not fixed, at least not for all numpy arrays. https://circleci.com/gh/pandas-dev/pandas/13593#tests/containers/2 had a failures. Hopefully eb43fa4, which is ugly, fixes it for good. I ran test_merge 1000 times both before and after that commit and things pass, so who knows :/ |
All reactions
Sorry, something went wrong.
| # na_value is the default NA value to use for this type. This is used in | ||
| # e.g. ExtensionArray.take. This should be the user-facing "boxed" version | ||
| # of the NA value, not the physical NA vaalue for storage. | ||
| na_value = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe even show an example (from JSON?)
Sorry, something went wrong.
All reactions
| return indices | ||
|
|
||
|
|
||
| def validate_indices(indices, n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so now we have 2 different code paths which do very similar type of validation, I would try to simplify Series._take and NDFrame._take to use these new helper functions, removing the need for maybe_convert_indices entirely.
Sorry, something went wrong.
All reactions
|
@jreback do you have any other comments? |
All reactions
Sorry, something went wrong.
|
I guess this is fine. I am sure this will spawn many more bugs and would like this to live in master for a while. |
All reactions
48C9
summary>
Sorry, something went wrong.
To what are you referring? |
All reactions
Sorry, something went wrong.
| return np.float64, np.nan | ||
|
|
||
| if is_uniform_reindex(join_units): | ||
| # XXX: integrate property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger long shot any recollection what this comment is asking for?
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the need for that is_uniform_reindex check by including that case in the main if / elif block (which may be done on master now?)
Sorry, something went wrong.
All reactions
Reviewers
jorisvandenbossche
jbrockmendel
jreback
Assignees
No one assignedLabels
Projects
Milestone
0.23.0Development
Successfully m 4420 erging this pull request may close these issues.
API: take interface for (Extension)Array-likes
Implements a take interface that's compatible with NumPy and optionally pandas'
NA semantics.
Closes #20640
git diff upstream/master -u -- "*.py" | flake8 --diff