8000 ENH: allow a duck-typed non NDArray sub-class to use the iterators · Issue #4240 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: allow a duck-typed non NDArray sub-class to use the iterators #4240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Jan 29, 2014 · 34 comments
Closed

ENH: allow a duck-typed non NDArray sub-class to use the iterators #4240

jreback opened this issue Jan 29, 2014 · 34 comments

Comments

@jreback
Copy link
jreback commented Jan 29, 2014

needs pandas 0.13 to fail
(fails with numpy 1.6 thru 1.9-dev)

import numpy as np
import pandas as pd    

rng = np.arange(5)

rng[rng > 2]                       # works as expected
>>> array([3, 4])

b = pd.Series(rng > 2)
rng[b]                               # doesn't work anymore
>>> IndexError: unsupported iterator index

Prior to pandas 0.13, Series WAS a sub-class of NDArray, so this worked as expected.

Series respects and implements virtually all of the interface to an NDArray.

This exception is coming from: https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/iterators.c#L867

I suspect that the PyArray_Check(obj) is not entered because Series is not a direct-sub class.

If this check were relaxed a bit (or provide a way to emulate this type of check; which without hooking into the c-interface I don't think it possible ATM).

This was probably done for perf reasons I would guess.

thanks.

Jeff

related pandas issue: pandas-dev/pandas#6168

@njsmith
Copy link
Member
njsmith commented Jan 29, 2014

Not sure what the problem is from the description here - I think you want
to be able to use an arbitrary object as an index into a ndarray? Is that
right?

If so then possibly the right way to do this is to add support for a
numpy_index method, that works just like index except can return
any numpy indexing object, not just an int.
On 29 Jan 2014 15:05, "jreback" notifications@github.com wrote:

needs pandas 0.13 to fail
(fails with numpy 1.6 thru 1.9-dev)

import numpy as np
import pandas as pd

rng = np.arange(5)

rng[rng > 2] # works as expected

array([3, 4])

b = pd.Series(rng > 2)
rng[b] # doesn't work anymore

IndexError: unsupported iterator index

Prior to pandas 0.13, Series WAS a sub-class of NDArray, so this worked
as expected.

Series respects and implements virtually all of the interface to an
NDArray.

This exception is coming from:
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/iterators.c#L867

I suspect that the PyArray_Check(obj) is not entered because Series is
not a direct-sub class.

If this check were relaxed a bit (or provide a way to emulate this type of
check; which without hooking into the c-interface I don't think it possible
ATM).

This was probably done for perf reasons I would guess.

thanks.

Jeff

related pandas issue: pandas-dev/pandas#6168pandas-dev/pandas#6168


Reply to this email directly or view it on GitHubhttps://github.com//issues/4240
.

@jreback
Copy link
Author
jreback commented Jan 29, 2014

@njsmith yes..that's exactly what I am suggesting, doesn't seem to me that their is a way ATM.

@juliantaylor
Copy link
Contributor

maybe we could try getting an indexing array from a pep3118 buffer, if Series already exposes the buffer pandas might not need changing.

@juliantaylor
Copy link
Contributor

hm pandas 0.13 Series does not support the buffer interface anymore
@jreback, is that an intentional change or a regression as its no ndarray subtype anymore?

@njsmith
Copy link
Member
njsmith commented Jan 29, 2014

Yeah, the other option (which I think is a generalization of what you're
suggesting) is basically to try calling asarray() on any unrecognized index
objects?

On Wed, Jan 29, 2014 at 5:01 PM, Julian Taylor notifications@github.comwrote:

maybe we could try getting an indexing array from a pep3118 buffer, if
Series already exposes the buffer pandas might not need changing.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4240#issuecomment-33604865
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@jreback
Copy link
Author
jreback commented Jan 29, 2014

@juliantaylor this was an intential change to not be a direct sub-class of ndarray any longer, only had a couple of minor incompatibilites: http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#internal-refactoring.

This brings the Series into line with other pandas objects (mainly DataFrame) so can share a lot of code

Series supports completely the python interface (e.g. __array__, only); I am not aware a way to support the buffer interface w/o the c-api, is this right? (but in truth everything goes thru higher level methods in any event).

@njsmith
Copy link
Member
njsmith commented Jan 29, 2014

I'm pretty numpy's indexing code doesn't go through the higher-level
methods, it just grovels arounds in the innards of the passed-in arrays
directly :-)

On Wed, Jan 29, 2014 at 5:28 PM, jreback notifications@github.com wrote:

@juliantaylor https://github.com/juliantaylor this was an intential
change to not be a direct sub-class of ndarray any longer, only had a
couple of minor incompatibilites:
http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#internal-refactoring
.

This brings the Series into line with other pandas objects (mainly
DataFrame) so can share a lot of code

Series supports completely the python interface (e.g. array, only); I
am not aware a way to support the buffer interface w/o the c-api, is this
right? (but in truth everything goes thru higher level methods in any
event).


Reply to this email directly or view it on GitHubhttps://github.com//issues/4240#issuecomment-33607899
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@jreback
Copy link
Author
jreback commented Jan 29, 2014

@njsmith oh I meant users going thru Series higher level methods.

It seems that their should be a way for numpy to say, hay you are not a sub-class (e.g. PyArray_Check fails, but you can still obey the iterator protocol, so I'll accept you;
by definition if I return an array when np.array is called on me, then this should work, no?

@juliantaylor
Copy link
Contributor

right, just call asarray/PyArray_FromAny on it, that supports __array__ and buffers.

@seberg
Copy link
Member
seberg commented Jan 29, 2014

It works in my code branch ;).

@seberg
Copy link
Member
seberg commented Jan 29, 2014

Anyway, if someone wants to attempt to fix it for a 1.8.x release or so. The fancy indexing code should be able to handle this fine (I am pretty sure it calls asarray even now), but the code deciding to trigger fancy indexing needs to be adjusted (you likely probably can use a Series, if you have another array in there somewhere ;)). I don't think it is possible to get everything quite consistent, but that probably doens't matter.

@seberg
Copy link
Member
seberg commented Jan 29, 2014

Hmmm, spam... But doesn't isn't a Series a python sequence with a length and everything? I can plug in for example array.arrays and it works fine. Heck, even strings work fine (though not a single string), or is there a difference that a Series doesn't implement the C-Api sequence protocol maybe?

@seberg
Copy link
Member
seberg commented Jan 29, 2014

Actually... This is fun, because it doesn't quite work in my branch:

In [6]: rng[b] 
/usr/bin/ipython:1: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
  #! /usr/bin/python
Out[6]: array([0, 0, 0, 1, 1])

And we would have to find some logic for that, and I am not sure what it would be.

@jreback
Copy link
Author
jreback commented Jan 29, 2014

@seberg Series implements __array__/dtype. plus all the usual __getitem__, __len__. Looks like an ordered sequence.

@jreback
Copy link
Author
jreback commented Jan 29, 2014

is numpy calling any special method when trying to figure out what the indexer is?
aside from the type checking, e.g. its not a scalar or slice (and not an ndarray sub-class)

@seberg
Copy link
Member
seberg commented Jan 29, 2014

didn't check the code, but I would think it calls PySequence_Check(...).

@seberg
Copy link
Member
seberg commented Jan 29, 2014

That said, my new code would basically just try:/except: the integer case (unless it is an array, where that would be potentially wrong currently). Though I will also double check the array at the end if it is a scalar. However, that part would not matter in how the code is currently written.

Are we talking about fixing this in 1.9 or hacking something for 1.8? The bigger problem in both cases is that your Series here is boolean, but a list (and anything else that is not a plain boolean array) such as [True, False] is interpreted as [1, 0] curretly and I do not see how we can easily distinguish this (if at all).

@jreback
Copy link
Author
jreback commented Jan 29, 2014

why would you not just try to asarray the indexer? (maybe in a try: except:)

I agree using this is subject to odd dtypes (it really should to be boolean or an int64 indexer array).
but I think you check this in any event.

In [4]: s
Out[4]: 
0    False
1    False
2    False
3     True
4     True
dtype: bool

In [5]: np.array(s)
Out[5]: array([False, False, False,  True,  True], dtype=bool)

@seberg
Copy link
Member
seberg commented Jan 29, 2014

Because we were just going to do that after a deprecation period which would start in 1.9...

@jreback
Copy link
Author
jreback commented Jan 29, 2014

ah..ok..np....

@seberg
Copy link
Member
seberg commented Jan 30, 2014

I somewhat read this issue first as "we have a serious problem here". If it is more a feature request and it is fine if it only works in 1.9 and the boolean case even later, then I think we can consider it handled for the moment. If this is creating serious problems out there, we need to think about how to best solve them, how bad the boolean side is, etc.

8000

@jreback
Copy link
Author
jreback commented Jan 30, 2014

its fine if it works in >= 1.9

the use case IMHO is pretty narrow; if you are already using Series you are almost always using them for most of the indexing, otherwise you can wrap as needed.

that said; a Series behaves and looks like a 1-dim ndarray, so compat should be their for this case.

this is one of those edge cases where a sub-class ndarray is different than a duck-typed ndarray

@seberg
Copy link
Member
seberg commented Feb 17, 2014

My branch has been merged in master, so that asarray logic is will now be always consistently used. The boolean discrepancy is of course there until deprecation is done (and that will be a while).

So closing this.

@seberg seberg closed this as completed Feb 17, 2014
@seberg
Copy link
Member
seberg commented Feb 17, 2014

That said, I just realized there is a big "be careful" here:

In [15]: a = np.arange(4).reshape(2,2)

In [16]: class C(object):
    def __array__(self):
        return a
    def __len__(self):
        return 2
    def __getitem__(self, item):
        return a[item]
   ....:     

In [17]: np.arange(10)[C()]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-17-4ea877d2b70c> in <module>()
----> 1 np.arange(10)[C()]

IndexError: too many indices for array

This is because the index is interpreted as np.arange(10)[C[0], C[1]] because it is a sequence of length <32 which contains other sequences! We could add a "is an array-like check", that would change behaviour but in a pretty wonky special case.

@seberg
Copy link
Member
seberg commented Feb 17, 2014

Or does anyone have a nice idea how to solve this ambiguity? In principle both solutions are correct after all...

@njsmith
Copy link
Member
njsmith commented Feb 17, 2014

Shouldn't the rule be that we only have multiple indexes if type(key) ==
tuple?
On 17 Feb 2014 12:55, "seberg" notifications@github.com wrote:

Or does anyone have a nice idea how to solve this ambiguity? In principle
both solutions are correct after all...


Reply to this email directly or view it on GitHubhttps://github.com//issues/4240#issuecomment-35307539
.

@seberg
Copy link
Member
seberg commented Feb 17, 2014

@njsmith, that would be nice, but even numpy tends to use lists to build these "tuples", though I suppose we could be brutal and deprecate that. Or only allow tuples and lists with this funky test. I don't like that length check anyway. Behaviour shouldn't work for large lists and suddenly fail if they get short, even if it probably doesn't matter in practice.

@jreback
Copy link
Author
jreback commented Feb 17, 2014

@seberg thanks for all the effort on this!

can you ref the commit for this change?

@njsmith
Copy link
Member
njsmith commented Feb 17, 2014

If we've supported lists in the past then I guess we have to continue to do
so, but it really is inherently broken... (When in doubt, refuse the
temptation to guess.)

Anyway we def. shouldn't start allowing more non-tuple types to do this
IMHO, and probably we should deprecate the use of lists too... (I know
people will scream but if it only works for certain types of indexes and
not others then it's hardly a reliable and generally useable feature!)
On 17 Feb 2014 17:39, "seberg" notifications@github.com wrote:

@njsmith https://github.com/njsmith, that would be nice, but even numpy
tends to use lists to build these "tuples", though I suppose we could be
brutal and deprecate that. Or only allow tuples and lists with this funky
test. I don't like that length check anyway. Behaviour shouldn't work for
large lists and suddenly fail if they get short, even if it probably
doesn't matter in practice.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4240#issuecomment-35329752
.

@seberg
Copy link
Member
seberg commented Feb 17, 2014

@jreback there is no one commit, the PR is gh-3798, but it is basically all rewritten. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/mapping.c#L186 is the start of the code branch deciding whether to treat a non-tuple as multiple indices or not (this is unchanged). Array likes will drop through all the checks until reaching https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/mapping.c#L392 at which point asarray is called on them.

@njsmith, I did not change anything in that logic... But since sequences other then list or tuple subclasses which are used in such a way are probably non-existing, we could maybe deprecate at least anything but lists fast?

@jreback
Copy link
Author
jreback commented Feb 17, 2014

beautiful...thank you much!

In [9]: np.__version__
Out[9]: '1.9.0.dev-b69ee44'

In [10]: rng = np.arange(5)

In [11]: b = pd.Series(rng > 2)

In [12]: rng[b]   
Out[12]: array([0, 0, 0, 1, 1])

In [13]: b
Out[13]: 
0    False
1    False
2    False
3     True
4     True
dtype: bool

@seberg
Copy link
Member
seberg commented Feb 17, 2014

@jreback, well half beautiful for now, enable deprecation warnings to be shown...

@njsmith
Copy link
Member
njsmith commented Feb 17, 2014

On Mon, Feb 17, 2014 at 6:23 PM, seberg notifications@github.com wrote:

@njsmith https://github.com/njsmith, I did not change anything in that
logic... But since sequences other then list or tuple subclasses which are
used in such a way are probably non-existing, we could maybe deprecate at
least anything but lists fast?

Eh, I guess it doesn't much matter whether we deprecate lists+everything
else, or just immediately remove everything else while deprecating lists.
And you're right that the former is more correct (even if it probably
doesn't make a difference in practice). So sure, let's deprecate everything
but tuples?

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@seberg
Copy link
Member
seberg commented Feb 17, 2014

It would require a couple of bla[tuple(whatever)] replacements, but in general I really dislike this logic too. But this needs to be discussed on the mailing list. Will send something soon I guess unless you are faster :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0