ENH: allow a duck-typed non NDArray sub-class to use the iterators #4240

jreback · 2014-01-29T15:05:25Z

needs pandas 0.13 to fail
(fails with numpy 1.6 thru 1.9-dev)

import numpy as np
import pandas as pd    

rng = np.arange(5)

rng[rng > 2]                       # works as expected
>>> array([3, 4])

b = pd.Series(rng > 2)
rng[b]                               # doesn't work anymore
>>> IndexError: unsupported iterator index

Prior to pandas 0.13, Series WAS a sub-class of NDArray, so this worked as expected.

Series respects and implements virtually all of the interface to an NDArray.

This exception is coming from: https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/iterators.c#L867

I suspect that the PyArray_Check(obj) is not entered because Series is not a direct-sub class.

If this check were relaxed a bit (or provide a way to emulate this type of check; which without hooking into the c-interface I don't think it possible ATM).

This was probably done for perf reasons I would guess.

thanks.

Jeff

related pandas issue: pandas-dev/pandas#6168

The text was updated successfully, but these errors were encountered:

njsmith · 2014-01-29T16:06:15Z

Not sure what the problem is from the description here - I think you want
to be able to use an arbitrary object as an index into a ndarray? Is that
right?

If so then possibly the right way to do this is to add support for a
numpy_index method, that works just like index except can return
any numpy indexing object, not just an int.
On 29 Jan 2014 15:05, "jreback" notifications@github.com wrote:

needs pandas 0.13 to fail
(fails with numpy 1.6 thru 1.9-dev)

import numpy as np
import pandas as pd

rng = np.arange(5)

rng[rng > 2] # works as expected

array([3, 4])

b = pd.Series(rng > 2)
rng[b] # doesn't work anymore

IndexError: unsupported iterator index

Prior to pandas 0.13, Series WAS a sub-class of NDArray, so this worked
as expected.

Series respects and implements virtually all of the interface to an
NDArray.

This exception is coming from:
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/iterators.c#L867

I suspect that the PyArray_Check(obj) is not entered because Series is
not a direct-sub class.

If this check were relaxed a bit (or provide a way to emulate this type of
check; which without hooking into the c-interface I don't think it possible
ATM).

This was probably done for perf reasons I would guess.

thanks.

Jeff

related pandas issue: pandas-dev/pandas#6168 pandas-dev/pandas#6168

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4240
.

jreback · 2014-01-29T16:55:18Z

@njsmith yes..that's exactly what I am suggesting, doesn't seem to me that their is a way ATM.

juliantaylor · 2014-01-29T17:01:51Z

maybe we could try getting an indexing array from a pep3118 buffer, if Series already exposes the buffer pandas might not need changing.

juliantaylor · 2014-01-29T17:06:31Z

hm pandas 0.13 Series does not support the buffer interface anymore
@jreback, is that an intentional change or a regression as its no ndarray subtype anymore?

njsmith · 2014-01-29T17:21:11Z

Yeah, the other option (which I think is a generalization of what you're
suggesting) is basically to try calling asarray() on any unrecognized index
objects?

On Wed, Jan 29, 2014 at 5:01 PM, Julian Taylor notifications@github.comwrote:

maybe we could try getting an indexing array from a pep3118 buffer, if
Series already exposes the buffer pandas might not need changing.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4240#issuecomment-33604865
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

jreback · 2014-01-29T17:28:49Z

@juliantaylor this was an intential change to not be a direct sub-class of ndarray any longer, only had a couple of minor incompatibilites: http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#internal-refactoring.

This brings the Series into line with other pandas objects (mainly DataFrame) so can share a lot of code

Series supports completely the python interface (e.g. __array__, only); I am not aware a way to support the buffer interface w/o the c-api, is this right? (but in truth everything goes thru higher level methods in any event).

njsmith · 2014-01-29T17:56:18Z

I'm pretty numpy's indexing code doesn't go through the higher-level
methods, it just grovels arounds in the innards of the passed-in arrays
directly :-)

On Wed, Jan 29, 2014 at 5:28 PM, jreback notifications@github.com wrote:

@juliantaylor https://github.com/juliantaylor this was an intential
change to not be a direct sub-class of ndarray any longer, only had a
couple of minor incompatibilites:
http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#internal-refactoring
.

This brings the Series into line with other pandas objects (mainly
DataFrame) so can share a lot of code

Series supports completely the python interface (e.g. array, only); I
am not aware a way to support the buffer interface w/o the c-api, is this
right? (but in truth everything goes thru higher level methods in any
event).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4240#issuecomment-33607899
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

jreback · 2014-01-29T18:04:46Z

@njsmith oh I meant users going thru Series higher level methods.

It seems that their should be a way for numpy to say, hay you are not a sub-class (e.g. PyArray_Check fails, but you can still obey the iterator protocol, so I'll accept you;
by definition if I return an array when np.array is called on me, then this should work, no?

juliantaylor · 2014-01-29T18:22:24Z

right, just call asarray/PyArray_FromAny on it, that supports __array__ and buffers.

seberg · 2014-01-29T18:28:02Z

It works in my code branch ;).

seberg · 2014-01-29T18:48:19Z

Anyway, if someone wants to attempt to fix it for a 1.8.x release or so. The fancy indexing code should be able to handle this fine (I am pretty sure it calls asarray even now), but the code deciding to trigger fancy indexing needs to be adjusted (you likely probably can use a Series, if you have another array in there somewhere ;)). I don't think it is possible to get everything quite consistent, but that probably doens't matter.

seberg · 2014-01-29T19:06:10Z

Hmmm, spam... But doesn't isn't a Series a python sequence with a length and everything? I can plug in for example array.arrays and it works fine. Heck, even strings work fine (though not a single string), or is there a difference that a Series doesn't implement the C-Api sequence protocol maybe?

seberg · 2014-01-29T19:25:05Z

Actually... This is fun, because it doesn't quite work in my branch:

In [6]: rng[b] 
/usr/bin/ipython:1: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
  #! /usr/bin/python
Out[6]: array([0, 0, 0, 1, 1])

And we would have to find some logic for that, and I am not sure what it would be.

jreback · 2014-01-29T19:28:21Z

@seberg Series implements __array__/dtype. plus all the usual __getitem__, __len__. Looks like an ordered sequence.

jreback · 2014-01-29T19:56:02Z

is numpy calling any special method when trying to figure out what the indexer is?
aside from the type checking, e.g. its not a scalar or slice (and not an ndarray sub-class)

seberg · 2014-01-29T20:00:57Z

didn't check the code, but I would think it calls PySequence_Check(...).

seberg · 2014-01-29T20:23:21Z

That said, my new code would basically just try:/except: the integer case (unless it is an array, where that would be potentially wrong currently). Though I will also double check the array at the end if it is a scalar. However, that part would not matter in how the code is currently written.

Are we talking about fixing this in 1.9 or hacking something for 1.8? The bigger problem in both cases is that your Series here is boolean, but a list (and anything else that is not a plain boolean array) such as [True, False] is interpreted as [1, 0] curretly and I do not see how we can easily distinguish this (if at all).

jreback · 2014-01-29T20:31:05Z

why would you not just try to asarray the indexer? (maybe in a try: except:)

I agree using this is subject to odd dtypes (it really should to be boolean or an int64 indexer array).
but I think you check this in any event.

In [4]: s
Out[4]: 
0    False
1    False
2    False
3     True
4     True
dtype: bool

In [5]: np.array(s)
Out[5]: array([False, False, False,  True,  True], dtype=bool)

seberg · 2014-01-29T20:45:10Z

Because we were just going to do that after a deprecation period which would start in 1.9...

jreback · 2014-01-29T20:51:06Z

ah..ok..np....

seberg · 2014-01-30T10:44:39Z

I somewhat read this issue first as "we have a serious problem here". If it is more a feature request and it is fine if it only works in 1.9 and the boolean case even later, then I think we can consider it handled for the moment. If this is creating serious problems out there, we need to think about how to best solve them, how bad the boolean side is, etc.

jreback · 2014-01-30T11:03:56Z

its fine if it works in >= 1.9

the use case IMHO is pretty narrow; if you are already using Series you are almost always using them for most of the indexing, otherwise you can wrap as needed.

that said; a Series behaves and looks like a 1-dim ndarray, so compat should be their for this case.

this is one of those edge cases where a sub-class ndarray is different than a duck-typed ndarray

seberg · 2014-02-17T17:35:34Z

My branch has been merged in master, so that asarray logic is will now be always consistently used. The boolean discrepancy is of course there until deprecation is done (and that will be a while).

So closing this.

seberg · 2014-02-17T17:53:48Z

That said, I just realized there is a big "be careful" here:

In [15]: a = np.arange(4).reshape(2,2)

In [16]: class C(object):
    def __array__(self):
        return a
    def __len__(self):
        return 2
    def __getitem__(self, item):
        return a[item]
   ....:     

In [17]: np.arange(10)[C()]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-17-4ea877d2b70c> in <module>()
----> 1 np.arange(10)[C()]

IndexError: too many indices for array

This is because the index is interpreted as np.arange(10)[C[0], C[1]] because it is a sequence of length <32 which contains other sequences! We could add a "is an array-like check", that would change behaviour but in a pretty wonky special case.

seberg · 2014-02-17T17:55:01Z

Or does anyone have a nice idea how to solve this ambiguity? In principle both solutions are correct after all...

njsmith · 2014-02-17T19:41:48Z

Shouldn't the rule be that we only have multiple indexes if type(key) ==
tuple?
On 17 Feb 2014 12:55, "seberg" notifications@github.com wrote:

Or does anyone have a nice idea how to solve this ambiguity? In principle
both solutions are correct after all...

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4240#issuecomment-35307539
.

seberg · 2014-02-17T22:39:54Z

@njsmith, that would be nice, but even numpy tends to use lists to build these "tuples", though I suppose we could be brutal and deprecate that. Or only allow tuples and lists with this funky test. I don't like that length check anyway. Behaviour shouldn't work for large lists and suddenly fail if they get short, even if it probably doesn't matter in practice.

jreback · 2014-02-17T23:12:17Z

@seberg thanks for all the effort on this!

can you ref the commit for this change?

njsmith · 2014-02-17T23:16:25Z

If we've supported lists in the past then I guess we have to continue to do
so, but it really is inherently broken... (When in doubt, refuse the
temptation to guess.)

Anyway we def. shouldn't start allowing more non-tuple types to do this
IMHO, and probably we should deprecate the use of lists too... (I know
people will scream but if it only works for certain types of indexes and
not others then it's hardly a reliable and generally useable feature!)
On 17 Feb 2014 17:39, "seberg" notifications@github.com wrote:

@njsmith https://github.com/njsmith, that would be nice, but even numpy
tends to use lists to build these "tuples", though I suppose we could be
brutal and deprecate that. Or only allow tuples and lists with this funky
test. I don't like that length check anyway. Behaviour shouldn't work for
large lists and suddenly fail if they get short, even if it probably
doesn't matter in practice.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4240#issuecomment-35329752
.

seberg · 2014-02-17T23:23:04Z

@jreback there is no one commit, the PR is gh-3798, but it is basically all rewritten. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/mapping.c#L186 is the start of the code branch deciding whether to treat a non-tuple as multiple indices or not (this is unchanged). Array likes will drop through all the checks until reaching https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/mapping.c#L392 at which point asarray is called on them.

@njsmith, I did not change anything in that logic... But since sequences other then list or tuple subclasses which are used in such a way are probably non-existing, we could maybe deprecate at least anything but lists fast?

jreback · 2014-02-17T23:39:48Z

beautiful...thank you much!

In [9]: np.__version__
Out[9]: '1.9.0.dev-b69ee44'

In [10]: rng = np.arange(5)

In [11]: b = pd.Series(rng > 2)

In [12]: rng[b]   
Out[12]: array([0, 0, 0, 1, 1])

In [13]: b
Out[13]: 
0    False
1    False
2    False
3     True
4     True
dtype: bool

seberg · 2014-02-17T23:50:21Z

@jreback, well half beautiful for now, enable deprecation warnings to be shown...

njsmith · 2014-02-17T23:50:31Z

On Mon, Feb 17, 2014 at 6:23 PM, seberg notifications@github.com wrote:

@njsmith https://github.com/njsmith, I did not change anything in that
logic... But since sequences other then list or tuple subclasses which are
used in such a way are probably non-existing, we could maybe deprecate at
least anything but lists fast?

Eh, I guess it doesn't much matter whether we deprecate lists+everything
else, or just immediately remove everything else while deprecating lists.
And you're right that the former is more correct (even if it probably
doesn't make a difference in practice). So sure, let's deprecate everything
but tuples?

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

seberg · 2014-02-17T23:52:15Z

It would require a couple of bla[tuple(whatever)] replacements, but in general I really dislike this logic too. But this needs to be discussed on the mailing list. Will send something soon I guess unless you are faster :)

jreback mentioned this issue Jan 30, 2014

Compatibility with pandas 0.13 matplotlib/matplotlib#2775

Closed

seberg closed this as completed Feb 17, 2014

jreback mentioned this issue Dec 27, 2014

ValueError: setting an array element with a sequence. pandas-dev/pandas#9156

Closed

Uh oh!

ENH: allow a duck-typed non NDArray sub-class to use the iterators #4240

ENH: allow a duck-typed non NDArray sub-class to use the iterators #4240

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!