8000 Override column __getitem__ to return int item as ndarray not column by taldcroft · Pull Request #3095 · astropy/astropy · GitHub
[go: up one dir, main page]

Skip to content

Override column __getitem__ to return int item as ndarray not column #3095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 2, 2014

Conversation

taldcroft
Copy link
Member

This was pulled from #2790 and is pending further discussion of when table columns should be dropped to ndarray.

@mhvk
Copy link
Contributor
mhvk commented Nov 11, 2014

Just a reminder about the discussion in #2790 and in separate e-mails, that @astrofrog and I argued for __getitem__ always returning the same type of object, though we didn't quite converge on whether this should always be an ndarray or always a Column.

My tendency would be to always return a Column even for single items, is that I think the metadata is just as relevant for a single item. Here, one can think in analogy with Row, which still holds all the meta information for a single row of a table.

@embray
Copy link
Member
embray commented Nov 11, 2014

I don't have a strong opinion about this either way. There has to be some default behavior for __getitem__ regardless. But if there's some other access mode that loses out because of that we could still add some other kind of indexing proxy for different access modes a la .ix, .loc, etc. in pandas.DataFrame

@astrofrog
Copy link
Member

@mhvk - just one clarification - Column.meta is different from Table.meta. If you access the meta on a row then it returns the Table.meta but if you then access an item from a row it also currently drops any Row-ness:

In [17]: t[1]['col1']
Out[17]: 3

so by symmetry we should get the same for:

In [17]: t['col1'][1]
Out[17]: 3

I think that just the symmetry is enough to show that we can't get a column if we slice a column, because we don't get a row if we slice a row, and we want to make sure that we get the same whether we access the rows first or columns first.

@mhvk
Copy link
Contributor
mhvk commented Nov 11, 2014

Yes, the current behaviour is consistent in the sense you imply; the only inconsistency is columns holding multiple dimensions; my Row example is somewhat of a red herring. Fortunately, if we make any changes to __getitem__ it will affect the examples in the same way.

Anyway, I'll let those speak who are more likely to use columns in their present form -- I'll be moving fully to SmartTable with quantity columns as soon as it is available!

@mhvk
Copy link
Contributor
mhvk commented Nov 11, 2014

Just so it is clear the current behaviour is most definitely broken (see entry [5]):

In [1]: from astropy.table import Table

In [2]: t = Table([[[1.111111, 2.22222], [4.4444444, 5.55555]], [4, 6]], names=('a', 'b'))

In [3]: print(t)
       a [2]          b 
-------------------- ---
 1.111111 .. 2.22222   4
4.4444444 .. 5.55555   6

In [4]: print(t['a'])
       a [2]        
--------------------
 1.111111 .. 2.22222
4.4444444 .. 5.55555

In [5]: print(t['a'][0])
   a    
--------
1.111111
 2.22222

In [6]: print(t[0])
<Row 0 of table
 values=([1.111111, 2.22222], 4)
 dtype=[('a', '<f8', (2,)), ('b', '<i8')]>

@taldcroft
Copy link
Member Author

My inclination (and intent with this patch) was to basically follow the numpy subclassing behavior, which is (I think):

  • Slicing and fancy indexing return the subclass (Column).
  • Accessing a single item returns an ndarray version of that element

This is the same as Pandas Series behavior (noting that Series can't contain multi-dim elements, but I think that single item access would return an ndarray version of the element).

EDIT: by "ndarray version of that element", I meant a numpy scalar or array as relevant.

@taldcroft
Copy link
Member Author

This PR makes Column consistent with Row (as of 0.4):

In [7]: t = Table([[[1.111111, 2.22222], [4.4444444, 5.55555]], [4, 6]], names=('a', 'b'))

In [8]: print t[0]['a']
[ 1.111111  2.22222 ]

In [9]: print type(t[0]['a'])
<type 'numpy.ndarray'>

@taldcroft
Copy link
Member Author

So in terms of precedence from Numpy itself and Pandas, my vote is to mostly leave Column item access the same but put in this patch to make multi-dim elements behave like scalars.

This is independent of what happens when Columns are involved in arithmetic operations or whatnot in the __array_wrap__ stage.

@taldcroft
Copy link
Member Author

@mhvk @embray @astrofrog - any thoughts on this?

Here are my thoughts on the current options that have been discussed (along with comment on precedent from other packages):

  1. Changing __getitem__ so that a Column is always returned, even for a single item like t['a'][5] is a big API change because people expect np scalar types (for 1-d columns). There is also a huge performance hit (factor of ~50) because making a Column is expensive. The xray table package does this.
  2. Changing __getitem__ to always return a pure numpy object (ndarray or np scalar) is likewise a big API change since slicing has returned a Column since the beginning, and it is natural to expect slicing or fancy indexing to return the same subclass. The bcolz table package always returns pure numpy for specialized reasons, including limited API of the carray object.
  3. Changing __getitem__ to always return a pure numpy object for an integer item access. This is the smallest API change (only the return type for multi-dim columns), and the current behavior is considered "clearly broken". The pandas package does this.

@astrofrog
Copy link
Member

@taldcroft - I think I like 3 best. So just to be clear, this means that accessing a single 'cell' in the table will return a Numpy scalar or array, whereas selecting multiple rows from a column will still return a column. Is this correct? If so, then 👍 from me.

< 8000 /div>

@astrofrog
Copy link
Member

It would be good to include a test which also includes comments for each test case to explain the behavior we expect in each case.

@taldcroft
Copy link
Member Author

Added tests.

@taldcroft
Copy link
Member Author

@embray @astrofrog - there are 3 Travis fails here, all on Python 3.x at the point where a method docstring in astropy.time tries to import matplotlib. This is unrelated to this PR. Any ideas? I dug around a little but don't understand the test system well enough.

=================================== FAILURES ===================================
_______________________ [doctest] astropy.time.core.val ________________________
1546     """
1547     Matplotlib `~matplotlib.pyplot.plot_date` input:
1548     1 + number of days from 0001-01-01 00:00:00 UTC
1549 
1550     This can be used directly in the matplotlib `~matplotlib.pyplot.plot_date`
1551     function::
1552 
1553       >>> import matplotlib.pyplot as plt
UNEXPECTED EXCEPTION: ImportError("No module named 'matplotlib'",)
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/test/lib/python3.3/doctest.py", line 1313, in __run
    compileflags, 1), test.globs)
  File "<doctest astropy.time.core.val[0]>", line 1, in <module>
ImportError: No module named 'matplotlib'

@astrofrog
Copy link
Member

That doctest should be skipped if the optional dependencies are not installed. Not sure what changed, and how it works on 2.7 though...

@taldcroft
Copy link
Member Author

Not sure what changed, and how it works on 2.7 though...

Exactly.

@mhvk
Copy link
Contributor
mhvk commented Dec 1, 2014

@taldcroft - OK with option 3, at least for now. We probably will have to revisit a little when we do the TIme, etc., columns. For those, the "column-ness" is much lighter and the column-data is added to the object, so it is less clear it should be propagated. Anyway, that's for #3011.

@taldcroft
Copy link
Member Author

I restarted one of the failing test cases and it passed this time, so I've restarted the other two as well. Hopefully it'll work 🙏.

@embray
Copy link
Member
embray commented Dec 1, 2014

@taldcroft I've seen that error before--it seems to happen non-deterministically. I really want to get to the bottom of it at some point because it is annoying an bizarre, but incredibly difficult to debug. As @astrofrog points out, that test should always be skipped.

@embray
Copy link
Member
embray commented Dec 1, 2014

AH, I think I figured it out. See #3169.

embray added a commit to embray/astropy that referenced this pull request Dec 1, 2014
…ormat classes to the TIME_FORMAT and TIME_DELTA_FORMAT dicts at class creation time instead of after the fact. This should also reduce the effort required to define new formats external to Astropy. This may understandably look overengineered, but it does have the advantage of fixing the annoyance mentioned here: astropy#3095 (comment) by not doing any module-level hackery.
@taldcroft
Copy link
Member Author

Tests passing and now ready for final review.

@astrofrog
Copy link
Member

👍

taldcroft added a commit that referenced this pull request Dec 2, 2014
Override column __getitem__ to return int item as ndarray not column
@taldcroft taldcroft merged commit 0e69c4d into astropy:master Dec 2, 2014
@taldcroft taldcroft deleted the table-np-item branch December 2, 2014 14:00
embray added a commit to embray/astropy that referenced this pull request Dec 2, 2014
…ormat classes to the TIME_FORMAT and TIME_DELTA_FORMAT dicts at class creation time instead of after the fact. This should also reduce the effort required to define new formats external to Astropy. This may understandably look overengineered, but it does have the advantage of fixing the annoyance mentioned here: astropy#3095 (comment) by not doing any module-level hackery.
taldcroft added a commit to taldcroft/astropy that referenced this pull request Jul 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0