Override column getitem to return int item as ndarray not column #3095

taldcroft · 2014-11-11T15:40:28Z

This was pulled from #2790 and is pending further discussion of when table columns should be dropped to ndarray.

mhvk · 2014-11-11T16:17:30Z

Just a reminder about the discussion in #2790 and in separate e-mails, that @astrofrog and I argued for __getitem__ always returning the same type of object, though we didn't quite converge on whether this should always be an ndarray or always a Column.

My tendency would be to always return a Column even for single items, is that I think the metadata is just as relevant for a single item. Here, one can think in analogy with Row, which still holds all the meta information for a single row of a table.

embray · 2014-11-11T18:22:00Z

I don't have a strong opinion about this either way. There has to be some default behavior for __getitem__ regardless. But if there's some other access mode that loses out because of that we could still add some other kind of indexing proxy for different access modes a la .ix, .loc, etc. in pandas.DataFrame

astrofrog · 2014-11-11T19:04:08Z

@mhvk - just one clarification - Column.meta is different from Table.meta. If you access the meta on a row then it returns the Table.meta but if you then access an item from a row it also currently drops any Row-ness:

In [17]: t[1]['col1']
Out[17]: 3

so by symmetry we should get the same for:

In [17]: t['col1'][1]
Out[17]: 3

I think that just the symmetry is enough to show that we can't get a column if we slice a column, because we don't get a row if we slice a row, and we want to make sure that we get the same whether we access the rows first or columns first.

mhvk · 2014-11-11T19:14:39Z

Yes, the current behaviour is consistent in the sense you imply; the only inconsistency is columns holding multiple dimensions; my Row example is somewhat of a red herring. Fortunately, if we make any changes to __getitem__ it will affect the examples in the same way.

Anyway, I'll let those speak who are more likely to use columns in their present form -- I'll be moving fully to SmartTable with quantity columns as soon as it is available!

mhvk · 2014-11-11T20:09:07Z

Just so it is clear the current behaviour is most definitely broken (see entry [5]):

In [1]: from astropy.table import Table

In [2]: t = Table([[[1.111111, 2.22222], [4.4444444, 5.55555]], [4, 6]], names=('a', 'b'))

In [3]: print(t)
       a [2]          b 
-------------------- ---
 1.111111 .. 2.22222   4
4.4444444 .. 5.55555   6

In [4]: print(t['a'])
       a [2]        
--------------------
 1.111111 .. 2.22222
4.4444444 .. 5.55555

In [5]: print(t['a'][0])
   a    
--------
1.111111
 2.22222

In [6]: print(t[0])
<Row 0 of table
 values=([1.111111, 2.22222], 4)
 dtype=[('a', '<f8', (2,)), ('b', '<i8')]>

taldcroft · 2014-11-11T20:10:51Z

My inclination (and intent with this patch) was to basically follow the numpy subclassing behavior, which is (I think):

Slicing and fancy indexing return the subclass (Column).
Accessing a single item returns an ndarray version of that element

This is the same as Pandas Series behavior (noting that Series can't contain multi-dim elements, but I think that single item access would return an ndarray version of the element).

EDIT: by "ndarray version of that element", I meant a numpy scalar or array as relevant.

taldcroft · 2014-11-11T20:13:06Z

This PR makes Column consistent with Row (as of 0.4):

In [7]: t = Table([[[1.111111, 2.22222], [4.4444444, 5.55555]], [4, 6]], names=('a', 'b'))

In [8]: print t[0]['a']
[ 1.111111  2.22222 ]

In [9]: print type(t[0]['a'])
<type 'numpy.ndarray'>

taldcroft · 2014-11-11T20:16:20Z

So in terms of precedence from Numpy itself and Pandas, my vote is to mostly leave Column item access the same but put in this patch to make multi-dim elements behave like scalars.

This is independent of what happens when Columns are involved in arithmetic operations or whatnot in the __array_wrap__ stage.

…[skip ci]

taldcroft · 2014-11-30T22:54:04Z

@mhvk @embray @astrofrog - any thoughts on this?

Here are my thoughts on the current options that have been discussed (along with comment on precedent from other packages):

Changing __getitem__ so that a Column is always returned, even for a single item like t['a'][5] is a big API change because people expect np scalar types (for 1-d columns). There is also a huge performance hit (factor of ~50) because making a Column is expensive. The xray table package does this.
Changing __getitem__ to always return a pure numpy object (ndarray or np scalar) is likewise a big API change since slicing has returned a Column since the beginning, and it is natural to expect slicing or fancy indexing to return the same subclass. The bcolz table package always returns pure numpy for specialized reasons, including limited API of the carray object.
Changing __getitem__ to always return a pure numpy object for an integer item access. This is the smallest API change (only the return type for multi-dim columns), and the current behavior is considered "clearly broken". The pandas package does this.

astrofrog · 2014-12-01T09:40:12Z

@taldcroft - I think I like 3 best. So just to be clear, this means that accessing a single 'cell' in the table will return a Numpy scalar or array, whereas selecting multiple rows from a column will still return a column. Is this correct? If so, then 👍 from me.

< 8000 /div>

astrofrog · 2014-12-01T09:40:57Z

It would be good to include a test which also includes comments for each test case to explain the behavior we expect in each case.

taldcroft · 2014-12-01T13:36:38Z

Added tests.

taldcroft · 2014-12-01T14:47:48Z

@embray @astrofrog - there are 3 Travis fails here, all on Python 3.x at the point where a method docstring in astropy.time tries to import matplotlib. This is unrelated to this PR. Any ideas? I dug around a little but don't understand the test system well enough.

=================================== FAILURES ===================================
_______________________ [doctest] astropy.time.core.val ________________________
1546     """
1547     Matplotlib `~matplotlib.pyplot.plot_date` input:
1548     1 + number of days from 0001-01-01 00:00:00 UTC
1549 
1550     This can be used directly in the matplotlib `~matplotlib.pyplot.plot_date`
1551     function::
1552 
1553       >>> import matplotlib.pyplot as plt
UNEXPECTED EXCEPTION: ImportError("No module named 'matplotlib'",)
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/test/lib/python3.3/doctest.py", line 1313, in __run
    compileflags, 1), test.globs)
  File "<doctest astropy.time.core.val[0]>", line 1, in <module>
ImportError: No module named 'matplotlib'

astrofrog · 2014-12-01T14:54:56Z

That doctest should be skipped if the optional dependencies are not installed. Not sure what changed, and how it works on 2.7 though...

taldcroft · 2014-12-01T15:40:08Z

Not sure what changed, and how it works on 2.7 though...

Exactly.

mhvk · 2014-12-01T16:25:15Z

@taldcroft - OK with option 3, at least for now. We probably will have to revisit a little when we do the TIme, etc., columns. For those, the "column-ness" is much lighter and the column-data is added to the object, so it is less clear it should be propagated. Anyway, that's for #3011.

taldcroft · 2014-12-01T17:02:36Z

I restarted one of the failing test cases and it passed this time, so I've restarted the other two as well. Hopefully it'll work 🙏.

embray · 2014-12-01T22:35:04Z

@taldcroft I've seen that error before--it seems to happen non-deterministically. I really want to get to the bottom of it at some point because it is annoying an bizarre, but incredibly difficult to debug. As @astrofrog points out, that test should always be skipped.

embray · 2014-12-01T22:47:30Z

AH, I think I figured it out. See #3169.

…ormat classes to the TIME_FORMAT and TIME_DELTA_FORMAT dicts at class creation time instead of after the fact. This should also reduce the effort required to define new formats external to Astropy. This may understandably look overengineered, but it does have the advantage of fixing the annoyance mentioned here: astropy#3095 (comment) by not doing any module-level hackery.

taldcroft · 2014-12-02T10:23:18Z

Tests passing and now ready for final review.

astrofrog · 2014-12-02T10:51:03Z

👍

Override column __getitem__ to return int item as ndarray not column

…ormat classes to the TIME_FORMAT and TIME_DELTA_FORMAT dicts at class creation time instead of after the fact. This should also reduce the effort required to define new formats external to Astropy. This may understandably look overengineered, but it does have the advantage of fixing the annoyance mentioned here: astropy#3095 (comment) by not doing any module-level hackery.

taldcroft added the table label Nov 11, 2014

taldcroft added this to the v1.0.0 milestone Nov 11, 2014

taldcroft self-assigned this Nov 11, 2014

taldcroft mentioned this pull request Nov 11, 2014

Remove numpy structured array as data container in Table #2790

Merged

Override column __getitem__ to return int item as ndarray not column …

1064be3

…[skip ci]

taldcroft added 2 commits December 1, 2014 08:28

Add tests for integer item access to column

2360686

Update CHANGES.rst

fd83529

taldcroft force-pushed the table-np-item branch from 7d626ce to fd83529 Compare December 1, 2014 13:35

taldcroft added the Affects-release label Dec 1, 2014

Include Py2 long type in getitem and test all int types

a374307

embray mentioned this pull request Dec 1, 2014

Add TimeFormatMeta and TimeDeltaFormatMeta #3169

Merged

taldcroft added the Ready-for-final-review label Dec 2, 2014

taldcroft added a commit that referenced this pull request Dec 2, 2014

Merge pull request #3095 from taldcroft/table-np-item

0e69c4d

Override column __getitem__ to return int item as ndarray not column

taldcroft merged commit 0e69c4d into astropy:master Dec 2, 2014

taldcroft deleted the table-np-item branch December 2, 2014 14:00

taldcroft mentioned this pull request Dec 25, 2014

0e69c4db introduced asv regression #3232

Closed

embray removed the Ready-for-final-review label Jan 13, 2015

taldcroft mentioned this pull request Jul 8, 2015

Back out #3095 and remove the Column __getitem__ override #3929

Closed

taldcroft added a commit to taldcroft/astropy that referenced this pull request Jul 8, 2015

Back out astropy#3095 and remove the Column __getitem__ override

e43aaf2

mhvk mentioned this pull request Sep 28, 2017

Remove unnessary numpy lookups for int, float, complex, bool and str #6603

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Override column getitem to return int item as ndarray not column #3095

Override column getitem to return int item as ndarray not column #3095

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Override column __getitem__ to return int item as ndarray not column #3095

Override column __getitem__ to return int item as ndarray not column #3095

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Override column getitem to return int item as ndarray not column #3095

Override column getitem to return int item as ndarray not column #3095