8000 ENH: add np.stack by shoyer · Pull Request #5605 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: add np.stack #5605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2015
Merged

ENH: add np.stack #5605

merged 1 commit into from
May 12, 2015

Conversation

shoyer
Copy link
Member
@shoyer shoyer commented Feb 25, 2015

The motivation here is to present a uniform and N-dimensional interface for joining arrays along a new axis, similarly to how concatenate provides a uniform and N-dimensional interface for joining arrays along an existing axis.

Background

Currently, users can choose between hstack, vstack, column_stack and dstack, but none of these functions handle N-dimensional input. In my opinion, it's also difficult to keep track of the differences between these methods and to predict how they will handle input with different dimensions.

In the past, my preferred approach has been to either construct the result array explicitly and use indexing for assignment, to or use np.array to stack along the first dimension and then use transpose (or a similar method) to reorder dimensions if necessary. This is pretty awkward.

I brought this proposal up a few weeks on the numpy-discussion list:
http://mail.scipy.org/pipermail/numpy-discussion/2015-February/072199.html

I also received positive feedback on Twitter:
https://twitter.com/shoyer/status/565937244599377920

Implementation notes

The one line summaries for concatenate and stack have been (re)written to mirror each other, and to make clear that the distinction between these functions is whether they join over an existing or new axis.

In general, I've tweaked the documentation and docstrings with an eye toward pointing users to concatenate/stack/split as a fundamental set of basic array manipulation routines, and away from array_split/{h,v,d}split/{h,v,d,column_}stack

I put this implementation in numpy.core.shape_base alongside hstack/vstack, but it appears that there is also a numpy.lib.shape_base module that contains another larger set of functions, including dstack. I'm not really sure where this belongs (or if it even matters).

Finally, it might be a good idea to write a masked array version of stack. But I don't use masked arrays, so I'm not well motivated to do that.

if axis < 0:
axis += result_ndim
sl = (slice(None),) * axis + (_nx.newaxis,)
return _nx.concatenate([arr[sl] for arr in arrays], axis=axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't trust this on first sight (but if you have a test, ignore the comment). What if the second array has more dimensions then the first one?
One suggestion for the API. How about adding an ndmin kwarg. That way hstack, etc. would really be just special cases of this function, I believe (though whether we should implement them as such, I don't know).

EDIT: Frankly, I am not sure if ndmin makes sense, it might be a bad design choice (and that hstack does it doesn't make it better)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me at least, it would be clearer if one were to do:

new_shape = array[0].shape[:axis] + [_nx.newaxis] + array[0].shape[axis:]
return _nx.concatenate([arr.reshape(new_shape) for arr in arrays], axis=axis)

(note that with the broadcasting suggestion, it might be faster to use the machinery from #5371 to get the broadcast shape at the top, and then use broadcast_to(arr, new_shape) here. Alternatively, use b = np.broadcast(*arrays) above, and new_shape = b.shape[:axis] ... here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg I find the ndmin behavior of the {v,h,d}stack functions confusing. I would say that the lack of automated rules for adding dimensions is a feature of this function :).

@mhvk My only hesitation with using reshape is that I know it can result in copies instead of views in some edge cases. So, I usually stick to indexing to be safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - actually, looking better both your method and mine can lead to unexpected errors when a later array has fewer dimensions than the first one, like::

np.stack([np.arange(9.).reshape(3,3), np.arange(3)], axis=2)

At least, I would think you would get sl=[:,:,np.newaxis] in which case you get an IndexError: too many indices on the second array (rather than the much more informative ValueError that concatenate would throw).

Though it would perhaps be needlessly expensive to test for this beforehand. Maybe just

try:
    return _nx.concatenate(...)
except IndexError:
    raise ValueError(<like concatenate>)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- fixed. Though again, weird stuff will happen if you pass an array whose dimensions can't be expanded in here and this will make that error message harder to track down. I would think that would include np.matrix, but look at what it does:

In [8]: np.mat('1')[:, :, None].shape
Out[8]: (1, 1, 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the indexing slightly safer, one could explicitly use reshape(arr.shape[:axis] + [np.newaxis] + arr.shape[axis:]) after all -- in that case one uses both the shape and reshape of the array subclass itself. Though, matrix is again interesting...

np.mat([[1,2],[3,4]]).reshape(2, 1, 2).shape
# (2, 2)

I don't know what's worse... Though again I don't feel one should "punish" other subclasses for strange behaviour of some.

@rgommers
Copy link
Member
rgommers commented Mar 8, 2015

@shoyer did you see gh-5057, which also adds a stack function? A comparison with that PR may be useful (these PRs can't both be merged...).

@shoyer
Copy link
Member Author
shoyer commented Mar 8, 2015

@rgommers Good point. I did see that discussion on the mailing list; I did not realize that #5057 proposed using the same name.

I'll raise this point again on the mailing list. I think both types of functionality are useful, but this is a more fundamental type of "stacking" (it unifies the existing *stack functions) and fills a more obvious hole in NumPy's functionality (no N-dimensional way to stack ndarray objects).

@shoyer
Copy link
Member Author
shoyer commented Mar 22, 2015

I asked on the mailing list about the name np.stack last week: http://mail.scipy.org/pipermail/numpy-discussion/2015-March/072491.html

Based on @sotte (author of #5057) and @njsmith's comments, it seems like we would be OK to use stack here -- #5057 should probably use something else, such as barray or block.

@mhvk
Copy link
Contributor
mhvk commented Mar 23, 2015

Looks all OK.

@njsmith
Copy link
Member
njsmith commented Mar 23, 2015

Just looked over the api and I like it.

row_stack and column_stack are special in that they magically alternate
between stacking and concatenation depending on the dimensionality of the
input. But {h,v,d}stack are not so magical. Are they equivalent to
stack(..., axis=...)? (This is both a question about api consistency, and a
question about whether their implementation can be simplified.)
On Feb 25, 2015 2:38 AM, "Stephan Hoyer" notifications@github.com wrote:

The motivation here is to present a uniform and N-dimensional interface
for joining arrays along a new axis, similarly to how concatenate
provides a uniform and N-dimensional interface for joining arrays along an
existing axis.
Background

Currently, users can choose between hstack, vstack, column_stack and
dstack, but none of these functions handle N-dimensional input. In my
opinion, it's also difficult to keep track of the differences between these
methods and to predict how they will handle input with different dimensions.

In the past, my preferred approach has been to either construct the result
array explicitly and use indexing for assignment, to or use np.array to
stack along the first dimension and then use transpose (or a similar
method) to reorder dimensions if necessary. This is pretty awkward.

I brought this proposal up a few weeks on the numpy-discussion list:
http://mail.scipy.org/pipermail/numpy-discussion/2015-February/072199.html

I also received positive feedback on Twitter:
https://twitter.com/shoyer/status/565937244599377920
Implementation notes

The one line summaries for concatenate and stack have been (re)written to
mirror each other, and to make clear that the distinction between these
functions is whether they join over an existing or new axis.

In general, I've tweaked the documentation and docstrings with an eye
toward pointing users to concatenate/stack/split as a fundamental set of
basic array manipulation routines, and away from array_split/{h,v,d}split/
{h,v,d,column_}stack

I put this implementation in numpy.core.shape_base alongside hstack/vstack,
but it appears that there is also a numpy.lib.shape_base module that
contains another larger set of functions, including dstack. I'm not
really sure where this belongs (or if it even matters).

Finally, it might be a good idea to write a masked array version of stack.

But I don't use masked arrays, so I'm not well motivated to do that.

You can view, comment on, or merge this pull request online at:

#5605
Commit Summary

  • ENH: add np.stack

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#5605.

@shoyer
Copy link
Member Author
shoyer commented Mar 23, 2015

@njsmith I'm afraid {h,v,d}stack are so magical (note that row_stack is an alias for vstack):

In [12]: x1 = np.zeros(1)

In [13]: x2 = np.zeros((1, 1))

In [14]: x3 = np.zeros((1, 1, 1))

In [16]: np.hstack([x1, x1]).shape
Out[16]: (2,)

In [17]: np.hstack([x2, x2]).shape
Out[17]: (1, 2)

In [18]: np.hstack([x3, x3]).shape
Out[18]: (1, 2, 1)

In [19]: np.vstack([x1, x1]).shape
Out[19]: (2, 1)

In [20]: np.vstack([x2, x2]).shape
Out[20]: (2, 1)

In [21]: np.vstack([x3, x3]).shape
Out[21]: (2, 1, 1)

In [22]: np.dstack([x1, x1]).shape
Out[22]: (1, 1, 2)

In [23]: np.dstack([x2, x2]).shape
Out[23]: (1, 1, 2)

In [24]: np.dstack([x3, x3]).shape
Out[24]: (1, 1, 2)

To summarize: if array.ndim < stack_dim (where stack_dim is 1 for hstack, 2 to vstack and 3 for dstack), then the {h,v,d}stack function stacks. Otherwise, it concatenates. There are currently no functions in the NumPy API that stack for arbitrary dimensional input -- that's why I made this PR.

From an implementation perspective, things are not so bad -- each of these methods just calls np.atleast_*d on all the inputs and then np.concatenate on the result. So in my opinion the API design problems are two fold:

  1. The behavior of atleast_*d is not very intuitive -- axes are inserted in somewhat random locations:

    In [45]: np.atleast_2d(np.zeros(2)).shape
    Out[45]: (1, 2)
    
    In [46]: np.atleast_3d(np.zeros(2)).shape
    Out[46]: (1, 2, 1)
    
    In [47]: np.atleast_3d(np.zeros((2, 2))).shape
    Out[47]: (2, 2, 1)
    

    If we were starting from scratch, I would pick the rule "always insert insert new dimensions at the start", which at least works like array broadcasting.

  2. The axis that {h,v,d}stack concatenates along has no clear progression:

    • hstack: use atleast_1d and concatenates with axis=1 (unless input is 1d, in which case axis=0)
    • vstack: use atleast_2d and concatenates with axis=0
    • dstack: use atleast_3d and concatenates with axis=2

Basically there is no way to understand these functions without reading the source code.

@shoyer
Copy link
Member Author
shoyer commented Apr 22, 2015

Ping! Does this need more work? I'd love to be able to merge this in time for numpy 1.10....

@charris charris added this to the 1.10.0 release milestone Apr 22, 2015

Parameters
----------
arrays : sequence of ndarrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inputs are converted to arrays, so this should be "sequence of array_like"

@charris
Copy link
Member
charris commented May 6, 2015

Does this deal with array scalers and empty arrays?

[3, 4]])

"""
arrays = [asanyarray(arr) for arr in arrays]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with mixed subtypes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe check that all types are the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do that by checking that set(type(a) for a in arrays) has one member.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for subtypes is mostly dictated by the behavior of np.concatenate. I don't see much advantage in explicitly checking for consistent types here when none of the logic in this function relies on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the type checking should be left for concatenate (which does not currently do this all that well, but could be rewritten, e.g., using insert methods if present on the first member or so).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@shoyer
Copy link
Member Author
shoyer commented May 6, 2015

@charris It does seem to deal properly with array scalers and empty arrays -- I'll add some tests.

% (axis, result_ndim, result_ndim))
if axis < 0:
A93C axis += result_ndim
sl = (slice(None),) * axis + (_nx.newaxis,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative method, once you have the shape of the arrays, is

newshape = shape[:axis] + (1,) + shape[axis:]
expanded_arrays = [a.reshape(newshape) for a in arrays]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, getting rid of expanded_arrays

_nx.concatenate([a.reshape(newshape) for a in arrays], axis=axis)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like using slicing for this operation rather than reshape because I know that slicing will always using a view rather than a copy. Though I suppose reshape is probably also safe when used in this way.

@shoyer
Copy link
Member Author
shoyer commented May 8, 2015

My latest commit includes changes in response to @charris's review (thanks!)

@charris
Copy link
Member
charris commented May 11, 2015

@shoyer Some more inprovement in the summary part of the docstring is needed, then this can go in.

@shoyer shoyer force-pushed the stack branch 2 times, most recently from 1113c59 to a9c7d8b Compare May 12, 2015 00:33
@shoyer
Copy link
Member Author
shoyer commented May 12, 2015

@charris I added your suggested words to the docstring and squashed my commits. I'm still waiting for Travis to run its tests, but they did pass (except for USE_BENTO=1) on my fork:
https://travis-ci.org/shoyer/numpy/builds/62164157

"""
Join a sequence of arrays along a new axis.

.. versionadded:: 1.10.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. versionadded:: 1.10.0 and its preceding blank line should come at the end of the summary ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@charris
Copy link
Member
charris commented May 12, 2015

Looks good. One more nitpick and it's done...

The motivation here is to present a uniform and N-dimensional interface for
joining arrays along a new axis, similarly to how `concatenate` provides a
uniform and N-dimensional interface for joining arrays along an existing axis.

Background
~~~~~~~~~~

Currently, users can choose between `hstack`, `vstack`, `column_stack` and
`dstack`, but none of these functions handle N-dimensional input. In my
opinion, it's also difficult to keep track of the differences between these
methods and to predict how they will handle input with different
dimensions.

In the past, my preferred approach has been to either construct the result
array explicitly and use indexing for assignment, to or use `np.array` to
stack along the first dimension and then use `transpose` (or a s
D690
imilar method)
to reorder dimensions if necessary. This is pretty awkward.

I brought this proposal up a few weeks on the numpy-discussion list:
http://mail.scipy.org/pipermail/numpy-discussion/2015-February/072199.html

I also received positive feedback on Twitter:
https://twitter.com/shoyer/status/565937244599377920

Implementation notes
~~~~~~~~~~~~~~~~~~~~

The one line summaries for `concatenate` and `stack` have been (re)written to
mirror each other, and to make clear that the distinction between these functions
is whether they join over an existing or new axis.

In general, I've tweaked the documentation and docstrings with an eye toward
pointing users to `concatenate`/`stack`/`split` as a fundamental set of basic
array manipulation routines, and away from
`array_split`/`{h,v,d}split`/`{h,v,d,column_}stack`

I put this implementation in `numpy.core.shape_base` alongside `hstack`/`vstack`,
but it appears that there is also a `numpy.lib.shape_base` module that contains
another larger set of functions, including `dstack`. I'm not really sure where
this belongs (or if it even matters).

Finally, it might be a good idea to write a masked array version of `stack`.
But I don't use masked arrays, so I'm not well motivated to do that.
charris added a commit that referenced this pull request May 12, 2015
@charris charris merged commit 18c89db into numpy:master May 12, 2015
@charris
Copy link
Member
charris commented May 12, 2015

Great. Thanks Stephan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0