ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support #14175

bmorris3 · 2022-12-13T17:18:43Z

Description

The nddata module (docs) supports arithmetic between NDDataArray objects with propagation of units, masking and uncertainties. The currently supported arithmetic operations include addition, subtraction, multiplication and division.

A set of unsupported operations would "collapse" NDDataArrays along specified axes. For a concrete example, imagine the NDDataArray named cube represents a spectral cube from an IFU with a spectral dimension (axis=0) and two spatial dimensions (axes 1 and 2). It may be useful to collapse the cube to a single spectrum by taking the maximum the spatial dimensions at each wavelength, which might be represented with a numpy-like syntax like with collapsed_spectrum = cube.max(axis=(1, 2)). In the case of min/max, we expect the uncertainties in collapsed_spectrum to be the uncertainties at each maximum flux in the spatial dimension. If instead we took the mean with cube.mean(axis=(1, 2)), we would propagate uncertainties along the collapsed axes. The mask should also be propagated so that only unmasked values are included in the collapse operation.

This type of generic collapse operation is something we need over in jdaviz, where we currently rely on glue to collapse our spectral cubes after they have been converted to glue Data objects. Since we don't anticipate unit/mask/error propagation to be added to glue Data objects, we're hoping to implement this functionality in astropy, and then offer jdaviz users the option to propagate units/masks/errors via the astropy methods.

This PR is a first-draft implementation of sum, mean, min, and max methods for NDDataArray, which accept an integer or a tuple of ints for the axis argument. As of opening this draft PR, I have implemented the collapse arithmetic for the data and uncertainties, and I'm now working on correctly applying the masks. I'm putting up this unfinished PR to collect comments and discussion while I put together docs and tests. Comments welcome and needed!

Checklist for package maintainer(s)

This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.

See additional to-do items at #14175 (comment)

github-actions · 2022-12-13T17:20:48Z

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

Cadair · 2022-12-14T09:00:06Z

@DanRyanIrish this looks like something you might want to look at.

DanRyanIrish · 2022-12-14T14:31:29Z

Hi @bmorris3. This looks like a specific case of the same problem we are trying to solve in ndcube in this currently open PR: https://github.com/sunpy/ndcube/pull/450/files#diff-929794bb532be1486907b360285433b432f8505ab7e7ff63eca3eab657f7d00dR712

In our case we are implementing superpixel which downsamples a cube by collapsing contiguous groups of pixels. But a special case of this is to collapse over entire axes. As well as the sum, mean, min and max, we are also supporting median. I'm not sure about the timescale of your effort, but perhaps the most desirable long-term result would be for ndcube's superpixel to live in NDData and then your axis-collapsing methods could just be wrappers around superpixel with a simpler API. What do you think?

I haven't gone through your draft implementation in detail yet but at least parts of your implementation might be better than what we have in ndcube so far. Possibly vice-versa as well. It would be great collaborate on this, even if to start with these PRs proceed separately.

bmorris3 · 2022-12-14T15:25:38Z

Hi @DanRyanIrish, thanks for the heads up on a related effort in sunpy/ndcube#450.

As you say, the superpixel implementation looks slightly different to the collapse methods in this PR. In my domain we'd usually call your operation "binning", and of course, the broadest possible binning is a single bin per axis, so I agree that this astropy PR is a special case of sunpy/ndcube#450.

At first look, there might be two main differences.

superpixel seems to perform the operations on exactly one axis at a time:

https://github.com/sunpy/ndcube/blob/488fc6bba269e6699d940696b1a0691761d90658/ndcube/ndcube.py#L808-L810

superpixel does not support uncertainty propagation for min, max or median:

https://github.com/sunpy/ndcube/blob/488fc6bba269e6699d940696b1a0691761d90658/ndcube/ndcube.py#L731

Two key features of this astropy PR are generalized solutions for (1) and (2). For example, in astronomical observations with spectral cubes, we often have an unresolved star producing most of the signal in a single pixel. A max operation on the two spatial dimensions while preserving the dispersion dimension would extract a first-guess stellar spectrum, while preserving the uncertainties and masks from each pixel in the original cube. I don't think this kind of ND operation is possible with sunpy/ndcube#450.

As you noticed, I'm still working on an efficient implementation of median, so it hasn't been included in this PR yet. Medians are tougher because they require sorting (+over multiple axes).

bmorris3 · 2022-12-14T15:59:17Z

Looking closer, I'm not sure that my previous comment is correct about (1). It looks like superpixel loops over the axes.

DanRyanIrish · 2022-12-14T16:22:27Z

Thanks for your thoughts on sunpy/ndcube#450 @bmorris3.

1. `superpixel` seems to perform the operations on exactly one axis at a time:

As you mentioned subsequently superpixel does operate on any number of axes but does apply the operation to each in a loop. It would be nice to get rid of that loop but haven't come up with a better way yet. If you have a better implementation that would be great.

2. `superpixel` does not support uncertainty propagation for `min`, `max` or `median`:

The reason uncertainty propagation is not supported for these operations is that, as far as I can see, the uncertainty is ill-defined. For example, there may be multiple array values equal to the maximum value in the cube. However, each may have different uncertainties, e.g. due to different noise levels in different pixels. In this case what should the uncertainty be? The same ambiguity holds for min, and median.

Two key features of this astropy PR are generalized solutions for (1) and (2). For example, in astronomical observations with spectral cubes, we often have an unresolved star producing most of the signal in a single pixel. A max operation on the two spatial dimensions while preserving the dispersion dimension would extract a first-guess stellar spectrum, while preserving the uncertainties and masks from each pixel in the original cube. I don't think this kind of ND operation is possible with sunpy/ndcube#450.

To mind the way to handle this use-case and unambiguously maintain the uncertainty is to find the index of the spatial pixel and then slice the cube so it gives you the spectrum at that spatial location. Although this does involve at least two operations rather than just one and so is a little more cumbersome.

WHat do you think about the above? Have I misunderstood your use-case? Is uncertainty well-defined for min, max and median in some way I haven't thought of?

bmorris3 · 2022-12-14T16:54:37Z

However, each may have different uncertainties, e.g. due to different noise levels in different pixels. In this case what should the uncertainty be? The same ambiguity holds for min, and median.

It's true that min and max are not guaranteed to be unique. This is a generic problem that you see in numpy too, of course. For example, there's a warning in the Notes section in the docs for np.argmin:

In case of multiple occurrences of the minimum values, the indices 
corresponding to the first occurrence are returned.

I think the advantage of this NDDataArray approach, which mirrors the numpy behavior, is that you always get an answer. The answer will be "correct" if the values in the data array are unique, and may correspond to the uncertainty elsewhere if the array is non-unique.

median is a different scenario. The median of an odd-length array is simply the middle value of the sorted data array. That measurement in the data array has a corresponding uncertainty that should be preserved. The median of an even-length array is the mean of the two middle values, which will allows us to use the mean implementation here within NDDataArray on the two data array values and their uncertainties.

mhvk · 2022-12-14T18:50:29Z

Just in case it helps: much of the machinery for propagating masks would seem identical to what is done in the Masked class [1] - might it be possible to re-use that? One specific thing worth mentioning: summing with masks is not all that trivial, as it is not obvious how to "ignore" masked items (replacing with zero not always being a good choice). .mean() is far easier. See https://docs.astropy.org/en/latest/utils/masked/index.html#utils-masked-vs-numpy-maskedarray

[1] E.g.,

astropy/astropy/utils/masked/core.py

Lines 1098 to 1119 in 5250b24

    
           def mean(self, axis=None, dtype=None, out=None, keepdims=False, *, where=True): 
        
               # Implementation based on that in numpy/core/_methods.py 
        
               # Cast bool, unsigned int, and int to float64 by default, 
        
               # and do float16 at higher precision. 
        
               is_float16_result = False 
        
               if dtype is None: 
        
                   if issubclass(self.dtype.type, (np.integer, np.bool_)): 
        
                       dtype = np.dtype("f8") 
        
                   elif issubclass(self.dtype.type, np.float16): 
        
                       dtype = np.dtype("f4") 
        
                       is_float16_result = out is None 
        
               where = ~self.mask & where 
        
               result = self.sum( 
        
                   axis=axis, dtype=dtype, out=out, keepdims=keepdims, where=where 
        
               ) 
        
               n = np.add.reduce(where, axis=axis, keepdims=keepdims) 
        
               result /= n 
        
               if is_float16_result: 
        
                   result = result.astype(self.dtype) 
        
               return result

bmorris3 · 2022-12-14T18:52:02Z

@mhvk thanks for the pointer! I was actually just reading about this module because I didn't know much about it.

bmorris3 · 2022-12-14T19:36:21Z

Upon further inspection of astropy.utils.masked, this looks like the way to go and @mhvk is my hero 🦸🏻.

bmorris3 · 2022-12-15T17:35:55Z

I'm still sorting out all of the implications of this, but since @mhvk's suggestion to build on Masked arrays from astropy.utils.masked, I pushed some changes in 9ae6115. This allows most numpy functions to be called on NDDataArray objects while following the rules in astropy.utils.masked for propagating masks.

Examples

Suppose you have loaded a spectral cube (two spatial dims, one spectral dim) into an NDDataArray called nddata. Now you can do

np.sum(nddata, axis=(1, 2))  # internally this becomes `nddata.sum(axis=(1,2))`

and you will be returned the collapsed spectrum as an NDDataArray with propagated masks, uncertainties, and units:

NDDataArray([68098.234, 80767.66 , 69840.2  , ..., 94635.03 , 92484.08 ,
             89727.45 ], unit='MJy / sr')

We can only propagate errors, and return a NDDataArray, if we can also explicitly define the rules for error propagation on that numpy operation. I've explicitly defined rules for sum, mean, min and max.

If you call a numpy function that doesn't have explicit support from the additions in this PR for error propagation like std:

np.std(nddata, axis=(1, 2))

it falls back on returning a np.ma.MaskedArray like this:

masked_array(data=[137.33192443847656, 138.0076141357422,
                   137.83934020996094, ..., 238.78318786621094,
                   238.67127990722656, 238.54486083984375],
             mask=[False, False, False, ..., False, False, False],
       fill_value=1e+20,
            dtype=float32)

mhvk · 2022-12-15T18:00:02Z

Happy that this looks helpful! Might it make sense to return MaskedNDArray instead of numpy's MaskedArray? The two behave a little differently (see link above), so that might help consistency.

p.s. At some point I really hope to make some progress on getting proper error propagation, including covariance, supported in uncertainties -- see #3715.

bmorris3 · 2022-12-15T19:47:42Z

@mhvk I agree that the behavior above is inconsistent, but I think this stems from a feature rather than a bug? nddata objects currently pass off masked arrays to numpy operations by default:

astropy/astropy/nddata/compat.py

Lines 229 to 237 in a7cb990

    
               def __array__(self): 
        
                   """ 
        
                   This allows code that requests a Numpy array to use an NDData 
        
                   object as a Numpy array. 
        
                   """ 
        
                   if self.mask is not None: 
        
                       return np.ma.masked_array(self.data, self.mask) 
        
                   else: 
        
                       return np.array(self.data)

The above snippet exports the NDDataArray to a np.ma.masked_array when you call np.std(nddata, axis=(1, 2)). I think that behavior is intended – if you call the numpy function on an astropy object, you export the astropy object to numpy object and get out of numpy's way.

The (few) NDDataArray methods that I've implemented in this PR allow you to propagate errors/masks/units when the rules are defined, but require you to use the built-in method on NDDataArray, like NDDataArray.sum. The ethos could be that NDDataArray "knows how to do a sum" on each of its own attributes, while np.std does not.

Clearly, one likely source of confusion in this type of implementation is that you have different output types if you call np.std(nddata) (returns np.ma.masked_array) compared with np.sum(nddata) or nddata.sum() (both return NDDataArray).

mhvk · 2022-12-15T19:51:47Z

OK, that certainly makes sense. Maybe in the future this can change (e.g., one might want to return MaskedQuantity if a unit is set), but for this PR certainly best to stick with the API that is there!

bmorris3 · 2022-12-15T20:26:24Z

Great! I want to preempt another likely source of confusion here: if you call a numpy function on an NDDataArray object like np.mean which also has an implementation on the NDDataArray object (NDDataArray.mean), you get still an NDDataArray. This is a feature of numpy – the source code in numpy for np.mean(x, axis=(1, 2)) actually first tries to call the mean method on x. So for sum, mean, min, and max, the numpy implementations also return NDDataArray objects, which might not be expected if you only read my comment above.

Relevant notes in the numpy docs here.

pllim · 2022-12-19T22:18:29Z

There is a conflict now, FYI

DanRyanIrish · 2022-12-20T09:57:14Z

It's true that min and max are not guaranteed to be unique. This is a generic problem that you see in numpy too, of course. For example, there's a warning in the Notes section in the docs for np.argmin:
In case of multiple occurrences of the minimum values, the indices 
corresponding to the first occurrence are returned.
I think the advantage of this NDDataArray approach, which mirrors the numpy behavior, is that you always get an answer. The answer will be "correct" if the values in the data array are unique, and may correspond to the uncertainty elsewhere if the array is non-unique.

I guess it comes down to whether you want a correct answer most of the time in a way that mimics numpy or to never give a wrong answer. This is probably more of a choice of preference but one argument against it is that numpy arrays don't have uncertainties. So whether np.argmin takes the first found or something more sophisticated, the end result is degenerate. That's not the case when you have uncertainties.

This has prompted me to attempt to support uncertainty preservation for min/max in the ndcube#450. However instead of taking the first found uncertainty, I preserve the largest uncertainty. This seems to me to better reflect our actual knowledge of what the min/max value is. Any thoughts?

median is a different scenario. The median of an odd-length array is simply the middle value of the sorted data array. That measurement in the data array has a corresponding uncertainty that should be preserved. The median of an even-length array is the mean of the two middle values, which will allows us to use the mean implementation here within NDDataArray on the two data array values and their uncertainties.

I hadn't thought this approach. It's compelling. One aspect is still not clear to me though. In the odd-length case, if there are multiple occurrences of the median value, which one's uncertainty gets used? Is it meaningful to simply take the "middle" one as determined by the sorting algorithm, or should it depend on the uncertainty, e.g. taking the largest as in the ndcube min/max case?

mhvk · 2022-12-20T13:56:03Z

Bit of a comment from the side, but I think that picking particular elements based on their uncertainty is not really "better"; the question of what are the minimum, maximum, median in the presence of uncertainties likely has more statistically meaningful answers (which may well depend on the distribution, number of points, and the confidence level one wishes to have, just like the mean has an error in the mean, which is not so trivially determined if the data points are not consistent with the mean within their uncertainties -- i.e., reduced chi2 inconsistent with unity).

bmorris3 · 2022-12-20T15:55:58Z

I agree with @mhvk. To summarize for others joining the discussion, I think the first decision to make is whether or not we return any result when the operation is ambiguous. We both agree that the user should get an answer, but we're disagreeing on which answer.

The approach @DanRyanIrish is suggesting says "take the most conservative possible uncertainty", which is absolutely a handy approach when doing model optimization on observations, for example. The approach I've implemented says "return the first found minimum/maximum and its corresponding uncertainty", which mirrors the behavior in numpy.

DanRyanIrish · 2022-12-22T17:21:30Z

To further contribute to this discussion, I spoke with a couple solar colleagues who argued for a 3rd way of calculating the uncertainty in the min & max cases which won me over. For the case of max, the returned data value is simply the max of the axis/superpixel. But the uncertainty is (data + uncertainty).max(axis=axis) - data.max(axis=axis) where data and uncertainty are the data and uncertainty arrays and axis is/are the axis/axes over which the data is being collapsed. This captures the maximum possible value given all the uncertainties and data values being collapsed. Hence it is in line with @mhvk s comment that the uncertainty should reflect the distribution of values and uncertainties, not simply be the uncertainty of a given pixel. And I agree that this is more meaningful than my original suggestion or the first found method. It also should not be significantly more computationally expensive than simply finding the max of the data.

DanRyanIrish · 2023-02-22T14:59:43Z

Hmm. Perhaps. exclude_masked_data?

bmorris3 · 2023-02-28T16:36:52Z

In conversations with @mwcraig and @DanRyanIrish, we have agreed that the kwarg discussed earlier (#14175 (comment)) should be changed to operation_ignores_mask, so in 0938834 I've used this new name.

mwcraig · 2023-02-28T16:47:03Z

@bmorris3 -- will a rebase fix the issue with the weird overage results?

Adding masking support in nddata collapse ops with astropy.utils.masked Adding tests for NDDataArray collapse operations correcting NDDataArray collapse w propagation for ndim<=3 remove print statement from testing Fixing unit propagation bug preserve masked copying behavior fixing up masked array views in nddata NDDataArray arithmetic testing improvements Further NDDataArray testing improvements Fixing ufunc inplace error test Updating docs to reflect small API change when not handling masks Cleaning up PR 14175 Fixing mask overwriting order for masked quantities more thorough handling of inputs for NDData constructor return to assumption NDData.data is ndarray fixing incorrect result for np.ma.maskedarray Increasing coverage adding changelog entry support for mask=None case fixing up astropy mask handling in ndarithmetic further improvements for mask handling remove amin/amax, drop uncertainties from collapse via min/max by default generalizing to >3D cube collapse Apply suggestions from code review Co-authored-by: Matt Craig <mattwcraig@gmail.com> Splitting out masked improvements for separate PR Adding filterwarnings for div by zero error (addressed in another PR) simplifying private methods for extrema in nddata Improving test coverage

bmorris3 · 2023-02-28T16:49:46Z

Only one way to find out. Here we go!

mwcraig

This is good to go once tests pass! Thanks to everyone who has provided input here.

pllim · 2023-02-28T18:23:08Z

Just to be safe, I am also running the cron jobs. 😬

pllim · 2023-02-28T18:28:15Z

Since this is slated to go in soon, I also took the liberty to fix the What's New listing that I noticed while glancing at the diff. Hope you don't mind!

pllim · 2023-02-28T21:32:19Z

linkcheck failure is unrelated. Merging, thank you all very much!

bmorris3 · 2023-02-28T21:35:48Z

Thanks to all reviewers! I'm very happy to have had your helpful, patient feedback.

github-actions bot added the nddata label Dec 13, 2022

pllim added this to the v5.3 milestone Dec 13, 2022

pllim requested review from eteq and mwcraig December 13, 2022 17:20

pllim added the API change PRs and issues that change an existing API, possibly requiring a deprecation period label Dec 13, 2022

bmorris3 force-pushed the extend-nddata branch from 0a18e0f to 2f51d83 Compare December 21, 2022 16:06

bmorris3 force-pushed the extend-nddata branch 2 times, most recently from d5dbb77 to 28a106a Compare December 28, 2022 15:04

bmorris3 changed the title ~~Introducing collapse operations on NDDataArray~~ ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support Dec 28, 2022

bmorris3 marked this pull request as ready for review December 28, 2022 17:48

bmorris3 mentioned this pull request Feb 27, 2023

Add Spectral Extraction plugin for Cubeviz spacetelescope/jdaviz#2039

Merged

9 tasks

bmorris3 added 11 commits February 28, 2023 11:49

comments and more readable tests

c8bcb61

Adding narrative documentation, fixing bugs it illuminated

5496878

increasing coverage, docs bugfix

0709339

Docs fix

5e105eb

Removing unnecessary unit assignment for future dask compatibility

2c64c65

Adding whats new entry

5db2e45

Adding kwarg for nddata

4c9c77c

Preserving units in corner case

e9e0b4e

Renaming kwarg to operation_ignores_mask

aaf1f4a

doctest bugfix

36b5f61

bmorris3 force-pushed the extend-nddata branch from 26a09bb to 36b5f61 Compare February 28, 2023 16:49

mwcraig approved these changes Feb 28, 2023

View reviewed changes

pllim added the Extra CI Run cron CI as part of PR label Feb 28, 2023

Fix v5.3 what's new listing

6c3d458

bmorris3 mentioned this pull request Feb 28, 2023

Adding support for uncertainty extraction to NDDataArray glue-viz/glue-astronomy#86

Merged

pllim merged commit 6fcea11 into astropy:main Feb 28, 2023

saimn mentioned this pull request Jun 5, 2023

ad[0].multiply(1.) fails with astropy 5.3 GeminiDRSoftware/DRAGONS#437

Closed

This was referenced Jun 26, 2023

In v5.3, NDDataRef mask propagation fails when one of the operand does not have a mask #14978

Closed

Bugfix for bitmasks passed to nddata #14995

Merged

bmorris3 mentioned this pull request Jul 21, 2023

Bugfix for collapses with NDData without masks, or without units #15082

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support #14175

ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support #14175

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support #14175

ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support #14175

Uh oh!

Conversation

Uh oh!

Description

Checklist for package maintainer(s)

See additional to-do items at #14175 (comment)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Examples

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!