8000 ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support by bmorris3 · Pull Request #14175 · astropy/astropy · GitHub
[go: up one dir, main page]

Skip to content

ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support #14175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 28, 2023

Conversation

bmorris3
Copy link
Contributor
@bmorris3 bmorris3 commented Dec 13, 2022

Description

The nddata module (docs) supports arithmetic between NDDataArray objects with propagation of units, masking and uncertainties. The currently supported arithmetic operations include addition, subtraction, multiplication and division.

A set of unsupported operations would "collapse" NDDataArrays along specified axes. For a concrete example, imagine the NDDataArray named cube represents a spectral cube from an IFU with a spectral dimension (axis=0) and two spatial dimensions (axes 1 and 2). It may be useful to collapse the cube to a single spectrum by taking the maximum the spatial dimensions at each wavelength, which might be represented with a numpy-like syntax like with collapsed_spectrum = cube.max(axis=(1, 2)). In the case of min/max, we expect the uncertainties in collapsed_spectrum to be the uncertainties at each maximum flux in the spatial dimension. If instead we took the mean with cube.mean(axis=(1, 2)), we would propagate uncertainties along the collapsed axes. The mask should also be propagated so that only unmasked values are included in the collapse operation.

This type of generic collapse operation is something we need over in jdaviz, where we currently rely on glue to collapse our spectral cubes after they have been converted to glue Data objects. Since we don't anticipate unit/mask/error propagation to be added to glue Data objects, we're hoping to implement this functionality in astropy, and then offer jdaviz users the option to propagate units/masks/errors via the astropy methods.

This PR is a first-draft implementation of sum, mean, min, and max methods for NDDataArray, which accept an integer or a tuple of ints for the axis argument. As of opening this draft PR, I have implemented the collapse arithmetic for the data and uncertainties, and I'm now working on correctly applying the masks. I'm putting up this unfinished PR to collect comments and discussion while I put together docs and tests. Comments welcome and needed!

Checklist for package maintainer(s)

This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.

  • Do the prop 8000 osed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the Extra CI label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the no-changelog-entry-needed label. If this is a manual backport, use the skip-changelog-checks label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but astropy-bot check might be missing; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate backport-X.Y.x label(s) before merge.

See additional to-do items at #14175 (comment)

@pllim pllim added this to the v5.3 milestone Dec 13, 2022
@pllim pllim requested review from eteq and mwcraig December 13, 2022 17:20
@pllim pllim added the API change PRs and issues that change an existing API, possibly requiring a deprecation period label Dec 13, 2022
@github-actions
Copy link
Contributor

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@Cadair
Copy link
Member
Cadair commented Dec 14, 2022

@DanRyanIrish this looks like something you might want to look at.

@DanRyanIrish
Copy link
Contributor

Hi @bmorris3. This looks like a specific case of the same problem we are trying to solve in ndcube in this currently open PR: https://github.com/sunpy/ndcube/pull/450/files#diff-929794bb532be1486907b360285433b432f8505ab7e7ff63eca3eab657f7d00dR712

In our case we are implementing superpixel which downsamples a cube by collapsing contiguous groups of pixels. But a special case of this is to collapse over entire axes. As well as the sum, mean, min and max, we are also supporting median. I'm not sure about the timescale of your effort, but perhaps the most desirable long-term result would be for ndcube's superpixel to live in NDData and then your axis-collapsing methods could just be wrappers around superpixel with a simpler API. What do you think?

I haven't gone through your draft implementation in detail yet but at least parts of your implementation might be better than what we have in ndcube so far. Possibly vice-versa as well. It would be great collaborate on this, even if to start with these PRs proceed separately.

@bmorris3
Copy link
Contributor Author

Hi @DanRyanIrish, thanks for the heads up on a related effort in sunpy/ndcube#450.

As you say, the superpixel implementation looks slightly different to the collapse methods in this PR. In my domain we'd usually call your operation "binning", and of course, the broadest possible binning is a single bin per axis, so I agree that this astropy PR is a special case of sunpy/ndcube#450.

At first look, there might be two main differences.

  1. superpixel seems to perform the operations on exactly one axis at a time:

https://github.com/sunpy/ndcube/blob/488fc6bba269e6699d940696b1a0691761d90658/ndcube/ndcube.py#L808-L810

  1. superpixel does not support uncertainty propagation for min, max or median:

https://github.com/sunpy/ndcube/blob/488fc6bba269e6699d940696b1a0691761d90658/ndcube/ndcube.py#L731

Two key features of this astropy PR are generalized solutions for (1) and (2). For example, in astronomical observations with spectral cubes, we often have an unresolved star producing most of the signal in a single pixel. A max operation on the two spatial dimensions while preserving the dispersion dimension would extract a first-guess stellar spectrum, while preserving the uncertainties and masks from each pixel in the original cube. I don't think this kind of ND operation is possible with sunpy/ndcube#450.

As you noticed, I'm still working on an efficient implementation of median, so it hasn't been included in this PR yet. Medians are tougher because they require sorting (+over multiple axes).

@bmorris3
Copy link
Contributor Author

Looking closer, I'm not sure that my previous comment is correct about (1). It looks like superpixel loops over the axes.

@DanRyanIrish
Copy link
Contributor
DanRyanIrish commented Dec 14, 2022

Thanks for your thoughts on sunpy/ndcube#450 @bmorris3.

1. `superpixel` seems to perform the operations on exactly one axis at a time:

As you mentioned subsequently superpixel does operate on any number of axes but does apply the operation to each in a loop. It would be nice to get rid of that loop but haven't come up with a better way yet. If you have a better implementation that would be great.

2. `superpixel` does not support uncertainty propagation for `min`, `max` or `median`:

The reason uncertainty propagation is not supported for these operations is that, as far as I can see, the uncertainty is ill-defined. For example, there may be multiple array values equal to the maximum value in the cube. However, each may have different uncertainties, e.g. due to different noise levels in different pixels. In this case what should the uncertainty be? The same ambiguity holds for min, and median.

Two key features of this astropy PR are generalized solutions for (1) and (2). For example, in astronomical observations with spectral cubes, we often have an unresolved star producing most of the signal in a single pixel. A max operation on the two spatial dimensions while preserving the dispersion dimension would extract a first-guess stellar spectrum, while preserving the uncertainties and masks from each pixel in the original cube. I don't think this kind of ND operation is possible with sunpy/ndcube#450.

To mind the way to handle this use-case and unambiguously maintain the uncertainty is to find the index of the spatial pixel and then slice the cube so it gives you the spectrum at that spatial location. Although this does involve at least two operations rather than just one and so is a little more cumbersome.

WHat do you think about the above? Have I misunderstood your use-case? Is uncertainty well-defined for min, max and median in some way I haven't thought of?

@bmorris3
Copy link
Contributor Author

However, each may have different uncertainties, e.g. due to different noise levels in different pixels. In this case what should the uncertainty be? The same ambiguity holds for min, and median.

It's true that min and max are not guaranteed to be unique. This is a generic problem that you see in numpy too, of course. For example, there's a warning in the Notes section in the docs for np.argmin:

In case of multiple occurrences of the minimum values, the indices 
corresponding to the first occurrence are returned.

I think the advantage of this NDDataArray approach, which mirrors the numpy behavior, is that you always get an answer. The answer will be "correct" if the values in the data array are unique, and may correspond to the uncertainty elsewhere if the array is non-unique.

median is a different scenario. The median of an odd-length array is simply the middle value of the sorted data array. That measurement in the data array has a corresponding uncertainty that should be preserved. The median of an even-length array is the mean of the two middle values, which will allows us to use the mean implementation here within NDDataArray on the two data array values and their uncertainties.

@mhvk
Copy link
Contributor
mhvk commented Dec 14, 2022

Just in case it helps: much of the machinery for propagating masks would seem identical to what is done in the Masked class [1] - might it be possible to re-use that? One specific thing worth mentioning: summing with masks is not all that trivial, as it is not obvious how to "ignore" masked items (replacing with zero not always being a good choice). .mean() is far easier. See https://docs.astropy.org/en/latest/utils/masked/index.html#utils-masked-vs-numpy-maskedarray

[1] E.g.,

def mean(self, axis=None, dtype=None, out=None, keepdims=False, *, where=True):
# Implementation based on that in numpy/core/_methods.py
# Cast bool, unsigned int, and int to float64 by default,
# and do float16 at higher precision.
is_float16_result = False
if dtype is None:
if issubclass(self.dtype.type, (np.integer, np.bool_)):
dtype = np.dtype("f8")
elif issubclass(self.dtype.type, np.float16):
dtype = np.dtype("f4")
is_float16_result = out is None
where = ~self.mask & where
result = self.sum(
axis=axis, dtype=dtype, out=out, keepdims=keepdims, where=where
)
n = np.add.reduce(where, axis=axis, keepdims=keepdims)
result /= n
if is_float16_result:
result = result.astype(self.dtype)
return result

@bmorris3
Copy link
Contributor Author
bmorris3 commented Dec 14, 2022

@mhvk thanks for the pointer! I was actually just reading about this module because I didn't know much about it.

@bmorris3
Copy link
Contributor Author

Upon further inspection of astropy.utils.masked, this looks like the way to go and @mhvk is my hero 🦸🏻.

@bmorris3
Copy link
Contributor Author
bmorris3 commented Dec 15, 2022

I'm still sorting out all of the implications of this, but since @mhvk's suggestion to build on Masked arrays from astropy.utils.masked, I pushed some changes in 9ae6115. This allows most numpy functions to be called on NDDataArray objects while following the rules in astropy.utils.masked for propagating masks.

Examples

Suppose you have loaded a spectral cube (two spatial dims, one spectral dim) into an NDDataArray called nddata. Now you can do

np.sum(nddata, axis=(1, 2))  # internally this becomes `nddata.sum(axis=(1,2))`

and you will be returned the collapsed spectrum as an NDDataArray with propagated masks, uncertainties, and units:

NDDataArray([68098.234, 80767.66 , 69840.2  , ..., 94635.03 , 92484.08 ,
             89727.45 ], unit='MJy / sr')

We can only propagate errors, and return a NDDataArray, if we can also explicitly define the rules for error propagation on that numpy operation. I've explicitly defined rules for sum, mean, min and max.

If you call a numpy function that doesn't have explicit support from the additions in this PR for error propagation like std:

np.std(nddata, axis=(1, 2))

it falls back on returning a np.ma.MaskedArray like this:

masked_array(data=[137.33192443847656, 138.0076141357422,
                   137.83934020996094, ..., 238.78318786621094,
                   238.67127990722656, 238.54486083984375],
             mask=[False, False, False, ..., False, False, False],
       fill_value=1e+20,
            dtype=float32)

@mhvk
Copy link
Contributor
mhvk commented Dec 15, 2022

Happy that this looks helpful! Might it make sense to return MaskedNDArray instead of numpy's MaskedArray? The two behave a little differently (see link above), so that might help consistency.

p.s. At some point I really hope to make some progress on getting proper error propagation, including covariance, supported in uncertainties -- see #3715.

@bmorris3
Copy link
Contributor Author
bmorris3 commented Dec 15, 2022

@mhvk I agree that the behavior above is inconsistent, but I think this stems from a feature rather than a bug? nddata objects currently pass off masked arrays to numpy operations by default:

def __array__(self):
"""
This allows code that requests a Numpy array to use an NDData
object as a Numpy array.
"""
if self.mask is not None:
return np.ma.masked_array(self.data, self.mask)
else:
return np.array(self.data)

The above snippet exports the NDDataArray to a np.ma.masked_array when you call np.std(nddata, axis=(1, 2)). I think that behavior is intended – if you call the numpy function on an astropy object, you export the astropy object to numpy object and get out of numpy's way.

The (few) NDDataArray methods that I've implemented in this PR allow you to propagate errors/masks/units when the rules are defined, but require you to use the built-in method on NDDataArray, like NDDataArray.sum. The ethos could be that NDDataArray "knows how to do a sum" on each of its own attributes, while np.std does not.

Clearly, one likely source of confusion in this type of implementation is that you have different output types if you call np.std(nddata) (returns np.ma.masked_array) compared with np.sum(nddata) or nddata.sum() (both return NDDataArray).

@mhvk
Copy link
Contributor
mhvk commented Dec 15, 2022

OK, that certainly makes sense. Maybe in the future this can change (e.g., one might want to return MaskedQuantity if a unit is set), but for this PR certainly best to stick with the API that is there!

@bmorris3
Copy link
Contributor Author
bmorris3 commented Dec 15, 2022

Great! I want to preempt another likely source of confusion here: if you call a numpy function on an NDDataArray object like np.mean which also has an implementation on the NDDataArray object (NDDataArray.mean), you get still an NDDataArray. This is a feature of numpy – the source code in numpy for np.mean(x, axis=(1, 2)) actually first tries to call the mean method on x. So for sum, mean, min, and max, the numpy implementations also return NDDataArray objects, which might not be expected if you only read my comment above.

Relevant notes in the numpy docs here.

@pllim
Copy link
Member
pllim commented Dec 19, 2022

There is a conflict now, FYI

@DanRyanIrish
Copy link
Contributor

It's true that min and max are not guaranteed to be unique. This is a generic problem that you see in numpy too, of course. For example, there's a warning in the Notes section in the docs for np.argmin:

In case of multiple occurrences of the minimum values, the indices 
corresponding to the first occurrence are returned.

I think the advantage of this NDDataArray approach, which mirrors the numpy behavior, is that you always get an answer. The answer will be "correct" if the values in the data array are unique, and may correspond to the uncertainty elsewhere if the array is non-unique.

I guess it comes down to whether you want a correct answer most of the time in a way that mimics numpy or to never give a wrong answer. This is probably more of a choice of preference but one argument against it is that numpy arrays don't have uncertainties. So whether np.argmin takes the first found or something more sophisticated, the end result is degenerate. That's not the case when you have uncertainties.

This has prompted me to attempt to support uncertainty preservation for min/max in the ndcube#450. However instead of taking the first found uncertainty, I preserve the largest uncertainty. This seems to me to better reflect our actual knowledge of what the min/max value is. Any thoughts?

median is a different scenario. The median of an odd-length array is simply the middle value of the sorted data array. That measurement in the data array has a corresponding uncertainty that should be preserved. The median of an even-length array is the mean of the two middle values, which will allows us to use the mean implementation here within NDDataArray on the two data array values and their uncertainties.

I hadn't thought this approach. It's compelling. One aspect is still not clear to me though. In the odd-length case, if there are multiple occurrences of the median value, which one's uncertainty gets used? Is it meaningful to simply take the "middle" one as determined by the sorting algorithm, or should it depend on the uncertainty, e.g. taking the largest as in the ndcube min/max case?

@mhvk
Copy link
Contributor
mhvk commented Dec 20, 2022

Bit of a comment from the side, but I think that picking particular elements based on their uncertainty is not really "better"; the question of what are the minimum, maximum, median in the presence of uncertainties likely has more statistically meaningful answers (which may well depend on the distribution, number of points, and the confidence level one wishes to have, just like the mean has an error in the mean, which is not so trivially determined if the data points are not consistent with the mean within their uncertainties -- i.e., reduced chi2 inconsistent with unity).

@bmorris3
Copy link
Contributor Author

I agree with @mhvk. To summarize for others joining the discussion, I think the first decision to make is whether or not we return any result when the operation is ambiguous. We both agree that the user should get an answer, but we're disagreeing on which answer.

The approach @DanRyanIrish is suggesting says "take the most conservative possible uncertainty", which is absolutely a handy approach when doing model optimization on observations, for example. The approach I've implemented says "return the first found minimum/maximum and its corresponding uncertainty", which mirrors the behavior in numpy.

@DanRyanIrish
Copy link
Contributor
DanRyanIrish commented Dec 22, 2022

To further contribute to this discussion, I spoke with a couple solar colleagues who argued for a 3rd way of calculating the uncertainty in the min & max cases which won me over. For the case of max, the returned data value is simply the max of the axis/superpixel. But the uncertainty is (data + uncertainty).max(axis=axis) - data.max(axis=axis) where data and uncertainty are the data and uncertainty arrays and axis is/are the axis/axes over which the data is being collapsed. This captures the maximum possible value given all the uncertainties and data values being collapsed. Hence it is in line with @mhvk s comment that the uncertainty should reflect the distribution of values and uncertainties, not simply be the uncertainty of a given pixel. And I agree that this is more meaningful than my original suggestion or the first found method. It also should not be significantly more computationally expensive than simply finding the max of the data.

@bmorris3 bmorris3 force-pushed the extend-nddata branch 2 times, most recently from d5dbb77 to 28a106a Compare December 28, 2022 15:04
@bmorris3 bmorris3 changed the title Introducing collapse operations on NDDataArray ENH nddata: collapse operations on NDDataArray, improved Masked Quantity support Dec 28, 2022
@bmorris3 bmorris3 marked this pull request as ready for review December 28, 2022 17:48
@DanRyanIrish
Copy link
Contributor

Hmm. Perhaps. exclude_masked_data?

@bmorris3
Copy link
Contributor Author

In conversations with @mwcraig and @DanRyanIrish, we have agreed that the kwarg discussed earlier (#14175 (comment)) should be changed to operation_ignores_mask, so in 0938834 I've used this new name.

@mwcraig
Copy link
Member
mwcraig commented Feb 28, 2023

@bmorris3 -- will a rebase fix the issue with the weird overage results?

Adding masking support in nddata collapse ops with astropy.utils.masked

Adding tests for NDDataArray collapse operations

correcting NDDataArray collapse w propagation for ndim<=3

remove print statement from testing

Fixing unit propagation bug

preserve masked copying behavior

fixing up masked array views in nddata

NDDataArray arithmetic testing improvements

Further NDDataArray testing improvements

Fixing ufunc inplace error test

Updating docs to reflect small API change when not handling masks

Cleaning up PR 14175

Fixing mask overwriting order for masked quantities

more thorough handling of inputs for NDData constructor

return to assumption  NDData.data is ndarray

fixing incorrect result for np.ma.maskedarray

Increasing coverage

adding changelog entry

support for mask=None case

fixing up astropy mask handling in ndarithmetic

further improvements for mask handling

remove amin/amax, drop uncertainties from collapse via min/max by default

generalizing to >3D cube collapse

Apply suggestions from code review

Co-authored-by: Matt Craig <mattwcraig@gmail.com>

Splitting out masked improvements for separate PR

Adding filterwarnings for div by zero error (addressed in another PR)

simplifying private methods for extrema in nddata

Improving test coverage
@bmorris3
Copy link
Contributor Author

Only one way to find out. Here we go!

Copy link
Member
@mwcraig mwcraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to go once tests pass! Thanks to everyone who has provided input here.

@pllim pllim added the Extra CI Run cron CI as part of PR label Feb 28, 2023
@pllim
Copy link
Member
pllim commented Feb 28, 2023

Just to be safe, I am also running the cron jobs. 😬

@pllim
Copy link
Member
pllim commented Feb 28, 2023

Since this is slated to go in soon, I also took the liberty to fix the What's New listing that I noticed while glancing at the diff. Hope you don't mind!

@pllim
Copy link
Member
pllim commented Feb 28, 2023

linkcheck failure is unrelated. Merging, thank you all very much!

@pllim pllim merged commit 6fcea11 into astropy:main Feb 28, 2023
@bmorris3
Copy link
Contributor Author

Thanks to all reviewers! I'm very happy to have had your helpful, patient feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change PRs and issues that change an existing API, possibly requiring a deprecation period Extra CI Run cron CI as part of PR nddata Ready-for-final-review whatsnew-needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0