8000 WIP Variable class that includes uncertainties; application to UQuantity by mhvk · Pull Request #3715 · astropy/astropy · GitHub
[go: up one dir, main page]

Skip to content

WIP Variable class that includes uncertainties; application to UQuantity #3715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mhvk
Copy link
Contributor
@mhvk mhvk commented Apr 23, 2015

EDIT: updated: this PR provides ways to propagate uncertainties analytically, keeping track of correlations. It is meant as a companion to the Distribution class, which follows, effectively, a Monte-Carlo approach to propagating uncertainties. Items still to do:

  • Decide on a name (Variable is poor, Measurement does not include propagation/correlated errors, Variate suggests random number generation; Measurand may be closest but is not common);
  • Like for Distribution, consider auto-generating subclasses (VariableArray, VariableQuantity, etc.).
  • Make PR consistent with current minimum numpy version (i.e., use __array_ufunc__ only, no __array_prepare__, etc.)
  • Implement reductions; by default, these should drop correlations.

Follow-up of this PR might include:

  • Implement user-control for when/whether correlations are kept.
  • Ensure that error propagation could can be used as module in nddata

BELOW IS THE OUTDATED ORIGINAL TEXT

Very much work in progress, so do not merge

This is a first stab at a general Variable class, a subclass of ndarray that tracks uncertainties. Also implemented is an application to UQuantity, i.e., a mixin with Quantity. It is still inspired by the uncertainties package [1], written by @lebigot, and many of the test cases are taken from the GSoC work of @Epitrochoid. Unlike those efforts, however, this is a proper nd 8000 array subclass (while in @Epitrochoid's approach arrays were not yet possible, while in uncertainties, object arrays (holding Variable instances) are used, which is much slower).

Note that I started writing this using __numpy_ufunc__, which is the most logical and fastest approach, and hence this PR includes the commits of #2583 and #2948. Upon further thought, however, it became clear that this would work with the usual __array_prepare__ and __array_wrap__ as well; for now, I didn't feel it was worth removing the __numpy_ufunc__ machinery.

Anyway, as is, this works well for generic ufunc operations. One big outstanding issue is what to do with methods that combine (parts of) an array, such as sum or mean. E.g., if we do x - x.mean(), where x is some large array, do we want to track for every element of x its dependence on every other element of x (which, for a 1000x1000 image would imply a 1000x1000x1000 derivative matrix).

cc: @wkerzendorf, @eteq, @mwcraig

@lebigot: please let me know if you are interested in using this in your package.

[1] https://github.com/lebigot/uncertainties

@mhvk mhvk force-pushed the uncertainty branch 4 times, most recently from 113f8d1 to ef14802 Compare April 23, 2015 19:29
@embray
Copy link
Member
embray commented Apr 24, 2015

remind me to look closer at this later--it might be good to start looking into if/how this can be integrated into modeling as well.

@lebigot
Copy link
lebigot commented May 6, 2015

Thanks @mhvk: I will have a look at what you did.

The "big outstanding issue" that you are mentioning (large memory consumption for even simple array operations) is tricky (it is one of the main reasons why I did not delve sooner into this business!). A NxN = 1000x1000 matrix x of numbers with uncertainties actually gives through x - x.mean() an N^4 array of derivatives (not N^3) so the difficulty is even worse than what you mention.

There are C (and C++?) libraries out there that handle uncertainties in arrays: maybe they solved this issue and can be a source of inspiration?

UFUNC_DERIVATIVES[np.abs] = UFUNC_DERIVATIVES[np.fabs]

if hasattr(np, 'cbrt'):
UFUNC_DERIVATIVES[np.cbrt] = lambda x: 1./(3.*np.cbrt(x)**2)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be **(2/3)? (__future__ is used, so no need for 2./3).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to think again why I did it, but then remembered: np.cbrt is pretty optimized, as is squaring. As a result, what I have is faster than just raising to the 2/3:

In [2]: a = np.arange(100.)

In [3]: %timeit np.cbrt(a)
100000 loops, best of 3: 2.93 µs per loop

In [4]: %timeit np.cbrt(a)**2
100000 loops, best of 3: 3.78 µs per loop

In [5]: %timeit a**(2/3)
100000 loops, best of 3: 7.33 µs per loop

Copy link
< 8000 h3 class="f5 text-normal" style="flex: 1 1 auto">

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I had missed the fact that you had put np.cbrt(x): I thought it was only x. Good timing results (I guess that in your example you have 2/3 = 0.666… and not 0, right?).

@mhvk
Copy link
Contributor Author
mhvk commented May 7, 2015

@lebigot - indeed, you're right, in x - x.mean(), each element would depend on each other element, so it would be a 1000**4 derivatives matrix. Nice way to run out of memory... Of course, in that case, the dependence on the other elements is also so small that one can really ignore it. At some level, I think this has to be left to common sense: if a variable depends on more than some large number of others, one just calculates the uncertainty and forgets the dependencies. Or, one makes a constant uncertainty component from all variables which contribute less than some percentage to the total uncertainty.

Overall, if only for clarity of implementation, it seems reasonably sensible to let any of the reduce methods (like mean, sum, etc.) ignore the dependencies. But I welcome further thought.

@lebigot
Copy link
lebigot commented May 7, 2015

Good that you welcome further thoughts: I have been needing mean() etc. with uncertainties quite a few times. Handling its uncertainty is almost OK: the mean with uncertainty is essentially as big as the array itself. So, in my ideal world, mean() would keep its correct uncertainty. Optimizations and approximations could be triggered, though, for x - x.mean(). Again, I would be curious to see how C libraries that handle uncertainties in arrays tackle this.

@astropy-bot
Copy link
astropy-bot bot commented Sep 27, 2017

Hi there @mhvk 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

I see this is an experimental pull request. I'll report back on the checks once the PR discussion in settled.

If there are any issues with this message, please report them here.

@mhvk
Copy link
Contributor Author
mhvk commented Sep 27, 2017

Rebased. @eteq - it would be good to discuss this together with the attempt at implementing distributions.

10000

@astropy astropy deleted a comment from astropy-bot bot Sep 27, 2017
@bsipocz
Copy link
Member
bsipocz commented Sep 27, 2017

@mhvk - I see that you've deleted a comment from the bot. Was it misbehaving? Any feedback is great at this point, so we can improve it.

@mhvk
Copy link
Contributor Author
mhvk commented Sep 27, 2017

Not at all, it just was no longer relevant, so I just hoped to make life easier for someone going over earlier discussion.


import numpy as np
from astropy.utils.misc import isiterable
from astropy.utils.compat import NUMPY_LT_1_10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we require np 1.10+, this isn't available any more

@@ -0,0 +1,5 @@
# Licensed under a 3-clause BSD style license - see LICENSE.rst
from __future__ import absolute_import
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the future import

return result

def __array_finalize__(self, obj):
if super(Variable, self).__array_finalize__:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With only Python 3 support you don't need the arguments for super.


def __array_finalize__(self, obj):
if super(Variable, self).__array_finalize__:
super(Variable, self).__array_finalize__(obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

value = q_cls(value, unit, **kwargs)
if uncertainty is not None:
uncertainty = q_cls(uncertainty, value.unit)
return super(UQuantity, cls).__new__(cls, value, uncertainty, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

# Use str('U') to avoid unicode for class name in python2.
self[q_cls] = type(str('U') + q_cls.__name__,
(UQuantity, Variable, q_cls), {})
return super(_SubclassDict, self).__getitem__(q_cls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

@mhvk
Copy link
Contributor Author
mhvk commented Jul 25, 2019

Talking about NDData interaction: need way to ignore covariance. Similarly, may want some way to ensure covariance is tracked even in reductions.

@astromancer
Copy link
Contributor

@mhvk This is great work! and very necessary imo! I'm keen to help you develop and test this. It would be helpful if you could point out the aspects that would be most beneficial to get some help on - based on your comments it seems rn making the covariance tracking mechanism optional is a priority. Otherwise I'll just start hacking and send PRs to your fork.

# -*- coding: utf-8 -*-
# Licensed under a 3-clause BSD style license - see LICENSE.rst
"""
Distribution class and associated machinery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably read "Uncertainty" instead of "Distribution"

np.multiply: (lambda x, y: y,
lambda x, y: x),
np.true_divide: (lambda x, y: 1/y,
lambda x, y: -x/y**2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of performance, np.square typically tends to be somewhat faster (5 - 15% in my experience) than using the power operator to square. Replacing all the squares here with np.square should give you a moderate performance boost. The same is probably true for np.power, but I have not tested this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy makes this optimization internally for __pow__: https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/number.c#L507

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I wasn't aware of this optimization. I convinced myself that np.square was faster by running this test, but I suppose these results are themselves uncertain (ahem):

for p in range(1, 8):
    n = int(10 ** p)
    x = np.random.rand(n)
    t_pow = %timeit -o x ** 2
    t_sqr = %timeit -o np.square(x)
    speedup = np.mean(1 - np.divide(t_sqr.timings, t_pow.timings))
    print('speedup (n = %.0e): %.2f%%' % (n, speedup * 100))

The results were:

971 ns ± 63.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
911 ns ± 52.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
speedup (n = 1e+01): 5.71%
977 ns ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
909 ns ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
speedup (n = 1e+02): 6.84%
1.6 µs ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.53 µs ± 73.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
speedup (n = 1e+03): 4.19%
5.23 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.81 µs ± 302 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
speedup (n = 1e+04): 7.70%
64.2 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
66 µs ± 7.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
speedup (n = 1e+05): -3.82%
1.86 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.86 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
speedup (n = 1e+06): -0.71%
55.4 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
47.9 ms ± 7.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
speedup (n = 1e+07): 12.76%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite large fluctuations, indeed. Overall, my sense would be to leave it for the future - there may be larger advantages to be had by carefully ensuring more complicated operations are done in-place (though some of that is also done automatically in numpy).

return '{0}({1})'.format(type(self).__name__, self.derivatives)


class Variable(np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there may not be a better name for this class. "Variable" is a term that is already quite overloaded, and doesn't seem such a good fit for the intended use of this class and could be confusing to the uninitiated. In statistical parlance, the objects represented by (instances of) this class are Random Variates - that is realizations of a Random Variable. It would also be a good idea to include "Array" in the name to indicate that this is actually a numpy array subclass. I would therefore suggest RandomVariateArray as a potential alternative name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed the name is very poor, but random to me suggests there is a random number generator involved, which is of course not the case. Maybe Measurement? But then I hope to use the same thing also for derived quantities... Wikipedia suggests Measurand, but it is the first I ever saw that term...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favour of Measurement or Variate or even VariateArray over Variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem with Variate remains the association with random number generators - I have not seen it used for measurements or derived quantities. My problem with Measurement is that this suggests no dependence on other objects, while the whole point of the class is to include correlations. But Variable sounds like an instance in a programme, which is also no good.

What I'll do for now is right on top put an item that we have to decide on the name...

@mhvk
Copy link
Contributor Author
mhvk commented Sep 11, 2019

@astromancer - it would be great to get some help with this - it is also on the schedule for a bit of hacking in November with @eteq, but the further it is by that time, the better - a definite goal to have it in 4.0.

The discussion above indeed mentions opting in/out of tracking covariance, but my feeling is that it is most important to have a good default. I think for now this has to mean that all reductions drop the covariances.

Looking at the code, though, I see that it really is a bit out of date (e.g., still having __array_prepare__ and __array_wrap__. I also think I would now do the construction a bit more like Distribution, where there is a non-array subclass that deals with the propagation and which mixes itself in with the input to make a new class.

But I don't think any of the above should stop you from contributing. In particular, it would be fantastic to have more tests. Or to have the basis for the reductions (method "reduce" in __array_ufunc__).

@bsipocz
Copy link
Member
bsipocz commented Sep 11, 2019

it is also on the schedule for a bit of hacking in November with @eteq, but the further it is by that time, the better - a definite goal to have it in 4.0.

@mhvk @eteq - feature freeze for 4.0 is the 25th of October

@mhvk mhvk modified the milestones: v4.0, v4.1 Oct 24, 2019
@mhvk
Copy link
Contributor Author
mhvk commented Oct 24, 2019

Made some good progress with this, but not ready by tomorrow. And, like for Masked Quantities, best to introduce this in a feature release, not LTS.

@astropy-bot
Copy link
astropy-bot bot commented Nov 18, 2019

Hi humans 👋 - this pull request hasn't had any new commits for approximately 5 months. I plan to close this in a month if the pull request doesn't have any new commits by then.

In lieu of a stalled pull request, please consider closing this and open an issue instead if a reminder is needed to revisit in the future. Maintainers may also choose to add keep-open label to keep this PR open but it is discouraged unless absolutely necessary.

If this PR still needs to be reviewed, as an author, you can rebase it to reset the clock.

If you believe I commented on this pull request incorrectly, please report this here.

@bsipocz
Copy link
Member
bsipocz commented Nov 18, 2019

@mhvk - could you rebase to keep this PR alive?

@astropy-bot
Copy link
astropy-bot bot commented Dec 19, 2019

I'm going to close this pull request as per my previous message. If you think what is being added/fixed here is still important, please remember to open an issue to keep track of it. Thanks!

If this is the first time I am commenting on this issue, or if you believe I closed this issue incorrectly, please report this here.

@astropy-bot astropy-bot bot closed this Dec 19, 2019
@bsipocz bsipocz removed this from the v4.1 milestone Jan 2, 2020
@nstarman
Copy link
Member

@mhvk, v5.0? Would be very cool.

@mhvk
Copy link
Contributor Author
mhvk commented Apr 20, 2021

@nstarman - I'd happily revive this!

@nstarman
Copy link
Member

And I'd be happy to help. I would love a way to naturally represent errors in coordinates and this seems a natural first stepping stone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0