8000 corrcoef started to escape [-1, 1] range · Issue #7392 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content
< 8000 div class="gh-header-show ">

corrcoef started to escape [-1, 1] range #7392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yarikoptic opened this issue Mar 8, 2016 · 31 comments
Closed

corrcoef started to escape [-1, 1] range #7392

yarikoptic opened this issue Mar 8, 2016 · 31 comments

Comments

@yarikoptic
Copy link
Contributor

Our tests of PyMVPA started to fail while building in Debian sid which carries beta of 1.11, so tried now on master:

$> python -c 'import numpy as np; print np.__version__; d = np.random.normal(size=(100, 100)).astype(np.float64); print (np.max(np.corrcoef(d)) - 1.0)'
1.12.0.dev0+b92cc76
2.22044604925e-16
@yarikoptic
Copy link
Contributor Author

bisect lead to following commit (08e8c14) which I haven't double checked but commit msg suggests that it must be the one ;)

08e8c1415670b029c26ae8ce0585fb9ea0b11e63 is the first bad commit
commit 08e8c1415670b029c26ae8ce0585fb9ea0b11e63
Author: Ruediger Meier <ruediger.meier@ga-group.nl>
Date:   Sat Sep 26 14:01:26 2015 +0200

    MAINT: corrcoef, memory usage optimization

    We calculate sqrt on the small vector rather than on that huge
    product matrix and we combine the "outer" product with element-wise
    devision. So even though we have a slower loop over the rows now ... this
    code snippet runs about 3 times faster than before.

    However the speed improvement of the whole function is not really
    significant because cov() takes 80-99% of the time (dependent on
    blas/lapack implementation and number of CPU cores).

    More important is that we will safe 1/3 memory. For example corrcoef()
    for a [23k, m]  matrix needs 8GB now instead of 12GB.

:040000 040000 275d9aec0035017678feb97f435d8ae36e163357 9bba7b0eb5a1b1bbe0bd517e0f846943201ff38b M  numpy
bisect run success
git-bisect-py $PWD v1.10.4 pre-removal-numpybook-2863-gb92cc76   419.31s user 15.93s system 97% cpu 7:25.90 total

@charris
Copy link
Member
charris commented Mar 8, 2016

I note that the error is 1 ulp, I wonder if that is worth worrying about, roundoff error is to be expected in floating point computations.

@yarikoptic
Copy link
Contributor Author

IMHO it is not just a precision issue. First of all it is a regression and IMHO should be treated as such -- hard to say how many people are possibly relying on the fact that corrcoef <= 1.0, which wouldn't be the case here regardless how small imprecision is.

@njsmith
Copy link
Member
njsmith commented Mar 10, 2016

@yarikoptic: that seems like a fair point. I guess it's sort of similar to how a 1 eps error that changes the sign is really bad (even though 1 eps errors are usually OK) -- preserving structural invariants is different than plain accuracy.

Do you see an easy way to fix this? (I guess we could just call np.clip, but maybe there's a better way. And surely we are not the first to wonder about the right way to compute correlations :-).)

Cc @rudimeier

@yarikoptic
Copy link
Contributor Author

to me nothing else than straightforward as clipping came to mind, but I didn't look yet into the OPT reimplementation which lead to this situation

@frederikhermans
Copy link

The issue seems to come from when the sqrt is computed. The new code (08e8c14) computes the outer product of the square root of d with itself. The old code (e6dbcea) computed the square root of the outer product of d with itself. In particular, if we change the old code as follows:

 diff --git a/numpy/lib/function_base.py b/numpy/lib/function_base.py
 index 007ff42..53de721 100644
 --- a/numpy/lib/function_base.py
 +++ b/numpy/lib/function_base.py
 @@ -2349,7 +2349,8 @@ def corrcoef(x, y=None, rowvar=1, bias=np._NoValue, ddof=np._NoValue):
      except ValueError:  # scalar covariance
          # nan if incorrect value (nan, inf, 0), 1 otherwise
          return c / c
 -    return c / sqrt(multiply.outer(d, d))
 +    d = np.sqrt(d)
 +    return c / multiply.outer(d, d)

then the sample problem occurs:

 >>> print np.max(np.corrcoef(d) - 1.0)
 2.22044604925e-16

A possible fix would be to change the new code to compute the sqrt in the loop

 diff --git a/numpy/lib/function_base.py b/numpy/lib/function_base.py
 index e649807..267c0d6 100644
 --- a/numpy/lib/function_base.py
 +++ b/numpy/lib/function_base.py
 @@ -2349,10 +2349,9 @@ def corrcoef(x, y=None, rowvar=1, bias=np._NoValue, ddof=np._NoValue):
      except ValueError:  # scalar covariance
          # nan if incorrect value (nan, inf, 0), 1 otherwise
          return c / c
 -    d = sqrt(d)
      # calculate "c / multiply.outer(d, d)" row-wise ... for memory and speed
      for i in range(0, d.size):
 -        c[i,:] /= (d * d[i])
 +        c[i,:] /= sqrt(d * d[i])
      return c

However, this slows down the execution. Without the proposed change:

 In [13]: d = np.random.normal(size=(2500, 2500)).astype(np.float64)
 In [14]: %timeit np.corrcoef(d)
 1 loop, best of 3: 416 ms per loop

With the proposed change:

 In [15]: %timeit np.corrcoef(d)
 1 loop, best of 3: 519 ms per loop

For reference, here's the performance before the bug was introduced:

 In [16]: %timeit np.corrcoef(d)
 1 loop, best of 3: 471 ms per loop

Hope this helps. I'll see if I can find a fix that doesn't affect performance.

@rudimeier
Copy link
Contributor

On Friday 11 March 2016 11:05:16 frederikhermans wrote:

          # nan if incorrect value (nan, inf, 0), 1 otherwise
          return c / c
 -    return c / sqrt(multiply.outer(d, d))
 +    d = np.sqrt(d)
 +    return c / multiply.outer(d, d)

then the sample problem occurs:
>>> print np.max(np.corrcoef(d) - 1.0)

 2.22044604925e-16

IMO you could never proof that any algorithm on certain machines would never
violate this constraint. Every numpy user knows that floating point
calculations on computers are never 100% exact.

So why not simply something like this at the end:
c = max(-1, min(1, c))

@jaimefrio
Copy link
Member

If it ends up boiling down to @rudimeier's post processing suggestion, remember np.clip!

@frederikhermans
Copy link

I agree with @rudimeier that moving the sqrt around doesn't really fix the core of the issue.

My only concern with clip is that it may mask other bugs: if there are later changes that produce obviously wrong correlation coefficients, these wrong coefficients would be silently clipped to [-1, 1].

@charris
Copy link
Member
charris commented Mar 11, 2016

@frederikhermans Good point. I'd go further and argue that any program that relied on the inequality being exactly preserved is likely incorrect.

@charris
Copy link
Member
charris commented Mar 11, 2016

Note that the current implementation still uses too much memory ;)

    for i in range(0, d.size):
        c[i,:] /= (d * d[i])

could be replaced by

c /= d[:, None]
c /= d[None, :]

and d should be inverted first so that * can be used instead of /.

@charris
Copy link
Member
charris commented Mar 11, 2016

And both of the improved methods can loose symmetry when xis y.

@njsmith
Copy link
Member
njsmith commented Mar 11, 2016

IMO you could never proof that any algorithm on certain machines would never violate this constraint. Every numpy user knows that floating point calculations on computers are never 100% exact.

This seems unnecessarily pessimistic. Floating point is quirky, but it isn't random; IEEE 754 is very carefully designed to make it possible to write algorithms that have various nice numerical guarantees, even when using floating point. (At least, if you are very clever.)

But yeah, I'm not 100% sure but I do suspect it's true that the old algorithm could also produce values outside [0, 1]. Apparently not as often, though? I'd still like to find whatever literature there is on this (anyone know a numerical analyst?), because we are certainly not the first to run into it.

I'd go further and argue that any program that relied on the inequality being exactly preserved is likely incorrect.

I dunno. Would you feel the same way about a program that relied on real-valued sqrt returning a number >= 0? (E.g., does code like log(sqrt(x)) look wrong to you?) And I'm not sure how much weight to put on the analogy from code that tries to use exact equality. Exact equality can never be made to work on general and nothing like it can either, so code that assumes it can is generally structurally broken and needs to be rethought. But code that assumes corrcoef falls in [0, 1] is generally going to be fine if you give it a corrcoef that satisfies this, and you can always produce such a function e.g. with np.clip. (Using clip makes for inelegant code, but numerically it can only increase accuracy.)

@rudimeier
Copy link
Contributor

On Friday 11 March 2016 17:32:19 Charles Harris wrote:

@frederikhermans Good point. I'd go further and argue that any program that
relied on the inequality being exactly preserved is likely incorrect.

Of course, but clip to [-1,1] would not hurt anybody.

On Friday 11 March 2016 17:43:28 Charles Harris wrote:

Note that the current implementation still uses too much memory ;)

I feel guilty because in pull request #6396 I had promised to add blas syrk()
but still not done that, except for my friend's private project:
sheyma/eigen_decomp@33fdf69

    for i in range(0, d.size):
        c[i,:] /= (d * d[i])

could be replaced by

c /= d[:, None]
c /= d[None, :]

and d should be inverted first so the * could be used instead of /.

This was discussed already in pull request #6396, see
"So I would go for this one, replacing expensive divisions by
multiplications:"

On Friday 11 March 2016 17:53:53 Charles Harris wrote:

And both of the improved methods can loose symmetry when xis y.

This should be fixed by the way when we use syrk().

@charris
Copy link
Member
charris commented Mar 11, 2016

real-valued sqrt returning a number >= 0

Different case, and sign is exact. The use of positive square roots instead of negative is a choice and that is specified. OTOH, it wouldn't surprise me if an algorithm that assumed that a computed number was non-negative and took the square root of the result might lead to an exception.

@rudimeier
Copy link
Contributor

Sorry my last reply by email was messed up somehow. I think this numpy issue tracker behave differently than github default ..., here again:

@frederikhermans Good point. I'd go further and argue that any program that
relied on the inequality being exactly preserved is likely incorrect.

Of course, but clip to [-1,1] would not hurt anybody.

Note that the current implementation still uses too much memory ;)

I feel guilty because in pull request #6396 I had promised to add blas syrk()
but still not done that, except for my friend's private project :)
sheyma/eigen_decomp@33fdf69

    for i in range(0, d.size):
        c[i,:] /= (d * d[i])

could be replaced by

c /= d[:, None]
c /= d[None, :]

and d should be inverted first so the * could be used instead of /.

This was discussed already in pull request #6396, see
"So I would go for this one, replacing expensive divisions by
multiplications:"

And both of the improved methods can loose symmetry when xis y.

This should be fixed by the way when we use syrk().

@charris
Copy link
Member
charris commented Mar 11, 2016

This should be fixed by the way when we use syrk().

That applies to the unnormalized correlation. The normalization also needs to preserve symmetry, which is not the case in the improvement I gave as the divisions take place in different order for symmetric off axis elements. Probably OK with current implementation if multiplication is commutative.

@charris
Copy link
Member
charris commented Mar 11, 2016

I think the function Is busted anyway, as diag is used and it returns a view these days. I've had my doubts that making diagonal return a view was a good idea, those doubts are getting stronger...

EDIT: And we are doing an inplace operation on the array from which the diagonal is taken.
EDIT: NVM, taking the sqrt makes a copy. Still, the possibility for problems with diagonal is worrysome.

@charris charris added this to the 1.11.0 release milestone Mar 11, 2016
@charris
Copy link
Member
charris commented Mar 11, 2016

In any case, we should make sure that inplace operations work with overlapping lhs and rhs befor going to far down the view route.

@njsmith
Copy link
Member
njsmith commented Mar 12, 2016

After thinking a bit and failing to find anything useful in Google, I'm leaning towards using something like

stddev = np.sqrt(c.diagonal())
c /= stddev[:, None]
c /= stddev[None, :]
np.clip(c, 0, 1, out=c)

(Or the / could be replaced by *, I don't care, though I'd want some checks that it doesn't kill numerical stability -- I'm wary because it's the same as doing dot(inv(...)) instead of solve(...).)

@njsmith
Copy link
Member
njsmith commented Mar 12, 2016

I can even see an argument for doing c[diag_indices] = 1at the end. No point in returning 0.99999 for these elements when it's basically free to return the exact answer instead.

@charris
Copy link
Member
charris commented Mar 12, 2016

i think you mean np.clip(c, -1, 1, out=c). Note that some off diagonal elements may also be exactly +/- 1, but pursuing this logic too far will lead us down the rabbit hole.

@njsmith
Copy link
Member
njsmith commented Mar 12, 2016

Sorry yes, -1 obviously. And yeah, that's why the clip and diagonal setting are not redundant. But agreed trying to detect exact 1s is definitely outside the scope of this function -- the suggestion was just that it might be worth handling the trivial obvious case exactly.

@rudimeier
Copy link
Contributor

I have tested something like this:

    d = 1.0/sqrt(d)
    n = d.size
    for i in range(0, n):
        c[i,i+1:n] *= (d[i+1:n] * d[i])
        np.clip( c[i,i+1:n], -1, 1, out=c[i,i+1:n])
        c[i+1:n,i] = c[i,i+1:n].conj()
        c[i,i] = 1.0

It's symmetric, In range [-1,1] and diag = 1.0.
This is BTW more like you would do it in C if you use syrk/herk directly which does not calculate the lower triangular. That's what I meant with "using syrk implies symmetry".

Would there be a chance to get this merged if I implement that in C? I expect a speed-up by factor 2.

@charris
Copy link
Member
charris commented Mar 12, 2016

@rudimeier I wouldn't be a fan of a C implementation. IMHO, even clipping verges on overkill, floating point is floating point, and there is virtue in having simple and transparent code.

@charris
Copy link
Member
charris commented Mar 13, 2016

@njsmith I don't think using the inverse is quite the same as using a matrix inverse, in particular, no row pivoting. The advantage of doing multiplication is that it is faster than division.

@charris
Copy link
Member
charris commented Mar 13, 2016

Note that clipping doesn't work for complex values. The Schwarz inequality only gives us that abs(x) <= 1 for the elements x of the result. Consequently we would need to clip the modulus of complex values, and I don't see a roundoff safe way to do that.

@charris
Copy link
Member
charris commented Mar 13, 2016

For complex observations we should use the real part of the diagonal, and probably just replace the diagonal by 1. The is a limit to what we can do with off diagonal elements.

@yarikoptic
Copy link
Contributor Author

On Sat, 12 Mar 2016, Charles Harris wrote:

For complex observations we should use the real part of the diagonal, and
probably just replace the diagonal by 1. The is a limit to what we can do
with off diagonal elements.

I know that this is a somewhat awkward but I think a viable usecase
against doing anything special to a diagonal: comparing correlation
values to identify identical series (may be just as a side-effect of
computing correlations). If a diagonal would get a special
treatment then some naive algorithms seeking numerical identity within
elements of a row would fail even if e.g. multiple input series are
identical bit to bit. I am not sure how much better or worse it would
be to not having diagonal of ones ;)

@charris
Copy link
Member
charris commented Mar 13, 2016

Note that NaN is a valid result, so we cannot just fill the diagonal with ones. I suggest if we want to make a stab at maintaining abs(c[i,j]) <= 1, that we simply clip the real and imag parts to the interval [-1, 1]. That doesn't cure all problems but is the only thing that is simple and easy to understand. Myself, I'd be inclined to just leave everything alone...

@charris
Copy link
Member
charris commented Mar 13, 2016

PR for consideration in #7414.

charris added a commit to charris/numpy that referenced this issue Mar 14, 2016
The non-nan elements of the result of corrcoef should satisfy the
inequality abs(x) <= 1 and the non-nan elements of the diagonal should
be exactly one. We can't guarantee those results due to roundoff, but
clipping the real and imaginary parts to the interval [-1, 1] improves
things to a small degree.

Closes numpy#7392.
jaimefrio pushed a commit to jaimefrio/numpy that referenced this issue Mar 22, 2016
The non-nan elements of the result of corrcoef should satisfy the
inequality abs(x) <= 1 and the non-nan elements of the diagonal should
be exactly one. We can't guarantee those results due to roundoff, but
clipping the real and imaginary parts to the interval [-1, 1] improves
things to a small degree.

Closes numpy#7392.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0