corrcoef started to escape [-1, 1] range #7392

yarikoptic · 2016-03-08T14:21:12Z

Our tests of PyMVPA started to fail while building in Debian sid which carries beta of 1.11, so tried now on master:

$> python -c 'import numpy as np; print np.__version__; d = np.random.normal(size=(100, 100)).astype(np.float64); print (np.max(np.corrcoef(d)) - 1.0)'
1.12.0.dev0+b92cc76
2.22044604925e-16

The text was updated successfully, but these errors were encountered:

yarikoptic · 2016-03-08T16:46:34Z

bisect lead to following commit (08e8c14) which I haven't double checked but commit msg suggests that it must be the one ;)

08e8c1415670b029c26ae8ce0585fb9ea0b11e63 is the first bad commit
commit 08e8c1415670b029c26ae8ce0585fb9ea0b11e63
Author: Ruediger Meier <ruediger.meier@ga-group.nl>
Date:   Sat Sep 26 14:01:26 2015 +0200

    MAINT: corrcoef, memory usage optimization

    We calculate sqrt on the small vector rather than on that huge
    product matrix and we combine the "outer" product with element-wise
    devision. So even though we have a slower loop over the rows now ... this
    code snippet runs about 3 times faster than before.

    However the speed improvement of the whole function is not really
    significant because cov() takes 80-99% of the time (dependent on
    blas/lapack implementation and number of CPU cores).

    More important is that we will safe 1/3 memory. For example corrcoef()
    for a [23k, m]  matrix needs 8GB now instead of 12GB.

:040000 040000 275d9aec0035017678feb97f435d8ae36e163357 9bba7b0eb5a1b1bbe0bd517e0f846943201ff38b M  numpy
bisect run success
git-bisect-py $PWD v1.10.4 pre-removal-numpybook-2863-gb92cc76   419.31s user 15.93s system 97% cpu 7:25.90 total

charris · 2016-03-08T17:06:41Z

I note that the error is 1 ulp, I wonder if that is worth worrying about, roundoff error is to be expected in floating point computations.

yarikoptic · 2016-03-08T19:19:00Z

IMHO it is not just a precision issue. First of all it is a regression and IMHO should be treated as such -- hard to say how many people are possibly relying on the fact that corrcoef <= 1.0, which wouldn't be the case here regardless how small imprecision is.

njsmith · 2016-03-10T18:57:11Z

@yarikoptic: that seems like a fair point. I guess it's sort of similar to how a 1 eps error that changes the sign is really bad (even though 1 eps errors are usually OK) -- preserving structural invariants is different than plain accuracy.

Do you see an easy way to fix this? (I guess we could just call np.clip, but maybe there's a better way. And surely we are not the first to wonder about the right way to compute correlations :-).)

Cc @rudimeier

yarikoptic · 2016-03-10T19:46:05Z

to me nothing else than straightforward as clipping came to mind, but I didn't look yet into the OPT reimplementation which lead to this situation

frederikhermans · 2016-03-11T10:05:06Z

The issue seems to come from when the sqrt is computed. The new code (08e8c14) computes the outer product of the square root of d with itself. The old code (e6dbcea) computed the square root of the outer product of d with itself. In particular, if we change the old code as follows:

 diff --git a/numpy/lib/function_base.py b/numpy/lib/function_base.py
 index 007ff42..53de721 100644
 --- a/numpy/lib/function_base.py
 +++ b/numpy/lib/function_base.py
 @@ -2349,7 +2349,8 @@ def corrcoef(x, y=None, rowvar=1, bias=np._NoValue, ddof=np._NoValue):
      except ValueError:  # scalar covariance
          # nan if incorrect value (nan, inf, 0), 1 otherwise
          return c / c
 -    return c / sqrt(multiply.outer(d, d))
 +    d = np.sqrt(d)
 +    return c / multiply.outer(d, d)

then the sample problem occurs:

 >>> print np.max(np.corrcoef(d) - 1.0)
 2.22044604925e-16

A possible fix would be to change the new code to compute the sqrt in the loop

 diff --git a/numpy/lib/function_base.py b/numpy/lib/function_base.py
 index e649807..267c0d6 100644
 --- a/numpy/lib/function_base.py
 +++ b/numpy/lib/function_base.py
 @@ -2349,10 +2349,9 @@ def corrcoef(x, y=None, rowvar=1, bias=np._NoValue, ddof=np._NoValue):
      except ValueError:  # scalar covariance
          # nan if incorrect value (nan, inf, 0), 1 otherwise
          return c / c
 -    d = sqrt(d)
      # calculate "c / multiply.outer(d, d)" row-wise ... for memory and speed
      for i in range(0, d.size):
 -        c[i,:] /= (d * d[i])
 +        c[i,:] /= sqrt(d * d[i])
      return c

However, this slows down the execution. Without the proposed change:

 In [13]: d = np.random.normal(size=(2500, 2500)).astype(np.float64)
 In [14]: %timeit np.corrcoef(d)
 1 loop, best of 3: 416 ms per loop

With the proposed change:

 In [15]: %timeit np.corrcoef(d)
 1 loop, best of 3: 519 ms per loop

For reference, here's the performance before the bug was introduced:

 In [16]: %timeit np.corrcoef(d)
 1 loop, best of 3: 471 ms per loop

Hope this helps. I'll see if I can find a fix that doesn't affect performance.

rudimeier · 2016-03-11T15:22:58Z

On Friday 11 March 2016 11:05:16 frederikhermans wrote:

          # nan if incorrect value (nan, inf, 0), 1 otherwise
          return c / c
 -    return c / sqrt(multiply.outer(d, d))
 +    d = np.sqrt(d)
 +    return c / multiply.outer(d, d)

then the sample problem occurs:
>>> print np.max(np.corrcoef(d) - 1.0)

 2.22044604925e-16

IMO you could never proof that any algorithm on certain machines would never
violate this constraint. Every numpy user knows that floating point
calculations on computers are never 100% exact.

So why not simply something like this at the end:
c = max(-1, min(1, c))

jaimefrio · 2016-03-11T15:25:05Z

If it ends up boiling down to @rudimeier's post processing suggestion, remember np.clip!

frederikhermans · 2016-03-11T15:53:19Z

I agree with @rudimeier that moving the sqrt around doesn't really fix the core of the issue.

My only concern with clip is that it may mask other bugs: if there are later changes that produce obviously wrong correlation coefficients, these wrong coefficients would be silently clipped to [-1, 1].

charris · 2016-03-11T16:32:06Z

@frederikhermans Good point. I'd go further and argue that any program that relied on the inequality being exactly preserved is likely incorrect.

charris · 2016-03-11T16:42:54Z

Note that the current implementation still uses too much memory ;)

    for i in range(0, d.size):
        c[i,:] /= (d * d[i])

could be replaced by

c /= d[:, None]
c /= d[None, :]

and d should be inverted first so that * can be used instead of /.

charris · 2016-03-11T16:53:42Z

And both of the improved methods can loose symmetry when xis y.

njsmith · 2016-03-11T17:03:54Z

IMO you could never proof that any algorithm on certain machines would never violate this constraint. Every numpy user knows that floating point calculations on computers are never 100% exact.

This seems unnecessarily pessimistic. Floating point is quirky, but it isn't random; IEEE 754 is very carefully designed to make it possible to write algorithms that have various nice numerical guarantees, even when using floating point. (At least, if you are very clever.)

But yeah, I'm not 100% sure but I do suspect it's true that the old algorithm could also produce values outside [0, 1]. Apparently not as often, though? I'd still like to find whatever literature there is on this (anyone know a numerical analyst?), because we are certainly not the first to run into it.

I'd go further and argue that any program that relied on the inequality being exactly preserved is likely incorrect.

I dunno. Would you feel the same way about a program that relied on real-valued sqrt returning a number >= 0? (E.g., does code like log(sqrt(x)) look wrong to you?) And I'm not sure how much weight to put on the analogy from code that tries to use exact equality. Exact equality can never be made to work on general and nothing like it can either, so code that assumes it can is generally structurally broken and needs to be rethought. But code that assumes corrcoef falls in [0, 1] is generally going to be fine if you give it a corrcoef that satisfies this, and you can always produce such a function e.g. with np.clip. (Using clip makes for inelegant code, but numerically it can only increase accuracy.)

rudimeier · 2016-03-11T17:08:21Z

On Friday 11 March 2016 17:32:19 Charles Harris wrote:

@frederikhermans Good point. I'd go further and argue that any program that
relied on the inequality being exactly preserved is likely incorrect.

Of course, but clip to [-1,1] would not hurt anybody.

On Friday 11 March 2016 17:43:28 Charles Harris wrote:

Note that the current implementation still uses too much memory ;)

I feel guilty because in pull request #6396 I had promised to add blas syrk()
but still not done that, except for my friend's private project:
sheyma/eigen_decomp@33fdf69

    for i in range(0, d.size):
        c[i,:] /= (d * d[i])
could be replaced by
c /= d[:, None]
c /= d[None, :]
and d should be inverted first so the * could be used instead of /.

This was discussed already in pull request #6396, see
"So I would go for this one, replacing expensive divisions by
multiplications:"

On Friday 11 March 2016 17:53:53 Charles Harris wrote:

And both of the improved methods can loose symmetry when xis y.

This should be fixed by the way when we use syrk().

charris · 2016-03-11T17:18:52Z

real-valued sqrt returning a number >= 0

Different case, and sign is exact. The use of positive square roots instead of negative is a choice and that is specified. OTOH, it wouldn't surprise me if an algorithm that assumed that a computed number was non-negative and took the square root of the result might lead to an exception.

rudimeier · 2016-03-11T17:23:04Z

Sorry my last reply by email was messed up somehow. I think this numpy issue tracker behave differently than github default ..., here again:

@frederikhermans Good point. I'd go further and argue that any program that
relied on the inequality being exactly preserved is likely incorrect.

Of course, but clip to [-1,1] would not hurt anybody.

Note that the current implementation still uses too much memory ;)

I feel guilty because in pull request #6396 I had promised to add blas syrk()
but still not done that, except for my friend's private project :)
sheyma/eigen_decomp@33fdf69

    for i in range(0, d.size):
        c[i,:] /= (d * d[i])
could be replaced by
c /= d[:, None]
c /= d[None, :]
and d should be inverted first so the * could be used instead of /.

This was discussed already in pull request #6396, see
"So I would go for this one, replacing expensive divisions by
multiplications:"

And both of the improved methods can loose symmetry when xis y.

This should be fixed by the way when we use syrk().

charris · 2016-03-11T17:24:44Z

This should be fixed by the way when we use syrk().

That applies to the unnormalized correlation. The normalization also needs to preserve symmetry, which is not the case in the improvement I gave as the divisions take place in different order for symmetric off axis elements. Probably OK with current implementation if multiplication is commutative.

charris · 2016-03-11T18:45:56Z

I think the function Is busted anyway, as diag is used and it returns a view these days. I've had my doubts that making diagonal return a view was a good idea, those doubts are getting stronger...

EDIT: And we are doing an inplace operation on the array from which the diagonal is taken.
EDIT: NVM, taking the sqrt makes a copy. Still, the possibility for problems with diagonal is worrysome.

charris · 2016-03-11T18:56:35Z

In any case, we should make sure that inplace operations work with overlapping lhs and rhs befor going to far down the view route.

njsmith · 2016-03-12T03:38:52Z

After thinking a bit and failing to find anything useful in Google, I'm leaning towards using something like

stddev = np.sqrt(c.diagonal())
c /= stddev[:, None]
c /= stddev[None, :]
np.clip(c, 0, 1, out=c)

(Or the / could be replaced by *, I don't care, though I'd want some checks that it doesn't kill numerical stability -- I'm wary because it's the same as doing dot(inv(...)) instead of solve(...).)

njsmith · 2016-03-12T03:41:34Z

I can even see an argument for doing c[diag_indices] = 1at the end. No point in returning 0.99999 for these elements when it's basically free to return the exact answer instead.

charris · 2016-03-12T04:32:49Z

i think you mean np.clip(c, -1, 1, out=c). Note that some off diagonal elements may also be exactly +/- 1, but pursuing this logic too far will lead us down the rabbit hole.

njsmith · 2016-03-12T07:55:11Z

Sorry yes, -1 obviously. And yeah, that's why the clip and diagonal setting are not redundant. But agreed trying to detect exact 1s is definitely outside the scope of this function -- the suggestion was just that it might be worth handling the trivial obvious case exactly.

rudimeier · 2016-03-12T09:14:12Z

I have tested something like this:

    d = 1.0/sqrt(d)
    n = d.size
    for i in range(0, n):
        c[i,i+1:n] *= (d[i+1:n] * d[i])
        np.clip( c[i,i+1:n], -1, 1, out=c[i,i+1:n])
        c[i+1:n,i] = c[i,i+1:n].conj()
        c[i,i] = 1.0

It's symmetric, In range [-1,1] and diag = 1.0.
This is BTW more like you would do it in C if you use syrk/herk directly which does not calculate the lower triangular. That's what I meant with "using syrk implies symmetry".

Would there be a chance to get this merged if I implement that in C? I expect a speed-up by factor 2.

charris · 2016-03-12T14:12:29Z

@rudimeier I wouldn't be a fan of a C implementation. IMHO, even clipping verges on overkill, floating point is floating point, and there is virtue in having simple and transparent code.

charris · 2016-03-13T02:01:31Z

@njsmith I don't think using the inverse is quite the same as using a matrix inverse, in particular, no row pivoting. The advantage of doing multiplication is that it is faster than division.

charris · 2016-03-13T02:56:36Z

Note that clipping doesn't work for complex values. The Schwarz inequality only gives us that abs(x) <= 1 for the elements x of the result. Consequently we would need to clip the modulus of complex values, and I don't see a roundoff safe way to do that.

charris · 2016-03-13T03:08:05Z

For complex observations we should use the real part of the diagonal, and probably just replace the diagonal by 1. The is a limit to what we can do with off diagonal elements.

yarikoptic · 2016-03-13T03:17:33Z

On Sat, 12 Mar 2016, Charles Harris wrote:

For complex observations we should use the real part of the diagonal, and
probably just replace the diagonal by 1. The is a limit to what we can do
with off diagonal elements.

I know that this is a somewhat awkward but I think a viable usecase
against doing anything special to a diagonal: comparing correlation
values to identify identical series (may be just as a side-effect of
computing correlations). If a diagonal would get a special
treatment then some naive algorithms seeking numerical identity within
elements of a row would fail even if e.g. multiple input series are
identical bit to bit. I am not sure how much better or worse it would
be to not having diagonal of ones ;)

charris · 2016-03-13T17:24:52Z

Note that NaN is a valid result, so we cannot just fill the diagonal with ones. I suggest if we want to make a stab at maintaining abs(c[i,j]) <= 1, that we simply clip the real and imag parts to the interval [-1, 1]. That doesn't cure all problems but is the only thing that is simple and easy to understand. Myself, I'd be inclined to just leave everything alone...

charris · 2016-03-13T21:04:24Z

PR for consideration in #7414.

The non-nan elements of the result of corrcoef should satisfy the inequality abs(x) <= 1 and the non-nan elements of the diagonal should be exactly one. We can't guarantee those results due to roundoff, but clipping the real and imaginary parts to the interval [-1, 1] improves things to a small degree. Closes numpy#7392.

charris added 06 - Regression component: numpy.lib labels Mar 8, 2016

charris added this to the 1.11.0 release milestone Mar 11, 2016

charris mentioned this issue Mar 13, 2016

Tweak corrcoef #7414

Merged

charris closed this as completed in d92ff4c Mar 14, 2016

charris mentioned this issue Mar 14, 2016

Backport 7414, Bound result of corrcoef #7417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

corrcoef started to escape [-1, 1] range #7392

corrcoef started to escape [-1, 1] range #7392

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

corrcoef started to escape [-1, 1] range #7392

corrcoef started to escape [-1, 1] range #7392

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!