8000 np.sum returns wrong answers · Issue #21769 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

np.sum returns wrong answers #21769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nc10 opened this issue Jun 15, 2022 · 16 comments
Closed

np.sum returns wrong answers #21769

nc10 opened this issue Jun 15, 2022 · 16 comments
Labels
33 - Question Question about NumPy usage or development

Comments

@nc10
Copy link
nc10 commented Jun 15, 2022

Describe the issue:

When I calculate the tail probability of a Poisson distribution, I get a wrong value with "np.sum". "sum" by itself returns a correct value.

Reproduce the code example:

rate_1 = 7*10**-6
rate_2 = 7.5*10**-6
rate_3 = 8*10**-6
scipy_dist_1 = poisson.pmf(np.arange(9), rate_1)
scipy_dist_2 = poisson.pmf(np.arange(9), rate_2)
scipy_dist_3 = poisson.pmf(np.arange(9), rate_3)
comparison = pd.DataFrame(
    {"rate=7*10**-6": scipy_dist_1,
     "rate=7.5*10**-6": scipy_dist_2,
     "rate=8*10**-6": scipy_dist_3, }
)
print(f"First 8 probabilities for three different rates:\n{comparison}")

comparison_tail_np = pd.DataFrame(
    {"rate=7*10**-6": (1-np.sum(scipy_dist_1)),
     "rate=7.5*10**-6": (1-np.sum(scipy_dist_2)),
     "rate=8*10**-6": (1-np.sum(scipy_dist_3))},
    index=[1]
)

print(f"Tail probabilities using 'np.sum':\n{comparison_tail}")

comparison_tail = pd.DataFrame(
    {"rate=7*10**-6": (1-sum(scipy_dist_1)),
     "rate=7.5*10**-6": (1-sum(scipy_dist_2)),
     "rate=8*10**-6": (1-sum(scipy_dist_3))},
    index=[1]
)

print(f"Tail probabilities using 'sum':\n{comparison_tail}")

Error message:

No error message, returns a wrong answer!

Couple of remarks: 
1. The rates are increasing, therefore the distributions are stochastically ordered. This means that the tail probabilities should be increasing (if they are not zero).
2. The rates are quite small; therefore, I was expecting to get all zeros. 
3. This comes up when one wants to compute derivatives numerically, so it is not a rare use case. 
4. Scipy returns correct probability values; I have checked them manually.

NumPy/Python version information:

%%

import sys, numpy; print(numpy.version, sys.version) -> 1.22.4 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0]

@nc10 nc10 added the 00 - Bug label Jun 15, 2022
@nc10
Copy link
Author
nc10 commented Jun 15, 2022

I am including the code and its output again; there were a few typos in my first submission. Sorry!

Code:

import pandas as pd
from scipy.stats import poisson
import numpy as np

rate_1 = 7*10**-6
rate_2 = 7.5*10**-6
rate_3 = 8*10**-6
scipy_dist_1 = poisson.pmf(np.arange(9), rate_1)
scipy_dist_2 = poisson.pmf(np.arange(9), rate_2)
scipy_dist_3 = poisson.pmf(np.arange(9), rate_3)
comparison = pd.DataFrame(
    {"rate=7*10**-6": scipy_dist_1,
     "rate=7.5*10**-6": scipy_dist_2,
     "rate=8*10**-6": scipy_dist_3, }
)
print(f"First 8 probabilities for three different rates:\n{comparison}")

comparison_tail_np = pd.DataFrame(
    {"rate=7*10**-6": (1-np.sum(scipy_dist_1)),
     "rate=7.5*10**-6": (1-np.sum(scipy_dist_2)),
     "rate=8*10**-6": (1-np.sum(scipy_dist_3))},
    index=[1]
)

print(f"Tail probabilities using 'np.sum':\n{comparison_tail_np}")

comparison_tail = pd.DataFrame(
    {"rate=7*10**-6": (1-sum(scipy_dist_1)),
     "rate=7.5*10**-6": (1-sum(scipy_dist_2)),
     "rate=8*10**-6": (1-sum(scipy_dist_3))},
    index=[1]
)

print(f"Tail probabilities using 'sum':\n{comparison_tail}")

Output:

First 8 probabilities for three different rates:
   rate=7*10**-6  rate=7.5*10**-6  rate=8*10**-6
0   9.999930e-01     9.999925e-01   9.999920e-01
1   6.999951e-06     7.499944e-06   7.999936e-06
2   2.449983e-11     2.812479e-11  
8000
 3.199974e-11
3   5.716627e-17     7.031197e-17   8.533265e-17
4   1.000410e-22     1.318349e-22   1.706653e-22
5   1.400574e-28     1.977524e-28   2.730645e-28
6   1.634002e-34     2.471905e-34   3.640860e-34
7   1.634002e-40     2.648470e-40   4.160983e-40
8   1.429752e-46     2.482941e-46   4.160983e-46
Tail probabilities using 'np.sum':
   rate=7*10**-6  rate=7.5*10**-6  rate=8*10**-6
1            0.0     1.110223e-16            0.0
Tail probabilities using 'sum':
   rate=7*10**-6  rate=7.5*10**-6  rate=8*10**-6
1            0.0              0.0            0.0

@seberg seberg added 33 - Question Question about NumPy usage or development and removed 00 - Bug labels Jun 15, 2022
@nc10
Copy link
Author
nc10 commented Jun 16, 2022

@seberg Sebastian,

Thanks for reviewing this issue. When you get a few minutes, I would appreciate if you can clarify why you don't consider this as a bug. The computation is clearly returning a mathematically wrong answer. To clarify, given the stochastic dominance relationship between the three Poisson distributions, the tail probabilities should be monotonically increasing (higher rates should yield higher tail probability). But it is not true that 1.11e-16 < 0! Also, what is curious is that the answer is correct for 7.0* 10**-6 and 8.0* 10**-6, but is incorrect for 7.5* 10**-6. This happens only for 7.5. For example, if I change 7.5 to 7.3 or 7.8, I get 0.0 which is the correct answer (the correct answer is of the order of 10**-54, which is essentially zero).

From my perspective this is a high-priority bug. If you cannot count on Numpy summing numbers correctly, then you cannot count on anything it computes.

Thanks for your attention.

@seberg
Copy link
Member
seberg commented Jun 16, 2022

@nc10 the typical precision computers use is "double precison", i.e. Pythons float. You can use Python sums and will also get 1 (it can be more precise though). Double precision numbers have about 15-16 decimal digits. Your numbers are clearly much smaller compared to 1.

You can try things like mpmath (arbitrary precision), or sympy (symbolic math, uses mpmath I think). But if you need precision in those ranges you simply cannot work the with typical computer math.

Now, there may be tricks available, np.logaddexp is a function that can be used when working in logspace can help. I doubt that tricks apply here, but I am not sure.

So no, it is clearly not a NumPy issue. But rather too high expectations of typical computer math. For more details, see for example:
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

@seberg seberg closed this as completed Jun 16, 2022
@charris
Copy link
Member
8000 charris commented Jun 16, 2022

See also #8786. There are several other discussions of the problem.

@nc10
Copy link
Author
nc10 commented Jun 16, 2022

I understand your point of view. But this is not about precision, but is about consistency. Numpy is returning a nonzero value when it is returning zero for a theoretically larger quantity. I actually don't need this level of precision. If a larger quantity is falling out side of float64 resolution and returned zero, a smaller quantity should also fall outside of that resolution and should return zero.

As an additional piece of information, the expression evaluates to correct value (zero) if I use numba and compile it C (i.e., it evaluates consistently). This suggests that the problem is not about how the numerical system represents small numbers but it is about how Numpy is carrying out the calculation.

Does this make sense?

@charris
Copy link
Member
charris commented Jun 16, 2022

Out of curiosity, are you running on a 32 bit platform?

@rkern
Copy link
Member
rkern commented Jun 16, 2022

FWIW, if you want actually accurate tail values, you should use the sf() method rather than any summing and subtracting. 0.0 is not the "correct" value. You are bound to get errors on the order of the ~1e-16 (np.finfo(float).eps / 2) that you observe (1 ULP difference between the number just before 1.0 and 1.0), depending on the precise order the summing is done. Subtracting two close numbers is always dangerous in floating point arithmetic.

np.sum() uses a pairwise summation scheme that usually is more accurate than just summing the values in the order given. But at extreme cases like yours, sometimes it doesn't. There are other summation schemes that would also perform differently with a different set of tradeoffs in accuracy and speed. All of them are valid "sum" functions. This amount of error is well within the bounds of what any "sum" function over floating point inputs is supposed to provide.

@nc10
Copy link
Author
nc10 commented Jun 16, 2022

A bit of theory: The problem is theoretically related to infinite sums and absolutely summable sequences. In numerical computation, the order in which the terms of a sequence are added up matters even when from the mathematical perspective this does not matter (as you pointed out, this is because of finite precision). In short, to get the correct result when a sequence is absolutely summable, one must first sort values in increasing order (in terms of their absolute values) and then add them up from small to large.

@nc10
Copy link
Author
nc10 commented Jun 16, 2022

@charris No, I am using 64bit platform (Ubuntu 18.04)

@charris
Copy link
Member
charris commented Jun 16, 2022

I ask, because the FPU may be different on 32 bit and 64 bit platforms, even on the same hardware. Results also vary depending on which hardware features are used by the compiler and how the compiler organizes the computation.

@nc10
Copy link
Author
nc10 commented Jun 16, 2022

@rkern
Thanks for suggesting sf() method. For the case of Poisson distribution, we can bound the tail very accurately since Poisson is simply terms of the Taylor expansion of exp(-lambda)*exp(lambda) [keeping exp(-lambda) as is and replacing exp(lambda) by its Taylor expansion].

I don't need accuracy at the level of 1e-16 for my numerical computation (1e-6 is far more than enough). What causes problem (at least for me) is inconsistent evaluation. If 1e-18 is represented as zero, then 1e-19 should also be represented as zero.

Also, don't forget that this is happening only for 7.5e-6. Both 7.4e-6 and 7.6e-6 evaluate to zero!

@nc10
8000
Copy link
Author
nc10 commented Jun 16, 2022

@charris
I can provide any additional information you may need. I can also run any tests you may find useful.

@rkern
Copy link
Member
rkern commented Jun 16, 2022

No one is "representing 1e-18 as zero". That's not what's going on here.

Forget about the properties of Poisson distributions like stochastic dominance (if you need something closer to those ideal properties, you must use the much more accurate sf()). You aren't summing a Poisson distribution: you are summing the finite numerical representations that came out of poisson.pmf(). There is already error in the scipy_dist_* arrays. If you do the sums in higher-precision arithmetic like mpmath, you will see that scipy_dist_1 and scipy_dist_3 sum to values > 1.0 by about 3.2e-17 while scipy_dist_2 sums to a value less than 1.0 by about 3.2e-17. When the sums round, the two that were over the true value, rounded down to 1.0 while the one that was under rounded down to 1.0 - eps / 2 under the pairwise summation of np.sum() (which is not the only valid way to do a sum, but is a valid way to do a sum). So that's the source of the inconsistency: the error that was already present in the input arrays.

[~]                                                                                                                                                                                                                  
|73> scipy_dist_1 = stats.poisson.pmf(np.arange(9), 7.*10**-6)                                                                                                                                                       

[~]                                                                                                                                                                                                                  
|74> scipy_dist_2 = stats.poisson.pmf(np.arange(9), 7.5*10**-6)                                                                                                                                                      

[~]                                                                                                                                                                                                                  
|75> scipy_dist_3 = stats.poisson.pmf(np.arange(9), 8*10**-6)                                                                                                                                                        

[~]                                                                                                                                                                                                                  
|76> import mpmath                                                                                                                                                                                                   

[~]                                                                                                                                                                                                                  
|77> mpmath.mp.dps = 100                                                                                                                                                                                             

[~]                                                                                                                                                                                                                  
|78> mx1 = [mpmath.mpf(x) for x in scipy_dist_1]                                                                                                                                                                     

[~]                                                                                                                                                                                                                  
|79> mx2 = [mpmath.mpf(x) for x in scipy_dist_2]                                                                                                                                                                     

[~]                                                                                                                                                                                                                  
|80> mx3 = [mpmath.mpf(x) for x in scipy_dist_3]                                                                                                                                                                     

[~]                                                                                                                                                                                                                  
|81> sum(mx1), sum(mx2), sum(mx3)                                                                                                                                                                                    
(mpf('1.000000000000000035543158992444986125307281598412537179984155602848568771752708462773803217685800334755'),
 mpf('0.9999999999999999679825336353806521389437440611735724656187933667814377354645920848919252321491996717316'),
 mpf('1.000000000000000035582096603948007111483074791071511523511410452349876933505127464929931662285796620975'))

Now, it does happen to be the case that 1.0 would be a better rounding of that more-accurate sum than 1.0-eps/2, and other summation schemes do get that in this case, but it is a tradeoff with other cases where those other schemes would be less accurate. But we're within 1 ULP of that answer, which is about all you can ask of a floating point routine.

@nc10
Copy link
Author
nc10 commented Jun 16, 2022

@rkern

Thanks for the explanation. I just made up this example to demonstrate the problem I am facing in my actual numerical computation. But your detailed description helped me to better understand the issue. I guess with your clarification, I should be able to devise a workaround.

Thanks A1C1 again!

@rkern
Copy link
Member
rkern commented Jun 16, 2022

Changing the order can get you different results, sometimes more preferably for certain purposes (np.sum(scipy_dist_2[::-1]), for example). It would be nice to have an option to turn off the partial pairwise summation to get the most control, but we haven't had a huge clamoring for such.

@nc10
Copy link
Author
nc10 commented Jun 16, 2022

Yes, you are correct. From my experience, the most reliable approach is to sort the sequence in increasing order (in absolute value term) and then summing from small to large. This of course make it an n-log(n) algorithm.

This entire issue came up because I was testing my computation by checking its continuity. I have closed-form expressions of the ratio of two tail probabilities for zero rate and non-zero rate and I wanted to numerically verify my computation by checking if the limit (when the rate goes to zero) matches up with my direct computation.

In any case, your comment was very helpful. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
33 - Question Question about NumPy usage or development
Projects
None yet
Development

No branches or pull requests

4 participants
0