8000 Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2170) · Issue #624 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2170) #624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
numpy-gitbot opened this issue Oct 19, 2012 · 3 comments
Labels
00 - Bug component: Other Priority: high High priority, also add milestones for urgent issues

Comments

@numpy-gitbot
Copy link

Original ticket http://projects.scipy.org/numpy/ticket/2170 on 2012-06-19 by trac user knopfra, assigned to unknown.

The array provided as a pickle file at the following link [http://dl.dropbox.com/u/30592272/data.pkl.gz] can be used to detect the following issue (64 bit Linux system, python 2.7.3, numpy 1.6.1). The array has 15606478 elements.

In [1]: import numpy as np

In [2]: a = np.load('data.pkl')

In [3]: np.nanmin(a)[[BR]]
Out[3]: 4.715836

In [4]: np.nanmax(a)[[BR]]
Out[4]: 4.7189121

In [5]: idx = np.where(np.isfinite(a))

In [6]: a[idx].mean()[[BR]]
Out[6]: 4.1792714738680736

In [7]: from scipy.stats import nanmean

In [8]: nanmean(a)[[BR]]
Out[8]: 4.1792714738680727

The mean value obtained is clearly wrong.
On a 32 bit system the result is the following:

In [1]: import numpy as np

In [2]: a = np.load('data.pkl')

In [3]: idx = np.where(np.isfinite(a))

In [4]: a[idx].mean()[[BR]]
Out[4]: 4.7184738182116019

and this time the mean value is correct.

@numpy-gitbot
Copy link
Author

trac user knopfra wrote on 2012-06-20

An additional comment that reduces the priority of the issue reported:

  • the array provided contains elements of type float32
  • as reported by the numpy.mean documentation, the mean is computed using the same precision the input has and this explains the incorrect mean value in the 64 bit system
  • converting the array elements to type float64, the mean value obtained is correct
  • in the 32 bit system, the python version used is " Python 2.7.2 -- EPD 7.1-2 (32-bit)" (Enthought Python Distribution)

The different results obtained by the 64 bit and 32 bit versions is still not clear

@numpy-gitbot
Copy link
Author

trac user zio_tom78 wrote on 2012-06-20

I see that there was at least another report of this kind (#1063). That bug was closed as "wontfix", because the developers did not want to force an implicit conversion of floating-point types from 32 to 64 bit in the NumPy code.

However, it seems clear to me that this bug arises from the kind of algorithm used to calculate the average: having a large number of samples causes the sum to throw away the least significant digits, incurring in a significant numerical error in the result.

Instead of accumulating the sum and then dividing by the number of elements, one might use the recurrence relation used e.g. by GSL in the implementation of [http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/mean_source.c gsl_stats_mean]. The value of the variable "mean" is always of the same order of magnitude of the result, and therefore overflows are less likely to occur. (I asked knopfra to apply this formula on his dataset, and he reported me that even 32-bit arithmetic yields the correct answer, 4.718489.)

@charris
Copy link
Member
charris commented Feb 11, 2014

This looks to be fixed, both in numpy and scipy.stats.nanmean. There is a slight difference for float32 result, but that is accounted for by roundoff error in float32.

In [3]: nanmean(a)
Out[3]: 4.7184114

In [4]: nanmean(a, dtype=float64)
Out[4]: 4.7184738693419233

In [5]: from scipy.stats import nanmean as spy_nanmean

In [6]: spy_nanmean(a)
Out[6]: 4.7184113922989948

Note that scipy nanmean returns the wrong scalar type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug component: Other Priority: high High priority, also add milestones for urgent issues
Projects
None yet
Development

No branches or pull requests

2 participants
0