Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2170) #624

numpy-gitbot · 2012-10-19T15:10:15Z

Original ticket http://projects.scipy.org/numpy/ticket/2170 on 2012-06-19 by trac user knopfra, assigned to unknown.

The array provided as a pickle file at the following link [http://dl.dropbox.com/u/30592272/data.pkl.gz] can be used to detect the following issue (64 bit Linux system, python 2.7.3, numpy 1.6.1). The array has 15606478 elements.

In [1]: import numpy as np

In [2]: a = np.load('data.pkl')

In [3]: np.nanmin(a)[[BR]]
Out[3]: 4.715836

In [4]: np.nanmax(a)[[BR]]
Out[4]: 4.7189121

In [5]: idx = np.where(np.isfinite(a))

In [6]: a[idx].mean()[[BR]]
Out[6]: 4.1792714738680736

In [7]: from scipy.stats import nanmean

In [8]: nanmean(a)[[BR]]
Out[8]: 4.1792714738680727

The mean value obtained is clearly wrong.
On a 32 bit system the result is the following:

In [1]: import numpy as np

In [2]: a = np.load('data.pkl')

In [3]: idx = np.where(np.isfinite(a))

In [4]: a[idx].mean()[[BR]]
Out[4]: 4.7184738182116019

and this time the mean value is correct.

numpy-gitbot · 2012-10-23T02:46:59Z

trac user knopfra wrote on 2012-06-20

An additional comment that reduces the priority of the issue reported:

the array provided contains elements of type float32
as reported by the numpy.mean documentation, the mean is computed using the same precision the input has and this explains the incorrect mean value in the 64 bit system
converting the array elements to type float64, the mean value obtained is correct
in the 32 bit system, the python version used is " Python 2.7.2 -- EPD 7.1-2 (32-bit)" (Enthought Python Distribution)

The different results obtained by the 64 bit and 32 bit versions is still not clear

numpy-gitbot · 2012-10-23T02:46:59Z

trac user zio_tom78 wrote on 2012-06-20

I see that there was at least another report of this kind (#1063). That bug was closed as "wontfix", because the developers did not want to force an implicit conversion of floating-point types from 32 to 64 bit in the NumPy code.

However, it seems clear to me that this bug arises from the kind of algorithm used to calculate the average: having a large number of samples causes the sum to throw away the least significant digits, incurring in a significant numerical error in the result.

Instead of accumulating the sum and then dividing by the number of elements, one might use the recurrence relation used e.g. by GSL in the implementation of [http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/mean_source.c gsl_stats_mean]. The value of the variable "mean" is always of the same order of magnitude of the result, and therefore overflows are less likely to occur. (I asked knopfra to apply this formula on his dataset, and he reported me that even 32-bit arithmetic yields the correct answer, 4.718489.)

charris · 2014-02-11T00:47:48Z

This looks to be fixed, both in numpy and scipy.stats.nanmean. There is a slight difference for float32 result, but that is accounted for by roundoff error in float32.

In [3]: nanmean(a)
Out[3]: 4.7184114

In [4]: nanmean(a, dtype=float64)
Out[4]: 4.7184738693419233

In [5]: from scipy.stats import nanmean as spy_nanmean

In [6]: spy_nanmean(a)
Out[6]: 4.7184113922989948

Note that scipy nanmean returns the wrong scalar type.

numpy-gitbot mentioned this issue Oct 23, 2012

Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2169) #623

Closed

charris closed this as completed Feb 11, 2014

ivirshup mentioned this issue Jun 16, 2021

score_genes reproducibility scverse/scanpy#313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2170) #624

Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2170) #624

Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2170) #624

Wrong result when calculating the mean masking nan values on a 64 bit system (Trac #2170) #624

Comments