You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The array provided as a pickle file at the following link [http://dl.dropbox.com/u/30592272/data.pkl.gz] can be used to detect the following issue (64 bit Linux system, python 2.7.3, numpy 1.6.1). The array has 15606478 elements.
In [1]: import numpy as np
In [2]: a = np.load('data.pkl')
In [3]: np.nanmin(a)[[BR]]
Out[3]: 4.715836
In [4]: np.nanmax(a)[[BR]]
Out[4]: 4.7189121
In [5]: idx = np.where(np.isfinite(a))
In [6]: a[idx].mean()[[BR]]
Out[6]: 4.1792714738680736
In [7]: from scipy.stats import nanmean
In [8]: nanmean(a)[[BR]]
Out[8]: 4.1792714738680727
The mean value obtained is clearly wrong.
On a 32 bit system the result is the following:
In [1]: import numpy as np
In [2]: a = np.load('data.pkl')
In [3]: idx = np.where(np.isfinite(a))
In [4]: a[idx].mean()[[BR]]
Out[4]: 4.7184738182116019
and this time the mean value is correct.
The text was updated successfully, but these errors were encountered:
An additional comment that reduces the priority of the issue reported:
the array provided contains elements of type float32
as reported by the numpy.mean documentation, the mean is computed using the same precision the input has and this explains the incorrect mean value in the 64 bit system
converting the array elements to type float64, the mean value obtained is correct
in the 32 bit system, the python version used is " Python 2.7.2 -- EPD 7.1-2 (32-bit)" (Enthought Python Distribution)
The different results obtained by the 64 bit and 32 bit versions is still not clear
I see that there was at least another report of this kind (#1063). That bug was closed as "wontfix", because the developers did not want to force an implicit conversion of floating-point types from 32 to 64 bit in the NumPy code.
However, it seems clear to me that this bug arises from the kind of algorithm used to calculate the average: having a large number of samples causes the sum to throw away the least significant digits, incurring in a significant numerical error in the result.
Instead of accumulating the sum and then dividing by the number of elements, one might use the recurrence relation used e.g. by GSL in the implementation of [http://bzr.savannah.gnu.org/lh/gsl/trunk/annotate/head:/statistics/mean_source.c gsl_stats_mean]. The value of the variable "mean" is always of the same order of magnitude of the result, and therefore overflows are less likely to occur. (I asked knopfra to apply this formula on his dataset, and he reported me that even 32-bit arithmetic yields the correct answer, 4.718489.)
This looks to be fixed, both in numpy and scipy.stats.nanmean. There is a slight difference for float32 result, but that is accounted for by roundoff error in float32.
In [3]: nanmean(a)
Out[3]: 4.7184114
In [4]: nanmean(a, dtype=float64)
Out[4]: 4.7184738693419233
In [5]: from scipy.stats import nanmean as spy_nanmean
In [6]: spy_nanmean(a)
Out[6]: 4.7184113922989948
Note that scipy nanmean returns the wrong scalar type.
Original ticket http://projects.scipy.org/numpy/ticket/2170 on 2012-06-19 by trac user knopfra, assigned to unknown.
The array provided as a pickle file at the following link [http://dl.dropbox.com/u/30592272/data.pkl.gz] can be used to detect the following issue (64 bit Linux system, python 2.7.3, numpy 1.6.1). The array has 15606478 elements.
In [1]: import numpy as np
In [2]: a = np.load('data.pkl')
In [3]: np.nanmin(a)[[BR]]
Out[3]: 4.715836
In [4]: np.nanmax(a)[[BR]]
Out[4]: 4.7189121
In [5]: idx = np.where(np.isfinite(a))
In [6]: a[idx].mean()[[BR]]
Out[6]: 4.1792714738680736
In [7]: from scipy.stats import nanmean
In [8]: nanmean(a)[[BR]]
Out[8]: 4.1792714738680727
The mean value obtained is clearly wrong.
On a 32 bit system the result is the following:
In [1]: import numpy as np
In [2]: a = np.load('data.pkl')
In [3]: idx = np.where(np.isfinite(a))
In [4]: a[idx].mean()[[BR]]
Out[4]: 4.7184738182116019
and this time the mean value is correct.
The text was updated successfully, but these errors were encountered: