BUG?: np.fromfile on Py3k #4118

larsoner · 2013-12-11T18:49:42Z

I've translated some code we have from Python2 to Python3, and I think I've encountered a strange bug with np.fromfile. If I run the following code snippet:

import numpy as np

for mode in ['rb', 'rb+']:
    with open('test.w', mode) as fid:
        fid.read(2)
        pos1 = fid.tell()
        data = np.fromfile(fid, count=1)
        pos2 = fid.tell()
        print(pos1, pos2)

I get this output in Python2:

(2, 10)
(2, 10)

But this output in Python3:

2 12
2 10

This is not good, because subsequent reads from the file will not occur at correct positions. In other words, I have to add the + to the file open parameter in order to get multiple calls to fromfile to read with correct positions. I've uploaded the test file here in case someone wants to try to replicate:

http://faculty.washington.edu/larsoner/test.w

I'm running Ubuntu Linux 13.10 64-bit. I had this problem on the system version, as well as on a version I just compiled myself (with MKL support).

The text was updated successfully, but these errors were encountered:

pv · 2013-12-11T20:32:34Z

To debug, run it in gdb with breakpoint on array_fromfile, and step through the execution.

Numpy uses a dup'd raw file pointer, and calls seek() on the Python level afterward, but it might be this runs afoul of something in cpython-side buffering. Difficult to see why rb and rb+ would be different if this is the cause, however.

cgohlke · 2013-12-11T22:35:54Z

cgohlke commented

Dec 11, 2013

I can reproduce this on Windows, except that the output is 2 -4084 2 10. I occasionally ran into a similar issue in the tifffile.py module on Python 3 and had to add a fd.seek(0, 2) after calls to np.fromfile to get back to a defined state. I was not able to isolate this so far...

larsoner · 2013-12-11T22:51:10Z

Ouch. We use np.fromfile extensively (on hundreds of lines) to do I/O, so forcing the correct seek afterward would be impractical for our use case. I'll see if I can figure it out using gdb when I get a chance, hopefully I will have some time soon. Interesting that it's a cross-platform error, too.

juliantaylor · 2013-12-11T23:15:21Z

numpy dups the file descriptor then seeks it, this also seeks the original file descriptor without notifying the file object about this, it then screws up its accounting. python2 does not seem to need any extra accounting so it works.

larsoner · 2013-12-11T23:16:54Z

@juliantaylor is this fixable at the numpy end then?

larsoner · 2013-12-11T23:20:26Z

To me this seems like a dangerous bug, it caused a non-zero amount of pain for me to figure out that np.fromfile wasn't behaving the same way between versions -- it could fail silently or inconsistently in different versions. Come to think about it, about a year back I had a heck of a time with some file reading under Windows, where the code worked perfectly under Linux -- in retrospect, this bug could easily be the culprit!

pv · 2013-12-12T00:35:17Z

Python 3 has no C-level API for reading files. The only proper fix is to rewrite all I/o in numpy to not use file* pointers on py3. This however will be a significant performance hit, since each fgetc etc becomes PyObject_CallMethod, so it's probably no go without numpy-side buffering.

There are a few hacks that can be tried, eg. seeking the fd back to the original position, before calling the python level seek to the new position. Whether this is enough depends on the cpython side implementation. Apparently, the cpython side strategy is different for rb and rb+ modes, so our tests did not catch this.

cgohlke · 2013-12-12T01:36:06Z

seeking the fd back to the original position, before calling the python level seek to the new position

That seems to work here. Another workaround is to first seek to the end of the file on the Python level.

cgohlke · 2013-12-12T04:35:38Z

This issue disappears when using Raw I/O, open('test.w', 'rb', buffering=0).

pv · 2013-12-12T08:57:35Z

It's probably safer to store the original C-level tell in a variable, and to C-level seek back to it before calling the Python-level seek to go to the new position. You cannot trust Python tell() call to return the correct location...

This scheme should be quite safe, because if no Python code is run in between, then I don't see how it could fail.

pv · 2013-12-12T10:31:44Z

It's safer to use ftell on the C-level file handle and stash the original position. The less the Python3 I/O api is messed with on the C level, the better...

larsoner · 2013-12-12T17:35:20Z

@cgohlke I can confirm on my test systems that buffering=0 is a workaround. Thanks for finding that, and for looking into fixing the underlying issue.

juliantaylor · 2013-12-12T20:39:01Z

This however will be a significant performance hit, since each fgetc etc becomes PyObject_CallMethod, so it's probably no go without numpy-side buffering.

there is only one instance of using fgetc for reading textfiles in numpy, so it shouldn't really matter.
But your suggesting of simply storing the position and seeking back sounds good to me and is also a smaller change.
Why is the file descriptor dup'ed in the first place? its not clear to me from the code.

cgohlke · 2013-12-12T21:01:02Z

Not using FILE pointers could/would take care of the long standing issue #2230 (numpy.fromfile does not accept StringIO object) (?)

pv · 2013-12-12T21:08:35Z

IIRC, it's dup'd because fclose on the fdopened handle closes the file.

charris · 2013-12-20T23:24:48Z

What is the status of this? I would be good to have a fix ;)

juliantaylor · 2013-12-27T01:09:20Z

you just need to seek the already existing (dup'd) C file descriptor back to what it was before the function call.
seeking a dup will seek the original too.

juliantaylor · 2013-12-27T01:30:39Z

npy_PyFile_Dup are not public functions, so they can be changed, and if I'm mistaken with that there is no reason array_fromfile needs to use them, they can use a internal copy of them that does accept the original position as argument.

juliantaylor · 2013-12-27T01:32:29Z

oh the header is installed so they are public, a internal copy it is then.

juliantaylor · 2013-12-27T01:44:50Z

something along the lines of this:
(I'll be able to do a PR earliest next sunday evening, please go ahead if you want to fix it earlier)

EDIT: doesn't work, also needs to remove the seeking from npy_PyFile_Dup

--- a/numpy/core/src/multiarray/multiarraymodule.c
+++ b/numpy/core/src/multiarray/multiarraymodule.c
@@ -1991,6 +1991,14 @@ array_fromfile(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds)
         Py_INCREF(file);
         own = 0;
     }
+
+    PyObject * ret3 = PyObject_CallMethod(file, "tell", "");
+    if (ret3 == NULL) {
+        goto fail;
+    }   
+    off_t cur = PyNumber_AsSsize_t(ret3, PyExc_OverflowError);
+
+    Py_DECREF(ret3);
     fp = npy_PyFile_Dup(file, "rb");
     if (fp == NULL) {
         PyErr_SetString(PyExc_IOError,
@@ -2001,11 +2009,17 @@ array_fromfile(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds)
     if (type == NULL) {
         type = PyArray_DescrFromType(NPY_DEFAULT_TYPE);
     }
+
     ret = PyArray_FromFile(fp, type, (npy_intp) nin, sep);
+    off_t new = ftell(fp);
+    // restore to internal pyfile bookkeeping
+    fseek(fp, cur, SEEK_SET);

-    if (npy_PyFile_DupClose(file, fp) < 0) {
+    PyObject  * ret2 = PyObject_CallMethod(file, "seek", NPY_SSIZE_T_PYFMT "i", new, 0); 
+    if (ret == NULL) {
+       fclose(fp);
         goto fail;
-    }
+    }   
+    Py_DECREF(ret2);
+    fclose(fp);
+
     if (own && npy_PyFile_CloseFile(file) < 0) {
         goto fail;
     }

juliantaylor · 2013-12-27T01:56:19Z

I would deprecate and remove them instead of ugly hacks to fix them.
Does anyone outside numpy use them?

pv · 2013-12-27T10:50:05Z

The comment on top of the header states that no backward compatibility is given, and all users should instead copy the header file. So no need to deprecate before changes.

Scipy doesn't actually use a copy (probably should), but IIRC the pyfile functions are not used there.

pv · 2014-01-02T17:39:02Z

Regarding switching away from FILE* interface altogether: the main block is in implementing fscanf (or streaming versions of strtoi / strtod for all of the supported data types). Since there is no ungetc in Python file API, the ability to read directly from non-seekable streams would be lost on Python 2, too. The performance impact should also be considered --- e.g. fromfile may need to be rewritten to use intermediate buffers.

The Python3 file handles keep records of current file positions, so that the raw handle needs to be restored to the original position in order to not confuse the mechanism. Fixes numpygh-4118

larsoner mentioned this issue Dec 11, 2013

WIP: Better py3k support mne-tools/mne-python#976

Closed

charris mentioned this issue Dec 22, 2013

BUG: Fix object scalar return type of 0-d array indices #4109

Merged

pv mentioned this issue Dec 27, 2013

BUG: core: ensure file handle positions are in sync #4152

Merged

charris closed this as completed in #4152 Jan 9, 2014

cjermain mentioned this issue Sep 3, 2015

numpy.fromfile OSError: first argument must be an open file (Python 3) #6246

Closed

rgommers mentioned this issue Dec 2, 2015

io tests crash on Windows scipy/scipy#5540

Closed

apontzen mentioned this issue Apr 13, 2019

Performance of fromfile on Python 3 #13319

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG?: np.fromfile on Py3k #4118

BUG?: np.fromfile on Py3k #4118

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BUG?: np.fromfile on Py3k #4118

BUG?: np.fromfile on Py3k #4118

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!