8000 BUG?: np.fromfile on Py3k · Issue #4118 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG?: np.fromfile on Py3k #4118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
larsoner opened this issue Dec 11, 2013 · 23 comments · Fixed by #4152
Closed

BUG?: np.fromfile on Py3k #4118

larsoner opened this issue Dec 11, 2013 · 23 comments · Fixed by #4152

Comments

@larsoner
Copy link
Contributor

I've translated some code we have from Python2 to Python3, and I think I've encountered a strange bug with np.fromfile. If I run the following code snippet:

import numpy as np

for mode in ['rb', 'rb+']:
    with open('test.w', mode) as fid:
        fid.read(2)
        pos1 = fid.tell()
        data = np.fromfile(fid, count=1)
        pos2 = fid.tell()
        print(pos1, pos2)

I get this output in Python2:

(2, 10)
(2, 10)

But this output in Python3:

2 12
2 10

This is not good, because subsequent reads from the file will not occur at correct positions. In other words, I have to add the + to the file open parameter in order to get multiple calls to fromfile to read with correct positions. I've uploaded the test file here in case someone wants to try to replicate:

http://faculty.washington.edu/larsoner/test.w

I'm running Ubuntu Linux 13.10 64-bit. I had this problem on the system version, as well as on a version I just compiled myself (with MKL support).

@pv
Copy link
Member
pv commented Dec 11, 2013

To debug, run it in gdb with breakpoint on array_fromfile, and step through the execution.

Numpy uses a dup'd raw file pointer, and calls seek() on the Python level afterward, but it might be this runs afoul of something in cpython-side buffering. Difficult to see why rb and rb+ would be different if this is the cause, however.

@cgohlke
Copy link < 8000 div aria-live="polite" aria-atomic="true" class="sr-only" data-clipboard-copy-feedback>
Contributor
cgohlke commented Dec 11, 2013

I can reproduce this on Windows, except that the output is 2 -4084 2 10. I occasionally ran into a similar issue in the tifffile.py module on Python 3 and had to add a fd.seek(0, 2) after calls to np.fromfile to get back to a defined state. I was not able to isolate this so far...

@larsoner
Copy link
Contributor Author

Ouch. We use np.fromfile extensively (on hundreds of lines) to do I/O, so forcing the correct seek afterward would be impractical for our use case. I'll see if I can figure it out using gdb when I get a chance, hopefully I will have some time soon. Interesting that it's a cross-platform error, too.

@juliantaylor
Copy link
Contributor

numpy dups the file descriptor then seeks it, this also seeks the original file descriptor without notifying the file object about this, it then screws up its accounting. python2 does not seem to need any extra accounting so it works.

@larsoner
Copy link
Contributor Author

@juliantaylor is this fixable at the numpy end then?

@larsoner
Copy link
Contributor Author

To me this seems like a dangerous bug, it caused a non-zero amount of pain for me to figure out that np.fromfile wasn't behaving the same way between versions -- it could fail silently or inconsistently in different versions. Come to think about it, about a year back I had a heck of a time with some file reading under Windows, where the code worked perfectly under Linux -- in retrospect, this bug could easily be the culprit!

@pv
Copy link
Member
pv commented Dec 12, 2013

Python 3 has no C-level API for reading files. The only proper fix is to rewrite all I/o in numpy to not use file* pointers on py3. This however will be a significant performance hit, since each fgetc etc becomes PyObject_CallMethod, so it's probably no go without numpy-side buffering.

There are a few hacks that can be tried, eg. seeking the fd back to the original position, before calling the python level seek to the new position. Whether this is enough depends on the cpython side implementation. Apparently, the cpython side strategy is different for rb and rb+ modes, so our tests did not catch this.

@cgohlke
Copy link
Contributor
cgohlke commented Dec 12, 2013

seeking the fd back to the original position, before calling the python level seek to the new position

That seems to work here. Another workaround is to first seek to the end of the file on the Python level.

@cgohlke
Copy link
Contributor
cgohlke commented Dec 12, 2013

This issue disappears when using Raw I/O, open('test.w', 'rb', buffering=0).

@pv
Copy link
Member
pv commented Dec 12, 2013

It's probably safer to store the original C-level tell in a variable, and to C-level seek back to it before calling the Python-level seek to go to the new position. You cannot trust Python tell() call to return the correct location...

This scheme should be quite safe, because if no Python code is run in between, then I don't see how it could fail.

@pv
Copy link
Member
pv commented Dec 12, 2013

It's safer to use ftell on the C-level file handle and stash the original position. The less the Python3 I/O api is messed with on the C level, the better...

@larsoner
Copy link
Contributor Author

@cgohlke I can confirm on my test systems that buffering=0 is a workaround. Thanks for finding that, and for looking into fixing the underlying issue.

@juliantaylor
Copy link
Contributor
This however will be a significant performance hit, since each fgetc etc becomes PyObject_CallMethod, so it's probably no go without numpy-side buffering.

there is only one instance of using fgetc for reading textfiles in numpy, so it shouldn't really matter.
But your suggesting of simply storing the position and seeking back sounds good to me and is also a smaller change.
Why is the file descriptor dup'ed in the first place? its not clear to me from the code.

@cgohlke
Copy link
Contributor
cgohlke commented Dec 12, 2013

Not using FILE pointers could/would take care of the long standing issue #2230 (numpy.fromfile does not accept StringIO object) (?)

@pv
Copy link
Member
pv commented Dec 12, 2013

IIRC, it's dup'd because fclose on the fdopened handle closes the file.

@charris
Copy link
Member
charris commented Dec 20, 2013

What is the status of this? I would be good to have a fix ;)

@juliantaylor
Copy link
Contributor

you just need to seek the already existing (dup'd) C file descriptor back to what it was before the function call.
seeking a dup will seek the original too.

@juliantaylor
Copy link
Contributor

npy_PyFile_Dup are not public functions, so they can be changed, and if I'm mistaken with that there is no reason array_fromfile needs to use them, they can use a internal copy of them that does accept the original position as argument.

@juliantaylor
Copy link
Contributor

oh the header is installed so they are public, a internal copy it is then.

@juliantaylor
Copy link
Contributor

something along the lines of this:
(I'll be able to do a PR earliest next sunday evening, please go ahead if you want to fix it earlier)

EDIT: doesn't work, also needs to remove the seeking from npy_PyFile_Dup

--- a/numpy/core/src/multiarray/multiarraymodule.c
+++ b/numpy/core/src/multiarray/multiarraymodule.c
@@ -1991,6 +1991,14 @@ array_fromfile(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds)
         Py_INCREF(file);
         own = 0;
     }
+
+    PyObject * ret3 = PyObject_CallMethod(file, "tell", "");
+    if (ret3 == NULL) {
+        goto fail;
+    }   
+    off_t cur = PyNumber_AsSsize_t(ret3, PyExc_OverflowError);
+
+    Py_DECREF(ret3);
     fp = npy_PyFile_Dup(file, "rb");
     if (fp == NULL) {
         PyErr_SetString(PyExc_IOError,
@@ -2001,11 +2009,17 @@ array_fromfile(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds)
     if (type == NULL) {
         type = PyArray_DescrFromType(NPY_DEFAULT_TYPE);
     }
+
     ret = PyArray_FromFile(fp, type, (npy_intp) nin, sep);
+    off_t new = ftell(fp);
+    // restore to internal pyfile bookkeeping
+    fseek(fp, cur, SEEK_SET);

-    if (npy_PyFile_DupClose(file, fp) < 0) {
+    PyObject  * ret2 = PyObject_CallMethod(file, "seek", NPY_SSIZE_T_PYFMT "i", new, 0); 
+    if (ret == NULL) {
+       fclose(fp);
         goto fail;
-    }
+    }   
+    Py_DECREF(ret2);
+    fclose(fp);
+
     if (own && npy_PyFile_CloseFile(file) < 0) {
         goto fail;
     }

@juliantaylor
Copy link
Contributor

I would deprecate and remove them instead of ugly hacks to fix them.
Does anyone outside numpy use them?

@pv
Copy link
Member
pv commented Dec 27, 2013

The comment on top of the header states that no backward compatibility is given, and all users should instead copy the header file. So no need to deprecate before changes.

Scipy doesn't actually use a copy (probably should), but IIRC the pyfile functions are not used there.

@pv
Copy link
Member
pv commented Jan 2, 2014

Regarding switching away from FILE* interface altogether: the main block is in implementing fscanf (or streaming versions of strtoi / strtod for all of the supported data types). Since there is no ungetc in Python file API, the ability to read directly from non-seekable streams would be lost on Python 2, too. The performance impact should also be considered --- e.g. fromfile may need to be rewritten to use intermediate buffers.

pv added a commit to pv/numpy that referenced this issue Jan 9, 2014
The Python3 file handles keep records of current file positions, so that
the raw handle needs to be restored to the original position in order
to not confuse the mechanism.

Fixes numpygh-4118
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0