-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG?: np.fromfile on Py3k #4118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To debug, run it in gdb with breakpoint on array_fromfile, and step through the execution. Numpy uses a dup'd raw file pointer, and calls seek() on the Python level afterward, but it might be this runs afoul of something in cpython-side buffering. Difficult to see why rb and rb+ would be different if this is the cause, however. |
I can reproduce this on Windows, except that the output is |
Ouch. We use |
numpy dups the file descriptor then seeks it, this also seeks the original file descriptor without notifying the file object about this, it then screws up its accounting. python2 does not seem to need any extra accounting so it works. |
@juliantaylor is this fixable at the |
To me this seems like a dangerous bug, it caused a non-zero amount of pain for me to figure out that |
Python 3 has no C-level API for reading files. The only proper fix is to rewrite all I/o in numpy to not use file* pointers on py3. This however will be a significant performance hit, since each fgetc etc becomes PyObject_CallMethod, so it's probably no go without numpy-side buffering. There are a few hacks that can be tried, eg. seeking the fd back to the original position, before calling the python level seek to the new position. Whether this is enough depends on the cpython side implementation. Apparently, the cpython side strategy is different for rb and rb+ modes, so our tests did not catch this. |
That seems to work here. Another workaround is to first seek to the end of the file on the Python level. |
This issue disappears when using Raw I/O, |
It's probably safer to store the original C-level This scheme should be quite safe, because if no Python code is run in between, then I don't see how it could fail. |
It's safer to use |
@cgohlke I can confirm on my test systems that |
there is only one instance of using fgetc for reading textfiles in numpy, so it shouldn't really matter. |
Not using FILE pointers could/would take care of the long standing issue #2230 (numpy.fromfile does not accept StringIO object) (?) |
IIRC, it's dup'd because fclose on the fdopened handle closes the file. |
What is the status of this? I would be good to have a fix ;) |
you just need to seek the already existing (dup'd) C file descriptor back to what it was before the function call. |
|
oh the header is installed so they are public, a internal copy it is then. |
something along the lines of this: EDIT: doesn't work, also needs to remove the seeking from npy_PyFile_Dup --- a/numpy/core/src/multiarray/multiarraymodule.c
+++ b/numpy/core/src/multiarray/multiarraymodule.c
@@ -1991,6 +1991,14 @@ array_fromfile(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds)
Py_INCREF(file);
own = 0;
}
+
+ PyObject * ret3 = PyObject_CallMethod(file, "tell", "");
+ if (ret3 == NULL) {
+ goto fail;
+ }
+ off_t cur = PyNumber_AsSsize_t(ret3, PyExc_OverflowError);
+
+ Py_DECREF(ret3);
fp = npy_PyFile_Dup(file, "rb");
if (fp == NULL) {
PyErr_SetString(PyExc_IOError,
@@ -2001,11 +2009,17 @@ array_fromfile(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds)
if (type == NULL) {
type = PyArray_DescrFromType(NPY_DEFAULT_TYPE);
}
+
ret = PyArray_FromFile(fp, type, (npy_intp) nin, sep);
+ off_t new = ftell(fp);
+ // restore to internal pyfile bookkeeping
+ fseek(fp, cur, SEEK_SET);
- if (npy_PyFile_DupClose(file, fp) < 0) {
+ PyObject * ret2 = PyObject_CallMethod(file, "seek", NPY_SSIZE_T_PYFMT "i", new, 0);
+ if (ret == NULL) {
+ fclose(fp);
goto fail;
- }
+ }
+ Py_DECREF(ret2);
+ fclose(fp);
+
if (own && npy_PyFile_CloseFile(file) < 0) {
goto fail;
} |
I would deprecate and remove them instead of ugly hacks to fix them. |
The comment on top of the header states that no backward compatibility is given, and all users should instead copy the header file. So no need to deprecate before changes. Scipy doesn't actually use a copy (probably should), but IIRC the pyfile functions are not used there. |
Regarding switching away from FILE* interface altogether: the main block is in implementing |
The Python3 file handles keep records of current file positions, so that the raw handle needs to be restored to the original position in order to not confuse the mechanism. Fixes numpygh-4118
I've translated some code we have from Python2 to Python3, and I think I've encountered a strange bug with
np.fromfile
. If I run the following code snippet:I get this output in Python2:
But this output in Python3:
This is not good, because subsequent reads from the file will not occur at correct positions. In other words, I have to add the
+
to the file open parameter in order to get multiple calls tofromfile
to read with correct positions. I've uploaded the test file here in case someone wants to try to replicate:http://faculty.washington.edu/larsoner/test.w
I'm running Ubuntu Linux 13.10 64-bit. I had this problem on the system version, as well as on a version I just compiled myself (with MKL support).
The text was updated successfully, but these errors were encountered: