8000 numpy.save broken in 1.14 with python 2.7.5 · Issue #10348 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

numpy.save broken in 1.14 with python 2.7.5 #10348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pnbat opened this issue Jan 9, 2018 · 35 comments
Closed

numpy.save broken in 1.14 with python 2.7.5 #10348

pnbat opened this issue Jan 9, 2018 · 35 comments

Comments

@pnbat
Copy link
Contributor
pnbat commented Jan 9, 2018

The following minimal example crashes with numpy 1.14 and python 2.7.5:

import numpy
a = numpy.zeros(64)
numpy.save("/tmp/a.npy", a)
lib/python2.7/site-packages/numpy/lib/format.pyc in _write_array_header(fp, d, version)
    306     header.append("}")
    307     header = "".join(header)
--> 308     header = asbytes(_filter_header(header))
    309 
    310     hlen = len(header) + 1 # 1 for newline

lib/python2.7/site-packages/numpy/lib/format.pyc in _filter_header(s)
    465             tokens.append(token)
    466         last_token_was_number = (token_type == tokenize.NUMBER)
--> 467     return tokenize.untokenize(tokens)
    468 
    469

/usr/lib64/python2.7/tokenize.pyc in untokenize(iterable)
    260     """
    261     ut = Untokenizer()
--> 262     return ut.untokenize(iterable)
    263 
    264 def generate_tokens(readline):

/usr/lib64/python2.7/tokenize.pyc in untokenize(self, iterable)
    196                 break
    197             tok_type, token, start, end, line = t
--> 198             self.add_whitespace(start)
    199             self.tokens.append(token)
    200             self.prev_row, self.prev_col = end

/usr/lib64/python2.7/tokenize.pyc in add_whitespace(self, start)
    185     def add_whitespace(self, start):
    186         row, col = start
--> 187         assert row <= self.prev_row
    188         col_offset = col - self.prev_col
    189         if col_offset:

AssertionError:

The tokenize is now performed before the padding and the newline is added to the header string. This does not seem to work in python 2.7. Adding the newline in _filter_header(s) fixes the problem for python 2.7, but does not work in python 3. This could be done in the version distinguishing part of _filter_header(s). I am not familiar with the numpy internals and the recent change for the 64 byte alignment (#9025), so I do not feel comfortable to change it myself.

@eric-wieser
Copy link
Member

can't seem to reproduce on python 2.7.11.

Can you enter the %debuger in ipython, and show the value of header causing the issue?

@pnbat
Copy link
Contributor Author
pnbat commented Jan 9, 2018

My colleague just sent me the following email: (It seems this is a python 2.7.5 issue)

Newest RPM in CentOS 7 (os)
http://mirror.centos.org/centos/7/os/x86_64/Packages/python-2.7.5-58.el7.x86_64.rpm

Only in Software Collections there's a newer version but we cannot use that one
https://centos.pkgs.org/7/centos-sclo-rh/python27-python-2.7.13-3.el7.x86_64.rpm.html

Tokens in _filter_header displayed below, python 2.7.5 shows no newline (\n)

  • PYTHON 2.7.5
Python 2.7.5 (default, Aug  4 2017, 00:39:18) 
Type "copyright", "credits" or "license" for more information.

IPython 5.5.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import numpy

In [2]: a = numpy.ndarray([0,0,0,0], dtype=numpy.uint8)

In [3]: numpy.save("/tmp/a.npy", a)
token (51, '{', (1, 0), (1, 1), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (3, "'descr'", (1, 1), (1, 8), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ':', (1, 8), (1, 9), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (3, "'|u1'", (1, 10), (1, 15), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ',', (1, 15), (1, 16), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (3, "'fortran_order'", (1, 17), (1, 32), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ':', (1, 32), (1, 33), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (1, 'False', (1, 34), (1, 39), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ',', (1, 39), (1, 40), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (3, "'shape'", (1, 41), (1, 48), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ':', (1, 48), (1, 49), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, '(', (1, 50), (1, 51), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (2, '0', (1, 51), (1, 52), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ',', (1, 52), (1, 53), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (2, '0', (1, 54), (1, 55), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ',', (1, 55), (1, 56), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (2, '0', (1, 57), (1, 58), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ',', (1, 58), (1, 59), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (2, '0', (1, 60), (1, 61), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ')', (1, 61), (1, 62), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, ',', (1, 62), (1, 63), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (51, '}', (1, 64), (1, 65), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }")
token (0, '', (2, 0), (2, 0), '')
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-3-63a44cce5b8d> in <module>()
----> 1 numpy.save("/tmp/a.npy", a)

/usr/lib64/python2.7/site-packages/numpy/lib/npyio.pyc in save(file, arr, allow_pickle, fix_imports)
    509         arr = np.asanyarray(arr)
    510         format.write_array(fid, arr, allow_pickle=allow_pickle,
--> 511                            pickle_kwargs=pickle_kwargs)
    512     finally:
    513         if own_fid:

/usr/lib64/python2.7/site-packages/numpy/lib/format.py in write_array(fp, array, version, allow_pickle, pickle_kwargs)
    561     _check_version(version)
    562     used_ver = _write_array_header(fp, header_data_from_array_1_0(array),
--> 563                                    version)
    564     # this warning can be removed when 1.9 has aged enough
    565     if version != (2, 0) and used_ver == (2, 0):

/usr/lib64/python2.7/site-packages/numpy/lib/format.py in _write_array_header(fp, d, version)
    306     header.append("}")
    307     header = "".join(header)
--> 308     header = asbytes(_filter_header(header))
    309 
    310     hlen = len(header) + 1 # 1 for newline

/usr/lib64/python2.7/site-packages/numpy/lib/format.py in _filter_header(s)
    466             tokens.append(token)
    467         last_token_was_number = (token_type == tokenize.NUMBER)
--> 468     return tokenize.untokenize(tokens)
    469 
    470 

/usr/lib64/python2.7/tokenize.pyc in untokenize(iterable)
    260     """
    261     ut = Untokenizer()
--> 262     return ut.untokenize(iterable)
    263 
    264 def generate_tokens(readline):

/usr/lib64/python2.7/tokenize.pyc in untokenize(self, iterable)
    196                 break
    197             tok_type, token, start, end, line = t
--> 198             self.add_whitespace(start)
    199             self.tokens.append(token)
    200             self.prev_row, self.prev_col = end

/usr/lib64/python2.7/tokenize.pyc in add_whitespace(self, start)
    185     def add_whitespace(self, start):
    186         row, col = start
--> 187         assert row <= self.prev_row
    188         col_offset = col - self.prev_col
    189         if col_offset:

AssertionError: 
  • PYTHON 2.7.14
Python 2.7.14 (default, Dec 11 2017, 16:08:01) 
Type "copyright", "credits" or "license" for more information.

IPython 5.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import numpy

In [2]: a = numpy.ndarray([0,0,0,0], dtype=numpy.uint8)

In [3]: numpy.save("/tmp/a.npy", a)
token (51, '{', (1, 0), (1, 1), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (3, "'descr'", (1, 1), (1, 8), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ':', (1, 8), (1, 9), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (3, "'|u1'", (1, 10), (1, 15), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ',', (1, 15), (1, 16), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (3, "'fortran_order'", (1, 17), (1, 32), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ':', (1, 32), (1, 33), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (1, 'False', (1, 34), (1, 39), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ',', (1, 39), (1, 40), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (3, "'shape'", (1, 41), (1, 48), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ':', (1, 48), (1, 49), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, '(', (1, 50), (1, 51), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (2, '0', (1, 51), (1, 52), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ',', (1, 52), (1, 53), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (2, '0', (1, 54), (1, 55), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ',', (1, 55), (1, 56), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (2, '0', (1, 57), (1, 58), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ',', (1, 58), (1, 59), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (2, '0', (1, 60), (1, 61), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ')', (1, 61), (1, 62), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, ',', (1, 62), (1, 63), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (51, '}', (1, 64), (1, 65), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (4, '\n', (1, 69), (1, 70), "{'descr': '|u1', 'fortran_order': False, 'shape': (0, 0, 0, 0), }    \n")
token (0, '', (2, 0), (2, 0), '')

@ghost
Copy link
ghost commented Jan 9, 2018

FWIW, there was a big rant from one of the Red Hat folks about people not using the Pythons from software collections a while ago. If it's a Python bug, then there's (probably) nothing to fix on our end.

@indiajoe
Copy link

I have the same error on Centos OS 7, which uses Python 2.7.5 (default, Aug 4 2017, 00:39:18)
and numpy 1.14.0

Is the only option, to wait till Cent OS 7 updates Python?

@HuynhLam
Copy link

I have the same problem while running on Centos 7, python 2.7.5 and numpy 1.14.0

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-32-9eb6a987a90e> in <module>()
      4 # saved_path = os.path.join('err_stats')
      5 # print(saved_path)
----> 6 np.save('err_stats', err_vals)
      7 mask = np.std(err_vals, axis=0)
      8 print(mask.shape)

/lib/python2.7/site-packages/numpy/lib/npyio.pyc in save(file, arr, allow_pickle, fix_imports)
    509         arr = np.asanyarray(arr)
    510         format.write_array(fid, arr, allow_pickle=allow_pickle,
--> 511                            pickle_kwargs=pickle_kwargs)
    512     finally:
    513         if own_fid:

/lib/python2.7/site-packages/numpy/lib/format.pyc in write_array(fp, array, version, allow_pickle, pickle_kwargs)
    560     _check_version(version)
    561     used_ver = _write_array_header(fp, header_data_from_array_1_0(array),
--> 562                                    version)
    563     # this warning can be removed when 1.9 has aged enough
    564     if version != (2, 0) and used_ver == (2, 0):

/lib/python2.7/site-packages/numpy/lib/format.pyc in _write_array_header(fp, d, version)
    306     header.append("}")
    307     header = "".join(header)
--> 308     header = asbytes(_filter_header(header))
    309 
    310     hlen = len(header) + 1 # 1 for newline

/lib/python2.7/site-packages/numpy/lib/format.pyc in _filter_header(s)
    465             tokens.append(token)
    466         last_token_was_number = (token_type == tokenize.NUMBER)
--> 467     return tokenize.untokenize(tokens)
    468 
    469 

/usr/lib64/python2.7/tokenize.pyc in untokenize(iterable)
    260     """
    261     ut = Untokenizer()
--> 262     return ut.untokenize(iterable)
    263 
    264 def generate_tokens(readline):

/usr/lib64/python2.7/tokenize.pyc in untokenize(self, iterable)
    196                 break
    197             tok_type, token, start, end, line = t
--> 198             self.add_whitespace(start)
    199             self.tokens.append(token)
    200             self.prev_row, self.prev_col = end

/usr/lib64/python2.7/tokenize.pyc in add_whitespace(self, start)
    185     def add_whitespace(self, start):
    186         row, col = start
--> 187         assert row <= self.prev_row
    188         col_offset = col - self.prev_col
    189         if col_offset:

AssertionError: 

@eric-wieser
Copy link
Member

I'll ask again - in the ipython traceback you have there, can you type %debug, then up a few times until you're in _write_array_header, and then run print(repr(header))?

@HuynhLam
Copy link
HuynhLam commented Jan 20, 2018
> /usr/lib64/python2.7/tokenize.py(187)add_whitespace()
    185     def add_whitespace(self, start):
    186         row, col = start
--> 187         assert row <= self.prev_row
    188         col_offset = col - self.prev_col
    189         if col_offset:

ipdb> up
> /usr/lib64/python2.7/tokenize.py(198)untokenize()
    196                 break
    197             tok_type, token, start, end, line = t
--> 198             self.add_whitespace(start)
    199             self.tokens.append(token)
    200             self.prev_row, self.prev_col = end

ipdb> up
> /usr/lib64/python2.7/tokenize.py(262)untokenize()
    260     """
    261     ut = Untokenizer()
--> 262     return ut.untokenize(iterable)
    263 
    264 def generate_tokens(readline):

ipdb> up
> /var/tmp/nb-cache/test/lib/python2.7/site-packages/numpy/lib/format.py(467)_filter_header()
    465             tokens.append(token)
    466         last_token_was_number = (token_type == tokenize.NUMBER)
--> 467     return tokenize.untokenize(tokens)
    468 
    469 

ipdb> up
> /var/tmp/nb-cache/test/lib/python2.7/site-packages/numpy/lib/format.py(308)_write_array_header()
    306     header.append("}")
    307     header = "".join(header)
--> 308     header = asbytes(_filter_header(header))
    309 
    310     hlen = len(header) + 1 # 1 for newline

ipdb> print(repr(header))
"{'descr': '<f8', 'fortran_order': False, 'shape': (1449, 480, 640), }"

@lzj1769
Copy link
lzj1769 commented Jan 25, 2018

I have the sample problem using python 2.7.5, numpy 1.14.0 and CentOS.
Anyone has idea about how to solve this issue?

@littleeleventhwolf
Copy link

Same problem while running on Centos 7, python 2.7.5 and numpy 1.14.0.

def convert_mean(binMean, npyMean):
    blob = caffe.proto.caffe_pb2.BlobProto()
    bin_mean = open(binMean, 'rb').read()
    blob.ParseFromString(bin_mean)
    arr = np.array(caffe.io.blobproto_to_array(blob))
    npy_mean = arr[0]
    np.save(npyMean, npy_mean)
binMean = caffe_root + 'examples/cifar10/mean.binaryproto'
npyMean = caffe_root + 'examples/cifar10/mean.npy'
convert_mean(binMean, npyMean)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-61deea8c4153> in <module>()
      9 binMean = caffe_root + 'examples/cifar10/mean.binaryproto'
     10 npyMean = caffe_root + 'examples/cifar10/mean.npy'
---> 11 convert_mean(binMean, npyMean)

<ipython-input-17-61deea8c4153> in convert_mean(binMean, npyMean)
      6     arr = np.array(caffe.io.blobproto_to_array(blob))
      7     npy_mean = arr[0]
----> 8     np.save(npyMean, npy_mean)
      9 binMean = caffe_root + 'examples/cifar10/mean.binaryproto'
     10 npyMean = caffe_root + 'examples/cifar10/mean.npy'

/usr/lib64/python2.7/site-packages/numpy/lib/npyio.pyc in save(file, arr, allow_pickle, fix_imports)
    509         arr = np.asanyarray(arr)
    510         format.write_array(fid, arr, allow_pickle=allow_pickle,
--> 511                            pickle_kwargs=pickle_kwargs)
    512     finally:
    513         if own_fid:

/usr/lib64/python2.7/site-packages/numpy/lib/format.pyc in write_array(fp, array, version, allow_pickle, pickle_kwargs)
    560     _check_version(version)
    561     used_ver = _write_array_header(fp, header_data_from_array_1_0(array),
--> 562                                    version)
    563     # this warning can be removed when 1.9 has aged enough
    564     if version != (2, 0) and used_ver == (2, 0):

/usr/lib64/python2.7/site-packages/numpy/lib/format.pyc in _write_array_header(fp, d, version)
    306     header.append("}")
    307     header = "".join(header)
--> 308     header = asbytes(_filter_header(header))
    309 
    310     hlen = len(header) + 1 # 1 for newline

/usr/lib64/python2.7/site-packages/numpy/lib/format.
8000
pyc in _filter_header(s)
    465             tokens.append(token)
    466         last_token_was_number = (token_type == tokenize.NUMBER)
--> 467     return tokenize.untokenize(tokens)
    468 
    469 

/usr/lib64/python2.7/tokenize.pyc in untokenize(iterable)
    260     """
    261     ut = Untokenizer()
--> 262     return ut.untokenize(iterable)
    263 
    264 def generate_tokens(readline):

/usr/lib64/python2.7/tokenize.pyc in untokenize(self, iterable)
    196                 break
    197             tok_type, token, start, end, line = t
--> 198             self.add_whitespace(start)
    199             self.tokens.append(token)
    200             self.prev_row, self.prev_col = end

/usr/lib64/python2.7/tokenize.pyc in add_whitespace(self, start)
    185     def add_whitespace(self, start):
    186         row, col = start
--> 187         assert row <= self.prev_row
    188         col_offset = col - self.prev_col
    189         if col_offset:

AssertionError: 

@HuynhLam
Copy link
HuynhLam commented Jan 26, 2018 via email

@emma-sjwang
Copy link

I solved this problem by
pip install numpy==1.13.3

@pnbat
Copy link
Contributor Author
pnbat commented Jan 31, 2018

I am still receiving emails regarding this issue. Just to make it clear again, it turned out that this is not a numpy issue. The problem occurs while using numpy 1.14 with python 2.7.5. Unfortunately, that python version is the most up-to-date version in the default CentOS repositories. So what can you do about it?

As pointed out before, you could simply use a previous version of numpy (e.g. 1.13.3). This does not look like a good solution to me, but if you have to stick to python 2.7.5 for any good reason, then this might be the solution you are looking for.
Otherwise, using the newer python 2 provided by the CentOS software collections repository should solve the issue. On most other distributions, a newer python 2 version should be already available (probably since quite a while).
If you are bound to CentOS and its default repositories, but you have the freedom to use docker, this could also help, depending on your particular use case.

@jo11111
Copy link
jo11111 commented Feb 2, 2018

I'm also seeing this problem with python 2.7.6
I've downgraded numpy to get around it.

@ahaldane
Copy link
Member
ahaldane commented Feb 2, 2018

Did we numpy devs figure out what changed between 1.13 and 1.14 to trigger this?

@pnbat
Copy link
Contributor Author
pnbat commented Feb 2, 2018

@ahaldane Well, it is pretty much related to the changes of #9025
I thought my first two comments would provide enough information. I was going through the code changes and if I remember it correctly, together with the changes for the alignment, the logic also changed quite a bit. I think the order of some operations changed. This should not be a huge problem, but with some old python 2 versions "Tokens in _filter_header displayed below, python 2.7.5 shows 8000 no newline (\n)" (cf. my second comment in this thread).
I was about to change the code myself, but it seems a bit messy and I have not worked on numpy before. What I also figured out, was that the pullrequest #9025 was pushed quite a lot to be integrated into 1.14. Things like this are usually not good for overall code quality...

@eric-wieser
Copy link
Member
eric-wieser commented Feb 2, 2018

So would adding a trailing newline before _filter_header, and then stripping it again after, solve the problem?

The ENH in #9025 is misleading - it's a substantial bugfix for incorrect alignments in python 2 and format 2.0, which I would consider to be a big plus for code quality

@pnbat
Copy link
Contributor Author
pnbat commented Feb 2, 2018

If you look at the diff, in line 313 of the old code, the header was modified to contain a newline:
header = header + ' '*topad + '\n'
This was moved to line 333 of the pull request code:
header = header + b' '*topad + b'\n'
That means the _filter_header method was called with the modified header in the old code and is called with the non modified header (not padded and no newline) in the new code. I do not have time to dig deeper into this, but adding the newline before calling _filter_header was probably the workaround for the python 2.7.5 issue with the tokenizer.

Regarding code quality, the pull request definitely made things worse. There is plenty of duplication and a lot of comments (which means the code itself is not clear). Well, if this is a bugfix for some alignment issue, then having things work properly might be more important. Too bad, it caused other problems which need to be fixed now.

@pnbat
Copy link
Contributor Author
pnbat commented Feb 2, 2018

How do you guys test the library? Are you running tests with different python versions?

@ahaldane
Copy link
Member
ahaldane commented Feb 2, 2018

Here's the cpython commit that affected things: python/cpython@5e6db31

I don't have time right now to read it and understand why this fixed missing newlines, but I did check that in python 2.7.5:

>>> tokenize.untokenize(tokenize.generate_tokens(StringIO("1000").read))
AssertionError
>>> tokenize.untokenize(tokenize.generate_tokens(StringIO("1000\n").read))
'1000\n'

so adding the newline would fix it for old python.

@pnbat
Copy link
Contributor Author
pnbat commented Feb 2, 2018

@ahaldane Well, there was a newline before the pull request...

@ahaldane
Copy link
Member
ahaldane commented Feb 2, 2018

Yeah, I got that now, I hadn't read your comments carefully before, sorry. My last comment was to to reply that adding the newline looks like a valid fix.

What I don't quite see is why the old tokenize code failed if a newline was missing, and whether that was intentional or not (seems not).

As to which pythons we test, it is 2.7.14. If you click around here you can see the options: https://travis-ci.org/numpy/numpy.

The 1.14 release notes say "This release supports Python 2.7 and 3.4 - 3.6.", so I think we should try to fix this for 2.7.5.

@eric-wieser
Copy link
Member

There is plenty of duplication and a lot of comments (which means the code itself is not clear)

Can you point to duplication that was introduced by the patch?

Also, a patch adding comments doesn't mean the code became less clear - it might just mean the old code was already unclear.

I think we could fix this by adding the newline first, and changing the padding to

header = header[:-1] + b' '*topad + header[-1]

@ahaldane
Copy link
Member
ahaldane commented Feb 2, 2018

Yeah I would accept anything along those lines, with a comment explaining why it's needed.

Might also do header = asbytes(_filter_header(header + "\n"))[:-1]

@ahaldane
Copy link
Member
ahaldane commented Feb 2, 2018

I'm going to open and tag 1.14.1

@ahaldane ahaldane reopened this Feb 2, 2018
@ahaldane ahaldane added this to the 1.14.1 release milestone Feb 2, 2018
@pnbat
Copy link
Contributor Author
pnbat commented Feb 2, 2018

@eric-wieser I am not writing here to bother you guys, so there is no need to react like that. If you want to sort things out by yourself, fine for me.

@eric-wieser @ahaldane I would be careful with the two solutions you proposed. They both assume that in all python versions (with all I mean all you want to support) the newline will be still present after _filter_header has been called and all the tokenize and untokenize stuff has been performed. I would suggest to double check this first.

Regarding the code quality:
What about
padded_header = pad_header_for_alignment(header, topad)
instead of

# Pad the header with spaces and a final newline such that the magic
# string, the header-length short and the header are aligned on a
# ARRAY_ALIGN byte boundary.  This supports memory mapping of dtypes
# aligned up to ARRAY_ALIGN on systems like Linux where mmap()
# offset must be page-aligned (i.e. the beginning of the file).
header = header + b' '*topad + b'\n'

And regarding the duplication, the version specific handling got much worse:

padlen_v1 = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize('<H') + hlen) % ARRAY_ALIGN)
padlen_v2 = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize('<I') + hlen) % ARRAY_ALIGN)
if hlen + padlen_v1 < 2**16 and version in (None, (1, 0)):
    version = (1, 0)
    header_prefix = magic(1, 0) + struct.pack('<H', hlen + padlen_v1)
    topad = padlen_v1
elif hlen + padlen_v2 < 2**32 and version in (None, (2, 0)):
    version = (2, 0)
    header_prefix = magic(2, 0) + struct.pack('<I', hlen + padlen_v2)
    topad = padlen_v2

Having two classes containing the version specific functionality with the same interface (the python kind of polymorphism) could improve the code a lot.

@eric-wieser
Copy link
Member

so there is no need to react like that

Sorry if my tone came across that way, and thanks for calling out the examples. I agree, there's still room for improvement there - patches welcome :)

@pnbat
Copy link
Contributor Author
pnbat commented Feb 2, 2018

@eric-wieser I will see if I find some time to contribute a bit. Seems a good starting point now...

@pnbat
Copy link
Contributor Author
pnbat commented Feb 3, 2018

@eric-wieser @ahaldane currently on an 11 hours flight. I had a look at format.py and would like to work on a few issues, including the python 2.7.5 issue.

I never contributed to any large open source project, I guess it's time to change that. How can I contribute to numpy, respectively format.py?

@ahaldane
Copy link
Member
ahaldane commented Feb 3, 2018

@pnbat I also started with some local fixes like this, and I found it was pretty easy to contribute once you know how git works. You need to get familiar with git/github.

We have a set of docs about contributing at https://docs.scipy.org/doc/numpy/dev/index.html, and the quick summary is at https://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html.

Here's how I do it, in brief: I have a fork of numpy on github, which I clone to my computer. On my clone, I 1. checkout a new branch, 2. make my changes, 3. run tests with python ./runtests.py, 4. If passed, commit them with a properly formatted commit message, 5. push to that branch in my github fork. 6. Press the button on github to make a new pull request. Then we will review and eventually merge, possibly after a few rounds of fixes.

The docs have more details, but some things from the dev guide that I remember I initially found numpy devs were picky about: 1. You should follow our/python's coding standards (PEP8). Common problems for me is I forget to always add a space between ) {, and that there should be no trailing whitespace on any line. 2. Commit header should start with BUG: my description or similar, and the commit message should say Fixes #10348 to automate issue resolution.

Usually any pull request should include a new unit test, following the pattern of other tests. The exact placement (which file) doesn't matter as long as it vaguely makes sense. But for this fix, since the bug only occurs in python 2.7.5 there may be no unit test to add since we don't test that version, and probably numpy already fails tests for that python version.

@charris
Copy link
Member
charris commented Feb 3, 2018

I'm starting to look at preparing the 1.14.1 release, so it would be good if this gets fixed soon. No pressure :-)

@pnbat
Copy link
Contributor Author
pnbat commented Feb 3, 2018

If it is that urgent I would recommend one of the workarounds that were suggested above. Either adding the newline earlier, removing the +1 from the len and adapting the padding (erics suggestion) or adding the newline just for the filter and removing it afterwards (which could also be done in the filter method itself).
I have done a lot of changes which require some cleanup now. Tests are running, but I am calling private methods during the tests, because I changed the interface. Did not look good to me that certain methods were part of the public interface. Enough work in the airplane, but I will get back to you guys tomorrow.

@pnbat
Copy link
Contributor Author
pnbat commented Feb 4, 2018

@charris @eric-wieser @ahaldane I have pushed a few changes to my numpy fork (https://github.com/pnbat/numpy/tree/format)

The changes include a bugfix for this issue here (tested it with python 2.7.5 and 2.7.14, but unfortunately I cannot test it with python 3). This is basically the last commit. I went for the easy solution, adding a newline right before the filtering and removing it after the filtering. To make this work I also had to change from StringIO.read to StringIO.readline (which seems more appropriate in any case).
Apart from that I started with some refactoring to make the code more readable and more testable. It should also be quite easy to make the alignment configurable now. I could continue a bit more into that direction, but I would be happy about a bit of feedback at this point.
It is also not really clear to me, what is part of the public API of the format module. To me the methods of interest are write_array, read_array and open_mmap. The methods magic, read_magic, dtype_to_descr, header_data_from_array_1_0, write_array_header_1_0, write_array_header_2_0, read_array_header_1_0, read_array_header_2_0 do not look like they should be exposed. What is the reason behind that?

@ahaldane
Copy link
Member
ahaldane commented Feb 5, 2018

Great! If you go to the page for your github fork of numpy, you should see a button "create pull request". Or, if you go to that specific branch, there should be a button "create pull request". Push the button.

It will give you a little form for a description. Fill it in briefly to explain what the pull request (PR) does, why the changes were needed, and any other comments (i.e., just copy your last comment). I suggest giving the PR the title "BUG: fix np.save tokenizer for python 2.7.5".

Then submit the PR. Then automatic tests will be run, and we will be able to do line-by-line comparison and review of the diff on github, and answer questions. You can later update your PR based on comments.

@baixiaoustc
Copy link

guys, how can i upgrade the newest numpy since 1.14.1 is not released?

@charris
Copy link
Member
charris commented Feb 12, 2018

There should be a release in the next week, we are just waiting on a couple of things.

@baixiaoustc
Copy link

@charris thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0