BUG: fix np.save tokenizer for python 2.7.5 #10523

pnbat · 2018-02-05T01:24:30Z

The changes include a bugfix for this issue here (tested it with python 2.7.5 and 2.7.14, but unfortunately I cannot test it with python 3). This is basically the last commit. I went for the easy solution, adding a newline right before the filtering and removing it after the filtering. To make this work I also had to change from StringIO.read to StringIO.readline (which seems more appropriate in any case).

Apart from that I started with some refactoring to make the code more readable and more testable. It should also be quite easy to make the alignment configurable now. I could continue a bit more into that direction, but I would be happy about a bit of feedback at this point.

It is also not really clear to me, what is part of the public API of the format module. To me the methods of interest are write_array, read_array and open_mmap. The methods magic, read_magic, dtype_to_descr, header_data_from_array_1_0, write_array_header_1_0, write_array_header_2_0, read_array_header_1_0, read_array_header_2_0 do not look like they should be exposed. What is the reason behind that?

pnbat · 2018-02-05T02:36:37Z

What is the issue with circleci? I suspect it is not related to my changes...

eric-wieser · 2018-02-05T03:08:54Z

Don't worry about circle CI yet - it was added recently and seems to have teething trouble.

This seems like too large a refactor to slip into a bugfic release, especially since the original problem was not a bug in our code. I'd be more comfortable with a tiny change for the workaround, and then queue uo this refactor for 1.15

pnbat · 2018-02-05T03:21:50Z

Well, the last two commits should fix the original issue.
Regarding the refactoring. I did not touch any tests and kept the interface of the format module as it was before. So it boils down to how much you trust the tests :) I can also prepare two pull requests, one for the bugfix and the other for the refactoring. In fact, I would also like to continue a bit more with the refactoring. But for that I would be happy about some feedback and I also need to know which methods in the format module need to be exposed (public) and which are exposed for other reasons (e.g. for tests without calling private methods).

eric-wieser · 2018-02-05T03:53:44Z

In practice, I think we consider np.lib.* to be private, and only the functions exposed through np.* to be public.

eric-wieser · 2018-02-05T01:46:21Z

numpy/lib/format.py

-        # Totally non-contiguous data. We will have to make it C-contiguous
-        # before writing. Note that we need to test for C_CONTIGUOUS first
-        # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS.
-        d['fortran_order'] = False


This comment should be preserved

I have the feeling that the part about "[w]e will have to make it C-contiguous before writing" belongs somewhere else. At least I do not see the connection between setting d['fortran_order'] = False and making it C-contiguous before writing. The part that c_contiguous implies not f_contiguous could be explained above the refactored code...

Just to fill you in: The c_contiguous and f_contiguous flags have somewhat confusing meanings which it is too late for us to change for back-compat reasons. It is possible for arrays to be both c-contiguous and f-contiguous (eg, all non-unusually-strided 1d arrays are like this). It is also possible to be neither (eg, for weird strides). I think the logic here is that if it is neither, we will later copy it into a C-contiguous buffer, so fortran_ordered should be False.

I agree something about this code seems a bit messy, since we repeat the check for f-order in two places: Here, and in write_array (which is where we copy to a C-contiguous buffer if needed). But maybe it's OK...

eric-wieser · 2018-02-05T01:57:38Z

numpy/lib/format.py

+        result_list = ["{"]
+        for key, value in sorted(dictionary.items()):
+            # Need to use repr here, since we eval these when reading
+            result_list.append("'%s': %s, " % (key, repr(value)))


You should use repr for both, via %r

It was like that before, I just moved this piece of code around a bit...

charris · 2018-02-05T04:51:24Z

I agree with Eric here that a very simple fix for the 1.14 problems is desirable, refactoring is best left to 1.15 where there is more time

charris · 2018-02-05T04:54:54Z

The circleci problem probably indicates that you didn't branch off of current master. I don't know if a rebase will help as circleci seems to not update once it has run. Might be worth a try, though, just so we can learn if that works.

pnbat · 2018-02-05T04:59:29Z

@charris If you want I can prepare two new pull requests. One with the bugfix and then a second one with the refactorings so far. I just have to do it as soon as possible, since I am currently between two jobs and from tomorrow on I need to get a permission from my employer. This should not be a problem, but might take a few days.

pnbat · 2018-02-05T05:36:46Z

@charris The new pull request for the bugfix is #10524 (circlet ci is working fine now) and the new pull request for the refactoring is #10525.

charris · 2018-02-05T05:54:17Z

@pnbat Great, thanks.

rgommers · 2018-02-05T07:46:49Z

In practice, I think we consider np.lib.* to be private, and only the functions exposed through np.* to be public.

Not really. Everything that's also in np.* should be imported from there, however there are certainly functions and classes that are only exposed in np.lib (e.g. NumpyVersion, stride_tricks, Arrayterator).

eric-wieser · 2018-02-05T09:18:27Z

Looks like I made that up - thanks for calling me out on it - looking again, we've done a pretty good job in lib of naming private helper functions appropriately

pnbat added 5 commits February 4, 2018 23:50
8000

class MagicString

87bf314

class HeaderVersion, some refactorings

07394d9

class DictionarySerializer, some refactoring

2306d7b

classes MultiVersionHeaderSerializer and HeaderSerializer

7265d1b

python 2.7.5 bugfix

019cb64

charris added 00 - Bug component: numpy.lib 09 - Backport-Candidate PRs tagged should be backported labels Feb 5, 2018

charris added this to the 1.14.1 release milestone Feb 5, 2018

minor fix for python3 compatibility

bb5bc08

eric-wieser reviewed Feb 5, 2018

View reviewed changes

pnbat closed this Feb 5, 2018

charris removed this from the 1.14.1 release milestone Feb 5, 2018

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Feb 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: fix np.save tokenizer for python 2.7.5 #10523

BUG: fix np.save tokenizer for python 2.7.5 #10523

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BUG: fix np.save tokenizer for python 2.7.5 #10523

BUG: fix np.save tokenizer for python 2.7.5 #10523

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!