8000 ENH: Save to ZIP files without using temporary files. by serhiy-storchaka · Pull Request #9863 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Save to ZIP files without using temporary files. #9863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 15, 2017

Conversation

serhiy-storchaka
Copy link
Contributor

In Python 3.6 large data can be written to ZIP files directly, without using temporary files.

try:
format.write_array(fid, np.asanyarray(val),
val = np.asanyarray(val)
force_zip64 = val.nbytes > 1000000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be clearer if a simple power of 2 is used, something like val.nbytes > 2**30.

fd, tmpfile = tempfile.mkstemp(prefix=file_prefix, dir=file_dir, suffix='-numpy.npy')
os.close(fd)
try:
if sys.version_info >= (3, 6):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might add a comment that since Python 3.6 it is possible to write directly to a zipfile.

@charris
Copy link
Member
charris commented Oct 15, 2017

This is worthy of a release note. Could you add something to doc/release/1.14.0-notes.rst, probably under improvements.

@serhiy-storchaka
Copy link
Contributor Author

Done. If you with I can add other improvements. Save small data without using temporary files even in Python <3.6. Use the context manager for avoiding file descriptor leak in case of failure.

@charris
Copy link
Member
charris commented Oct 15, 2017

Save small data without using temporary files even in Python < 3.6.

How would you do that? The problem is that the array data needs to be formatted and the resultant stream cannot be written directly to the zipfile.

@charris
Copy link
Member
charris commented Oct 15, 2017

Upgrades are pretty easy these days, I'd be inclined to simply point to Python 3.6 for people who find the old implementation limiting.

@serhiy-storchaka
Copy link
Contributor Author

The size of the formatted data can be estimated by the nbytes attribute of the array. If use in-memory files for formatting arrays smaller than say 1MB, this would significantly increase the performance of writing small arrays, but don't add too much to memory consumption (CPython itself needs tens megabytes to start).

@charris
Copy link
Member
charris commented Oct 15, 2017

Hmm, those are pretty small files. I suspect speed is not of great importance here, but size can be. However, if you wish to add that I won't complain.

Since Python 3.6 it is possible to write directly to a ZIP file,
without creating temporary files.
@serhiy-storchaka
Copy link
Contributor Author

If speed is not of great importance here, I think it is not worth to complicate the code.

I have moved import tempfile which is not needed in 3.6.

format.write_array(fid, np.asanyarray(val),
val = np.asanyarray(val)
force_zip64 = val.nbytes >= 2**30
with zipf.open(fname, 'w', force_zip64=force_zip64) as fid:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is force_zip64 an unrelated fix here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and should this be wb to match the one below?

Copy link
Member
@charris charris Oct 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mode parameter, if included, must be 'r' (the default) or 'w'

I suspect the zipped files are automatically binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force_zip64 is needed when write a stream. Tests are failed without this option.

'r' and 'w' are the only supported options. Opened file-like objects are binary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like zip64 has been available since 2.7, but the force_zip64 keyword is new. It is used for files of unknown size that may exceed 2 GiB. As we are formatting the array, I suppose it qualifies as of unknown size, but probably not by much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using force_zip64 on all version of python 3, not just 3.6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force_zip64 doesn't exist in earlier versions.

@charris charris merged commit ebf4a6c into numpy:master Oct 15, 2017
@charris
Copy link
Member
charris commented Oct 15, 2017

Thanks @serhiy-storchaka .

@serhiy-storchaka serhiy-storchaka deleted the savez-no-tmpfile branch October 15, 2017 21:47
@serhiy-storchaka
Copy link
Contributor Author

Thank you for your time and review @charris.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0