8000 EHN: Using in-mem temporary files rather than in-disk for building zip archive in _savez by rherault-pro · Pull Request #6540 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

EHN: Using in-mem temporary files rather than in-disk for building zip archive in _savez #6540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rherault-pro
Copy link

Temporary files are used in the _savez function to stage data during
archiving. The choice is done through new keyword arg: disk_temp_files
If set to True, use in-disk temporary files (default_option).
If set to False, use in-mem temporary files.

Please note that the in-mem files are based on BytesIO.
In python2 BytesIO lacks getbuffer method which returns the content of
the file without copying it. Thus in python2, getvalue is used which is
less memory efficient. In python3 getbuffer is privileged.

This pull request is based on an other pull request #6545.

@charris
Copy link
Member
charris commented Oct 21, 2015

Please read doc/source/dev/gitwash/development_workflow.rst for commit message formatting.

@rherault-pro rherault-pro changed the title Using in-mem temporary files rather than in-disk for building zip archive in _savez EHN: Using in-mem temporary files rather than in-disk for building zip archive in _savez Oct 21, 2015
@rherault-pro
Copy link
Author

All test are passing except the wheel one. Before squashing commits the "wheel" used to pass too an my code do not touch at all distribution. Any clue on what can happen ?

@charris
Copy link
Member
charris commented Oct 21, 2015

Do a force push: git push --force origin HEAD if you have that branch checked out.

@njsmith
Copy link
Member
njsmith commented Oct 21, 2015

I haven't read carefully, but I doubt you can turn on in-memory tempfiles by default, because AFAICT they greatly increase peak memory usage so you'll break existing code that tries to save arrays that are, say, 90% of the size of RAM. The real win would be to avoid making a copy of the data at all (e.g. by pointing zipfile directly at the array's internal memory buffer when possible).

Can you make one pull request just for the with stuff and then look at avoiding the on-disk temporary files as a second PR?

@rherault-pro
Copy link
Author

@charris done, but that do not trigger the checks.

@njsmith at least in python3 buffer is shared between the memfile and zipfile.
I don't see how we can manage to have a flat memory region with the npy header followed by the array data meanwhile the array data is used by an actual ndarray. May be it can be done with some mmap magic but it will for sure break compatibility with non *nix systems.

I want first to solve that wheel problem before splitting the PR and defaulting to in-disk temp file

@rherault-pro rherault-pro force-pushed the memfileinsavez branch 2 times, most recently from b92bcfe to 08359d9 Compare October 21, 2015 21:18
@rherault-pro
Copy link
Author

I've done the split between #6545 and current pull request.
It is defaulting to in-disk temp file.
TODO need to make unit tests on the in-mem temp files (as a result of defaulting to in-disk)

@rherault-pro
Copy link
Author

DONE Unit test on mem file added to test_io.py

@rherault-pro rherault-pro force-pushed the memfileinsavez branch 4 times, most recently from 2f0659e to 22b6fe3 Compare October 27, 2015 16:09
@rherault-pro
Copy link
Author

I've got a problem with test_load_refcount that randomly fails even on the master branch
#6571

So sometimes Travis check fails or not with the same exact code.

@homu
Copy link
Contributor
homu commented Feb 9, 2016

☔ The latest upstream changes (presumably #7133) made this pull request unmergeable. Please resolve the merge conflicts.

The function _savez stages data during archiving through temporary
files. Before that commit these temp and zip files were handled through
old open/close style. This commits switches to 'with' statement.
which is the recommendation since python 2.5. It's simplifying A LOT
exception handling and garbage collection.

Due to uncompatibility between zipfile in python2.6 and with statement,
ZipFile class has been subclassed to add needed stub methods.

BUG: WinNT prevents a file to be opened twice

Using contextlib.closing rather than own class
Temporary files are used in the _savez function to stage data during
archiving. The choice is done through new keyword arg: disk_temp_files
If set to True, use in-disk temporary files (default_option).
If set to False, use in-mem temporary files.

Please note that the in-mem files are based on BytesIO.
In python2 BytesIO lacks getbuffer method which returns the content of
the file without copying it. Thus in python2, getvalue is used which is
less memory efficient. In python3 getbuffer is privileged.

WINNT bug
@homu
Copy link
Contributor
homu commented Apr 7, 2016

☔ The latest upstream changes (presumably #7518) made this pull request unmergeable. Please resolve the merge conflicts.

@serhiy-storchaka
Copy link
Contributor

In Python 3.6 you can write to ZIP files directly, without using temporary files. See #9863.

@charris
Copy link
Member
charris commented Oct 15, 2017

After the release of NumPy 1.14.0, this may be achieved by upgrading to Python 3.6, which allows writing streams directly to the zipfile. That seems a better solution, so closing this. Thanks @rherault-insa.

@charris charris closed this Oct 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0