8000 [MRG] Uses gzip when caching in fetch_openml by thomasjpfan · Pull Request #11830 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Uses gzip when caching in fetch_openml #11830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 2, 2018

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #11822

What does this implement/fix? Explain your changes.

Adds the HTTP header: Accept-encoding: gzip when data_home is not None.

@thomasjpfan thomasjpfan changed the title MRG: Uses gzip when caching [MRG]: Uses gzip when caching Aug 15, 2018
@thomasjpfan thomasjpfan changed the title [MRG]: Uses gzip when caching [MRG] Uses gzip when caching Aug 15, 2018
@rth
Copy link
Member
rth commented Aug 16, 2018

Thanks for your PR!

Could you please add "openml" to the title somewhere? Currently it's not really clear from the title or description what this PR is about. (That would help attracting reviewers).

Also tests (and in particular Travis CI) are failing..

@thomasjpfan thomasjpfan changed the title [MRG] Uses gzip when caching [MRG] Uses gzip when caching in openml Aug 16, 2018
@thomasjpfan thomasjpfan changed the title [MRG] Uses gzip when caching in openml [WIP] Uses gzip when caching in openml Aug 16, 2018
@thomasjpfan thomasjpfan changed the title [WIP] Uses gzip when caching in openml [MRG] Uses gzip when caching in openml Aug 16, 2018
@thomasjpfan thomasjpfan changed the title [MRG] Uses gzip when caching in openml [MRG] Uses gzip when caching in fetch_openml Aug 16, 2018
@thomasjpfan
Copy link
Member Author

Ah I see, the test mocks have been updated to account for the gzip feature.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that this results in a faster fetch of MNIST when cache=False?

fsrc = urlopen(_OPENML_PREFIX + openml_path)
with open(local_path, 'wb') as fdst:
req.add_header('Accept-encoding', 'gzip')
fsrc = urlopen(req)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably double check that the http response says it is actually gzipped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the responses' s Content-Encoding is not gzip, I see two options to handle this:

  1. Do another request for without gzip.
  2. Raise an exception.

Which do you prefer?

@thomasjpfan
Copy link
Member Author

With gzip enabled the response body size is 19.8 MB, without gzip enabled, it is 127.9 MB. With the MNIST datasets, there is little to none speed difference when running fetch_openml since it takes time to uncompress the response. The biggest advantage with using gzip is the 84% savings in the response body size.

@jnothman
Copy link
Member
jnothman commented Aug 30, 2018 via email

@rth
Copy link
Member
rth commented Aug 30, 2018

When the responses' s Content-Encoding is not gzip, I see two options to handle this:
Do another request for without gzip.
Raise an exception.

Can't you allways set Accept-encoding: gzip in the request, then read the Content-Encoding of the response and gzip or not accordingly?

@thomasjpfan
Copy link
Member Author

@rth I like the idea. I will update this PR according.

@thomasjpfan
Copy link
Member Author

@jnothman Previously, when cache=False, gzip was not requested which lead to similar fetch_open timings. With the latest commit, ALL requests are sent with the Accept-encoding: gzip header, which requests in faster downloads and fetch_open timings.

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking nice, but tests are failing with,

AttributeError: MockHTTPResponse instance has no attribute 'tell'

on python 2.7

return MockHTTPResponse(fp, True)
else:
fp = read_fn(path, 'rb')
return MockHTTPResponse(fp, False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is not tested, can we add a test for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has been updated to test both branches.

@@ -147,32 +149,64 @@ def _monkey_patch_webbased_functions(context, data_id, gziped_files):
path_suffix = '.gz'
read_fn = gzip.open

class MockHTTPResponse():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MockHTTPResponse(object) to use new-style classes on Py2, not that it matters much..

@thomasjpfan
Copy link
Member Author

I update this PR to address the API differences of urlopen between Python 2 and 3.

@jnothman jnothman added this to the 0.20 milestone Sep 2, 2018
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @thomasjpfan !

@rth
Copy link
Member
rth commented Sep 2, 2018

@jnothman Do you have any other comments about this? Should we merge it?

@jnothman
Copy link
Member
jnothman commented Sep 2, 2018

I tried it out in our slow Australian internet and was very pleased :)

@jnothman jnothman merged commit 83e7375 into scikit-learn:master Sep 2, 2018
@jnothman
Copy link
Member
jnothman commented Sep 2, 2018

Thanks yet again for your great work, @thomasjpfan!

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 2, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fetch_openml: Use compressed HTTP responses
3 participants
0