-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Uses gzip when caching in fetch_openml #11830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c20f9ef
to
97dd729
Compare
Thanks for your PR! Could you please add "openml" to the title somewhere? Currently it's not really clear from the title or description what this PR is about. (That would help attracting reviewers). Also tests (and in particular Travis CI) are failing.. |
Ah I see, the test mocks have been updated to account for the gzip feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm that this results in a faster fetch of MNIST when cache=False?
sklearn/datasets/openml.py
Outdated
fsrc = urlopen(_OPENML_PREFIX + openml_path) | ||
with open(local_path, 'wb') as fdst: | ||
req.add_header('Accept-encoding', 'gzip') | ||
fsrc = urlopen(req) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably double check that the http response says it is actually gzipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the responses' s Content-Encoding
is not gzip
, I see two options to handle this:
- Do another request for without
gzip
. - Raise an exception.
Which do you prefer?
With gzip enabled the response body size is 19.8 MB, without gzip enabled, it is 127.9 MB. With the MNIST datasets, there is little to none speed difference when running |
Okay.
Raise an exception please
|
Can't you allways set |
@rth I like the idea. I will update this PR according. |
@jnothman Previously, when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking nice, but tests are failing with,
AttributeError: MockHTTPResponse instance has no attribute 'tell'
on python 2.7
return MockHTTPResponse(fp, True) | ||
else: | ||
fp = read_fn(path, 'rb') | ||
return MockHTTPResponse(fp, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is not tested, can we add a test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has been updated to test both branches.
@@ -147,32 +149,64 @@ def _monkey_patch_webbased_functions(context, data_id, gziped_files): | |||
path_suffix = '.gz' | |||
read_fn = gzip.open | |||
|
|||
class MockHTTPResponse(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MockHTTPResponse(object)
to use new-style classes on Py2, not that it matters much..
I update this PR to address the API differences of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @thomasjpfan !
@jnothman Do you have any other comments about this? Should we merge it? |
I tried it out in our slow Australian internet and was very pleased :) |
Thanks yet again for your great work, @thomasjpfan! |
Reference Issues/PRs
Fixes #11822
What does this implement/fix? Explain your changes.
Adds the HTTP header:
Accept-encoding: gzip
whendata_home
is notNone
.