-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why closed?
|
Please don't close just because there are test failures |
Sorry didn't mean to do that exactly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chadykamar , a few comments from a partial review are below.
sklearn/datasets/openml.py
Outdated
warn('Data set file hash {} does not match the checksum {}.' | ||
.format(md5_hash, md5_checksum)) | ||
|
||
fp = tempfile.TemporaryFile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable to avoid creating a temporary file. Maybe using BytesIO
?
sklearn/datasets/openml.py
Outdated
warn('Data set file hash {} does not match the' | ||
' checksum {}.'.format(md5_hash, md5_checksum)) | ||
|
||
fdst.write(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not strictly equivalent. In copyfileobj
,
by default the data is read in chunks to avoid uncontrolled memory consumption
I'm not fully sure whether hashing by chunks (see e.g. these SO answers) is worthwhile here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that, it might be necessary for very large data files.
# Capture all warnings | ||
with pytest.warns(None) as records: | ||
fetch_openml(data_id=data_id, data_home=str(tmpdir), cache=cache) | ||
# assert no warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above 2 comments are unecessary IMO
- Replace temp file with BytesIO when not caching. - Hashing and writing by chunks - Removed unecessary comments in test_openml.py
…into md5-checksum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this LGTM, apart for the two comments below.
(Maybe it was a merge that went wrong?)
# test return_X_y option | ||
fetch_func = partial(fetch_openml, data_id=data_id, cache=False, | ||
target_column=target_column) | ||
check_return_X_y(data_by_id, fetch_func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a mistake when merging.
UserWarning, | ||
"Multiple active versions of the dataset matching the name" | ||
" iris exist. Versions may be fundamentally different, " | ||
"returning version 1.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment: we still want to check this warning message, unless I am missing something.
'expect_sparse': False, | ||
'expected_data_dtype': np.float64, | ||
'expected_target_dtype': object, | ||
'compare_default_target': True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all the above, changes are due to fixing the merge. We don't want to see any of these changes in this diff because it would complicate e.g. git blame. So the easiest to remove these changes would be,
- either manually with
git diff master...HEAD
(orHEAD...master
I never can remember) and editing the file until none of it shows. - or copying this file somewhere, reverting to the version on master
git checkout master -- <path_to_this_file>
than adding back your changes.
Don't worry about individual commits as in the end everything will be squashed into a single one, so only the total diff matter. Apart for this this looks good.
sklearn/datasets/openml.py
Outdated
@@ -399,6 +436,10 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None, | |||
If True, returns ``(data, target)`` instead of a Bunch object. See | |||
below for more information about the `data` and `target` objects. | |||
|
|||
verify_checksum : boolean, default=True | |||
Whether or not to validate that the dataset file's MD5 hash matches the | |||
data set description's expected checksum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe note that "Verification occurs when fetching the data, so if cache=True, the data will be presumed valid on subsequent calls."
sklearn/datasets/openml.py
Outdated
fdst.write(block) | ||
|
||
if md5_checksum != md5.hexdigest(): | ||
warn('Data set file hash {} does not match the ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want a warning rather than an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought an exception would be raised when decoding in _arff
, but it might be better to fail earlier. Which exception would you raise instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's incomplete but stops at a line break, there will be no exception. We should raise one here
if md5_checksum is None: | ||
return response | ||
|
||
stream = io.BytesIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there seems to be some duplication between here and a few lines below. Could you factor that out in a function or something?
Could you please resolve conflicts ? There are also a few CI failures.. |
@chadykamar, any updates on this one? You also need to rebase master to resolve conflicts. |
Tried to resolve conflicts, but it's not trivial. I think it would probably easier to add these changes on top of upstream master manually. |
def _check_md5_checksum(fsrc, fdst, md5_checksum): | ||
|
||
md5 = hashlib.md5() | ||
block_size = 128 * md5.block_size | ||
for block in iter(lambda: fsrc.read(block_size), b''): | ||
md5.update(block) | ||
fdst.write(block) | ||
|
||
if md5_checksum != md5.hexdigest(): | ||
msg = 'Data set file hash {} does not match the checksum {}.' | ||
msg = msg.format(md5.hexdigest(), md5_checksum) | ||
raise Exception(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does more than just check the MD5 hash of a file it also writes to a file stream. So it is at least named misleadingly.
@chadykamar Are you still on this? I would give it a try, even restarting on top of master, but I don't want you to loose any credit for your changes so far. |
Following @adrinjalali's advice I'll give it a try and start from scratch now. |
Closing as #14800 fixes the issue. |
Reference Issues/PRs
Fixes #11816
What does this implement/fix? Explain your changes.
Compares the MD5 hash of the downloaded ARFF with the checksum provided in the OpenML data set description. Issues a warning if they are not equal.
Any other comments?