8000 [MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml by chadykamar · Pull Request #11890 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890

wants to merge 15 commits into from

Conversation

chadykamar
Copy link
Contributor

Reference Issues/PRs

Fixes #11816

What does this implement/fix? Explain your changes.

Compares the MD5 hash of the downloaded ARFF with the checksum provided in the OpenML data set description. Issues a warning if they are not equal.

Any other comments?

@chadykamar chadykamar closed this Aug 22, 2018
@jnothman
Copy link
Member
jnothman commented Aug 22, 2018 via email

@jnothman
Copy link
Member

Please don't close just because there are test failures

@chadykamar
Copy link
Contributor Author

Sorry didn't mean to do that exactly

@chadykamar chadykamar reopened this Aug 22, 2018
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chadykamar , a few comments from a partial review are below.

warn('Data set file hash {} does not match the checksum {}.'
.format(md5_hash, md5_checksum))

fp = tempfile.TemporaryFile()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable to avoid creating a temporary file. Maybe using BytesIO ?

warn('Data set file hash {} does not match the'
' checksum {}.'.format(md5_hash, md5_checksum))

fdst.write(content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

This is not strictly equivalent. In copyfileobj,

by default the data is read in chunks to avoid uncontrolled memory consumption

I'm not fully sure whether hashing by chunks (see e.g. these SO answers) is worthwhile here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that, it might be necessary for very large data files.

# Capture all warnings
with pytest.warns(None) as records:
fetch_openml(data_id=data_id, data_home=str(tmpdir), cache=cache)
# assert no warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above 2 comments are unecessary IMO

- Replace temp file with BytesIO when  not caching.
- Hashing and writing by chunks
- Removed unecessary comments in test_openml.py
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this LGTM, apart for the two comments below.

(Maybe it was a merge that went wrong?)

# test return_X_y option
fetch_func = partial(fetch_openml, data_id=data_id, cache=False,
target_column=target_column)
check_return_X_y(data_by_id, fetch_func)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake when merging.

UserWarning,
"Multiple active versions of the dataset matching the name"
" iris exist. Versions may be fundamentally different, "
"returning version 1.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment: we still want to check this warning message, unless I am missing something.

8000
'expect_sparse': False,
'expected_data_dtype': np.float64,
'expected_target_dtype': object,
'compare_default_target': True}
Copy link
Member
@rth rth Aug 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all the above, changes are due to fixing the merge. We don't want to see any of these changes in this diff because it would complicate e.g. git blame. So the easiest to remove these changes would be,

  • either manually with git diff master...HEAD (or HEAD...master I never can remember) and editing the file until none of it shows.
  • or copying this file somewhere, reverting to the version on master git checkout master -- <path_to_this_file> than adding back your changes.

Don't worry about individual commits as in the end everything will be squashed into a single one, so only the total diff matter. Apart for this this looks good.

@@ -399,6 +436,10 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` objects.

verify_checksum : boolean, default=True
Whether or not to validate that the dataset file's MD5 hash matches the
data set description's expected checksum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that "Verification occurs when fetching the data, so if cache=True, the data will be presumed valid on subsequent calls."

fdst.write(block)

if md5_checksum != md5.hexdigest():
warn('Data set file hash {} does not match the '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want a warning rather than an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought an exception would be raised when decoding in _arff, but it might be better to fail earlier. Which exception would you raise instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's incomplete but stops at a line break, there will be no exception. We should raise one here

if md5_checksum is None:
return response

stream = io.BytesIO()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be some duplication between here and a few lines below. Could you factor that out in a function or something?

@rth
Copy link
Member
rth commented Sep 14, 2018

Could you please resolve conflicts ? There are also a few CI failures..

@adrinjalali
Copy link
Member

@chadykamar, any updates on this one? You also need to rebase master to resolve conflicts.

@rth
Copy link
Member
rth commented Jul 18, 2019

Tried to resolve conflicts, but it's not trivial. I think it would probably easier to add these changes on top of upstream master manually.

@rth rth added the Stalled label Jul 18, 2019
Comment on lines +93 to +104
def _check_md5_checksum(fsrc, fdst, md5_checksum):

md5 = hashlib.md5()
block_size = 128 * md5.block_size
for block in iter(lambda: fsrc.read(block_size), b''):
md5.update(block)
fdst.write(block)

if md5_checksum != md5.hexdigest():
msg = 'Data set file hash {} does not match the checksum {}.'
msg = msg.format(md5.hexdigest(), md5_checksum)
raise Exception(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does more than just check the MD5 hash of a file it also writes to a file stream. So it is at least named misleadingly.

@deeplook
Copy link
Contributor

@chadykamar Are you still on this? I would give it a try, even restarting on top of master, but I don't want you to loose any credit for your changes so far.

@deeplook
Copy link
Contributor

Following @adrinjalali's advice I'll give it a try and start from scratch now.

@adrinjalali
Copy link
Member

Closing as #14800 fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate MD5 checksum of downloaded ARFF in fetch_openml
6 participants
0