[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890

chadykamar · 2018-08-22T04:11:02Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Compares the MD5 hash of the downloaded ARFF with the checksum provided in the OpenML data set description. Issues a warning if they are not equal.

Any other comments?

jnothman · 2018-08-22T07:38:54Z

Why closed?

jnothman · 2018-08-22T07:39:29Z

Please don't close just because there are test failures

…into md5-checksum

chadykamar · 2018-08-22T16:13:17Z

Sorry didn't mean to do that exactly

rth

Thanks @chadykamar , a few comments from a partial review are below.

rth · 2018-08-22T21:29:50Z

sklearn/datasets/openml.py

+                warn('Data set file hash {} does not match the checksum {}.'
+                     .format(md5_hash, md5_checksum))
+
+        fp = tempfile.TemporaryFile()


It would be preferable to avoid creating a temporary file. Maybe using BytesIO ?

rth · 2018-08-22T21:35:10Z

sklearn/datasets/openml.py

+                        warn('Data set file hash {} does not match the'
+                             ' checksum {}.'.format(md5_hash, md5_checksum))
+
+                fdst.write(content)


This is not strictly equivalent. In copyfileobj,

by default the data is read in chunks to avoid uncontrolled memory consumption

I'm not fully sure whether hashing by chunks (see e.g. these SO answers) is worthwhile here

I'll do that, it might be necessary for very large data files.

rth · 2018-08-22T21:36:23Z

sklearn/datasets/tests/test_openml.py

+    # Capture all warnings
+    with pytest.warns(None) as records:
+        fetch_openml(data_id=data_id, data_home=str(tmpdir), cache=cache)
+        # assert no warnings


Above 2 comments are unecessary IMO

- Replace temp file with BytesIO when not caching. - Hashing and writing by chunks - Removed unecessary comments in test_openml.py

…into md5-checksum

rth

Generally this LGTM, apart for the two comments below.

(Maybe it was a merge that went wrong?)

rth · 2018-08-25T10:32:36Z

sklearn/datasets/tests/test_openml.py

-    # test return_X_y option
-    fetch_func = partial(fetch_openml, data_id=data_id, cache=False,
-                         target_column=target_column)
-    check_return_X_y(data_by_id, fetch_func)


Why was this removed?

I made a mistake when merging.

rth · 2018-08-25T10:33:02Z

sklearn/datasets/tests/test_openml.py

-        UserWarning,
-        "Multiple active versions of the dataset matching the name"
-        " iris exist. Versions may be fundamentally different, "
-        "returning version 1.",


Same comment: we still want to check this warning message, unless I am missing something.

rth · 2018-08-26T19:38:25Z

sklearn/datasets/tests/test_openml.py

+            'expect_sparse': False,
+            'expected_data_dtype': np.float64,
+            'expected_target_dtype': object,
+            'compare_default_target': True}


In all the above, changes are due to fixing the merge. We don't want to see any of these changes in this diff because it would complicate e.g. git blame. So the easiest to remove these changes would be,

either manually with git diff master...HEAD (or HEAD...master I never can remember) and editing the file until none of it shows.

or copying this file somewhere, reverting to the version on master git checkout master -- <path_to_this_file> than adding back your changes.

Don't worry about individual commits as in the end everything will be squashed into a single one, so only the total diff matter. Apart for this this looks good.

jnothman · 2018-08-27T00:14:09Z

sklearn/datasets/openml.py

@@ -399,6 +436,10 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
        If True, returns ``(data, target)`` instead of a Bunch object. See
        below for more information about the `data` and `target` objects.

+    verify_checksum : boolean, default=True
+        Whether or not to validate that the dataset file's MD5 hash matches the
+        data set description's expected checksum.


Maybe note that "Verification occurs when fetching the data, so if cache=True, the data will be presumed valid on subsequent calls."

jnothman · 2018-08-27T00:15:51Z

sklearn/datasets/openml.py

+                        fdst.write(block)
+
+                    if md5_checksum != md5.hexdigest():
+                        warn('Data set file hash {} does not match the '


Why do we want a warning rather than an exception?

I thought an exception would be raised when decoding in _arff, but it might be better to fail earlier. Which exception would you raise instead?

If it's incomplete but stops at a line break, there will be no exception. We should raise one here

lesteve · 2018-09-01T11:31:17Z

sklearn/datasets/openml.py

+        if md5_checksum is None:
+            return response
+
+        stream = io.BytesIO()


there seems to be some duplication between here and a few lines below. Could you factor that out in a function or something?

rth · 2018-09-14T06:32:07Z

Could you please resolve conflicts ? There are also a few CI failures..

adrinjalali · 2018-12-16T20:24:51Z

@chadykamar, any updates on this one? You also need to rebase master to resolve conflicts.

rth · 2019-07-18T11:42:07Z

Tried to resolve conflicts, but it's not trivial. I think it would probably easier to add these changes on top of upstream master manually.

deeplook · 2019-10-13T10:12:18Z

sklearn/datasets/openml.py

+def _check_md5_checksum(fsrc, fdst, md5_checksum):
+
+    md5 = hashlib.md5()
+    block_size = 128 * md5.block_size
+    for block in iter(lambda: fsrc.read(block_size), b''):
+        md5.update(block)
+        fdst.write(block)
+
+    if md5_checksum != md5.hexdigest():
+        msg = 'Data set file hash {} does not match the checksum {}.'
+        msg = msg.format(md5.hexdigest(), md5_checksum)
+        raise Exception(msg)


This does more than just check the MD5 hash of a file it also writes to a file stream. So it is at least named misleadingly.

deeplook · 2019-10-13T10:24:56Z

@chadykamar Are you still on this? I would give it a try, even restarting on top of master, but I don't want you to loose any credit for your changes so far.

deeplook · 2019-10-13T10:28:05Z

Following @adrinjalali's advice I'll give it a try and start from scratch now.

adrinjalali · 2019-10-15T10:37:43Z

Closing as #14800 fixes the issue.

Validate MD5 checksum

91e946d

chadykamar closed this Aug 22, 2018

chadykamar added 4 commits August 22, 2018 11:49

Merge changes

f4ae428

Remove data desc false checksum files

c93fe8a

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

0257697

…into md5-checksum

Add comment

463989e

chadykamar reopened this Aug 22, 2018

chadykamar added 2 commits August 22, 2018 13:30

Fix build error related to tmpdir fixture

2c054a6

Fix json TypeError in Python 2 build

46f4db4

rth reviewed Aug 22, 2018

View reviewed changes

chadykamar added 4 commits August 22, 2018 22:01

Hash and write by chunks

2c707bc

- Replace temp file with BytesIO when not caching. - Hashing and writing by chunks - Removed unecessary comments in test_openml.py

Remove unnecessary copy

6052726

Fix inadvertently pass test

1eef5b6

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

50ce64a

…into md5-checksum

rth approved these changes Aug 25, 2018

View reviewed changes

Correct merging error

6e75868

rth reviewed Aug 26, 2018

View reviewed changes

Merge mishaps

9345647

jnothman reviewed Aug 27, 2018

View reviewed changes

lesteve reviewed Sep 1, 2018

View reviewed changes

chadykamar added 2 commits September 13, 2018 18:17

Change warning into exeception

158d921

Add doc detail for verify_checksum param

44a3155

rth added the Stalled label Jul 18, 2019

deeplook reviewed Oct 13, 2019

View reviewed changes

deeplook mentioned this pull request Oct 13, 2019

WIP: Verify MD5 checksums on OpenML ARFF files #15238

Closed

adrinjalali closed this Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890

[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890

[MRG] Validate MD5 checksum of downloaded ARFF in fetch_openml #11890

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!