8000 [MRG+1] Fix check array sparse param by jkarno · Pull Request #7937 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Fix check array sparse param #7937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 20, 2016

Conversation

jkarno
Copy link
Contributor
@jkarno jkarno commented Nov 24, 2016

Reference Issue

Fixes #7880

What does this implement/fix? Explain your changes.

  • Validates that the accept_sparse parameter for check_array is a string, list of strings, boolean, or None
  • Updates the documentation to state that booleans are allowed
  • Added test cases to test_validation.py

Any other comments?

  1. I made the decision to currently allow booleans because there were lots of places that used accept_sparse=True which this would break otherwise.
  2. Currently, the validation is checked in _ensure_sparse_format because this is where the parameter is used. However, this means no warning or exception is thrown if the array is not sparse. This may or may not be the right behavior.

@@ -218,11 +218,12 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
spmatrix : scipy sparse matrix
Input to validate and convert.

accept_sparse : string, list of string or None (default=None)
accept_sparse : string, list of string, boolean or None (default=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to drop documenting None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty list should perhaps have the same behaviour (which is untested).

'data is required. Use X.toarray() to '
'convert to a dense numpy array.')
elif (isinstance(accept_sparse, (list, tuple)) and
len(accept_sparse) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the len and the isinstance? That's automatic, right? And the error if it's not a string might be better than what is below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass in an empty list or tuple then it will pass the isinstance, but it may be an empty list. len ensures it has an element and the next check ensures that the first element is a string.

I can take those checks out however and allow it to fail on its own when it tries to index the list or format with a non-string, if that would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm right the empty list would fail with the indexing, sorry I didn't think about that. But the not a sparse format error you get from the string seems better than the generic accept_parse error message below, so I'd remove that check.

changed_format = True
elif accept_sparse is not True:
# any other type
raise ValueError('The parameter \'accept_sparse\' was '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also provide the value of accept_sparse in the message. Btw, if you mix " and ' you can avoid escaping ;)

Copy link
Contributor Author
@jkarno jkarno Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to provide the value with "accept_sparse=%s" % accept_sparse but when an empty tuple is passed in (essentially "%s" % ()), this errors with "not enough arguments for format string". Do you have a suggestion for how to format properly in case its an empty tuple, aside from just checking for that specific case?

@jnothman
Copy link
Member
jnothman commented Nov 30, 2016 via email

assert_raise_message(ValueError, msg.format(invalid_type),
check_array, X_csr, accept_sparse=invalid_type)

msg = "Can't convert 'SVR' object to str implicitly"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use "'SVR' object" and it'll work on python2 and python3 and we know it's the right error (probably)

changed_format = True
elif accept_sparse is not True:
# any other type
raise ValueError(("The parameter 'accept_sparse={}' "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Invalid parameter 'accept_sparse'={}"? You you prefer the current wording, it should be "is not of correct type". The argument is (probably) not a type.

@amueller
Copy link
Member

LGTM. Having the default option (None) not documented is a bit weird, though. Maybe we should change the default to False and deprecate passing None?

@lesteve
Copy link
Member
lesteve commented Dec 1, 2016

LGTM. Having the default option (None) not documented is a bit weird, though. Maybe we should change the default to False and deprecate passing None?

+1 to that.

Copy link
Member
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

@@ -218,11 +218,12 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
spmatrix : scipy sparse matrix
Input to validate and convert.

accept_sparse : string, list of string or None (default=None)
accept_sparse : string, list of strings or boolean (default=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"string, boolean or list of strings" make it clearer. You avoid the potential confusion of reading list of strings or list of booleans.

the allowed format, it will be converted to the first listed format.
'csr', 'coo', 'dok', 'bsr', 'lil', 'dia'). If the input is sparse but
not in the allowed format, it will be converted to the first listed
format. True allows the input to be any format. False or None means
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you deprecated None my feeling is that you may remove it from the docstring.

@@ -284,11 +295,12 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None,
array : object
Input object to check / convert.

accept_sparse : string, list of string or None (default=None)
accept_sparse : string, list of strings or boolean (default=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: string, boolean or list of strings

elif accept_sparse is not True:
# any other type
raise ValueError(("Invalid parameter "
"'accept_sparse={}'").format(accept_sparse))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please list the allowed type of parameters, i.e. something like 'accept_sparse' should be a string, boolean or list of strings, you provided 'accept_sparse={}'.format(accept_sparse).

check_array(X_csr, accept_sparse=True)
check_array(X_csr, accept_sparse='csr')
check_array(X_csr, accept_sparse=['csr'])
check_array(X_csr, accept_sparse=('csr'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not checking what you think it is checking (I guess you meant to test accept_sparse as a tuple) ('csr') is not a tuple, ('csr',) is a tuple with one element.

To be honest this four lines need should be moved to another function since they do not test exception. accept_sparse=True, accept_sparse='csr' and accept_sparse=['csr']. It would be great to test the type of the returned object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and by the way tuple is not mentioned in the docstring ...

spmatrix = spmatrix.asformat(accept_sparse[0])
changed_format = True

if isinstance(accept_sparse, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use six.string_types rather than str so that unicode strings can be passed in in Python 2.7.

@lesteve
Copy link
Member
lesteve commented Dec 1, 2016

Having the default option (None) not documented is a bit weird, though. Maybe we should change the default to False and deprecate passing None?

check_X_y has accept_sparse=None by default as well, maybe worth deprecating it too?

To be honest thinking about it a bit more, maybe deprecating accept_sparse=None can be done in a separate PR. It will make this PR easier to review and merge. Your call @jkarno.

@jkarno
Copy link
Contributor Author
jkarno commented Dec 1, 2016

The deprecation didn't seem like too much extra so I decided to work on adding it to this PR. However, I am pushing now because I have a few small questions.

  • The deprecation test case is failing because it is raising two exceptions, but I am unsure how to catch both while only testing the message of the DeprecationWarning.
  • I feel that there is probably a better deprecation message than what I have currently. Particularly, I'm not sure what version this would be introduced and when the deprecation would be converted to a warning.
  • Similarly to the type checking, I only put the deprecation warning within the _ensure_sparse_format method because check_array and check_X_y both should lead to this point if a sparse array is passed in. Let me know if this message should be in every method instead.

Also, I assume I should be adding something to the what's new document now that the deprecation is being added?

msg = "Passing None to parameter 'accept_sparse' is " \
"deprecated in 0.19. Use False instead."

assert_raise_message((DeprecationWarning, TypeError), msg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For warnings you'd use assert_warns, not assert_raise_message. And you're best to test with something that does not raise an exception (although assert_warns(DeprecationError, assert_raise_message(TypeError, func, args)) should work also).

if accept_sparse is None:
warnings.warn(
"Passing None to parameter 'accept_sparse' is "
"deprecated in 0.19. Use False instead.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"0.19" -> "version 0.19 and will be removed in 0.21"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should deprecate that in check_array imho, as the deprecation warning should be raised even if a dense array is passed.

@amueller
Copy link
Member
amueller commented Dec 6, 2016

X_csr = sp.csr_matrix(X)
invalid_type = SVR()

msg = "A sparse matrix was passed, but dense data is required. " \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we would use parentheses around a string to get alignment more visually pleasing.

check_array, X_csr, accept_sparse=None)

msg = "Parameter 'accept_sparse' should be a string, " \
"boolean or list of strings. You provided 'accept_sparse={}'."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this indentation is inconsistent with the similar string above

@jnothman
Copy link
Member

LGTM

@jnothman jnothman changed the title [MRG] Fix check array sparse param [MRG+1] Fix check array sparse param Dec 12, 2016
Copy link
Member
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to add an entry in doc/whats_new.rst to advertise the deprecation of accept_sparse=None I think.

@@ -218,11 +218,12 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
spmatrix : scipy sparse matrix
Input to validate and convert.

accept_sparse : string, list of string or None (default=None)
accept_sparse : string, boolean or list/tuple of strings (default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading docstring, there is no default for this function. Actually this is a problem for all the parameters in this function, can you please fix this?

F987
# accept_sparse 'None' deprecation check
if accept_sparse is None:
warnings.warn(
"Passing 'None' to parameter 'accept_sparse' is "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the function name to the warning. The origin of the warning is sometimes not trivial to understand so you want to make it as explicit as possible.

if isinstance(accept_sparse, six.string_types):
accept_sparse = [accept_sparse]

if accept_sparse is False:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use elif and remove the newline mostly for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work, right? We want to end up in the case below where isinstance(accept_sparse, (list, tuple))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work, right?

Oh yeah, sorry I missed that. Ignore my comment then.

'data is required. Use X.toarray() to '
'convert to a dense numpy array.')
elif (isinstance(accept_sparse, (list, tuple)) and
len(accept_sparse)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm so what is the behaviour of accept_sparse=[], Is it the same as accept_sparse=False? Maybe this should be tackled inside this if clause to be more clear, i.e. something like:

elif ininstance(accept_sparse, (list, tuple)):
    if len(accept_sparse) == 0:
        raise ValueError("When providing 'accept_sparse' as a tuple or list it should be non empty or some better errror message")

changed_format = True
elif accept_sparse is not True:
# any other type
raise ValueError(("Parameter 'accept_sparse' should be a string, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick you don't need the additional parentheses inside ValueError

@@ -451,11 +470,12 @@ def check_X_y(X, y, accept_sparse=None, dtype="numeric", order=None,
y : nd-array, list or sparse matrix
Labels.

accept_sparse : string, list of string or None (default=None)
accept_sparse : string, boolean or list of string (default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to deprecate accept_sparse=None here too.

@@ -451,11 +470,12 @@ def check_X_y(X, y, accept_sparse=None, dtype="numeric", order=None,
y : nd-array, list or sparse matrix
Labels.

accept_sparse : string, list of string or None (default=None)
accept_sparse : string, boolean or list of string (default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean raise a warning? But check_array is called...

Copy link
Member
@lesteve lesteve Dec 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I missed that, maybe the deprecation message in check_array can be changed to mention both check_array and check_X_y then.

@jkarno
Copy link
Contributor Author
jkarno commented Dec 18, 2016

Apologies on the gaps between commits, I was caught in the middle of finals this week.

I am working on rebasing on master to solve the conflicts with whats_new, based on the contribution guide. Now when I try to push it says Updates were rejected because the tip of your current branch is behind its remote counterpart. Do I need to do a force push to update after the rebase?

@lesteve
Copy link
Member
lesteve commented Dec 19, 2016

In this case the conflict is not too hard to fix so I would recommend to use the new Github feature of resolving the conflicts via the web UI. Click on "Resolve conflicts".

Now when I try to push it says Updates were rejected because the tip of your current branch is behind its remote counterpart. Do I need to do a force push to update after the rebase?

If you want to do that on the command line instead, then the answer to your question is yes. When you do a rebase you change the history of your branch so you need to force push. If you don't feel super confident with the force push, you can make a backup branch (git branch fix_check_array_sparse_param_bak fix_check_array_sparse_param) in case something goes wrong. Also please look at your diff compared to upstream/master (for example git diff upstream/master... helps in general) to make sure that what you are force pushing is sensible.

@lesteve
Copy link
Member
lesteve commented Dec 19, 2016

OK I pushed a few minor changes into your branch and ended up resolving the conflicts via the web UI while I was at it. Hope you don't mind.

@lesteve
Copy link
Member
lesteve commented Dec 19, 2016

LGTM by the way, let's wait for the CIs.

@lesteve
Copy link
Member
lesteve commented Dec 19, 2016

@jkarno looks like you have some flake8 violations.

@amueller
Copy link
Member
amueller commented Dec 19, 2016 via email

@jnothman
Copy link
Member

(@lesteve, I assume I can squash and merge and you forego any attribution for your changes?)

@lesteve
Copy link
Member
lesteve commented Dec 20, 2016

(@lesteve, I assume I can squash and merge and you forego any attribution for your changes?)

Yeah go for it!

@jnothman jnothman merged commit e5ceda8 into scikit-learn:master Dec 20, 2016
@jnothman
Copy link
Member

Thanks @jkarno

@lesteve
Copy link
Member
lesteve commented Dec 20, 2016

Nice clean-up @jkarno!

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

check_array 'accept_sparse' parameter quirk since 0.17
4 participants
0