8000 Ensure that the shape of sample_weight is checked in all the functions · Issue #9926 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Ensure that the shape of sample_weight is checked in all the functions #9926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qinhanmin2014 opened this issue Oct 15, 2017 · 9 comments · Fixed by #11598
Closed

Ensure that the shape of sample_weight is checked in all the functions #9926

qinhanmin2014 opened this issue Oct 15, 2017 · 9 comments · Fixed by #11598
Labels
help wanted Moderate Anything that requires some knowledge of conventions and best practices

Comments

@qinhanmin2014
Copy link
Member
qinhanmin2014 commented Oct 15, 2017

proposed by @lesteve in #9786 (comment) and #9903 (comment)

in this PR we spotted a place where check_consistent_lengths(X, y) was used where check_consistent_lengths(X, y, sample_weight) should have called it would be good to double-check that this error is not present in some other places in our codebase.

Note what I had originally in mind was something a bit more generic than just doing it for metrics i.e. checking that each time a function signature had sample_weight, there was a check_consistent_length(..., ..., sample_weight) call within its body. Not sure how easy this though.

After going through all the public functions, it seems that the following functions do not check the shape of sample_weight and rely on certain statement to block the code from running through. (See #9903 (comment) for more detail)

DummyClassifier
KernelRidge
LinearRegression, RANSACRegressor, Ridge, RidgeClassifier
MultinomialNB, ComplementNB, BernoulliNB

This may cause:
(1)Users can't get meaningful error message (e.g., now you may get something like operands could not be broadcast together with shapes (2,1) (3,1))
(2)Sometimes all the statements fail to block the code and you even can't get an erorr (e.g., roc_auc_score previously)

It is not difficult to fix the code but the core issue here is how to test it in an elegant way. Also see #5367, #5515.

@lesteve
Copy link
Member
lesteve commented Oct 16, 2017

Clearer error messages is always better, so I would fix the problems you found and add tests.

@qinhanmin2014
Copy link
Member Author

Thanks :)

so I would fix the problems you found and add tests.

You mean to assign it to yourself? Otherwise, I think I'll try it recently. Or you can also tag it to attract others since my work on it will not start too soon.

@jnothman jnothman added Moderate Anything that requires some knowledge of conventions and best practices Need Contributor labels Oct 16, 2017
@jnothman
Copy link
Member

I don't think @lesteve's intention was to assign it to himself

@lesteve
Copy link
Member
lesteve commented Oct 17, 2017

You mean to assign it to yourself?

No it was only a recommendation ;-).

@qinhanmin2014
Copy link
Member Author

Thanks to you all for the instant reply :)
I think for this issue, it is not hard to fix the code. The core issue here is how to test it elegantly. If we only add test for specific estimators, it will be quite simple and the test will be regression test, but I'm afraid it might be a bit awkward. I recently find that we are attemping to add common test for sample_weight in #5367, #5515, so we might do it there, but in this way we are not able to test the error message (different between modules) so that the test will not be regression test. WDYT? Thanks :)

@jbschiratti
Copy link
jbschiratti commented Jul 17, 2018

@qinhanmin2014 If I understand correctly, what could be done to fix this issue is:

Is that the way to go?

@qinhanmin2014
Copy link
Member Author

@jbschiratti Sorry for the late reply
Your first point is related to this PR. The purpose of this PR is to use check_consistent_length to avoid uninformative error message when the shape of sample_weight is invalid.
Your second point is related to a different PR (#11316), which aims to provide common tests for sample_weight.
Currently, I would suggest you to take up #11316 since that's more important than current PR.

@rth
Copy link < 8000 /clipboard-copy>
Member
rth commented Nov 3, 2019

Closing as I think this is superseded by _check_sample_weight (#15358) that does more general checks than check_consistent_lengths (and there have been a number of PR on that lately).

@rth rth closed this as completed Nov 3, 2019
@rth
Copy link
Member
rth commented Feb 5, 2020

Ropening since this was not fully addressed, and there a few related open PRs.

@rth rth reopened this Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0