-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Fix empty input data common checks #4245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Fix empty input data common checks #4245
Conversation
travis is not happy. |
can you rebase? |
I think these are intended to be smoke tests? If so, summarizing the 11 failures ( 1 error msg included ) : Needs handling of 0 samples
Needs handling of 0 features
Needs rewording of the error message for the 0 samples test
Needs rewording of the error message for the 0 features test
Others :
|
39dbaf7
to
25898de
Compare
I rebased. I know the tests don't pass as @ragv summarized many estimators are not consistent with the 0-samples or 0-features edge cases. I am under the impression that having support for any of those 2 cases is never useful and can be the source of confusion when the user is not properly aware of the shape of the datastructures he/she fed to sklearn estimators. This is why I think we should always raise an exception consistently with an informative error message. I will work on fixing the failures. We can discuss if maintaining exceptions is desirable or not. |
+1 on failing with a good error. |
25898de
to
549d98c
Compare
# Convert data | ||
# ensure_2d=False because there are actually unit test checking we fail | ||
# for 1d. FIXME make this consistent in the future. | ||
X = check_array(X, dtype=DTYPE, ensure_2d=False, accept_sparse="csc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller I removed ensure_2d=False
here and no unit test check failed contrary to what the FIXME inline comment says. Do remember which tests used to fail?
This change is technically not required for this PR. I can revert it if you prefer to deal with this case in another PR such as the one tackling #4252 for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a tree-specific test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather forest-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does no longer fail apparently.
I fixed all the failures revealed by the new checks. |
# Check that all estimator yield informative messages when | ||
# trained on empty datasets | ||
multi_output = name.startswith('MultiTask') | ||
yield (check_estimators_empty_data_messages, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you use the "make_y_multioutput" helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find what you are talking about in the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant y = multioutput_estimator_convert_y_2d(name, y)
Maybe add a whatsnew (helpful for looking up when something was fixed), and maybe use the multioutput helper. Otherwise +1 |
@@ -318,6 +319,28 @@ def check_estimators_dtypes(name, Estimator): | |||
pass | |||
|
|||
|
|||
def check_estimators_empty_data_messages(name, Estimator, multi_output=False): | |||
with warnings.catch_warnings(record=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the ignore warnings decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why you need this? I am missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the ignore warnings decorator.
Indeed.
Can you explain why you need this? I am missing something here.
I will relaunch the tests without the catch to make sure.
090ad08
to
67f6ad1
Compare
LGTM |
X_zero_features = np.empty(0).reshape(3, 0) | ||
# the following y should be accepted by both classifiers and regressors | ||
# and ignored by unsupervised models | ||
if multi_output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here you could do
y = multioutput_estimator_convert_y_2d(name, y)
And remove the multioutput argument. This is how it is done in all the other tests.
7d7b95a
to
feceab6
Compare
Thanks @amueller, I addressed your comment, squashed and merged. |
This PR includes #4214 but also additional common tests that currently fail on some estimators that do not have a consistent behavior and that probably need to be fixed on a case by case basis.