8000 [MRG+2] Fix empty input data common checks by ogrisel · Pull Request #4245 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] Fix empty input data common checks #4245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ogrisel
Copy link
Member
@ogrisel ogrisel commented Feb 13, 2015

This PR includes #4214 but also additional common tests that currently fail on some estimators that do not have a consistent behavior and that probably need to be fixed on a case by case basis.

@agramfort
Copy link
Member

travis is not happy.

@amueller
Copy link
Member

can you rebase?

@raghavrv
Copy link
Member

I think these are intended to be smoke tests?

If so, summarizing the 11 failures ( 1 error msg included ) :

Needs handling of 0 samples

  • sklearn.preprocessing.data.StandardScaler (?)
  • sklearn.kernel_approximation.Nystroem

Needs handling of 0 features

  • sklearn.preprocessing.data.MinMaxScaler

Needs rewording of the error message for the 0 samples test

  • sklearn.linear_model.ridge.RidgeClassifierCV
  • sklearn.linear_model.ridge.RidgeClassifier

Needs rewording of the error message for the 0 features test

  • sklearn.ensemble.forest.ExtraTreesRegressor
  • sklearn.ensemble.forest.ExtraTreesClassifier
  • sklearn.ensemble.forest.RandomForestRegressor
  • sklearn.ensemble.forest.RandomForestClassifier

Others :

  • sklearn.linear_model.coordinate_descent.MultiTaskLassoCV
    and
  • sklearn.linear_model.coordinate_descent.MultiTaskElasticNetCV complain that they aren't supposed to be used for single task case.

@ogrisel ogrisel force-pushed the fix-empty-input-data-common-checks branch from 39dbaf7 to 25898de Compare February 16, 2015 10:01
@ogrisel
Copy link
Member Author
ogrisel commented Feb 16, 2015

I rebased. I know the tests don't pass as @ragv summarized many estimators are not consistent with the 0-samples or 0-features edge cases.

I am under the impression that having support for any of those 2 cases is never useful and can be the source of confusion when the user is not properly aware of the shape of the datastructures he/she fed to sklearn estimators. This is why I think we should always raise an exception consistently with an informative error message.

I will work on fixing the failures. We can discuss if maintaining exceptions is desirable or not.

@amueller
Copy link
Member

+1 on failing with a good error.

@ogrisel ogrisel force-pushed the fix-empty-input-data-common-checks branch from 25898de to 549d98c Compare February 16, 2015 14:18
@ogrisel
Copy link
Member Author
ogrisel commented Feb 16, 2015

Somewhat related, I opened an issue to track that inconsistent handling of 1D input #4252. Fixing #4252 is probably not required to have to fix the remaining failures of this PR but it's good to keep in mind as we move closer to 1.0.

# Convert data
# ensure_2d=False because there are actually unit test checking we fail
# for 1d. FIXME make this consistent in the future.
X = check_array(X, dtype=DTYPE, ensure_2d=False, accept_sparse="csc")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller I removed ensure_2d=False here and no unit test check failed contrary to what the FIXME inline comment says. Do remember which tests used to fail?

This change is technically not required for this PR. I can revert it if you prefer to deal with this case in another PR such as the one tackling #4252 for instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a tree-specific test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather forest-specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does no longer fail apparently.

@ogrisel
Copy link
Member Author
ogrisel commented Feb 16, 2015

I fixed all the failures revealed by the new checks.

@ogrisel ogrisel changed the title [WIP] Fix empty input data common checks [MRG] Fix empty input data common checks Feb 16, 2015
@ogrisel ogrisel added the Bug label Feb 23, 2015
@ogrisel ogrisel added this to the 0.16 milestone Feb 23, 2015
# Check that all estimator yield informative messages when
# trained on empty datasets
multi_output = name.startswith('MultiTask')
yield (check_estimators_empty_data_messages,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you use the "make_y_multioutput" helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find what you are talking about in the code base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant y = multioutput_estimator_convert_y_2d(name, y)

@amueller
Copy link
Member

Maybe add a whatsnew (helpful for looking up when something was fixed), and maybe use the multioutput helper. Otherwise +1

@@ -318,6 +319,28 @@ def check_estimators_dtypes(name, Estimator):
pass


def check_estimators_empty_data_messages(name, Estimator, multi_output=False):
with warnings.catch_warnings(record=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the ignore warnings decorator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you need this? I am missing something here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the ignore warnings decorator.

Indeed.

Can you explain why you need this? I am missing something here.

I will relaunch the tests without the catch to make sure.

@ogrisel ogrisel force-pushed the fix-empty-input-data-common-checks branch from 090ad08 to 67f6ad1 Compare February 25, 2015 14:33
@ogrisel
Copy link
Member Author
ogrisel commented Feb 25, 2015

@arjoly I addressed your comments.

@amueller I don't know which helper you are referring to...

@ogrisel ogrisel changed the title [MRG] Fix empty input data common checks [MRG+1] Fix empty input data common checks Feb 25, 2015
@arjoly
Copy link
Member
arjoly commented Feb 25, 2015

LGTM

@arjoly arjoly changed the title [MRG+1] Fix empty input data common checks [MRG+2] Fix empty input data common checks Feb 25, 2015
X_zero_features = np.empty(0).reshape(3, 0)
# the following y should be accepted by both classifiers and regressors
# and ignored by unsupervised models
if multi_output:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you could do
y = multioutput_estimator_convert_y_2d(name, y)
And remove the multioutput argument. This is how it is done in all the other tests.

@ogrisel ogrisel force-pushed the fix-empty-input-data-common-checks branch from 7d7b95a to feceab6 Compare February 26, 2015 09:46
@ogrisel ogrisel merged commit feceab6 into scikit-learn:master Feb 26, 2015
@ogrisel
Copy link
Member Author
ogrisel commented Feb 26, 2015

Thanks @amueller, I addressed your comment, squashed and merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0