8000 More generic estimator checks · Issue #14712 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

More generic estimator checks #14712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
raamana opened this issue Aug 21, 2019 · 5 comments · Fixed by #29875
Closed

More generic estimator checks #14712

raamana opened this issue Aug 21, 2019 · 5 comments · Fixed by #29875
Labels
API Developer API Third party developer API related

Comments

@raamana
Copy link
Contributor
raamana commented Aug 21, 2019

Description

While trying to roll my own estimator as part of kernelmethods, I ran into a failed estimator check check_non_transformer_estimators_n_iter.

This function is not supposed to run the check (as my estimator is derived from SVR) according to code below:

    # These models are dependent on external solvers like
    # libsvm and accessing the iter parameter is non-trivial.
    not_run_check_n_iter = ['Ridge', 'SVR', 'NuSVR', 'NuSVC',
                            'RidgeClassifier', 'SVC', 'RandomizedLasso',
                            'LogisticRegressionCV', 'LinearSVC',
                            'LogisticRegression']

    # Tested in test_transformer_n_iter
    not_run_check_n_iter += CROSS_DECOMPOSITION
    if name in not_run_check_n_iter:
        return

However, given my estimator name (OptimalKernelSVR) is not actually in the list, it goes ahead and runs the check which is irrelevant to my estimator. To prevent this, I suggest making these checks less specific to sklearn's native estimators and more amenable to external libraries. Some possible solutions are:

  1. change if name in not_run_check_n_iter: to
for no_test_name in not_run_check_n_iter:
    if name.lower() in no_test_name.lower():
        return 
  1. Allow passing some flags to check_estimator skip a specific test in here

Steps/Code to Reproduce

  1. Clone and install kernelmethods in dev mode

  2. Run this code

from kernelmethods.algorithms import OptimalKernelSVR
from sklearn.utils.estimator_checks import check_estimator
OKSVR = OptimalKernelSVR(k_bucket='light')
check_estimator(OKSVR)

Versions

System:
    python: 3.7.2 (default, Dec 29 2018, 00:00:04)  [Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /Users/Reddy/anaconda3/envs/py36/bin/python
   machine: Darwin-18.6.0-x86_64-i386-64bit
BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /Users/Reddy/anaconda3/envs/py36/lib
cblas_libs: mkl_rt, pthread
Python deps:
       pip: 19.1.1
setuptools: 41.0.1
   sklearn: 0.21.2
     numpy: 1.15.4
     scipy: 1.1.0
    Cython: None
    pandas: 0.24.2
@raamana
Copy link
Contributor Author
raamana commented Aug 21, 2019

Another peripheral suggestion:
As I ran into multiple fails in check_estimator, I noticed there is too much reliance on ValueError to check if the estimator raises an error when it is supposed to. I also noticed some checks look for a specific string (e.g. Inf or NaN) in the exception message. While this is okay, I think it would be more valuable to raise SKLearnException with a specific tag/flag to identify the errors more precisely.

This also helps us distinguish errors of incorrect usage (e.g. predicting before fitting) vs. incorrect input etc.

@amueller
Copy link
Member

Thanks for your input. Also see #13969, #6715.
This specific test is kinda weird and maybe we don't want to run it for external models, so it might be excluded in a non-strict mode.

Clearly the hardcoded list here makes this impossible to apply to third party estimators.

You might also be interested in #14381 which should make running the tests much nicer.

@raamana
Copy link
Contributor Author
raamana commented Aug 21, 2019

Thanks - didn't realize this was being actively discussed and improved on (my search for check_estimator through issues didn't bring any of that strangely).

Another suggestion to help the devs is to add a concrete check-list to the Rolling Your Own Estimator guide, detailing more concrete expectations. For example, .fit() must result in atleast one parameter with a trailing underscore param_, as this is checked! Perhaps the checklist can be populated directly from the tests themselves, and grouped by 3-5 types of estimators.

With the disassembled version of checks, and this checklist, users would be less confused when estimator checks fail.

Funny enough, in this case, check-list proposed is figurative as well as quite literal (i.e. list of checks)

@amueller
Copy link
Member

I agree ;) #10082
Help welcome!

@raamana
Copy link
Contributor Author
raamana commented Aug 23, 2019

Thanks, would have loved to help. Given there may be multiple facets to this (disassembling the whole check_estimator part of codebase, different versions of check lists for different estimator types etc, and refining them by targeting the tests directly), it is a non-trivial job and I am unfortunately quite busy with my own software, trying to get kernelmethods published in JMLR etc. But I'd be happy to review any PRs in the meantime, as I believe this is important, and am also interested in extending the design of the estimator to allow for more advanced uses (such as passing in covariates etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Developer API Third party developer API related
Development

Successfully merging a pull request may close this issue.

4 participants
0