8000 Use check_scalar for parameters validation · Issue #20724 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Use check_scalar for parameters validation #20724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
glemaitre opened this issue Aug 10, 2021 · 41 comments
Closed

Use check_scalar for parameters validation #20724

glemaitre opened this issue Aug 10, 2021 · 41 comments
Labels
Moderate Anything that requires some knowledge of conventions and best practices

Comments

@glemaitre
Copy link
Member
glemaitre commented Aug 10, 2021

I just discover that we have a helper function to validate scalar:
https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_scalar.html

Since this helper could help to get consistent error types and messages, I was wondering if we could make a long-running issue to introduce this helper everywhere possible.

I think this could be a good issue for first contributors and short sprints.

I made an example PR that make the tool slightly more flexible:
#20723

For more detailed instructions on how to proceed with this issue, see issue #21927

@glemaitre glemaitre added the RFC label Aug 10, 2021
@glemaitre
Copy link
Member Author

@adrinjalali @thomasjpfan do you think it is worth it?

@glemaitre glemaitre added Sprint Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve labels Aug 10, 2021
@adrinjalali
Copy link
Member

I'm in favor of consistent input validation, there's also some discussion here: #14721

@glemaitre
Copy link
Member Author

So _check_in_range proposed by @thomasjpfan is indeed exactly the job of check_scalar with a TypeError on the top :).

We might want to introduce the check_scalar everywhere, change the error message or add tests. Then we could see if we want to refactor using a decorator or not?

@adrinjalali
Copy link
Member

I'm all in favor. It's certainly an improvement.

@architjen
Copy link

@glemaitre I'm interested in working on this issue, it would be my very first contribution.

< 8000 /task-lists>

@glemaitre
Copy link
Member Author

@architjen you could take an estimator and make the changes. You can have a look at #20723

I would like to merge #20723 first such that we have the strictly_* parameters thought.

@jjerphan
Copy link
Member

As of now check_scalar returns nothing, but it would be nice to be able to return the checked values similarly to what's done for check_array, e.g.:

number = check_scalar(number, "number", Integral)

What do you think?

@adrinjalali
Copy link
Member

that's a fair point @jjerphan . I'd be happy to either return a value (in case of type conversions, e.g. int to float), or change the name to test_scalar, maybe.

@kaushikroychowdhury
Copy link
Contributor

Hey I want to work on this issue. This will be my first contribution.
Will you please help me @glemaitre @architjen

@shivangij19
Copy link

I would also like to contribute to this issue.
Can u help walk me through @glemaitre

@SanjayMarreddi
Copy link
Contributor
SanjayMarreddi commented Aug 17, 2021

Hey @glemaitre and @adrinjalali!
I have been using scikit-learn so frequently in my AI side projects. I also contributed to a few Python OpenSource projects. Now I would like to start contributing to scikit-learn too!

I would like to begin with this issue. I went through all the discussions available here & linked PRs too.
I am planning to start with parameters validation for the K Means or DBSCAN clustering algorithms using check_scalar.
Would like to know if you have any suggestions for me on this estimator?

Thanks and Regards,
Sanjay Marreddi.

@vishaljha2121
Copy link

Hello @glemaitre, I am interested in contributing to this issue. Can you please assign it to me?

@thomasjpfan
Copy link
Member

Thank you everyone that is interested in the issue! Before we start working on other estimators, we should wait for #20723 to get merged. That PR makes adjustments to the error message and adds more API to check_scalar which will be useful in estimators.

@nshah1503
Copy link

Hello @glemaitre, I would like to work on this issue, this will be my first contribution to open-source and It would be amazing if you suggest how to proceed.

@NaYrA-IaR
Copy link

Hello, I am a beginner and I would like to contribute to this project. It will be my first contribution to open source. Can you help me with this @glemaitre ?

@adrinjalali
Copy link
Member

@NaYrA-IaR @nshah1503 please read the discussion on issues before asking if you can work on it. As you see in the comment here: #20724 (comment) you need to wait for another PR (pull request) to finish, before anyone can continue working on this issue.

@SanjayMarreddi
Copy link
Contributor

@glemaitre
I am ( In fact many people ) are eagerly waiting for your PR ( #20723 ) to be merged!
The reason being I also tried using check_scalar for parameters validation for a different estimator, but to make my draft PR, I need to wait since your PR makes adjustments to the error message and adds more API to check_scalar.

I am sorry if my mentioning of you here caused some inconvenience. I do understand you are fully occupied with your works. I am just requesting you to do so due to the curiosity of my contribution to scikit-learn!

Thanks!

@glemaitre glemaitre removed the RFC label Aug 22, 2021
@ArulananthamAnujan
Copy link

I'm interested in working on this issue, it would be my very first contribution.

@S-T-A-R-L-O-R-D
Copy link

is this issue still open, if yes can I start working on it?

@adrinjalali adrinjalali removed Sprint good first issue Easy with clear instructions to resolve labels Aug 24, 2021
@glemaitre
Copy link
Member Author

#20723 has been merged and can serve as an example to see which modifications are expected. This PR was modifying in the AffinityPropagation estimator.

@genvalen
Copy link
Contributor
genvalen commented Nov 6, 2021

Working on AdaBoostRegressor

@genvalen
Copy link
Contributor
genvalen commented Nov 10, 2021

Working on ensemble/_gb.py. This includes:
GradientBoostingClassifier, and
GradientBoostingRegressor.

@hvassard
Copy link
Contributor

Hello @glemaitre @jjerphan !

As #20723 is now merged I'd like to work on this issue, this would be my very first contribution !

On which estimator / classifier would you like me to work on ?

Do you have any suggestions / advices ?

Thank you ! 🙏

@adrinjalali
Copy link
Member

@hvassard a part of this issue is to find places where this change would make sense. I suggest you start by looking at a few PRs linked here and see what they've done, and then have a look at the code base on different estimators, and see if you can fix any of them. It would also give you a better idea of how the code is organized :)

@jjerphan
Copy link
Member

@hvassard: thank you for your interest!

I suggest you have a look at this comment which explains how to proceed #20724 (comment).

@hvassard
Copy link
Contributor
hvassard commented Dec 4, 2021

Working on SpectralClustering

@reshamas
Copy link
Member
reshamas commented Dec 7, 2021

Working on: file (class)

  • linear_model/_glm/glm.py (GeneralizedLinearRegressor)
  • linear_model/_glm/glm.py (PoissonRegressor)
  • linear_model/_glm/glm.py (GammaRegressor)
  • linear_model/_glm/glm.py (TweedieRegressor)

@genvalen
Copy link
Contributor

Working on BaseDecisionTree.
This one's needed before I can complete ensemble/_gb.py

@reshamas reshamas added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels Jan 7, 2022
@genvalen
Copy link
Contributor
genvalen commented Jan 9, 2022

Working on ensemble/_voting.py.
Includes:
VotingClassifier,
VotingRegressor.

@genvalen
Copy link
Contributor

Hi everyone! I am trying to understand the range for n_jobs in voting estimators. Glossary definition here.
It seems like it can be any integer (positive or negative), except for zero. Is that correct?

@reshamas
Copy link
Member

@genvalen I think that's right.
What happens if you test it out in a binder notebook and try n_jobs = 0?

  • None ( n_jobs = 1 )
  • n_jobs < -1
    • n_jobs = -2 & n_cpus = 2 ==> 1 (n_cpus + 1 + n_jobs)
    • n_jobs = -2 & n_cpus = 64 ==> 63 (n_cpus + 1 + n_jobs)
  • n_jobs = -1
  • n_jobs = 1

@genvalen
Copy link
Contributor

@reshamas this is the traceback from the REPL for:
clf1 = LogisticRegression(multi_class='multinomial', n_jobs=0)
clf1.fit(X,y)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gen/Desktop/scikit-learn/sklearn/linear_model/_logistic.py", line 1589, in fit
    fold_coefs_ = Parallel(
  File "/Users/gen/anaconda3/envs/sklearn-dev/lib/python3.9/site-packages/joblib/parallel.py", line 966, in __call__
    n_jobs = self._initialize_backend()
  File "/Users/gen/anaconda3/envs/sklearn-dev/lib/python3.9/site-packages/joblib/parallel.py", line 733, in _initialize_backend
    n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
  File "/Users/gen/anaconda3/envs/sklearn-dev/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 489, in configure
    n_jobs = self.effective_n_jobs(n_jobs)
  File "/Users/gen/anaconda3/envs/sklearn-dev/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 504, in effective_n_jobs
    raise ValueError('n_jobs == 0 in Parallel has no meaning')
ValueError: n_jobs == 0 in Parallel has no meaning

@thomasjpfan
Copy link
Member

Since we pass n_jobs directly to joblib's Parallel which does the validation, I do not think we need to validate n_jobs in scikit-learn itself.

@genvalen
Copy link
Contributor
genvalen commented Jan 31, 2022

Working on ensemble/_stacking.py.
Includes:
StackingRegressor,
StackingClassifier.

@matiasrvazquez
Copy link
Contributor

Hi all. I will be working on sklearn.linear_model._bayes, which includes the classes BayesianRidge and ARDRegression

@jeremiedbb
Copy link
Member

Closing in favor of #23462

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests

0