8000 Another input needed for the parameter `n_features_to_select` in SequentialFeatureSelector · Issue #21291 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Another input needed for the parameter n_features_to_select in SequentialFeatureSelector #21291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hellojinwoo opened this issue Oct 9, 2021 · 4 comments

Comments

@hellojinwoo
Copy link

Describe the workflow you want to enable

Currently, to use the SequentialFeatureSelector, you need a parameter n_features_to_select. However, according to the book 'Introduction to statistical learning', you can know how many number of variables are appropriate only after you test all number of parameters and get the best model of each number of parameters.

image

This is an excerpt from the book ISLR, which shows that you can figure out how many number of features are appropriate after testing all numbers of predictors. You cannot figure it out beforehand.

Describe your proposed solution

I suggest to create options like "the lowest adjusted r_squared" for the parameter n_features_to_select. By doing so, you can choose the number of features that has the lowest adjusted r_squared, which cannot be known beforehand.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@bmreiniger
Copy link
Contributor

@thomasjpfan
Copy link
Member

Closing because this is a duplicate of #20137. Note that there is on-going work at #20145 that will resolve the issue.

@bmreiniger
Copy link
Contributor
bmreiniger commented Oct 10, 2021

@thomasjpfan this would be a little different if the scores-vs-number-of-features graph isn't convex: this proposal would order all of the features and then choose the best number, which might not be at the first turnaround. That said, it's not clear how common it would be that later additions/deletions would be sufficiently better to be worth the extra processing time.

With the work in #20145, a user could get this by setting tol=-np.inf and then manually set the number of features (would changing the n_features_to_select_ learned attribute have the desired effect when transforming?). I'd imagine, if it was deemed worth it, we could add an option n_features_to_select="best" without too much trouble after 20145.

@thomasjpfan thomasjpfan reopened this Oct 10, 2021
@hellojinwoo
Copy link
Author
hellojinwoo commented Nov 2, 2021

@bmreiniger As you said, checking out the estimated test score(e.g. adjusted R Squared, AIC, BIC, etc. ) of all number of features and then deciding which number of features to use, it is not a duplicate of #20137. #20137 talks about something like "continuing to select the features as long as AIC gets better". However, this methodology is based on the assumption that you cannot solely rely on one estimated test score to choose which number of features should be selection. So as long as training SSE decreases, the feature selection goes on and on till all number of features is used, and then decides which number of features to use, by taking a look at several estimated test scores.

R supports this function so it would be great if we see this in SKlearn as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0