8000 FIX Allow input validation by pass in MLPClassifier by betatim · Pull Request #24873 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
/ < 8000 a data-pjax="#repo-content-pjax-container" data-turbo-frame="repo-content-turbo-frame" href="/scikit-learn/scikit-learn">scikit-learn Public

FIX Allow input validation by pass in MLPClassifier #24873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 4, 2023

Conversation

betatim
Copy link
Member
@betatim betatim commented Nov 9, 2022

With arrays created inside the estimator we do not need to perform validation again. This change allows the estimator to compute the score during fitting, without having to provide column names for the validation dataset.

This solves a problem where fitting data with feature names generates warnings. The reason for the warnings is that the input validation strips the column names from the input data, which is then split into a train and validation set. When the validation set is passed to score, it performs input validation and raises a warning because the input data doesn't have feature names.

Reference Issues/PRs

closes #24846

What does this implement/fix? Explain your changes.

This adds a private score method which is used during fitting with early stopping. The private score method uses a _predict method that does not perform input data validation. This is Ok because the validation dataset is derived from a train_test_split of "input validated data".

Alternatives

We could turn X_val into a dataframe when it is created by using self.feature_names_in_. Feels awkward.

We could delay setting feature_names_in_ until the fitting is complete, would require reworking the validation functions.

What do people think?

With arrays created inside the estimator we do not need to perform
validation again. This change allows the stimator to compute the score
during fitting, without having to provide column names for the
validation dataset.
@cmarmo
Copy link
Contributor
cmarmo commented Nov 11, 2022

Hi @betatim, thanks for the pull request ( :) ).
Pandas is still a soft dependency for scikit-learn tests.
You might want to use pytest.importorskip as in

pd = pytest.importorskip("pandas")
to fix the checks.

@betatim
Copy link
Member Author
betatim commented Nov 14, 2022

TIL about importorskip!

@betatim
Copy link
Member Author
betatim commented Nov 14, 2022

This is ready for review. No Changelog entry because I don't think this is needed for this change.

@cmarmo cmarmo added No Changelog Needed Validation related to input validation labels Nov 14, 2022
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @betatim !

@betatim
Copy link
Member Author
betatim commented Nov 24, 2022

All comments addressed.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a big enough change to have a change log entry as a bug fix.

@betatim
Copy link
Member Author
betatim commented Dec 2, 2022

Added what's new entry, targeting v1.3. Not sure about how/when things still get added to the upcoming v1.2 but I guess we can do that in addition?

@thomasjpfan thomasjpfan changed the title ENH Allow input validation by pass in MLPClassifier FIX Allow input validation by pass in MLPClassifier Dec 2, 2022
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@betatim betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Dec 6, 2022
@glemaitre glemaitre self-requested a review December 28, 2022 11:01
@glemaitre glemaitre removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Dec 28, 2022
@betatim
Copy link
Member Author
betatim commented Jan 4, 2023

Thanks for the review @glemaitre. I've implemented your changes.

@glemaitre glemaitre merged commit 97fe999 into scikit-learn:main Jan 4, 2023
@glemaitre
Copy link
Member

LGTM. Thanks @betatim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:neural_network Validation related to input validation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"X does not have valid feature names" when training on DataFrame input with MLP
4 participants
0