8000 [RFC] deprecate 1d X in check_array [was reshape sensibly] by amueller · Pull Request #4511 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[RFC] deprecate 1d X in check_array [was reshape sensibly] #4511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

amueller
Copy link
Member
@amueller amueller commented Apr 3, 2015

reshape in check_array for ndim==1 using reshape(-1, 1), not reshape(1, -1).
See #4509 #4466. [edit] Not sure this is the right idea any more[/edit].

On master, all "transform", "decision_function" and "predict_proba" take X of shape (n_features,)
without issue. Investigating whether I brought this upon us with check_array.

Sadness so far:

  • Naive Bayes, DictionaryLearning, GradientBoosting, SGDClassifier, LSHForest, BallTree, KDTree, RBM and some feature_selection worked the other way around (assuming shape (1, n_features)).
  • Trees & forests asserted they don't work on 1d (only slightly saddens me to remove this test).

@agramfort
Copy link
Member

you have my +1 on this. We "just" need to fix all the estimators that complain...

@amueller
Copy link
Member Author
amueller commented Apr 5, 2015

I'm on it ;)

@amueller
Copy link
8000
Member Author
amueller commented Apr 5, 2015

Most of the remaining fun seems to be estimators that just in general don't handle 1d data.... great!

@amueller
Copy link
Member Author
amueller commented Apr 5, 2015

This will probably break a lot of code, seeing how it broke so many tests.

I see the following possible choices:

  1. Go through a deprecation cycle for those estimators that expected it "the other way around" and then switch it do be consistent (n_samples, 1)
  2. Go through a deprecation cycle for accepting 1d and in the future just die
  3. just break people's code that relied on it being (1, n_features) in any of the estimators I mentioned above (and the once I didn't notice).
  4. means a lot of additional input validation and testing code. In particular, we need to make sure that we really didn't change previous behavior.
  5. to me means an inconvenient interface, with a similar amount of testing code.
  6. means breaking peoples code.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling ecf628c on amueller:single_feature_X into 6560a8e on scikit-learn:master.

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

Feedback from @GaelVaroquaux @ogrisel @jnothman would be very welcome. I'll probably go ahead and "fix" this anyhow, but the deprecations probably need somewhat different work.

@ogrisel
Copy link
Member
ogrisel commented Apr 9, 2015

+1 on making check_array(data, ensure_2d=True) treat the case of ndim==1 as a n_samples dimension, that is reshape to (-1, 1).

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

@ogrisel and deprecate the current behavior? on master, all predict, decision_function and transform do it the other way around.

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

FYI, in 0.15, most estimators "worked" but some broke on these methods when given (n_features,). With such helpful errors as "einstein sum subscripts string contains too many subscripts for operand 0".

@ogrisel
Copy link
Member
ogrisel commented Apr 9, 2015

@ogrisel and deprecate the current behavior? on master, all predict, decision_function and transform do it the other way around.

Indeed I had not realized. I am not so sure anymore.

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

Checking the current fit function in master.
It looks like all but the following three assume (n_features,)
DictionaryLearning, MinMaxScaler, StandardScaler, which assume (n_samples, )

It is a bit hard to say, though, as many estimators crash when given a single sample.

@ogrisel
Copy link
Member
ogrisel commented Apr 9, 2015

I think for fit we should always assume (n_samples,) as is really does not make sense to do machine learning on a single sample dataset.

Or alternatively we could raise a ValueError for fit with a single sample.

For predict / transform and friends, we don't want to break people codes that might be leveraging the current (partially unintended) behavior as a convenience to do single sample predictions. I would be ok to keep the current master behavior for those guys. That means that check_array should be called with ensure_min_samples=2 in fit throughout the code base.

I wonder what other people think.

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

ask on the ml?
I think I mostly agree with you. But it would be weird to have ndim=1 mean a different axis for prediction and fitting. So maybe raise a value error in fit / deprecate if it currently works?

@ogrisel ogrisel added this to the 1.0 milestone Apr 9, 2015
@ogrisel ogrisel added the API label Apr 9, 2015
@amueller amueller changed the title FIX make check_array reshape sensibly [RFC] make check_array reshape sensibly May 1, 2015
@amueller
Copy link
Member Author

After the discussion on the ML, I think we deprecate and "raise"?

@dukebody
Copy link
dukebody commented Aug 9, 2015

I think the proposed solution is the right one since it is the only one consistent with the label transformers interface. For example, LabelEncoder accepts a 1-d array of shape (n_samples,), not (n_features,). This way we will always have the first dimension (axis=0, rows) as the samples, and the second one (axis=1, columns) as the features.

My 2 cents.

@amueller
Copy link
Member Author

@dukebody well the label transformers are on y. In the future we will deprecate passing 1d arrays to X as the API is currently so inconsistent.

@amueller amueller changed the title [RFC] make check_array reshape sensibly [RFC] deprecate 1d X in check_array [was reshape sensibly] Aug 24, 2015
@amueller
Copy link
Member Author
amueller commented Sep 9, 2015

Replaced by #5152 which got merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0