E544 [RFC] deprecate 1d X in check_array [was reshape sensibly] by amueller · Pull Request #4511 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Conversation

amueller
Copy link
Member
@amueller amueller commented Apr 3, 2015

reshape in check_array for ndim==1 using reshape(-1, 1), not reshape(1, -1).
See #4509 #4466. [edit] Not sure this is the right idea any more[/edit].

On master, all "transform", "decision_function" and "predict_proba" take X of shape (n_features,)
without issue. Investigating whether I brought this upon us with check_array.

Sadness so far:

  • Naive Bayes, DictionaryLearning, GradientBoosting, SGDClassifier, LSHForest, BallTree, KDTree, RBM and some feature_selection worked the other way around (assuming shape (1, n_features)).
  • Trees & forests asserted they don't work on 1d (only slightly saddens me to remove this test).

@agramfort
Copy link
Member

you have my +1 on this. We "just" need to fix all the estimators that complain...

@amueller
Copy link
Member Author
amueller commented Apr 5, 2015

I'm on it ;)

@amueller
Copy link
Member Author
amueller commented Apr 5, 2015

Most of the remaining fun seems to be estimators that just in general don't handle 1d data.... great!

@amueller
Copy link
Member Author
amueller commented Apr 5, 2015

This will probably break a lot of code, seeing how it broke so many tests.

I see the following possible choices:

  1. Go through a deprecation cycle for those estimators that expected it "the other way around" and then switch it do be consistent (n_samples, 1)
  2. Go through a deprecation cycle for accepting 1d and in the future just die
  3. just break people's code that relied on it being (1, n_features) in any of the estimators I mentioned above (and the once I didn't notice).
  4. means a lot of additional input validation and testing code. In particular, we need to make sure that we really didn't change previous behavior.
  5. to me means an inconvenient interface, with a similar amount of testing code.
  6. means breaking peoples code.

@landscape-bot
Copy link

Code Health
Repository health decreased by 0.01% when pulling ecf628c on amueller:single_feature_X into 6560a8e on scikit-learn:master.

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

Feedback from @GaelVaroquaux @ogrisel @jnothman would be very welcome. I'll probably go ahead and "fix" this anyhow, but the deprecations probably need somewhat different work.

@ogrisel
Copy link
Member
ogrisel commented Apr 9, 2015

+1 on making check_array(data, ensure_2d=True) treat the case of ndim==1 as a n_samples dimension, that is reshape to (-1, 1).

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

@ogrisel and deprecate the current behavior? on master, all predict, decision_function and transform do it the other way around.

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

FYI, in 0.15, most estimators "worked" but some broke on these methods when given (n_features,). With such helpful errors as "einstein sum subscripts string contains too many subscripts for operand 0".

@ogrisel
Copy link
Member
ogrisel commented Apr 9, 2015

@ogrisel and deprecate the current behavior? on master, all predict, decision_function and transform do it the other way around.

Indeed I had not realized. I am not so sure anymore.

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

Checking the current fit function in master.
It looks like all but the following three assume (n_features,)
DictionaryLearning, MinMaxScaler, StandardScaler, which assume (n_samples, )

It is a bit hard to say, though, as many estimators crash when given a single sample.

@ogrisel
Copy link
Member
ogrisel commented Apr 9, 2015

I think for fit we should always assume (n_samples,) as is really does not make sense to do machine learning on a single sample dataset.

Or alternatively we could raise a ValueError for fit with a single sample.

For predict / transform and friends, we don't want to break people codes that might be leveraging the current (partially unintended) behavior as a convenience to do single sample predictions. I would be ok to keep the current master behavior for those guys. That means that check_array should be called with ensure_min_samples=2 in fit throughout the code base.

I wonder what other people think.

@amueller
Copy link
Member Author
amueller commented Apr 9, 2015

ask on the ml?
I think I mostly agree with you. But it would be weird to have ndim=1 mean a different axis for prediction and fitting. So maybe raise a value error in fit / deprecate if it currently works?

@ogrisel ogrisel added this to the 1.0 milestone Apr 9, 2015
@ogrisel ogrisel added the API label Apr 9, 2015
@amueller amueller changed the title FIX make check_array reshape sensibly [RFC] make check_array reshape sensibly May 1, 2015
@amueller
Copy link
Member Author

After the discussion on the ML, I think we deprecate and "raise"?

@dukebody
Copy link
dukebody commented Aug 9, 2015

I think the proposed solution is the right one since it is the only one consistent with the label transformers interface. For example, LabelEncoder accepts a 1-d array of shape (n_samples,), not (n_features,). This way we will always have the first dimension (axis=0, rows) as the samples, and the second one (axis=1, columns) as the features.

My 2 cents.

@amueller
Copy link
Member Author

@dukebody well the label transformers are on y. In the future we will deprecate passing 1d arrays to X as the API is currently so inconsistent.

@amueller amueller changed the title [RFC] make check_array reshape sensibly [RFC] deprecate 1d X in check_array [was reshape sensibly] Aug 24, 2015
@amueller
Copy link
Member Author
amueller commented Sep 9, 2015

Replaced by #5152 which got merged.

Sign u 5285 p for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

0