[RFC] deprecate 1d X in check_array [was reshape sensibly] #4511

amueller · 2015-04-03T21:48:49Z

reshape in check_array for ndim==1 using reshape(-1, 1), not reshape(1, -1).
See #4509 #4466. [edit] Not sure this is the right idea any more[/edit].

On master, all "transform", "decision_function" and "predict_proba" take X of shape (n_features,)
without issue. Investigating whether I brought this upon us with check_array.

Sadness so far:

Naive Bayes, DictionaryLearning, GradientBoosting, SGDClassifier, LSHForest, BallTree, KDTree, RBM and some feature_selection worked the other way around (assuming shape (1, n_features)).
Trees & forests asserted they don't work on 1d (only slightly saddens me to remove this test).

…e(1, -1).

agramfort · 2015-04-05T18:53:00Z

you have my +1 on this. We "just" need to fix all the estimators that complain...

amueller · 2015-04-05T19:07:33Z

I'm on it ;)

amueller · 2015-04-05T21:01:53Z

Most of the remaining fun seems to be estimators that just in general don't handle 1d data.... great!

amueller · 2015-04-05T22:35:19Z

This will probably break a lot of code, seeing how it broke so many tests.

I see the following possible choices:

Go through a deprecation cycle for those estimators that expected it "the other way around" and then switch it do be consistent (n_samples, 1)
Go through a deprecation cycle for accepting 1d and in the future just die
just break people's code that relied on it being (1, n_features) in any of the estimators I mentioned above (and the once I didn't notice).
means a lot of additional input validation and testing code. In particular, we need to make sure that we really didn't change previous behavior.
to me means an inconvenient interface, with a similar amount of testing code.
means breaking peoples code.

landscape-bot · 2015-04-06T13:27:46Z

Repository health decreased by 0.01% when pulling ecf628c on amueller:single_feature_X into 6560a8e on scikit-learn:master.

1 new problem was found (including 0 errors and 0 code smells).
No problems were fixed.

amueller · 2015-04-09T17:31:48Z

Feedback from @GaelVaroquaux @ogrisel @jnothman would be very welcome. I'll probably go ahead and "fix" this anyhow, but the deprecations probably need somewhat different work.

ogrisel · 2015-04-09T18:08:50Z

+1 on making check_array(data, ensure_2d=True) treat the case of ndim==1 as a n_samples dimension, that is reshape to (-1, 1).

amueller · 2015-04-09T18:12:15Z

@ogrisel and deprecate the current behavior? on master, all predict, decision_function and transform do it the other way around.

amueller · 2015-04-09T18:13:17Z

FYI, in 0.15, most estimators "worked" but some broke on these methods when given (n_features,). With such helpful errors as "einstein sum subscripts string contains too many subscripts for operand 0".

ogrisel · 2015-04-09T18:19:05Z

@ogrisel and deprecate the current behavior? on master, all predict, decision_function and transform do it the other way around.

Indeed I had not realized. I am not so sure anymore.

amueller · 2015-04-09T18:40:21Z

Checking the current fit function in master.
It looks like all but the following three assume (n_features,)
DictionaryLearning, MinMaxScaler, StandardScaler, which assume (n_samples, )

It is a bit hard to say, though, as many estimators crash when given a single sample.

ogrisel · 2015-04-09T20:23:21Z

I think for fit we should always assume (n_samples,) as is really does not make sense to do machine learning on a single sample dataset.

Or alternatively we could raise a ValueError for fit with a single sample.

For predict / transform and friends, we don't want to break people codes that might be leveraging the current (partially unintended) behavior as a convenience to do single sample predictions. I would be ok to keep the current master behavior for those guys. That means that check_array should be called with ensure_min_samples=2 in fit throughout the code base.

I wonder what other people think.

amueller · 2015-04-09T20:37:20Z

ask on the ml?
I think I mostly agree with you. But it would be weird to have ndim=1 mean a different axis for prediction and fitting. So maybe raise a value error in fit / deprecate if it currently works?

amueller · 2015-05-11T18:15:56Z

After the discussion on the ML, I think we deprecate and "raise"?

dukebody · 2015-08-09T15:55:57Z

I think the proposed solution is the right one since it is the only one consistent with the label transformers interface. For example, LabelEncoder accepts a 1-d array of shape (n_samples,), not (n_features,). This way we will always have the first dimension (axis=0, rows) as the samples, and the second one (axis=1, columns) as the features.

My 2 cents.

amueller · 2015-08-10T13:58:25Z

@dukebody well the label transformers are on y. In the future we will deprecate passing 1d arrays to X as the API is currently so inconsistent.

amueller · 2015-09-09T14:42:13Z

Replaced by #5152 which got merged.

amueller added 2 commits April 3, 2015 17:41

FIX make check_array reshape ndim==1 using reshape(-1, 1), not reshap…

01b718e

…e(1, -1).

bla

f6db1eb

This was referenced Apr 5, 2015

Univariate fast_mcd: np.reshape(X, (-1, 1)) #4517

Closed

Array dimension issue when using sklearn.covariance.fast_mcd #4512

Closed

FIX NB behavior :-/

c75fb16

amueller added 5 commits April 5, 2015 17:41

remove 1d checks in forests anc covariance

f9d3742

check_X_one_dim: compare against X.reshape(-1, 1).

b7fe5d1

fix in univariate selection

e9c1678

fix feature selection, gradient boosting, dict learning

32a7df8

fix gaussian process

9e4bc78

more fixes in sgd, neighbors.

ecf628c

make more tests more explicit

8000

7e12d41

ogrisel added this to the 1.0 milestone Apr 9, 2015

ogrisel added the API label Apr 9, 2015

amueller changed the title ~~FIX make check_array reshape sensibly~~ [RFC] make check_array reshape sensibly May 1, 2015

amueller mentioned this pull request Jun 10, 2015

Added error messages in case user provides one dimensional data #4845

Closed

This was referenced Jun 15, 2015

Ensure that fitting on 1D input data is consistent across estimators #4252

Closed

undesirable behavior from check_array function when passed a 1D numpy array in 0.16.1 #4877

Closed

giorgiop mentioned this pull request Aug 10, 2015

[MRG+2] partial_fit for StandardScaler, MinMaxScaler and MaxAbsScaler #5104

Merged

12 tasks

amueller mentioned this pull request Aug 18, 2015

Suggestion to Have multiclass.py allow prediction over one sample only ! #5135

Closed

amueller changed the title ~~[RFC] make check_array reshape sensibly~~ [RFC] deprecate 1d X in check_array [was reshape sensibly] Aug 24, 2015

amueller mentioned this pull request Aug 25, 2015

[MRG + 1] Warn on 1D arrays, addresses #4511 #5152

Closed

amueller closed this Sep 9, 2015

jdkent mentioned this pull request Mar 17, 2021

should inverse_transform always return 4D output? nilearn/nilearn#2726

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] deprecate 1d X in check_array [was reshape sensibly] #4511

[RFC] deprecate 1d X in check_array [was reshape sensibly] #4511

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[RFC] deprecate 1d X in check_array [was reshape sensibly] #4511

[RFC] deprecate 1d X in check_array [was reshape sensibly] #4511

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!