8000 undesirable behavior from check_array function when passed a 1D numpy array in 0.16.1 · Issue #4877 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

undesirable behavior from check_array function when passed a 1D numpy array in 0.16.1 #4877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JLConawayII opened this issue Jun 19, 2015 · 14 comments

Comments

@JLConawayII
Copy link

I'm going to be using Gaussian Mixture Models for my research and I thought I would input some examples to see how the package worked. When I tried running the 1D Gaussian Mixture Example located here http://www.astroml.org/book_figures/chapter4/fig_GMM_1D.html it kicked back this error in Pycharm:

Traceback (most recent call last):
File "/home/jconaway/Research/Kepler_Analysis_2/gaussian_mixture_example.py", line 86, in
logprob, responsibilities = M_best.score_samples(x)
File "/home/jconaway/anaconda3/lib/python3.4/site-packages/sklearn/mixture/gmm.py", line 315, in score_samples
raise ValueError('The shape of X is not compatible with self')
ValueError: The shape of X is not compatible with self

Process finished with exit code 1

I figured it should have worked as-is, so I did some exploring as to why it wasn't working correctly. I found that when M_best.score_samples(x) reaches X = check_array(X) on line 309 of gmm.py, it looks like it doesn't return the correct array. Here's some doodling around I did with some arbitrary arrays:

In [1]: import numpy as np

In [2]: x = np.linspace(-3,3,7)

In [3]: y = x[:,np.newaxis]

In [4]: x
Out[4]: array([-3., -2., -1., 0., 1., 2., 3.])

In [5]: y
Out[5]:
array([[-3.],
[-2.],
[-1.],
[ 0.],
[ 1.],
[ 2.],
[ 3.]])

In [6]: y.shape[1]
Out[6]: 1

In [7]: from sklearn.utils import validation

In [8]: Q = np.array([1,2,3,4,5])

In [9]: q = validation.check_array(Q)

In [10]: q
Out[10]: array([[1, 2, 3, 4, 5]])

In [11]: s = q[:,np.newaxis]

In [12]: s
Out[12]: array([[[1, 2, 3, 4, 5]]])

When I went back to the example and changed it to this:

s = x[:,np.newaxis]
logprob, responsibilities = M_best.score_samples(s)

it worked fine. That's as far as I've gotten with this. Hope it's helpful.

@amueller
Copy link
Member

Thanks for the report. Currently support for 1d input is in inconsistent, though it shouldn't raise that bad an error. In the future, giving 1d vectors will always raise an error, telling you to do x[:, np.newaxis]. See also #4511.
Which version of scikit-learn are you using?

@JLConawayII
Copy link
Author

So my impromptu fix to the example was the correct way to go about it. Good to know. I'm using Anaconda with scikit-learn 0.16.1 and Python 3.4.3

@amueller
Copy link
Member

@xuewei4d maybe make sure that this doesn't happen in the new implementation... I am surprised that we give such a bad error :-/

@xuewei4d
Copy link
Contributor

Sure. @amueller

I repeated this code. @JLConawayII, correct me if I am wrong.

gmm fits on the data of shape (n, 1), then score_samples on data of shape (n, ), which is further transformed into (1, n) by check_array. Then it raises an error.

If check_array in #4511 does raise an error, then we leave the problem to the user. When would you merge #4511, until 1.0 ? @amueller
But before we merge check_array, I think we'd better raise an warning whenever encounters 1D data and transform to (n, 1), or raise an error? Current implementation in the master branch does not deal 1D data in _fit. It will eventually raise an error 'EM cannot converge' since X has only one data sample.

@amueller
Copy link
Member

#4511 currently does the wrong thing and needs a rewrite. If you create a new class, please make sure that it will raise an error in all functions if an 1d array is given. Maybe I should rewrite #4511 :-/

@JLConawayII
Copy link
Author

Yes that looks right. When it returns the array the dimensions are switched.

To me it seems like a strange choice to have the function work this way. Personally I wouldn't have a validation tool change the input array at all, but check_array seems to do several things at once. It seems especially unnecessary in this case, since immediately after the check_array function is called in score_samples there is code to convert the 1D array into a (n,1) array:

gmm.py
...
309 X = check_array(X)
310 if X.ndim == 1:
311 X = X[:, np.newaxis]
312 if X.size == 0:
313 return np.array([]), np.empty((0, self.n_components))
314 if X.shape[1] != self.means_.shape[1]:
315 raise ValueError('The shape of X is not compatible with self')

Lines 310-311 add an axis to the 1D array if necessary, and then lines 314-315 check for a shape mismatch. I'm not really sure what the intention was here.

@amueller
Copy link
Member

This was an oversight on my part when I introduced the check_array function. It made many things much simpler, but it seems I overlooked this check (I had to edit all files). We are now switching to raising an error whenever a 1d array is passed, with a deprecation cycle.

@xuewei4d
Copy link
Contributor

@amueller Okay.

@jnothman
Copy link
Member
jnothman commented Jul 2, 2015

So is there a resolution for this issue? Should a more specific issue be created?

@xuewei4d
Copy link
Contributor
xuewei4d commented Jul 3, 2015

Before #4511 is merged, in the PR #4802 I am working on, the class will raise a ValueError on 1D data. @jnothman

@jnothman
Copy link
Member
jnothman commented Jul 4, 2015

So this will be fixed by either of #4511 or #4802 being merged?

@xuewei4d
Copy link
Contributor
xuewei4d commented Jul 7, 2015

Yes. I think so. Or we create another PR to fix it? @jnothman

@amueller
Copy link
Member

Yes. Well the fixed version of #4511. The current version is bs.

@amueller
Copy link
Member
amueller commented Sep 9, 2015

Fixed via #5152.

@amueller amueller closed this as completed Sep 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0