-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] Fix K Means init center bug - Included test case #7872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Fix K Means init center bug - Included test case #7872
Conversation
…cted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )
@jkarno can you see why travis is not happy? |
@agramfort Looks like there was one line too long failing a pyflakes test, so I fixed that. The other failure seems like it was Travis hanging on downloading a certain package. It's passing now on a rebuild. |
The appveyor failure is unrelated |
if hasattr(init, '__array__'): | ||
init = check_array(init, dtype=X.dtype.type, copy=True) | ||
_validate_center_shape(X, n_clusters, init) | ||
if hasattr(init, '__array__'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. Even if X is sparse, we can still pass explicit initial centers.
np.testing.assert_allclose( | ||
centers, | ||
KMeans(n_clusters=3, | ||
init=centers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't call validate_centers
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a test that an error is raised if init has the wrong shape (say 4 custers)
'performing only one init in k-means instead of n_init=%d' | ||
% n_init, RuntimeWarning, stacklevel=2) | ||
n_init = 1 | ||
init -= X_mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only make this conditional on whether X i not sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, could you clarify this again? Are you saying that it doesn't need to check if it's an array in order to subtract the mean? Or are you saying that this is the only line that should stay under the "is not sparse" check, as well as the array check?
Because I'm not sure how it should then handle the other cases of init being a string or a callable.
|
||
# Test that a ValueError is raised for validate_center_shape | ||
classifier = KMeans(n_clusters=3, init=centers, n_init=1) | ||
assert_raises(ValueError, classifier.fit, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use assert_raise_message
to be more specific.
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address @amueller's comment regarding a stronger assertion. Otherwise LGTM. Please add an entry to what's new.
@@ -88,6 +88,10 @@ Bug fixes | |||
- Tree splitting criterion classes' cloning/pickling is now memory safe | |||
:issue:`7680` by `Ibraim Ganiev`_. | |||
|
|||
- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a sparse array | |||
X and initial centroids, where X's means were unnecessarily being subtracted from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep line length < 80 chars where possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to be nitpicky, but if I get you to fix it up once you hopefully won't forget it next time
@@ -88,6 +88,10 @@ Bug fixes | |||
- Tree splitting criterion classes' cloning/pickling is now memory safe | |||
:issue:`7680` by `Ibraim Ganiev`_. | |||
|
|||
- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a | |||
sparse array X and initial centroids, where X's means were unnecessarily | |||
being subtracted from the centroids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attribution? issue number?
Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name. |
Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.
By default we use links to github, i.e. https://github.com/jkarno in your case. |
@@ -824,3 +824,47 @@ def test_KMeans_init_centers(): | |||
km = KMeans(init=init_centers_test, n_clusters=3, n_init=1) | |||
km.fit(X_test) | |||
assert_equal(False, np.may_share_memory(km.cluster_centers_, init_centers)) | |||
|
|||
|
|||
def test_sparse_KMeans_init_centers(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow flake8 doesn't enforce naming conventions, I did not know that.
Actually I did not read the error message very well, the problem is that the message has additional L |
@@ -88,6 +88,10 @@ Bug fixes | |||
- Tree splitting criterion classes' cloning/pickling is now memory safe | |||
:issue:`7680` by `Ibraim Ganiev`_. | |||
|
|||
- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a | |||
sparse array X and initial centroids, where X's means were unnecessarily | |||
being subtracted from the centroids. :issue:`7872` by Josh Karnofsky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use
by :user:`Josh Karnofsky <jkarno>`
Thanks @jkarno! |
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )
K-Means: Subtract X_means from initial centroids iff it's also subtracted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )
Reference Issue
Fixes #6740 and builds upon #6741 with additional test case
What does this implement/fix? Explain your changes.
This takes the previous PR and adds the test case described by the user. It also resolves conflicts with the master branch.
Any other comments?
I added the test case described by the previous user. Please let me know if there are other necessary test cases to be handled.
Also, apologies for the slow update, I was traveling throughout this week.