[MRG+2] Fix K Means init center bug - Included test case #7872

jkarno · 2016-11-14T17:47:50Z

Reference Issue

Fixes #6740 and builds upon #6741 with additional test case

What does this implement/fix? Explain your changes.

This takes the previous PR and adds the test case described by the user. It also resolves conflicts with the master branch.

Any other comments?

I added the test case described by the previous user. Please let me know if there are other necessary test cases to be handled.

Also, apologies for the slow update, I was traveling throughout this week.

…cted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )

agramfort · 2016-11-14T20:00:36Z

@jkarno can you see why travis is not happy?

jkarno · 2016-11-14T20:48:13Z

@agramfort Looks like there was one line too long failing a pyflakes test, so I fixed that. The other failure seems like it was Travis hanging on downloading a certain package. It's passing now on a rebuild.

amueller · 2016-11-16T16:45:03Z

The appveyor failure is unrelated

amueller · 2016-11-16T16:48:26Z

sklearn/cluster/k_means_.py

-    if hasattr(init, '__array__'):
-        init = check_array(init, dtype=X.dtype.type, copy=True)
-        _validate_center_shape(X, n_clusters, init)
+        if hasattr(init, '__array__'):


I don't think this is correct. Even if X is sparse, we can still pass explicit initial centers.

amueller · 2016-11-16T16:49:58Z

sklearn/cluster/tests/test_k_means.py

+    np.testing.assert_allclose(
+        centers,
+        KMeans(n_clusters=3,
+               init=centers,


This doesn't call validate_centers, right?

maybe add a test that an error is raised if init has the wrong shape (say 4 custers)

amueller · 2016-11-16T17:10:57Z

sklearn/cluster/k_means_.py

-                'performing only one init in k-means instead of n_init=%d'
-                % n_init, RuntimeWarning, stacklevel=2)
-            n_init = 1
+            init -= X_mean


only make this conditional on whether X i not sparse.

Sorry, could you clarify this again? Are you saying that it doesn't need to check if it's an array in order to subtract the mean? Or are you saying that this is the only line that should stay under the "is not sparse" check, as well as the array check?

Because I'm not sure how it should then handle the other cases of init being a string or a callable.

amueller · 2016-11-22T20:44:22Z

sklearn/cluster/tests/test_k_means.py

+
+    # Test that a ValueError is raised for validate_center_shape
+    classifier = KMeans(n_clusters=3, init=centers, n_init=1)
+    assert_raises(ValueError, classifier.fit, X)


you could use assert_raise_message to be more specific.

amueller · 2016-11-22T20:44:46Z

LGTM

jnothman

Please address @amueller's comment regarding a stronger assertion. Otherwise LGTM. Please add an entry to what's new.

jnothman · 2016-11-25T01:01:37Z

doc/whats_new.rst

@@ -88,6 +88,10 @@ Bug fixes
   - Tree splitting criterion classes' cloning/pickling is now memory safe
     :issue:`7680` by `Ibraim Ganiev`_.

+   - Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a sparse array
+     X and initial centroids, where X's means were unnecessarily being subtracted from


Please keep line length < 80 chars where possible

jnothman

sorry to be nitpicky, but if I get you to fix it up once you hopefully won't forget it next time

jnothman · 2016-11-27T10:02:32Z

doc/whats_new.rst

@@ -88,6 +88,10 @@ Bug fixes
   - Tree splitting criterion classes' cloning/pickling is now memory safe
     :issue:`7680` by `Ibraim Ganiev`_.

+   - Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a
+     sparse array X and initial centroids, where X's means were unnecessarily
+     being subtracted from the centroids.


attribution? issue number?

jkarno · 2016-11-28T07:14:30Z

Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name.

lesteve · 2016-11-28T08:31:25Z

Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.

======================================================================
FAIL: sklearn.cluster.tests.test_k_means.test_sparse_validate_centers
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27-x64\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "C:\Python27-x64\lib\site-packages\sklearn\cluster\tests\test_k_means.py", line 870, in test_sparse_validate_centers
    assert_raise_message(ValueError, msg, classifier.fit, X)
  File "C:\Python27-x64\lib\site-packages\sklearn\utils\testing.py", line 368, in assert_raise_message
    (message, error_message))
AssertionError: Error message does not include the expected string: 'The shape of the initial centers ((4, 4)) does not match the number of clusters 3'. Observed error message: 'The shape of the initial centers ((4L, 4L)) does not match the number of clusters 3'

Is this what you wanted for the attribution? I don't have a link associated with my name so I didn't include the link markdown to my name.

By default we use links to github, i.e. https://github.com/jkarno in your case.

lesteve · 2016-11-28T08:40:24Z

sklearn/cluster/tests/test_k_means.py

@@ -824,3 +824,47 @@ def test_KMeans_init_centers():
        km = KMeans(init=init_centers_test, n_clusters=3, n_init=1)
        km.fit(X_test)
        assert_equal(False, np.may_share_memory(km.cluster_centers_, init_centers))
+
+
+def test_sparse_KMeans_init_centers():


Wow flake8 doesn't enforce naming conventions, I did not know that.

lesteve · 2016-11-28T08:42:41Z

Seems like there is a genuine error on AppVeyor on Python 2.7 64bit on Windows.

Actually I did not read the error message very well, the problem is that the message has additional L ((4L, 4L)) instead of ((4, 4)). It's easy to fix by using assert_raises_regex and using something like this inside the regex r'\(\(4L?, 4L?\)\)'.

jnothman · 2016-11-28T10:07:42Z

doc/whats_new.rst

@@ -88,6 +88,10 @@ Bug fixes
   - Tree splitting criterion classes' cloning/pickling is now memory safe
     :issue:`7680` by `Ibraim Ganiev`_.

+   - Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a
+     sparse array X and initial centroids, where X's means were unnecessarily
+     being subtracted from the centroids. :issue:`7872` by Josh Karnofsky


you can use

by :user:`Josh Karnofsky <jkarno>`

jnothman · 2016-11-30T00:57:50Z

Thanks @jkarno!

K-Means: Subtract X_means from initial centroids iff it's also subtracted from X The bug happens when X is sparse and initial cluster centroids are given. In this case the means of each of X's columns are computed and subtracted from init for no reason. To reproduce: import numpy as np import scipy from sklearn.cluster import KMeans from sklearn import datasets iris = datasets.load_iris() X = iris.data '''Get a local optimum''' centers = KMeans(n_clusters=3).fit(X).cluster_centers_ '''Fit starting from a local optimum shouldn't change the solution''' np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X).cluster_centers_ ) '''The same should be true when X is sparse, but wasn't before the bug fix''' X_sparse = scipy.sparse.csr_matrix(X) np.testing.assert_allclose( centers, KMeans(n_clusters=3, init=centers, n_init=1).fit(X_sparse).cluster_centers_ )

tomtung and others added 4 commits May 1, 2016 19:09

fix conflict with k_mean_init_bug branch

15a6711

add test case for k_means_init_center

c37de06

change test case name fix_k_means_init_center

6bad275

reduce line length kmeans test

53d938c

amueller reviewed Nov 16, 2016

View reviewed changes

lesteve mentioned this pull request Nov 17, 2016

PCA singular value non-deterministic test failure on Appveyor #7893

Closed

add test_case and fix k_means sparse

04fc64f

amueller reviewed Nov 22, 2016

View reviewed changes

amueller changed the title ~~[MRG] Fix K Means init center bug - Included test case~~ [MRG + 1] Fix K Means init center bug - Included test case Nov 22, 2016

amueller modified the milestones: 0.18.1, 0.19 Nov 22, 2016

jnothman approved these changes Nov 23, 2016

View reviewed changes

jnothman changed the title ~~[MRG + 1] Fix K Means init center bug - Included test case~~ [MRG+2] Fix K Means init center bug - Included test case Nov 23, 2016

jkarno added 2 commits November 24, 2016 17:11

update k_means test case and whats new page

cef70aa

update k_means test error message

95ae826

jnothman requested changes Nov 25, 2016

View reviewed changes

fix whats new line length

8e31301

jnothman reviewed Nov 27, 2016

View reviewed changes

jkarno added 2 commits November 28, 2016 02:08

add attribution k_means bug

f25425b

fix attribution spacing k_means bug

a40611d

lesteve reviewed Nov 28, 2016

View reviewed changes

jnothman reviewed Nov 28, 2016

View reviewed changes

jkarno added 2 commits November 29, 2016 18:57

k_means fix test cases and attribution

83dc551

fix k_means regex test case

f1d75b8

jnothman approved these changes Nov 30, 2016

View reviewed changes

jnothman merged commit 89b2e45 into scikit-learn:master Nov 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG+2] Fix K Means init center bug - Included test case #7872

[MRG+2] Fix K Means init center bug - Included test case #7872

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG+2] Fix K Means init center bug - Included test case #7872

[MRG+2] Fix K Means init center bug - Included test case #7872

Uh oh!

Conversation

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!