8000 [MRG+2] Fix sparse matrix handling in clustering by amueller · Pull Request #4052 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] Fix sparse matrix handling in clustering #4052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 16, 2015

Conversation

amueller
Copy link
Member
@amueller amueller commented Jan 6, 2015

See #4051.

Needs fixing:

  • MeanShift
  • DBSCAN
  • AffinityPropagation

@amueller
Copy link
Member Author
amueller commented Jan 6, 2015

@jnothman
The issue in DBSCAN is here:

> /home/andy/checkout/scikit-learn/sklearn/cluster/dbscan_.py(234)fit()
-> self.components_ = X[self.core_sample_indices_].copy()
(Pdb) p self.core_sample_indices_
array([], dtype=int64)

@jnothman
Copy link
Member
jnothman commented Jan 6, 2015

Huh? Oh, perhaps that's the only part of the implementation I didn't test! :s

@amueller
Copy link
Member Author
amueller commented Jan 6, 2015

You can never test enough ;)

@jnothman
Copy link
Member
jnothman commented Jan 6, 2015

No, had nothing to do with parts I didn't test. It's an edge case. That scipy.sparse doesn't seem to handle well... Until recently there was no support for sparse matrices with a zero-dimension. Apart from that, it seems csr_matrix.__getitem__ doesn't handle the edge case of an empty index array

@jnothman jnothman closed this Jan 6, 2015
@jnothman jnothman reopened this Jan 6, 2015
@amueller
Copy link
Member Author
amueller commented Jan 6, 2015

Meh :-/ Do you want to look into it / do a fix? I'm not super familiar with the DBSCAN code.

def is_supervised(estimator):
return (isinstance(estimator, ClassifierMixin)
or isinstance(estimator, RegressorMixin)
# transformers can all take a y
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y, oh y?

@amueller
Copy link
Member Author
amueller commented Jan 6, 2015

more tests with less code :D

@jnothman
Copy link
Member
jnothman commented Jan 6, 2015

Regrding the edge case of slicing with an empty index array: this is unsupported in scipy 0.13, but is fixed in master. So perhaps the solution is to special-case it. If we land up with no core samples, create a dense array of the correct shape (but with no data)! Or, could special-case it only if scipy raises an error.

@amueller
Copy link
Member Author
amueller commented Jan 6, 2015

maybe just special-case it all the time and comment saying it is scipy backward compatibility?

@amueller amueller force-pushed the clustering_sparse_matrix branch 2 times, most recently from 9925a98 to c71ebb2 Compare January 6, 2015 23:14
@amueller
Copy link
Member Author
amueller commented Jan 7, 2015

fixed the dbscan issue.

@amueller amueller force-pushed the clustering_sparse_matrix branch from ae632fc to 75f2f5e Compare January 9, 2015 18:21
@amueller amueller changed the title [WIP] TST add test for sparse matrix handling in clustering [MRG] TST add test for sparse matrix handling in clustering Jan 9, 2015
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 75f2f5e on amueller:clustering_sparse_matrix into 69567ec on scikit-learn:master.

8000
@@ -78,6 +77,19 @@ def test_dbscan_sparse():
assert_array_equal(labels_dense, labels_sparse)


def test_dbscan_no_score_samples():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*no_core_samples

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe that's where the missing s went from a very different patch)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha I bet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has not been addressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn. I forgot to push at some point or a rebase messed it up :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jnothman
Copy link
Member

Except for check_sparsify_multiclass_classifier needing a better name, and the special-casing in dbscan needing a comment, this LGTM.

@amueller
Copy link
Member Author

The check_sparsify has a better name in the other PR ;) I need to rebase this one, sorry.

@amueller
Copy link
Member Author

But I can also ditch the other one and we just use this.

@amueller
Copy link
Member Author

the problem is that I used #4058 elsewhere, too.

@amueller amueller force-pushed the clustering_sparse_matrix branch 2 times, most recently from 04a4364 to 2345d23 Compare January 11, 2015 18:56
@amueller amueller changed the title [MRG] TST add test for sparse matrix handling in clustering [MRG + 1] Fix sparse matrix handling in clustering Jan 11, 2015
@jnothman
Copy link
Member

Ah sorry, I'd forgotten this was on top of #4058

On 12 January 2015 at 05:45, Andreas Mueller notifications@github.com
wrote:

The check_sparsify has a better name in the other PR ;) I need to rebase
this one, sorry.


Reply to this email directly or view it on GitHub
#4052 (comment)
.

@ogrisel ogrisel changed the title [MRG + 1] Fix sparse matrix handling in clustering [MRG+1] Fix sparse matrix handling in clustering Jan 15, 2015
@ogrisel
Copy link
Member
ogrisel commented Jan 15, 2015

#4058 has been merged but the diff view of this PR does not reflect it. @amueller can you please rebase this on top of the current master and push -f to get github to compute a new diff (and have travis re-run the tests just to be sure?).

@amueller amueller force-pushed the clustering_sparse_matrix branch from 2345d23 to 3e3fdd3 Compare January 15, 2015 16:16
@amueller
Copy link
Member Author

done

@amueller amueller force-pushed the clustering_sparse_matrix branch from 3e3fdd3 to 93ed5d8 Compare January 15, 2015 17:04
X[X < .8] = 0

db = DBSCAN().fit(X)
assert_array_equal(db.components_, np.empty((0, X.shape[1])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that all samples are clustered as outliers in that case? It would be great to add an assertion to check that here, first by asserting that core_sample_indices_ has shape (0,) and second that labels_ is filed with -1s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ogrisel
Copy link
Member
ogrisel commented Jan 16, 2015

Aside from this last minor comment, +1 on my side as well.

Make sparsity check check everything.

don't test everything. That would be nice but is out of scope :-/

catch special case of no core samples in DBSCAN

add nonregression test for sparse dbscan with no core samples.
@amueller amueller force-pushed the clustering_sparse_matrix branch from 93ed5d8 to 494b8e5 Compare January 16, 2015 17:03
@amueller
Copy link
Member Author

will merge if travis agrees with me.

@amueller amueller changed the title [MRG+1] Fix sparse matrix handling in clustering [MRG+2] Fix sparse matrix handling in clustering Jan 16, 2015
amueller added a commit that referenced this pull request Jan 16, 2015
[MRG+2] Fix sparse matrix handling in clustering
@amueller amueller merged commit bf203de into scikit-learn:master Jan 16, 2015
@amueller amueller deleted the clustering_sparse_matrix branch January 16, 2015 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0