-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+2] Fix sparse matrix handling in clustering #4052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Fix sparse matrix handling in clustering #4052
Conversation
@jnothman
|
77a4e13
to
de91554
Compare
Huh? Oh, perhaps that's the only part of the implementation I didn't test! :s |
You can never test enough ;) |
No, had nothing to do with parts I didn't test. It's an edge case. That scipy.sparse doesn't seem to handle well... Until recently there was no support for sparse matrices with a zero-dimension. Apart from that, it seems |
Meh :-/ Do you want to look into it / do a fix? I'm not super familiar with the DBSCAN code. |
def is_supervised(estimator): | ||
return (isinstance(estimator, ClassifierMixin) | ||
or isinstance(estimator, RegressorMixin) | ||
# transformers can all take a y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y, oh y?
more tests with less code :D |
Regrding the edge case of slicing with an empty index array: this is unsupported in scipy 0.13, but is fixed in master. So perhaps the solution is to special-case it. If we land up with no core samples, create a dense array of the correct shape (but with no data)! Or, could special-case it only if scipy raises an error. |
maybe just special-case it all the time and comment saying it is scipy backward compatibility? |
9925a98
to
c71ebb2
Compare
fixed the dbscan issue. |
ae632fc
to
75f2f5e
Compare
@@ -78,6 +77,19 @@ def test_dbscan_sparse(): | |||
assert_array_equal(labels_dense, labels_sparse) | |||
|
|||
|
|||
def test_dbscan_no_score_samples(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*no_core_samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Maybe that's where the missing s went from a very different patch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha I bet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has not been addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn. I forgot to push at some point or a rebase messed it up :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Except for |
The |
But I can also ditch the other one and we just use this. |
the problem is that I used #4058 elsewhere, too. |
04a4364
to
2345d23
Compare
Ah sorry, I'd forgotten this was on top of #4058 On 12 January 2015 at 05:45, Andreas Mueller notifications@github.com
|
2345d23
to
3e3fdd3
Compare
done |
3e3fdd3
to
93ed5d8
Compare
X[X < .8] = 0 | ||
|
||
db = DBSCAN().fit(X) | ||
assert_array_equal(db.components_, np.empty((0, X.shape[1]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that all samples are clustered as outliers in that case? It would be great to add an assertion to check that here, first by asserting that core_sample_indices_
has shape (0,)
and second that labels_
is filed with -1
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Aside from this last minor comment, +1 on my side as well. |
Make sparsity check check everything. don't test everything. That would be nice but is out of scope :-/ catch special case of no core samples in DBSCAN add nonregression test for sparse dbscan with no core samples.
93ed5d8
to
494b8e5
Compare
will merge if travis agrees with me. |
[MRG+2] Fix sparse matrix handling in clustering
See #4051.
Needs fixing: