-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Error for cosine affinity when zero vectors present #7943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Error for cosine affinity when zero vectors present #7943
Conversation
I think that this is not the right fix. There are two problems here: one that cosine is not well defined when vectors are null, and the other that hierarchical clustering is buggy. We might consider to do bugware and implement such a check, to save the users' time (I stress the "might", as I would really be bugware). If we do this, we should:
|
Thanks for the feedback. I apologize if I am making things harder than they need to be. I still have a few questions: Cosine is not well defined --
Hierarchical Clustering is buggy --
|
I made the additions (1) and (2) in your list, @GaelVaroquaux. Regarding (3), as I mentioned in my previous post, I don't believe there to be a remaining bug in Scipy's hierarchical-clustering. What I propose is: we put this merge on hold while we create an issue regarding: What should Agglomerative Clustering do when using Cosine Distance with zero vectors? Once this gets resolved, we can either go forward with this fix... or we can proceed with whatever gets decided. If this sounds reasonable, I can submit this new Issue. |
# 'cosine' affinity is used | ||
X = np.array([[0, 1], | ||
[0, 0]]) | ||
assert_raises(ValueError, linkage_tree, X, affinity='cosine') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use assert_raises_message
sklearn/cluster/hierarchical.py
Outdated
@@ -379,6 +379,10 @@ def linkage_tree(X, connectivity=None, n_components=None, | |||
'Unknown linkage option, linkage should be one ' | |||
'of %s, but %s was given' % (linkage_choices.keys(), linkage)) | |||
|
|||
if np.sum(1 - np.any(X, axis=1)) > 0 and affinity == 'cosine': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the faster condition first.
@GaelVaroquaux why do you consider this bugware (also I'm not sure what bugware is). If we agree this solution is ok, then LGTM from a code perspective. |
ping @GaelVaroquaux again? |
re "ping @GaelVaroquaux again?" If no objections +1 to merge. |
sklearn/cluster/hierarchical.py
Outdated
@@ -379,6 +379,10 @@ def linkage_tree(X, connectivity=None, n_components=None, | |||
'Unknown linkage option, linkage should be one ' | |||
'of %s, but %s was given' % (linkage_choices.keys(), linkage)) | |||
|
|||
if affinity == 'cosine' and np.sum(1 - np.any(X, axis=1)) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to be the person always complaining, but "np.sum(1 - np.any(X, axis=1)) > 0" is a strange test. What exactly is it testing? That X is zero everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment 8000 h3>
The reason will be displayed to describe this comment to others. Learn more.
It makes sure there are no rows with all zeros.
[[0,0],
[1,1]]
for example will produce this error, but the following example won't.
[[1,0],
[1,1]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ~
instead of 1-
and use np.any instead of sum > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it would be readable as "any rows without any values"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have written that "np.any(np.all(X == 0, axis=1))". But what is written here is fine.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
+1 for merge once CI has ran. |
Codecov Report
@@ Coverage Diff @@
## master #7943 +/- ##
==========================================
+ Coverage 96.19% 96.26% +0.06%
==========================================
Files 348 401 +53
Lines 64645 72877 +8232
Branches 0 7895 +7895
==========================================
+ Hits 62187 70154 +7967
- Misses 2458 2699 +241
- Partials 0 24 +24
Continue to review full report at Codecov.
|
…rn#7943) * cosine affinity cannot be used when X contains zero vectors * fixed issue with tabs spaces * changed to np.any and created a test for this new ValueError * use assert_raise_message and flipped order of if conditions * fixed 0 row calculation
Reference Issue
Fixes #7689
What does this implement/fix? Explain your changes.
Previously, code would hang/create memory leak when trying to agglomerate observations/features when using cosine distance and zero vectors. Now an error is produced in this setting.
Cosine distance is generally undefined when taking distances to zero vectors--and in fact different answers are produced depending on the code used scipy's pdist vs. sklearn's cosine_distances vs. sklearn's paired_cosine_distances.
Any other comments?