8000 [MRG] Error for cosine affinity when zero vectors present by mthorrell · Pull Request #7943 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Error for cosine affinity when zero vectors present #7943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 21, 2019

Conversation

mthorrell
Copy link
Contributor

Reference Issue

Fixes #7689

What does this implement/fix? Explain your changes.

Previously, code would hang/create memory leak when trying to agglomerate observations/features when using cosine distance and zero vectors. Now an error is produced in this setting.

Cosine distance is generally undefined when taking distances to zero vectors--and in fact different answers are produced depending on the code used scipy's pdist vs. sklearn's cosine_distances vs. sklearn's paired_cosine_distances.

Any other comments?

@GaelVaroquaux
Copy link
Member

I think that this is not the right fix.

There are two problems here: one that cosine is not well defined when vectors are null, and the other that hierarchical clustering is buggy.

We might consider to do bugware and implement such a check, to save the users' time (I stress the "might", as I would really be bugware).

If we do this, we should:

  1. Use "np.any" instead of the second np.sum and comparison
  2. Add a test
  3. Create an issue in scipy's tracker as, if I don't get it wrong, the hierarchical-clustering bug is there.

@mthorrell
Copy link
Contributor Author

Thanks for the feedback. I apologize if I am making things harder than they need to be. I still have a few questions:

Cosine is not well defined --

  • Should we make a separate issue for this at least for the agglomeration setting? So that we can get consensus on what cosine distance to zero should be in this setting?
  • I was planning on making even a different issue that points out that paired_cosine_distances and cosine_distances in metrics.pairwise handle zeros differently to hopefully resolve a part of this issue more generally.

Hierarchical Clustering is buggy --

  • An update to scipy as described here scipy#6774 will essentially produce the same output as this pull request: it will create an Error when a distance is nan or Inf. As far as I can tell, this is where they will leave their implementation of hierarchical clustering.
  • With this in mind, would you still call scipy's clustering buggy?
  • Also with this in mind (and assuming this is the desired behavior), is kind of preempting the scipy fix--as I have done with this PR--appropriate?

@mthorrell
Copy link
Contributor Author

I made the additions (1) and (2) in your list, @GaelVaroquaux.

Regarding (3), as I mentioned in my previous post, I don't believe there to be a remaining bug in Scipy's hierarchical-clustering.

What I propose is: we put this merge on hold while we create an issue regarding: What should Agglomerative Clustering do when using Cosine Distance with zero vectors? Once this gets resolved, we can either go forward with this fix... or we can proceed with whatever gets decided.

If this sounds reasonable, I can submit this new Issue.

# 'cosine' affinity is used
X = np.array([[0, 1],
[0, 0]])
assert_raises(ValueError, linkage_tree, X, affinity='cosine')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use assert_raises_message

@@ -379,6 +379,10 @@ def linkage_tree(X, connectivity=None, n_components=None,
'Unknown linkage option, linkage should be one '
'of %s, but %s was given' % (linkage_choices.keys(), linkage))

if np.sum(1 - np.any(X, axis=1)) > 0 and affinity == 'cosine':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the faster condition first.

@amueller
Copy link
Member
amueller commented Dec 1, 2016

@GaelVaroquaux why do you consider this bugware (also I'm not sure what bugware is).
Erroring on invalid input seems ok for me?

If we agree this solution is ok, then LGTM from a code perspective.

@amueller
Copy link
Member

ping @GaelVaroquaux again?

< 8000 a class="d-inline-block" data-hovercard-type="user" data-hovercard-url="/users/rth/hovercard" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" href="/rth">@rth
Copy link
Member
rth commented Jun 20, 2019

re "ping @GaelVaroquaux again?"

If no objections +1 to merge.

@@ -379,6 +379,10 @@ def linkage_tree(X, connectivity=None, n_components=None,
'Unknown linkage option, linkage should be one '
'of %s, but %s was given' % (linkage_choices.keys(), linkage))

if affinity == 'cosine' and np.sum(1 - np.any(X, axis=1)) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to be the person always complaining, but "np.sum(1 - np.any(X, axis=1)) > 0" is a strange test. What exactly is it testing? That X is zero everywhere?

Copy link
Contributor Author
@mthorrell mthorrell Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sure there are no rows with all zeros.

[[0,0],
[1,1]] 

for example will produce this error, but the following example won't.

[[1,0],
[1,1]] 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ~ instead of 1- and use np.any instead of sum > 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it would be readable as "any rows without any values"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have written that "np.any(np.all(X == 0, axis=1))". But what is written here is fine.

Thanks!

8000

@GaelVaroquaux GaelVaroquaux self-requested a review June 21, 2019 13:08
Copy link
Member
@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@GaelVaroquaux
Copy link
Member

+1 for merge once CI has ran.

@codecov
Copy link
codecov bot commented Jun 21, 2019

Codecov Report

Merging #7943 into master will increase coverage by 0.06%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7943      +/-   ##
==========================================
+ Coverage   96.19%   96.26%   +0.06%     
==========================================
  Files         348      401      +53     
  Lines       64645    72877    +8232     
  Branches        0     7895    +7895     
==========================================
+ Hits        62187    70154    +7967     
- Misses       2458     2699     +241     
- Partials        0       24      +24
Impacted Files Coverage Δ
sklearn/cluster/hierarchical.py 99.69% <ø> (+0.05%) ⬆️
sklearn/cluster/tests/test_hierarchical.py 100% <100%> (ø) ⬆️
sklearn/__init__.py 67.64% <0%> (-30.09%) ⬇️
sklearn/ensemble/gradient_boosting.py 68.12% <0%> (-27.97%) ⬇️
sklearn/tests/test_common.py 85.83% <0%> (-10.64%) ⬇️
sklearn/_build_utils/__init__.py 52% <0%> (-9.23%) ⬇️
sklearn/datasets/covtype.py 46% <0%> (-6.18%) ⬇️
sklearn/feature_extraction/hashing.py 95.34% <0%> (-4.66%) ⬇️
sklearn/datasets/olivetti_faces.py 29.26% <0%> (-4.07%) ⬇️
sklearn/decomposition/sparse_pca.py 96.1% <0%> (-3.9%) ⬇️
... and 400 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e29334...67f3fdf. Read the comment docs.

@GaelVaroquaux GaelVaroquaux merged commit 3d99769 into scikit-learn:master Jun 21, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
…rn#7943)

* cosine affinity cannot be used when X contains zero vectors

* fixed issue with tabs spaces

* changed to np.any and created a test for this new ValueError

* use assert_raise_message and flipped order of if conditions

* fixed 0 row calculation
@jnothman jnothman mentioned this pull request Oct 28, 2019
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AgglomerativeClustering with metric='cosine' broken for all-zero rows
5 participants
0