8000 AgglomerativeClustering with metric='cosine' broken for all-zero rows · Issue #7689 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
weixuanfu opened this issue Oct 17, 2016 · 22 comments · Fixed by #7943
Closed

AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689

weixuanfu opened this issue Oct 17, 2016 · 22 comments · Fixed by #7943
Labels

Comments

@weixuanfu
Copy link
weixuanfu commented Oct 17, 2016

Original title: "Cosine" affinity type in FeatureAgglomeration somehow casue memory overflow in a particular dataset

Description

Please carefully test the codes below! Using "cosine" affinity type in FeatureAgglomeration, the codes will cause memory overflow with a particular dataset (Download here). It is all right with other affinity types. And this issue cannot be reproduced using simulation data (like using make_classification in sci-kit learn). Not sure why it happened.

Steps/Code to Reproduce

from sklearn.cluster import FeatureAgglomeration
import numpy as np
import time
train_data = np.genfromtxt('fold_2_trainFeatVec.csv', delimiter=',')
train_labels= np.genfromtxt('fold_2_trainLabels.csv', delimiter=',')


fa = FeatureAgglomeration(affinity="cosine", linkage="average") # no matter linage ="average" or "complete"
time_start= time.time()
fa.fit(train_data, train_labels) # memory keeps increasing here
time_end = time.time()
print('Time usage:',time_end-time_start)

@amueller
Copy link
Member

And with another affinity that's not a problem?
How large is the dataset?

@weixuanfu
Copy link
Author

Yep, other affinity types have no problem. The training dataset have more than 4000 features and a sample size of ~900

From my iPhone

On Oct 17, 2016, at 6:27 PM, Andreas Mueller notifications@github.com wrote:

And with another affinity that's not a problem?
How large is the dataset?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@jnothman
Copy link
Member
jnothman commented Nov 9, 2016

The infinite loop occurs in _hc_get_descendent. I assume there's a cycle in the "tree". Ping @GaelVaroquaux

@jnothman
Copy link
Member
jnothman commented Nov 9, 2016

(if it is truly a result of cosine, this might relate to it not being a proper distance; otherwise, it might just be that this data happened to create the problem with cosine, but another metric could be a problem for another dataset.)

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Nov 9, 2016 via email

@jnothman
Copy link
Member
jnothman commented Nov 9, 2016

Well, yes, I get a segfault when two features are identical, e.g.

train_data = [[1, 0, 0], [.5, 0, 0], [0, 0, 0]]

@jnothman jnothman changed the title "Cosine" affinity type in FeatureAgglomeration somehow casue memory overflow in a particular dataset AgglomerativeClustering broken for duplicate samples Nov 9, 2016
@jnothman
Copy link
Member
jnothman commented Nov 9, 2016

I've changed the title to reflect that issue

@jnothman
Copy link
Member
jnothman commented Nov 9, 2016

although I've not actually checked that this dataset has duplicate features.

< 8000 include-fragment loading="lazy" src="/scikit-learn/scikit-learn/issue_comments/259518078/edit_form?textarea_id=issuecomment-259518078-body&comment_context=" data-nonce="v2:26151e16-b786-edd7-cd1e-d0d9185687b7" data-view-component="true" class="previewable-comment-form js-comment-edit-form-deferred-include-fragment">

@jnothman jnothman changed the title AgglomerativeClustering broken for duplicate samples AgglomerativeClustering with metric='cosine' broken for all-zero rows Nov 9, 2016
@jnothman
Copy link
Member
jnothman commented Nov 9, 2016

Actually that segfault occurs in scipy.cluster.hierarchy.linkage, and it's because of vectors with a norm of zero, not because of duplication. Posted issue at scipy/scipy#6774. I've renamed this issue too early. There aren't duplicate distances from any point to others in the supplied dataset.

However, there are a number of features in train_data with all-zero value, and rather than segfault, pdist seems to be returning inf for this. These are passed on by hierarchy.linkage, together, perhaps with cycles in the graph.

Note that our own cosine_distances implementation (apart from being much faster than pdist) does not return inf, but returns 1 for distances involving vectors zero norms. It does so even in the case where those vectors are identical, which is a bit surprising.

Therefore I'm not sure the following is correct in it results, but at least it terminates with a result:

fa = FeatureAgglomeration(affinity="precomputed", linkage="average")
fa.fit(cosine_distances(train_data.T), train_labels) # memory keeps increasing here

@mthorrell
Copy link
Contributor

Regarding cosine_distances: shouldn't any cosine distance to a zero vector be nan like it is using the pdist function? The cosine distance formula itself encounters a 0/0 which is indeterminate. The practical implication of this is: I suspect there are different scenarios where cosine distance to a zero vector is usefully defined as 0, 1 or 2. Thus, using a nan would be the most accurate option. Perhaps this is a bug to consider?

Specifically regarding feature agglomeration: shouldn't zero columns just be removed since they carry no information? This seems like the quickest fix to this bug, and checking the code, removing the zero columns in the dataset lets the code run fine. Not sure if this is an overly drastic solution.

Regarding hierarchical agglomeration using cosine distance in general: defining distances to zero vectors as 1 (as they would be using cosine_distances) will make the code run, but I suspect some undesirable edge cases may show up with this answer. I would propose to keep zeros as a separate group until the last agglomeration step... But this may be out of scope for this issue.

I'm trying to get started contributing to sklearn, so please let me know if I've strayed too far on these items. I would love to discuss and/or be a contributor on a solution that gets decided.

@jnothman
Copy link
Member

I'm trying to get started contributing to sklearn, so please let me know if I've strayed too far on these items. I would love to discuss and/or be a contributor on a solution that gets decided.

I really appreciate your putting thought to it. Except in rare cases, we tend to find the merits of one solution or another much easier to evaluate once there's a patch in front of us that attempts to implement one. I agree, marking the distance to zero as NaN would be correct, except that we're explicitly here for machine learning applications, and that means sometimes choosing sensible approximations that work (perhaps with a warning so that the user isn't blind to that compromise). I think you've already identified that there are multiple changes involved to really get to the bottom of this issue, and introducing a warning may be one of them.

Specifically regarding feature agglomeration: shouldn't zero columns just be removed since they carry no information? This seems like the quickest fix to this bug, and checking the code, removing the zero columns in the dataset lets the code run fine. Not sure if this is an overly drastic solution.

Well, zero-variance columns can be removed from feature agglomeration in general. I think this patch to FeatureAgglomeration would be acceptable, but wouldn't fix the underlying issue here.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Nov 14, 2016 via email

@mthorrell
Copy link
Contributor

To make this simple then to start, I'll remove zero and zero-variance columns in Feature Agglomeration and I'll have it generate a warning. Then I'll submit the change for you all to review.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Nov 14, 2016 via email

@jnothman
Copy link
Member

specific to the correlation distance (which is not a distance)

I suppose you mean "cosine". Even if we did the arccosine distance which is apparently a true distance, it would not be defined where points are 0. Certainly scipy's linkage admits cosine and clustering in cosine space is not uncommon.

@mthorrell
Copy link
Contributor
mthorrell commented Nov 15, 2016

Re-thinking my previous comment, maybe removing all zero variance columns is not the right thing to do. However on the question of: do we choose a number for cosine distance to 0 or do we punt and do something else? I'm still in the do something else camp.

Hierarchical agglomeration I believe resolves ties in distances by making basically arbitrary decisions (deciding based on observation number for instance). If we choose cosine distance to 0 as 1 (or any number between 0 and 2), we risk introducing a lot of ties to the agglomeration algorithm. Thus, many agglomeration steps could be decided arbitrarily. This would lead to unpredictable performance in some cases.

I would propose to remove the zero vectors prior to clustering... and then do something else with them. Maybe add them back in at the end?

@mthorrell
Copy link
Contributor
mthorrell commented Nov 19, 2016

For what it's worth, the update to scipy discussed by @jnothman (scipy/issues/6774) fixes this bug to the extent of: when using the dev version of Scipy, the memory overflow no longer occurs when running the original code snippet. Instead an error is just produced. In other words, Scipy refuses to perform agglomerative clustering when using cosine distance with zero vectors.

I am unsure of the correct action to take for sklearn given the existence of an upcoming solution in Scipy. Is it appropriate to wait? Or should we produce an error in sklearn to be safe? And I suppose the question remains: is this the right functionality... though it seems reasonable to me.

@jnothman
Copy link
Member

Well, we don't like leading our users to segfaults, and we don't require
them to upgrade to that scipy for another couple of years yet!

On 20 November 2016 at 03:23, mthorrell notifications@github.com wrote:

For what it's worth, the update to scipy discussed by @jnothman
https://github.com/jnothman (scipy/issues/6774
scipy/scipy#6774) fixes this bug to the
extent of: when using the dev version of Scipy, the memory overflow no
longer occurs when running the original code snippet. Instead an error is
just produced. In other words, Scipy refuses to perform agglomerative
clustering when using cosine distance with zero vectors.

I am unsure of the correct action to take for sklearn given the existence
of an upcoming solution in Scipy. Is it appropriate to wait? Or should we
produce an error in sklearn to be safe?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7689 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz676rc56te9HHdf5Vp5XMwOitJwKyks5q_yJ-gaJpZM4KZGTp
.

@GaelVaroquaux
Copy link
Member

We have two problems here:

  • Figuring what cosine should return for all-zero rows
  • Making sure that AgglomerativeClustering should not crash anyhow
    (crashers are bad things).

Ideally, we should also solve the 2nd one.

@mthorrell
Copy link
Contributor
mthorrell commented Nov 21, 2016

On the general question of what cosine distance should return for all-zero rows: I do think it would be folly to define distance to zero as 1 in every setting. There may be settings where choosing 1 may give undesired performance. In fact, this bug may have given us one such setting.

Consider the following code where cosine distance to 0 is 1 (and d(0,0) = 1 as well).

import numpy as np
from scipy.cluster import hierarchy 
from scipy.spatial import distance

X = np.array([[0,0,0],
              [1,0,0],
              [0.1,1,0],
              [0,0,1],
              [0,0,0]])

y = distance.pdist(X, metric="cosine")
y[np.isnan(y)] = 1

out = hierarchy.linkage(y,method='average')

children = out[:, :2].astype(np.int)
print(children)

Output:

[[1 2]
 [0 5]
 [3 6]
 [4 7]]

Hence points 1 and 2 are correctly grouped first. Then it grabs the first zero due to ordering. Then it grabs point [0,0,1]. Then the last zero, again due to ordering. This sequence seems very unnatural to me.

@jnothman
Copy link
Member
jnothman commented Nov 24, 2016

@mthorrell, without much capacity to focus on this issue in detail myself, I'd appreciate:

  • one or more new issues with a clear, specific description of the problem and optionally proposed solutions
  • pull requests for proposed solutions, eventually

Thanks!

@mthorrell
Copy link
Contributor

No problem. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0