-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
And with another affinity that's not a problem? |
Yep, other affinity types have no problem. The training dataset have more than 4000 features and a sample size of ~900 From my iPhone
|
The infinite loop occurs in |
(if it is truly a result of cosine, this might relate to it not being a proper distance; otherwise, it might just be that this data happened to create the problem with cosine, but another metric could be a problem for another dataset.) |
Could it be because the pairwise distance matrix has exact draws: a point
that is exactly at the same distance as two others? This might create a
non tree graph.
Absolute intuition-based hunch.
|
Well, yes, I get a segfault when two features are identical, e.g. train_data = [[1, 0, 0], [.5, 0, 0], [0, 0, 0]] |
I've changed the title to reflect that issue |
although I've not actually checked that this dataset has duplicate features. |
Actually that segfault occurs in scipy.cluster.hierarchy.linkage, and it's because of vectors with a norm of zero, not because of duplication. Posted issue at scipy/scipy#6774. I've renamed this issue too early. There aren't duplicate distances from any point to others in the supplied dataset. However, there are a number of features in Note that our own Therefore I'm not sure the following is correct in it results, but at least it terminates with a result: fa = FeatureAgglomeration(affinity="precomputed", linkage="average")
fa.fit(cosine_distances(train_data.T), train_labels) # memory keeps increasing here |
Regarding Specifically regarding feature agglomeration: shouldn't zero columns just be removed since they carry no information? This seems like the quickest fix to this bug, and checking the code, removing the zero columns in the dataset lets the code run fine. Not sure if this is an overly drastic solution. Regarding hierarchical agglomeration using cosine distance in general: defining distances to zero vectors as 1 (as they would be using I'm trying to get started contributing to sklearn, so please let me know if I've strayed too far on these items. I would love to discuss and/or be a contributor on a solution that gets decided. |
I really appreciate your putting thought to it. Except in rare cases, we tend to find the merits of one solution or another much easier to evaluate once there's a patch in front of us that attempts to implement one. I agree, marking the distance to zero as NaN would be correct, except that we're explicitly here for machine learning applications, and that means sometimes choosing sensible approximations that work (perhaps with a warning so that the user isn't blind to that compromise). I think you've already identified that there are multiple changes involved to really get to the bottom of this issue, and introducing a warning may be one of them.
Well, zero-variance columns can be removed from feature agglomeration in general. I think this patch to FeatureAgglomeration would be acceptable, but wouldn't fix the underlying issue here. |
Regarding cosine_distances: shouldn't any cosine distance to a zero vector be
nan like it is using the pdist function?
Nan is not a number. Hence it's not a distance.
Part of me frowns at returning nan in a distance function. However,
there's a difficult choice to be made: should it be zero, or something
large? Or should we just return nan.
If we return nan, it's not clear that AgglomerativeClustering will do the
right thing.
|
To make this simple then to start, I'll remove zero and zero-variance columns in Feature Agglomeration and I'll have it generate a warning. Then I'll submit the change for you all to review. |
To make this simple then to start, I'll remove zero and zero-variance columns
in Feature Agglomeration and I'll have it generate a warning.
Is this a good idea? It seems to me that the problem that you are having
is specific to the correlation distance (which is not a distance). For
other distances, it would be legit to have zero and zero-variance
columns.
|
I suppose you mean "cosine". Even if we did the arccosine distance which is apparently a true distance, it would not be defined where points are 0. Certainly scipy's |
Re-thinking my previous comment, maybe removing all zero variance columns is not the right thing to do. However on the question of: do we choose a number for cosine distance to 0 or do we punt and do something else? I'm still in the do something else camp. Hierarchical agglomeration I believe resolves ties in distances by making basically arbitrary decisions (deciding based on observation number for instance). If we choose cosine distance to 0 as 1 (or any number between 0 and 2), we risk introducing a lot of ties to the agglomeration algorithm. Thus, many agglomeration steps could be decided arbitrarily. This would lead to unpredictable performance in some cases. I would propose to remove the zero vectors prior to clustering... and then do something else with them. Maybe add them back in at the end? |
For what it's worth, the update to scipy discussed by @jnothman (scipy/issues/6774) fixes this bug to the extent of: when using the dev version of Scipy, the memory overflow no longer occurs when running the original code snippet. Instead an error is just produced. In other words, Scipy refuses to perform agglomerative clustering when using cosine distance with zero vectors. I am unsure of the correct action to take for sklearn given the existence of an upcoming solution in Scipy. Is it appropriate to wait? Or should we produce an error in sklearn to be safe? And I suppose the question remains: is this the right functionality... though it seems reasonable to me. |
Well, we don't like leading our users to segfaults, and we don't require On 20 November 2016 at 03:23, mthorrell notifications@github.com wrote:
|
We have two problems here:
Ideally, we should also solve the 2nd one. |
On the general question of what cosine distance should return for all-zero rows: I do think it would be folly to define distance to zero as 1 in every setting. There may be settings where choosing 1 may give undesired performance. In fact, this bug may have given us one such setting. Consider the following code where cosine distance to 0 is 1 (and d(0,0) = 1 as well).
Output:
Hence points 1 and 2 are correctly grouped first. Then it grabs the first zero due to ordering. Then it grabs point [0,0,1]. Then the last zero, again due to ordering. This sequence seems very unnatural to me. |
@mthorrell, without much capacity to focus on this issue in detail myself, I'd appreciate:
Thanks! |
No problem. Thanks. |
Uh oh!
There was an error while loading. Please reload this page.
Original title: "Cosine" affinity type in FeatureAgglomeration somehow casue memory overflow in a particular dataset
Description
Please carefully test the codes below! Using "cosine" affinity type in FeatureAgglomeration, the codes will cause memory overflow with a particular dataset (Download here). It is all right with other affinity types. And this issue cannot be reproduced using simulation data (like using make_classification in sci-kit learn). Not sure why it happened.
Steps/Code to Reproduce
The text was updated successfully, but these errors were encountered: