AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689

weixuanfu · 2016-10-17T20:59:46Z

Original title: "Cosine" affinity type in FeatureAgglomeration somehow casue memory overflow in a particular dataset

Description

Please carefully test the codes below! Using "cosine" affinity type in FeatureAgglomeration, the codes will cause memory overflow with a particular dataset (Download here). It is all right with other affinity types. And this issue cannot be reproduced using simulation data (like using make_classification in sci-kit learn). Not sure why it happened.

Steps/Code to Reproduce

from sklearn.cluster import FeatureAgglomeration
import numpy as np
import time
train_data = np.genfromtxt('fold_2_trainFeatVec.csv', delimiter=',')
train_labels= np.genfromtxt('fold_2_trainLabels.csv', delimiter=',')


fa = FeatureAgglomeration(affinity="cosine", linkage="average") # no matter linage ="average" or "complete"
time_start= time.time()
fa.fit(train_data, train_labels) # memory keeps increasing here
time_end = time.time()
print('Time usage:',time_end-time_start)

The text was updated successfully, but these errors were encountered:

amueller · 2016-10-17T22:26:24Z

And with another affinity that's not a problem?
How large is the dataset?

weixuanfu · 2016-10-17T22:40:33Z

Yep, other affinity types have no problem. The training dataset have more than 4000 features and a sample size of ~900

From my iPhone

On Oct 17, 2016, at 6:27 PM, Andreas Mueller notifications@github.com wrote:

And with another affinity that's not a problem?
How large is the dataset?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

jnothman · 2016-11-09T10:53:43Z

The infinite loop occurs in _hc_get_descendent. I assume there's a cycle in the "tree". Ping @GaelVaroquaux

jnothman · 2016-11-09T11:18:09Z

(if it is truly a result of cosine, this might relate to it not being a proper distance; otherwise, it might just be that this data happened to create the problem with cosine, but another metric could be a problem for another dataset.)

GaelVaroquaux · 2016-11-09T12:02:55Z

Could it be because the pairwise distance matrix has exact draws: a point that is exactly at the same distance as two others? This might create a non tree graph. Absolute intuition-based hunch.

jnothman · 2016-11-09T20:27:53Z

Well, yes, I get a segfault when two features are identical, e.g.

train_data = [[1, 0, 0], [.5, 0, 0], [0, 0, 0]]

jnothman · 2016-11-09T20:29:44Z

I've changed the title to reflect that issue

jnothman · 2016-11-09T20:30:01Z

although I've not actually checked that this dataset has duplicate features.

jnothman · 2016-11-09T23:21:49Z

Actually that segfault occurs in scipy.cluster.hierarchy.linkage, and it's because of vectors with a norm of zero, not because of duplication. Posted issue at scipy/scipy#6774. I've renamed this issue too early. There aren't duplicate distances from any point to others in the supplied dataset.

However, there are a number of features in train_data with all-zero value, and rather than segfault, pdist seems to be returning inf for this. These are passed on by hierarchy.linkage, together, perhaps with cycles in the graph.

Note that our own cosine_distances implementation (apart from being much faster than pdist) does not return inf, but returns 1 for distances involving vectors zero norms. It does so even in the case where those vectors are identical, which is a bit surprising.

Therefore I'm not sure the following is correct in it results, but at least it terminates with a result:

fa = FeatureAgglomeration(affinity="precomputed", linkage="average")
fa.fit(cosine_distances(train_data.T), train_labels) # memory keeps increasing here

mthorrell · 2016-11-13T01:00:34Z

Regarding cosine_distances: shouldn't any cosine distance to a zero vector be nan like it is using the pdist function? The cosine distance formula itself encounters a 0/0 which is indeterminate. The practical implication of this is: I suspect there are different scenarios where cosine distance to a zero vector is usefully defined as 0, 1 or 2. Thus, using a nan would be the most accurate option. Perhaps this is a bug to consider?

Specifically regarding feature agglomeration: shouldn't zero columns just be removed since they carry no information? This seems like the quickest fix to this bug, and checking the code, removing the zero columns in the dataset lets the code run fine. Not sure if this is an overly drastic solution.

Regarding hierarchical agglomeration using cosine distance in general: defining distances to zero vectors as 1 (as they would be using cosine_distances) will make the code run, but I suspect some undesirable edge cases may show up with this answer. I would propose to keep zeros as a separate group until the last agglomeration step... But this may be out of scope for this issue.

I'm trying to get started contributing to sklearn, so please let me know if I've strayed too far on these items. I would love to discuss and/or be a contributor on a solution that gets decided.

jnothman · 2016-11-14T12:44:25Z

I'm trying to get started contributing to sklearn, so please let me know if I've strayed too far on these items. I would love to discuss and/or be a contributor on a solution that gets decided.

I really appreciate your putting thought to it. Except in rare cases, we tend to find the merits of one solution or another much easier to evaluate once there's a patch in front of us that attempts to implement one. I agree, marking the distance to zero as NaN would be correct, except that we're explicitly here for machine learning applications, and that means sometimes choosing sensible approximations that work (perhaps with a warning so that the user isn't blind to that compromise). I think you've already identified that there are multiple changes involved to really get to the bottom of this issue, and introducing a warning may be one of them.

Specifically regarding feature agglomeration: shouldn't zero columns just be removed since they carry no information? This seems like the quickest fix to this bug, and checking the code, removing the zero columns in the dataset lets the code run fine. Not sure if this is an overly drastic solution.

Well, zero-variance columns can be removed from feature agglomeration in general. I think this patch to FeatureAgglomeration would be acceptable, but wouldn't fix the underlying issue here.

GaelVaroquaux · 2016-11-14T13:21:04Z

Regarding cosine_distances: shouldn't any cosine distance to a zero vector be nan like it is using the pdist function?

Nan is not a number. Hence it's not a distance. Part of me frowns at returning nan in a distance function. However, there's a difficult choice to be made: should it be zero, or something large? Or should we just return nan. If we return nan, it's not clear that AgglomerativeClustering will do the right thing.

mthorrell · 2016-11-14T13:57:22Z

To make this simple then to start, I'll remove zero and zero-variance columns in Feature Agglomeration and I'll have it generate a warning. Then I'll submit the change for you all to review.

GaelVaroquaux · 2016-11-14T14:06:08Z

To make this simple then to start, I'll remove zero and zero-variance columns in Feature Agglomeration and I'll have it generate a warning.

Is this a good idea? It seems to me that the problem that you are having is specific to the correlation distance (which is not a distance). For other distances, it would be legit to have zero and zero-variance columns.

jnothman · 2016-11-15T00:08:21Z

specific to the correlation distance (which is not a distance)

I suppose you mean "cosine". Even if we did the arccosine distance which is apparently a true distance, it would not be defined where points are 0. Certainly scipy's linkage admits cosine and clustering in cosine space is not uncommon.

mthorrell · 2016-11-15T02:59:50Z

Re-thinking my previous comment, maybe removing all zero variance columns is not the right thing to do. However on the question of: do we choose a number for cosine distance to 0 or do we punt and do something else? I'm still in the do something else camp.

Hierarchical agglomeration I believe resolves ties in distances by making basically arbitrary decisions (deciding based on observation number for instance). If we choose cosine distance to 0 as 1 (or any number between 0 and 2), we risk introducing a lot of ties to the agglomeration algorithm. Thus, many agglomeration steps could be decided arbitrarily. This would lead to unpredictable performance in some cases.

I would propose to remove the zero vectors prior to clustering... and then do something else with them. Maybe add them back in at the end?

mthorrell · 2016-11-19T16:23:25Z

For what it's worth, the update to scipy discussed by @jnothman (scipy/issues/6774) fixes this bug to the extent of: when using the dev version of Scipy, the memory overflow no longer occurs when running the original code snippet. Instead an error is just produced. In other words, Scipy refuses to perform agglomerative clustering when using cosine distance with zero vectors.

I am unsure of the correct action to take for sklearn given the existence of an upcoming solution in Scipy. Is it appropriate to wait? Or should we produce an error in sklearn to be safe? And I suppose the question remains: is this the right functionality... though it seems reasonable to me.

jnothman · 2016-11-20T02:31:05Z

Well, we don't like leading our users to segfaults, and we don't require
them to upgrade to that scipy for another couple of years yet!

On 20 November 2016 at 03:23, mthorrell notifications@github.com wrote:

For what it's worth, the update to scipy discussed by @jnothman
https://github.com/jnothman (scipy/issues/6774
scipy/scipy#6774) fixes this bug to the
extent of: when using the dev version of Scipy, the memory overflow no
longer occurs when running the original code snippet. Instead an error is
just produced. In other words, Scipy refuses to perform agglomerative
clustering when using cosine distance with zero vectors.

I am unsure of the correct action to take for sklearn given the existence
of an upcoming solution in Scipy. Is it appropriate to wait? Or should we
produce an error in sklearn to be safe?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7689 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz676rc56te9HHdf5Vp5XMwOitJwKyks5q_yJ-gaJpZM4KZGTp
.

GaelVaroquaux · 2016-11-20T08:02:45Z

We have two problems here:

Figuring what cosine should return for all-zero rows
Making sure that AgglomerativeClustering should not crash anyhow
(crashers are bad things).

Ideally, we should also solve the 2nd one.

mthorrell · 2016-11-21T00:08:10Z

On the general question of what cosine distance should return for all-zero rows: I do think it would be folly to define distance to zero as 1 in every setting. There may be settings where choosing 1 may give undesired performance. In fact, this bug may have given us one such setting.

Consider the following code where cosine distance to 0 is 1 (and d(0,0) = 1 as well).

import numpy as np
from scipy.cluster import hierarchy 
from scipy.spatial import distance

X = np.array([[0,0,0],
              [1,0,0],
              [0.1,1,0],
              [0,0,1],
              [0,0,0]])

y = distance.pdist(X, metric="cosine")
y[np.isnan(y)] = 1

out = hierarchy.linkage(y,method='average')

children = out[:, :2].astype(np.int)
print(children)

Output:

[[1 2]
 [0 5]
 [3 6]
 [4 7]]

Hence points 1 and 2 are correctly grouped first. Then it grabs the first zero due to ordering. Then it grabs point [0,0,1]. Then the last zero, again due to ordering. This sequence seems very unnatural to me.

jnothman · 2016-11-24T02:06:23Z

@mthorrell, without much capacity to focus on this issue in detail myself, I'd appreciate:

one or more new issues with a clear, specific description of the problem and optionally proposed solutions
pull requests for proposed solutions, eventually

Thanks!

mthorrell · 2016-11-25T16:50:38Z

No problem. Thanks.

amueller added Bug Need Contributor labels Oct 17, 2016

weixuanfu mentioned this issue Oct 19, 2016

Remove 'cosine' affinity type from FeatureAgglomeration to avoid memory overflow EpistasisLab/tpot#293

Merged

jnothman changed the title ~~"Cosine" affinity type in FeatureAgglomeration somehow casue memory overflow in a particular dataset~~ AgglomerativeClustering broken for duplicate samples Nov 9, 2016

jnothman changed the title ~~AgglomerativeClustering broken for duplicate samples~~ AgglomerativeClustering with metric='cosine' broken for all-zero rows Nov 9, 2016

mthorrell mentioned this issue Nov 26, 2016

[MRG] Error for cosine affinity when zero vectors present #7943

Merged

amueller removed the Need Contributor label Mar 3, 2017

jnothman mentioned this issue Nov 7, 2017

ValueError in distance matrix with agglomerative clustering #10076

Closed

GaelVaroquaux closed this as completed in #7943 Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689

AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689

AgglomerativeClustering with metric='cosine' broken for all-zero rows #7689

Comments

Uh oh!

Description

Steps/Code to Reproduce

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!