8000 ValueError in distance matrix with agglomerative clustering · Issue #10076 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ValueError in distance matrix with agglomerative clustering #10076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tilmanbeck opened this issue Nov 6, 2017 · 7 comments
Closed

ValueError in distance matrix with agglomerative clustering #10076

tilmanbeck opened this issue Nov 6, 2017 · 7 comments

Comments

@tilmanbeck
Copy link

Description

ValueError thrown when applying AgglomerativeClustering on textual data because distance matrix contains infinite values

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
def main():
        dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42, 
                                                            remove=('headers', 'footers', 'quotes') )
        data_samples = dataset.data
	targets = dataset.target
	categories = dataset.target_names
	k = np.unique(targets).shape[0]
	tf_vectorizer = TfidfVectorizer(max_features=50000, max_df=1.0, min_df=1)
	tfs = tf_vectorizer.fit_transform(data_samples)
	agg = AgglomerativeClustering(linkage="complete", n_clusters=k, affinity="cosine")
	agg.fit(tfs.toarray())
	return dataset

if __name__ == '__main__':
	main()

Expected Results

No error is thrown and the distance matrix should not contain infinite values

Actual Results

File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 750, in fit
    **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 527, in _complete_linkage
    return linkage_tree(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/sklearn/cluster/hierarchical.py", line 417, in linkage_tree
    out = hierarchy.linkage(X, method=linkage, metric=affinity)
  File "/venv/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 713, in linkage
    raise ValueError("The condensed distance matrix must contain only "
ValueError: The condensed distance matrix must contain only finite values.

Versions

>>> import platform; print(platform.platform())
Linux-4.4.0-81-generic-x86_64-with-Ubuntu-16.04-xenial
>>> import sys; print("Python", sys.version)
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.0.0
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.19.0
>>> 

Comment
I have used the same code on a subset of Reuters-21578 text data set and no error was thrown. I was not able to track down what might have caused the infinite values in the distance matrix

@jnothman
Copy link
Member
jnothman commented Nov 6, 2017 via email

@tilmanbeck
Copy link
Author
tilmanbeck commented Nov 7, 2017

Hi @jnothman ,
after some deeper investigating (should have done that before ;) I found out that there were empty text documents which resulted, as you suggested, in vectors with no non-zero elements. I tracked it down and it's because the remove=('headers', 'footers', 'quotes') parameter for the fetch_20newsgroups function cuts the whole text in some documents.
Thanks for the tip!

@jnothman
Copy link
Member
jnothman commented Nov 7, 2017

So is it fine to close this?

@tilmanbeck
Copy link
Author

Yes, although I wonder if it would be better if the distance of two zero-valued vectors should be simply zero instead of non-finite. You think it makes sense to track it down or is this expected behaviour?

@jnothman
Copy link
Member
jnothman commented Nov 7, 2017

Actually this is a duplicate of #7689, so see there...

@lesteve lesteve closed this as completed Nov 8, 2017
@bottydim
Copy link

For me, the problem was that the gram_matrix contained identical observations, which meant that the condensed distance matrix contained only zeros.

@selahlynch
Copy link

I've discovered that all 1's will cause the same error. I searched for these df.columns[df.nunique() == 1] and dropped them and my problem was solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0