Segmentation fault when calculating euclidean_distances for large numbers of rows #4197

dschallis · 2015-02-03T12:05:35Z

With scikit-learn 0.15.2, numpy 1.9.1, python 2.7.8 (on OS X), the following code segfaults:

import numpy
import sklearn.cluster

numpy.random.seed(1)
X = numpy.random.random((50000, 100))
model = sklearn.cluster.KMeans(n_clusters=3, random_state=1)
model.fit_predict(X)
print sklearn.metrics.silhouette_score(X, model.labels_, metric='euclidean')

Results in:

Segmentation fault: 11

Dropping the rows down to 30000, and the above completes fine. Dropping rows to 40000, and the script takes a very long amount of time, but didn't appear to segfault.

The text was updated successfully, but these errors were encountered:

jnothman · 2015-02-03T12:39:16Z

You can cut that example down to:

import numpy
from sklearn.metrics import euclidean_distances
numpy.random.seed(1)
X = numpy.random.random((50000, 100))
euclidean_distances(X)

I'm hence changing the issue title.

dschallis · 2015-02-03T12:41:49Z

Thanks, makes sense to me, the above also segfaults for me also.

jnothman · 2015-02-03T12:46:31Z

The crash occurs in numpy.dot for numpy 1.8.1. And indeed, np.dot(X, X.T) segfaults. Check whether the segfault occurs at numpy master, and otherwise report it there.

jnothman · 2015-02-03T12:59:35Z

I'm closing this as belonging to numpy (until further notice?).

amueller · 2015-02-03T16:43:08Z

That is ... surprising... to say the least.

dschallis · 2015-02-03T16:58:29Z

I've submitted with some more details to numpy for now, thanks for the investigation work.

Renzhh · 2015-05-11T04:46:16Z

@dschallis Have the problem solved by Numpy? I was having the same question originally from calculating silhouette_score or silhouette_samples with large amounts of rows. See #4701

dschallis · 2015-05-14T15:04:43Z

@Renzhh I don't think so, it's currently still an open issue with numpy: numpy/numpy#5533

Renzhh · 2015-05-15T02:53:48Z

@dschallis Is the question “Segmentation fault when calculating euclidean_distances for large numbers of rows” really caused by numpy? I happened to this problem, too. See my issue: #4701 .When the amount of sparse matrix less than 30,000 rows, both of silhouette_score and silhouette_samples are OK and can get expected results. But when the amount of X more than 100,000, the program crashed and get "Segmentation fault (core dumped)". I'm debugging...

Since then, How do you calculate silhouette_score within large rows?

jnothman · 2015-05-15T03:01:25Z

For a sparse matrix we may need to assume it is a different issue. Still, check whether it's a problem with silhouette_score (unlikely, because it doesn't do anything low-level enough to result in a segfault) or with euclidean_distances. Also, density in sparse matrices may be more important than number of rows. In any case, a segfault is undesirable behaviour and needs to be fixed. Finally, what version of scipy do you have? Does the segfault occur with the most recent version?

Renzhh · 2015-05-15T05:59:42Z

@jnothman After checking, I'm sure the problem caused by euclidean_distances. The versin of my scipy is 0.13.3 and numpy is1.8.2 . Now I try to use recent stable version to check wherether the segmentation fault still occurs.

Renzhh · 2015-05-18T02:54:00Z

@jnothman With recent version, scipy 0.15.1 and numpy 1.9.2, the segmentation fault still occurs. But with scipy.test(), it seems that my installed scipy package has some little problem

jnothman changed the title ~~Segmentation fault when calculating silhouette_score for large numbers of rows~~ Segmentation fault when calculating euclidean_distances for large numbers of rows Feb 3, 2015

jnothman closed this as completed Feb 3, 2015

amueller mentioned this issue May 11, 2015

python crashed when computing silhouette_score/ silhouette_samples of KMeans on large amounts of data #4701

Closed

argriffing mentioned this issue May 21, 2015

Numpy inverse of very large matrix returns all-zero matrix without erro 70C9 r numpy/numpy#5898

Closed

daydayup1 mentioned this issue May 14, 2017

There was a “ValueError:array is too big...” when computing silhouette_samples of KMeans on large amounts of data #8878

Closed

sturlamolden mentioned this issue Jul 21, 2017

Memory Allocation Fault in LU factorization scipy/scipy#7131

Closed

Gitman-code mentioned this issue Dec 9, 2017

MemoryError from sklearn.metrics.silhouette_samples #10279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Segmentation fault when calculating euclidean_distances for large numbers of rows #4197

Segmentation fault when calculating euclidean_distances for large numbers of rows #4197

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Segmentation fault when calculating euclidean_distances for large numbers of rows #4197

Segmentation fault when calculating euclidean_distances for large numbers of rows #4197

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!