8000 Segmentation fault when calculating euclidean_distances for large numbers of rows · Issue #4197 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Segmentation fault when calculating euclidean_distances for large numbers of rows #4197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dschallis opened this issue Feb 3, 2015 · 12 comments

Comments

@dschallis
Copy link

With scikit-learn 0.15.2, numpy 1.9.1, python 2.7.8 (on OS X), the following code segfaults:

import numpy
import sklearn.cluster

numpy.random.seed(1)
X = numpy.random.random((50000, 100))
model = sklearn.cluster.KMeans(n_clusters=3, random_state=1)
model.fit_predict(X)
print sklearn.metrics.silhouette_score(X, model.labels_, metric='euclidean')

Results in:

Segmentation fault: 11

Dropping the rows down to 30000, and the above completes fine. Dropping rows to 40000, and the script takes a very long amount of time, but didn't appear to segfault.

@jnothman
Copy link
Member
jnothman commented Feb 3, 2015

You can cut that example down to:

import numpy
from sklearn.metrics import euclidean_distances
numpy.random.seed(1)
X = numpy.random.random((50000, 100))
euclidean_distances(X)

I'm hence changing the issue title.

@jnothman jnothman changed the title Segmentation fault when calculating silhouette_score for large numbers of rows Segmentation fault when calculating euclidean_distances for large numbers of rows Feb 3, 2015
@dschallis
Copy link
Author

Thanks, makes sense to me, the above also segfaults for me also.

@jnothman
Copy link
Member
jnothman commented Feb 3, 2015

The crash occurs in numpy.dot for numpy 1.8.1. And indeed, np.dot(X, X.T) segfaults. Check whether the segfault occurs at numpy master, and otherwise report it there.

@jnothman
Copy link
Member
jnothman commented Feb 3, 2015

I'm closing this as belonging to numpy (until further notice?).

@jnothman jnothman closed this as completed Feb 3, 2015
@amueller
Copy link
Member
amueller commented Feb 3, 2015

That is ... surprising... to say the least.

@dschallis
Copy link
Author

I've submitted with some more details to numpy for now, thanks for the investigation work.

@Renzhh
Copy link
Renzhh commented May 11, 2015

@dschallis Have the problem solved by Numpy? I was having the same question originally from calculating silhouette_score or silhouette_samples with large amounts of rows. See #4701

@dschallis
Copy link
Author

@Renzhh I don't think so, it's currently still an open issue with numpy: numpy/numpy#5533

@Renzhh
Copy link
Renzhh commented May 15, 2015

@dschallis Is the question “Segmentation fault when calculating euclidean_distances for large numbers of rows” really caused by numpy? I happened to this problem, too. See my issue: #4701 .When the amount of sparse matrix less than 30,000 rows, both of silhouette_score and silhouette_samples are OK and can get expected results. But when the amount of X more than 100,000, the program crashed and get "Segmentation fault (core dumped)". I'm debugging...

Since then, How do you calculate silhouette_score within large rows?

@jnothman
Copy link
Member

For a sparse matrix we may need to assume it is a different issue. Still, check whether it's a problem with silhouette_score (unlikely, because it doesn't do anything low-level enough to result in a segfault) or with euclidean_distances. Also, density in sparse matrices may be more important than number of rows. In any case, a segfault is undesirable behaviour and needs to be fixed. Finally, what version of scipy do you have? Does the segfault occur with the most recent version?

@Renzhh
Copy link
Renzhh commented May 15, 2015

@jnothman After checking, I'm sure the problem caused by euclidean_distances. The versin of my scipy is 0.13.3 and numpy is1.8.2 . Now I try to use recent stable version to check wherether the segmentation fault still occurs.

@Renzhh
Copy link
Renzhh commented May 18, 2015

@jnothman With recent version, scipy 0.15.1 and numpy 1.9.2, the segmentation fault still occurs. But with scipy.test(), it seems that my installed scipy package has some little problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0