-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
K-Means clustering performance improvements #10744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I suppose there's some sense in handling low dimensions as fast paths for
something like KMeans. Interesting
|
One can get large performance improvement in any dimension. Here is a code running with: MacBookPro: 4 cores, Haswell (AVX2) Dual Xeon: 36 cores, Skylake (AVX512) There is still room for improvement. But even in large dimension, one can increase substantially the performance. import time
import numpy as np
import sklearn.cluster
from sklearn.cluster import KMeans
from cython.parallel import prange
cimport cython
cdef int nb_points = 100000
cdef int nb_clusters = 1024
cdef int dim = 128
cdef int nb_iterations_insideloop = 1
cdef int nb_iterations_scikitlearn = 1
cdef int nb_iterations_intel = 1
point_il = np.random.rand(nb_points * dim).astype('float32')
point = np.random.rand(nb_points, dim).astype('float32')
cluster = np.arange(0, nb_points, dtype = 'int32') % nb_clusters
centroid_il = np.zeros(nb_clusters * dim, dtype = 'float32')
pop_il = np.zeros(nb_clusters, dtype = 'int32')
cdef float[::1] point_view = point_il
cdef int[::1] cluster_view = cluster
cdef float[::1] centroid_view = centroid_il
cdef int[::1] pop_view = pop_il
t0 = time.time()
kmeans(dim, point_view, cluster_view, centroid_view, pop_view, nb_iterations_insideloop)
t1 = time.time()
gflops = 1.0e-9 * nb_points * nb_iterations_insideloop * nb_clusters * 3 * dim / (t1 - t0)
print(' Time for KMeans clustering, InsideLoop: {} s'.format((t1 - t0) / nb_iterations_insideloop))
print(' Performance: {} Glops/s'.format(gflops))
t0 = time.time()
res = sklearn.cluster.k_means(point, nb_clusters, init = 'random',
n_init = 1, tol = 1.0e-16, max_iter = nb_iterations_scikitlearn, return_n_iter = True)
t1 = time.time()
gflops = 1.0e-9 * nb_points * nb_iterations_scikitlearn * nb_clusters * 3 * dim / (t1 - t0)
print('Time for KMeans clustering, Scikit-Learn: {} s'.format((t1 - t0) / nb_iterations_scikitlearn))
print(' Performance: {} Glops/s'.format(gflops))
t0 = time.time()
res = KMeans(init = 'random', n_init = 1, tol = 1.0e-16, n_clusters = nb_clusters,
random_state = 0, max_iter = nb_iterations_intel).fit(point)
t1 = time.time()
gflops = 1.0e-9 * nb_points * nb_iterations_intel * nb_clusters * 3 * dim / (t1 - t0)
print(' Time for KMeans clustering, Intel: {} s'.format((t1 - t0) / nb_iterations_intel))
print(' Performance: {} Glops/s'.format(gflops))
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef void kmeans(int dim,
float[::1] point, int[::1] cluster,
float[::1] centroid, int[::1] pop,
int nb_iterations) nogil:
cdef int m = len(point) / dim # m is the number of points
cdef int n = len(centroid) / dim # n is the number of clusters
cdef int i, j, k, l, best_j
cdef float alpha, x, distance, best_distance
for k in range(nb_iterations):
for j in range(n):
for l in range(dim):
centroid[j * dim + l] = 0.0
pop[j] = 0
for i in range(m):
j = cluster[i]
for l in range(dim):
centroid[j * dim + l] += point[i * dim + l]
pop[j] += 1
for j in range(n):
if pop[j] > 0:
alpha = 1.0 / pop[j]
for l in range(dim):
centroid[j * dim + l] *= alpha
for i in prange(m):
best_distance = dim + 1.0
best_j = -1
for j in range(n):
distance = 0.0
for l in range(dim):
x = point[i * dim + l] - centroid[j * dim + l]
distance = distance + x * x
if distance < best_distance:
best_distance = distance
best_j = j
cluster[i] = best_j |
Interesting benchmarks, thank you. To use something like that, there are currently two blockers, as far as I know: one is OpenMP usage (see #7650) and the other is memoryviews (#10624 , also related #10663, and probably some older issues) . Looks like both could get resolved; @ogrisel and @lesteve would know more about it...
Wouldn't that be an unfair advantage for these benchmarks, depending on what the compiler optimizes for? Maybe importing the cython function from a pure python script and running the timings there would safer, just to be sure? |
Concerning the case where the dimension is known at compile time, it was just to show that this information could lead to large speedups. It was the first code I have tried, but it is not the main point on which I would like to focus. Concerning OpenMP, I don't know how you handle multithreading in Scikit-Learn (apart from using a multithreaded BLAS). I've heard from Alexandre Gramfort that OpenMP might cause problems as some compilers (think Apple and their version of clang) do not support OpenMP out of the box. I know that Intel has been working hard for their Python distribution to integrate TBB (Threading Building Blocks) into their Python. It is an open source library (Apache 2), works on many platforms (including ARM), and has the huge advantage of handling nested parallelism. Moreover, Intel MKL support both OpenMP and TBB. I don't know if this is an option you have considered. Concerning memoryviews, it is extremely important if you want vectorization. With Numpy, the problem is that the strides can be whatever you want. As far as I know, there is no guarantee that a[i] and a[i + 1] are next to each other in memory. This alone just kills performance and prevents vectorization when writing loops. With AVX-512 and 32-bit floating point numbers, you are leaving a potential performance factor of 16. Even on your laptop with AVX2, this is a performance factor of 8. As I said, I am a newcomer to the Python community and I don't know well your requirements and problems but I would be glad to learn more and I have some good knowledge of multithreading and vectorization on CPUs. I can help and give you a hand on those subjects. |
Is is also interesting to notice that multithreading and vectorization does not explain everything. On my MacBookPro with 4 cores and AVX2 (256 bits with 32 bits floats), we could hope for a theoretical speedup of factor of 4 x (256 / 32) = 32. Usually, you are happy when you get a speedup of 16 in those cases. Here, the speedup is 234. So, not only you are not exploiting multithreading and vectorization correctly, but also you are doing something that make things worse. In your current implementation, the hotspot is here: for sample_idx in range(n_samples):
min_dist = -1
for center_idx in range(n_clusters):
dist = 0.0
# hardcoded: minimize euclidean distance to cluster center:
# ||a - b||^2 = ||a||^2 + ||b||^2 -2 <a, b>
dist += dot(n_features, &X[sample_idx, 0], x_stride,
¢ers[center_idx, 0], center_stride)
dist *= -2
dist += center_squared_norms[center_idx]
dist += x_squared_norms[sample_idx]
if min_dist == -1 or dist < min_dist:
min_dist = dist
labels[sample_idx] = center_idx
if store_distances:
distances[sample_idx] = min_dist
inertia += min_dist This code is the hotspot of your current implementation in the file
Now imagine the disaster when you call dot to make a scalar product of vectors in dimension 3 : you are slowing things down.
As a consequence, I believe that many things can be done to make your code faster and more stable. I would do, in order:
Now comes the question of multithreading. There is a say in HPC: vectorize the innermost loop and use multithreading for the outermost loop. Unfortunately, the Python language push people to let the multithreading being handled by low level library and therefore use multithreading in innermost loops. If there is a loop that needs multithreading here, it is These are my 2 cents on this implementation. I can help on implementing something without the (a-b)^2 trick and without the call to the BLAS. It should make things faster and make it way more efficient to parallelize the different initializations. But I would really need memoryview. I don't understand your problem regarding that. Can someone give me some pointers on the problems it raises? François |
I don't think I'm able to invest too deeply in this discussion right now. I
don't really see how we're going to compile out the number of features
without adding dependencies to perform that just-in-time compilation... Do
we get big gains without that optimisation?
|
Hi Joel. In my first post on this thread, the benchmarks needs the knowledge of the number of features at compile time. But on my second post, the number of features is not known at compile time and we still get large speedups. PS1: Note that I have managed to do a few changes (register blocking) and made K-means clustering with 3 features (known at compile time) |
Thanks for the summary. Very interesting analysis. The Euclidean computation trick has bitten us elsewhere. I'm +1 for the improvement to that and dot at a minimum. Our docs aren't cheating so much as trying to stay fast to build. |
Thanks for this in depth analysis @FrancoisFayard ! It's valuable to have an new look on this, Below are a few comments and questions, ( cc @massich )
Outside of Cython we use
Regarding TBB, not, as far as Github issue history goes. BTW, #9429 might be somewhat related to this issue in general. Also note that, scikit-learn is also frequently used in combination with OpenBLAS and I'm not sure how much TBB is used in that context (e.g. OpenMathLib/OpenBLAS#1282).
The issue was that until recently Cython did not support read-only memoryviews (cf #10624 (comment)). This is currently necessary for parallel computations with joblib. Therefore most of scikit-learn cython code currently uses the array buffer interface. It also allows to force array contiguity and alignment (e.g. exemple here), but from what I understand, indexing does call the Python C API, one can't release the GIL and there is some additional performance cost. In the upcoming Cython 0.28 there will be support for read-only memory views, but it remains to be seen how well this will work with the existing code base cython/cython#1869 (comment) (hopefully it will).
Interesting. If that's true in general, why does MKL / OpenBLAS multi-threads those? (I imagine
How did you install the default scikit-learn version ("Scikit-Learn - Vanilla" in your benchmarks)? Bear in mind that for portability reasons binary distributions of scikit-learn may have been compiled with intentionally outdated versions of gcc. For instance see manylinux1 policy that produce the binary wheels uploaded to PyPi uses gcc 4.x . This can have some impact I imagine (e.g. squeaky-pl/japronto#52). Similarly until recently conda used gcc 4.x. As far as I could see, on Travis CI used to build wheels the default compilation flag is
As Joel said, your help would be appreciated. I don't know if it's possible to get some those speedup with the Cython buffer interface instead of arrays for now. |
Thanks for your reply @rth
This is to say that I don't think that using an old version of gcc (I have seen worse than gcc 4.x) is not a major problem. The bigger problem would be to generate a single binary from your Cython code that works optimally on all Intel architectures wether they are SSE, AVX, AVX2 or AVX512. There are some solutions to do that in C/C++, but it would need to be integrated into Cython if you want to get it from Cython.
I'll try to work and optimize my C++ code to see how it compares against the DAAL from Intel and I'll then try to make it as fast using Cython and memoryviews. It might take some time though as I am quite busy right now. I'll come back to you and I'll even try to give a talk at INRIA about my findings. |
Thank you for a great analysis. Not sure if that has been addressed elsewhere but
That's exactly what we do if n_jobs > 1. I think the (a-b)^2 trick is so that we can make a blas call which can use multithreading / fast blas. |
We can only consider TBB if it works well with openblas and if we can access it from cython. |
I think what loky allows here (#7650 has more details). is to use OpenMP through cython while avoiding freezes due to interaction of Python multiprocessing with OpenMP (to sum up bad interaction of fork witout exec with libraries maintaining their own thread-pool, see this for more details). Note that loky is only part of the joblib development version and so is not bundled in scikit-learn at the time of writing. There has been quite a few changes in joblib and it may take a little bit of time until the next joblib release. While I am here a comment more to the point of this issue: until now it seems like the focus if how fast can we make KMeans in an ideal world. I think it would be good at one point to also look at how this kind of ideas can be integrated into the existing scikit-learn code and have some benchmarks of scikit-learn PR vs scikit-learn master. There is a benchmark in benchmarks/bench_plot_fastkmeans.py, maybe some of it can be reused. |
@amueller: I agree that the (a-b)^2 trick is a nice way to use a BLAS. But it can be better used. I can propose a replacement that could be way better, especially when the number of features is large. Concerning TBB, I have talked to some Intel people, and they might consider giving support to it in Cython if there is a requirement. But, I don't think you could get TBB in OpenBLAS unless you do it yourself. Still it is possible to mix multithreading librairies and call an OpenMP-multithreaded library from a TBB-multithreaded library. It is not ideal, but this can be done. |
@FrancoisFayard Hi, I don't know if you are still working on this but if you have some time I've got a few questions about your benchmark. I'm trying to reproduce your benchmarks (with unknown n_dim at compile time) but I'm not even getting close to yours. extensions = [Extension("kmeans",
sources=["kmeans.pyx"],
extra_compile_args=['-qopenmp', '-march=core_avx2'],
extra_link_args=['-qopenmp'],
include_dirs=[numpy.get_include()])]
setup(
ext_modules = cythonize(extensions, annotate=True)
) Could you tell me which flags did you use to compile with icc ? Also did you try to compile with gcc on your laptop? and finally, can you give me more precise infos of your laptop cpu ? are you close to the peak ? |
Hi,
I am new to the Scikit-Learn world, but I've talked a lot with Alexandre Gramfort about some performance enhancement that can be done to Scikit-Learn. I knew for a long time that the KMeans clustering code was suboptimal, but my version was a C++ code and it seems that it is not a language that is used in Scikit-Learn. Wether or not it was possible to make the KMeans clustering implementation of Scikit-Learn faster with the constraints of the community was an open question to me as I am not a Python programmer.
I am going to focus on Intel architectures. The benchmarks I've done are in between the vanilla Scikit-Learn (the latest one on Anaconda), Intel Scikit-Learn (the latest one from the intel channel which is linked to the DAAL) and a code I've written using Cython and compiled with the Intel compiler.
Here are the difference of speed to classify 270 000 points in 1024 clusters in dimension 3. This example comes from the color quantization documentation of Scikit-Learn which has an image of 270 000 pixels. The reference speed is the vanilla Scikit-Learn on a MacBook Pro with 4 cores. Any higher number represents a speedup over this configuration:
MacBookPro: 4 cores, Haswell (AVX2)
Scikit-Learn - Vanilla: 1
Scikit-Learn - Intel:
x 52
InsideLoop - Icc 18:
x 234
(About 129 GFlops/s)Dual Xeon: 36 cores, Skylake (AVX512)
Scikit-Learn - Vanilla:
x 2
Scikit-Learn - Intel:
x 85
InsideLoop - Gcc 7.3:
x 276
InsideLoop - Icc 18:
x 1382
(About 754 GFlops/s)This shows that some huge improvements can be done, even above the Intel version. There are a few things where my code has an advantage:
Here is the code if you are interested. A few things are missing such as the handling of empty clusters but I have never seen that being a hotspot for k-means clustering. Also, bare in mind that I am more of a C/C++ programmer. This is my first Cython code and there might be things that are not really "appropriate". To get the speedup, I leverage both parallelisation (using OpenMP) and vectorization (thanks to memoryview and the Intel compiler).
The text was updated successfully, but these errors were encountered: