Description
I ran some benchmarks on KMeans performances when varying the working_memory (see #10280). I open this discussion as suggested by @rth in #11271. In KMeans, working memory is involved in the function pairwise_distances_argmin_min
.
You can see benchmarks below. I benchmarked KMeans.fit
on a problem with 100000 samples, 50 dimensions and 1000 clusters, on 3 different machines.
It seems that working memory has an impact on performances, and moreover that the optimal is close to the cpu cache size. I think the first has lot of noise because it was made on my machine with other processes running and also focuses on smaller working memories.
Even if the improvement could only be at most 2x, it's worth considering a modification of the default value of the working memory, which is currently 1000Mo. However, it depends on the cpu specs. Would it be possible to make working_memory
be inferred from that ?
ping @ogrisel