-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Discussion working memory vs. performances #11506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for these benchmarks @jeremiedbb ! They are consistent with @jnothman 's earlier benchmarks in #10280 (comment) If I rerun the first set of benchmarks, divide all timing by the curve minimum (to be able to compare them on a linear scale), remove the log scale and add more points for My L3 CPU cache is 3 MB, and the optimum here appears to be around 30 MB, which is rather consistent with your figures above.
Assuming we could detect the CPU L3 cache size (which doesn't seem very straightforward) what would be the relationship you think? |
Not that consistent. I find the minimum to be around cpu cache. Yours seems to be 10 * cpu cache. On which function did you make your benchmark ? |
Of What's the cache size in your |
Sorry I didn't tell. It's 4Mo. what's the number of clusters ? and what the dtype of your arrays ? |
See the gist with the benchmark code in the first link of my first comment. |
https://github.com/workhorsy/py-cpuinfo seems to be the python tool of choice to get the L3 CPU cache size. |
@rth I ran your benchmark on my machine and got the same result. Min around 30Mo working memory.
Agreed. However I've no idea how the optimal working memory is related to the cpu specs. So should we use a lower fixed default like 32 or 64Mo ? |
In addition, the dtype of X in your benchmarks is np.float64. When I run the same benchmark with dtype=np.float32, I find the min at 60Mo. See below Hum, actually I'm not sure but the chunk sizes seem wrongly computed in chunk_n_rows = get_chunk_n_rows(row_bytes=8 * _num_samples(Y),
max_n_rows=n_samples_X,
working_memory=working_memory)
We should make row_bytes computed accordingly to X.dtype. @jnothman You wrote that code right ? Can you confirm ? |
We won't be able to add an external dependency like py-cpuinfo for this.
Yes, but what about #10280 (comment) ?
There was somewhat related discussion in #10280 (comment) but I'm not sure if this was desired or nor. (I would test 64 bit rather than 32 bit by default given #9354.) |
sorry I don't have much attention for this right now. yes, I may have
forgotten to make the output dtype size configurable. I did not however
expect us to get a perfect parameterisation on the first shot, and needed,
in the first instance, a sensible mechanism for controlling algorithms
where unbounded memory usage was causing problems.
I had also initially found 64 MiB was appropriate, but increased the value
given others' benchmarks. Again, I don't think this requires fine-tuning as
long as the API is stable, but I would happily see it deceased an order of
magnitude or calculated from system specs where readily available.
|
Given the recent improvements about pairwise distances + reductions, which are memory efficient and fast, I think the subject of this issue is getting out of date. pairwise_distances_chunked, pairwise_distances_argmin(_min) are destined to disappear at some point. |
I ran some benchmarks on KMeans performances when varying the working_memory (see #10280). I open this discussion as suggested by @rth in #11271. In KMeans, working memory is involved in the function
pairwise_distances_argmin_min
.You can see benchmarks below. I benchmarked



KMeans.fit
on a problem with 100000 samples, 50 dimensions and 1000 clusters, on 3 different machines.It seems that working memory has an impact on performances, and moreover that the optimal is close to the cpu cache size. I think the first has lot of noise because it was made on my machine with other processes running and also focuses on smaller working memories.
Even if the improvement could only be at most 2x, it's worth considering a modification of the default value of the working memory, which is currently 1000Mo. However, it depends on the cpu specs. Would it be possible to make
working_memory
be inferred from that ?ping @ogrisel
The text was updated successfully, but these errors were encountered: