-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Some processes not working under clustering.MeanShift #6943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Possibly related: #6023 |
It works fine on my desktop but I can reproduce the behaviour on a big memory server with 48 cores. Will need to investigate more. |
Yes, here too, the problem shows up on large machines. |
So it seems like the automatic batching of tasks is not well suited to some machines. I am not exactly sure why yet. A work-around that works for me is to set joblib.parallel.MIN_IDEAL_BATCH_DURATION to a higher value. If you can test whether this snippet works for you, that'd be great: import numpy as np
from sklearn.cluster import MeanShift
from sklearn.externals.joblib import parallel
parallel.MIN_IDEAL_BATCH_DURATION = 1.
parallel.MAX_IDEAL_BATCH_DURATION = parallel.MIN_IDEAL_BATCH_DURATION * 10
ndim = 4
points = np.random.random([100000, ndim])
MS = MeanShift(n_jobs=20, bandwidth=0.1)
print("Starting.")
MS.fit(points) |
This does indeed make things better. Should we add your workaround to the MeanShift code, or are you going to fix this in joblib? |
Glad to hear that!
I think this kind of work-around is best left in client code rather than in scikit-learn. I'll try to understand the problem in more details and if there is a fix it will happen in joblib. |
Great sleuthing @lesteve! (I'm wondering how you narrowed it down to that, or whether the timing of the problem gave it away.) |
To be perfectly honest, I tried different things before it started to make sense. At one point tried with different batch_size in the Parallel object and realized than the auto batching wasn't performing very well. |
I have opened an issue in joblib: joblib/joblib#372. I'll close this one. |
@lesteve I'm still experiencing this problem when working with very large datasets. My supervisor worked on it and found the only possible workaround was to define def _mean_shift_multi_seeds(my_means, X, nbrs, max_iter):
return [_mean_shift_single_seed(my_mean, X, nbrs, max_iter)
for my_mean in my_means] and then all_res = Parallel(n_jobs=n_jobs, max_nbytes=1e6, verbose=2)(
delayed(_mean_shift_multi_seeds, has_shareable_memory)
(seeds[i*nseeds:(i+1)*nseeds], X, nbrs, max_iter)
for i in range(n_jobs)) In other words, manually splitting the seeds in a number of arrays corresponding to the number of jobs we want to spawn. However, this means that joblib can't handle it properly. How can we solve this? At the moment, we have an ill-functioning method in sklearn. |
I am not sure what you mean by this, are you saying the work-around of setting
Out of interest, what is As far as I can tell, your work-around is to have just fewer but longer tasks. This agrees with the conclusion we reached before: the automatic batching of tasks is not working great in your setup. |
Yes, setting the batch duration is not enough when datasets are very large. What I don't understand is the following: if I set n_jobs to be 20, say, joblib should, at least at some point, split You're right, the automatic batching doesn't work well, 8000 but "my setup" is simply to use sklearn.cluster.MeanShift... |
Thanks for clarifying this. If you had typical sizes for which you start encountering the problem (data shape, n_jobs, seeds shape), that would be great. Also have you tried tweaking
From what I remember the problem was arising on a big memory server but not on my laptop, this is why I was talking about setup. My guess was that IPC (Inter-Process Communication) had more overhead on the former than on the latter but I never found time to investigate further since.
I am not following you, so I'll try to clarify: n_jobs=20 means you have a pool of 20 subprocesses waiting for tasks to execute. The number of tasks is the length of the iterator you use in your Parallel call, in your case Now auto-batching was done to try to alleviate this kind of problems. Auto-batching tries to group tasks in batches (a batch is a Python list of tasks basically) and does so dynamically by measuring the time taken by each batch. It has some heuristics based on |
I see, I'm starting to understand. Now, I can keep using the trick I mentioned, but should we do anything to sklearn? Or just wait for joblib to solve this? |
As I was saying above:
It would be great to know when the work-around seems to alleviate the problem and when it does not seem effective any more. Seeing the influence of This pieces of information would allow me to check whether I can reproduce the same patterns on the big memory server I have access to and investigate the problem in more details.
I was hoping that setting Ideally we would find a way to fix this problem in joblib. On the other hand if fixing in joblib in a generic manner turns out too hard and if we know that individual tasks are very short (and with a reasonably uniform computation time) in a MeanShift context, we could group them by hand in the scikit-learn code. Funny story: I thought about this problem a few days ago in a different context so maybe it is the universe telling me to take the plunge and dive deep into this issue ;-). |
So, we are using this code import numpy as np
from sklearn.cluster import MeanShift
import tempfile
import os
from sklearn.externals.joblib import load, dump
from sklearn.externals.joblib import parallel
import shutil
parallel.MIN_IDEAL_BATCH_DURATION = 1.
parallel.MAX_IDEAL_BATCH_DURATION = parallel.MIN_IDEAL_BATCH_DURATION * 10
def crash_it(n):
arr = np.random.random_sample(size=(n, 4))
MS = MeanShift(bin_seeding=True, bandwidth=0.03,
cluster_all=True, min_bin_freq=1, n_jobs=-1)
temp_folder = tempfile.mkdtemp()
filename = os.path.join(temp_folder, 'joblib_test.mmap')
print(filename)
if os.path.exists(filename):
os.unlink(filename)
mmap_arr = np.memmap(filename, dtype=arr.dtype, shape=arr.shape, mode='w+')
dump(arr, filename)
mmap_arr = load(filename, mmap_mode='r')
MS.fit_predict(mmap_arr)
try:
shutil.rmtree(temp_folder)
except OSError:
pass
crash_it(100000) Only one core seems to work for n=1000000, while it's working reasonably well at n=100000. Note that using a mmap array does seem to help. This was run on 48 cores. |
Thanks a lot for the snippet, I'll take a closer look to see whether I can see the same patterns. |
Which version of scikit-learn are you using? Things have moved around a bit in joblib with the recent parallel backends feature. With scikit-learn 0.18 the work-around is a bit modified and you need to do this instead: from sklearn.externals.joblib._parallel_backends import AutoBatchingMixin
min_ideal_batch_duration = 1.
AutoBatchingMixin.MIN_IDEAL_BATCH_DURATION = min_ideal_batch_duration
AutoBatchingMixin.MAX_IDEAL_BATCH_DURATION = 10 * min_ideal_batch_duration @martinosorb can you try this and let me know if that fixes your issue ? |
I was using 0.17, now 0.18 with the change you suggest and I see no difference. |
OK thanks for the feed-back, will keep looking then. |
This happens when there is a very large number of seeds. Then joblib.delayed creates an equally huge number of tuples to work through, and for some reason this becomes very slow, preventing full utilisation of all CPUs. |
is Joblib.Parallel's batch_size argument sufficient?
|
No, that does not help, we tried all sorts of things. The problems start when we have several Mio data points to cluster, and it seems to come down to the number of seeds. Maybe this even goes down to how python does lists, not sure. |
In principle, joblib should be able to tackle this via its auto-batching mechanism and reach a solution as efficient as your manual batching. In practice it looks like the heuristics that joblib use for its auto-batching are not great for big memory machines for some reason, which I am afraid I was not really able to figure out. A few questions whose answers may help:
|
I can still see the problem on a 48-core machine, with the same code I posted when I opened the issue (n_jobs = 20; seeds is, I believe, the same number as the points, so 100000). On smaller machines there is no problem. As said above, setting ideal batch durations makes it better. |
Thanks @martinosorb for your answer! What about you @mhhennig? |
Oh, mhhennig and I work together on the same machines. But let's see if he has anything to add. |
Right, I have prepared a test data set that should illustrate the problem - this is 4D data, by the way: https://datasync.ed.ac.uk/index.php/s/0m3ebEgihENqAps (passwd is A short script in that folder will first execute the modified version, which takes about 9 minutes on 12 cores (6 physical) to cluster. Then it executes the shipped version, which should take a whole night or so to complete on the same machine. |
Thanks I will try to understand what goes wrong in joblib. If that fails as my previous attempt we can always adopt a pragmatic approach in scikit-learn and do some manual pre-batching if the number of seeds is too big. |
Ok, have a look at the modified version too (that's in mean_shift_.py in the same folder). I suspect in any case this implementation is more efficient, although I have not tested this (yet). If you want to make it worse, by the way, just reduce min_bin_freq to say 10, as this will yield even more seeds. |
Hi,
I'm using the parallel version of clustering.MeanShift (which I had written, interestingly). I've now noticed that most of the processes are actually "sleeping", and only a few actually work. Even more oddly, this doesn't always happen:
multiprocessing
instead ofjoblib
makes it workI have no idea where to start...
Reproduce
When running the code
a call to
htop
shows:Versions
Linux-2.6.32-573.3.1.el6.x86_64-x86_64-with-redhat-6.6-Carbon
Python 3.4.2 (default, Feb 4 2015, 08:24:27)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)]
NumPy 1.11.1
SciPy 0.17.1
Scikit-Learn 0.17.1
The text was updated successfully, but these errors were encountered: