Some processes not working under clustering.MeanShift #6943

martinosorb · 2016-06-27T13:31:31Z

Hi,
I'm using the parallel version of clustering.MeanShift (which I had written, interestingly). I've now noticed that most of the processes are actually "sleeping", and only a few actually work. Even more oddly, this doesn't always happen:

the problem is worse on some machine than on others
the problem doesn't seem to appear when working with 2 dimensions instead of 4 (see code below).
changing the code to use multiprocessing instead of joblib makes it work

I have no idea where to start...

Reproduce

When running the code

from sklearn.cluster import MeanShift
import numpy as np

ndim = 4
points = np.random.random([100000, ndim])

MS = MeanShift(n_jobs=20, bandwidth=0.1)
print("Starting.")
MS.fit(points)

a call to htop shows:

Versions

Linux-2.6.32-573.3.1.el6.x86_64-x86_64-with-redhat-6.6-Carbon
Python 3.4.2 (default, Feb 4 2015, 08:24:27)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)]
NumPy 1.11.1
SciPy 0.17.1
Scikit-Learn 0.17.1

The text was updated successfully, but these errors were encountered:

martinosorb · 2016-06-27T13:32:32Z

Possibly related: #6023

lesteve · 2016-06-28T08:45:48Z

It works fine on my desktop but I can reproduce the behaviour on a big memory server with 48 cores. Will need to investigate more.

martinosorb · 2016-06-28T09:11:57Z

Yes, here too, the problem shows up on large machines.

lesteve · 2016-06-28T09:49:29Z

So it seems like the automatic batching of tasks is not well suited to some machines. I am not exactly sure why yet.

A work-around that works for me is to set joblib.parallel.MIN_IDEAL_BATCH_DURATION to a higher value. If you can test whether this snippet works for you, that'd be great:

import numpy as np

from sklearn.cluster import MeanShift
from sklearn.externals.joblib import parallel

parallel.MIN_IDEAL_BATCH_DURATION = 1.
parallel.MAX_IDEAL_BATCH_DURATION = parallel.MIN_IDEAL_BATCH_DURATION * 10

ndim = 4
points = np.random.random([100000, ndim])

MS = MeanShift(n_jobs=20, bandwidth=0.1)
print("Starting.")
MS.fit(points)

martinosorb · 2016-06-28T10:25:35Z

This does indeed make things better. Should we add your workaround to the MeanShift code, or are you going to fix this in joblib?
Anyway, thanks a lot.

lesteve · 2016-06-28T11:02:50Z

This does indeed make things better.

Glad to hear that!

Should we add your workaround to the MeanShift code, or are you going to fix this in joblib?

I think this kind of work-around is best left in client code rather than in scikit-learn. I'll try to understand the problem in more details and if there is a fix it will happen in joblib.

jnothman · 2016-06-28T11:37:12Z

Great sleuthing @lesteve! (I'm wondering how you narrowed it down to that, or whether the timing of the problem gave it away.)

lesteve · 2016-06-28T11:46:40Z

I'm wondering how you narrowed it down to that, or whether the timing of the problem gave it away

To be perfectly honest, I tried different things before it started to make sense. At one point tried with different batch_size in the Parallel object and realized than the auto batching wasn't performing very well.

lesteve · 2016-06-28T11:51:52Z

I have opened an issue in joblib: joblib/joblib#372. I'll close this one.

martinosorb · 2016-12-14T12:19:02Z

@lesteve I'm still experiencing this problem when working with very large datasets. My supervisor worked on it and found the only possible workaround was to define

def _mean_shift_multi_seeds(my_means, X, nbrs, max_iter):
    return [_mean_shift_single_seed(my_mean, X, nbrs, max_iter)
            for my_mean in my_means]

and then

    all_res = Parallel(n_jobs=n_jobs, max_nbytes=1e6, verbose=2)(
        delayed(_mean_shift_multi_seeds, has_shareable_memory)
        (seeds[i*nseeds:(i+1)*nseeds], X, nbrs, max_iter)
        for i in range(n_jobs))

In other words, manually splitting the seeds in a number of arrays corresponding to the number of jobs we want to spawn. However, this means that joblib can't handle it properly. How can we solve this? At the moment, we have an ill-functioning method in sklearn.

lesteve · 2016-12-14T13:04:02Z

@lesteve I'm still experiencing this problem when working with very large datasets.

I am not sure what you mean by this, are you saying the work-around of setting parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION as in #6943 (comment) is sub-optimal in some cases? Are you saying the work-around is not helping at all? Are you still seeing the same behaviour that it only happens on some machines and not others?

In other words, manually splitting the seeds in a number of arrays corresponding to the number of jobs we want to spawn. However, this means that joblib can't handle it properly. How can we solve this? At the moment, we have an ill-functioning method in sklearn.

Out of interest, what is n_jobs in your case and what is the shape of seeds?

As far as I can tell, your work-around is to have just fewer but longer tasks. This agrees with the conclusion we reached before: the automatic batching of tasks is not working great in your setup.

martinosorb · 2016-12-14T13:13:15Z

Yes, setting the batch duration is not enough when datasets are very large. What I don't understand is the following: if I set n_jobs to be 20, say, joblib should, at least at some point, split seeds in 20 parts and give them to the 8 processes. So why it seems to work only if I manually split seeds in 8?
I'll ask my supervisor about the details of his experiment.

You're right, the automatic batching doesn't work well, 8000 but "my setup" is simply to use sklearn.cluster.MeanShift...

lesteve · 2016-12-14T13:32:35Z

Yes, setting the batch duration is not enough when datasets are very large.

Thanks for clarifying this. If you had typical sizes for which you start encountering the problem (data shape, n_jobs, seeds shape), that would be great.

Also have you tried tweaking parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION to see whether you could get to an acceptable speed.

You're right, the automatic batching doesn't work well, but "my setup" is simply to use sklearn.cluster.MeanShift...

From what I remember the problem was arising on a big memory server but not on my laptop, this is why I was talking about setup. My guess was that IPC (Inter-Process Communication) had more overhead on the former than on the latter but I never found time to investigate further since.

What I don't understand is the following: if I set n_jobs to be 20, say, joblib should, at least at some point, split seeds in 20 parts and give them to the 8 processes. So why it seems to work only if I manually split seeds in 8?

I am not following you, so I'll try to clarify: n_jobs=20 means you have a pool of 20 subprocesses waiting for tasks to execute. The number of tasks is the length of the iterator you use in your Parallel call, in your case len(seeds). If your tasks take a very short time to run in the subprocess you are dominated by the IPC overhead and running in parallel may actually take more time than running sequentially.

Now auto-batching was done to try to alleviate this kind of problems. Auto-batching tries to group tasks in batches (a batch is a Python list of tasks basically) and does so dynamically by measuring the time taken by each batch. It has some heuristics based on parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION to decide dynamically the size of the batch. It seems like these heuristics are not well-suited in the context of a big memory server.

martinosorb · 2016-12-14T13:38:29Z

I see, I'm starting to understand. Now, I can keep using the trick I mentioned, but should we do anything to sklearn? Or just wait for joblib to solve this?

lesteve · 2016-12-14T14:09:08Z

As I was saying above:

If you had typical sizes for which you start encountering the problem (data shape, n_jobs, seeds shape), that would be great.

It would be great to know when the work-around seems to alleviate the problem and when it does not seem effective any more. Seeing the influence of parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION would be great too.

This pieces of information would allow me to check whether I can reproduce the same patterns on the big memory server I have access to and investigate the problem in more details.

Now, I can keep using the trick I mentioned, but should we do anything to sklearn? Or just wait for joblib to solve this?

I was hoping that setting parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION would be enough as a stop-gap solution but apparently it is not (not sure why yet).

Ideally we would find a way to fix this problem in joblib. On the other hand if fixing in joblib in a generic manner turns out too hard and if we know that individual tasks are very short (and with a reasonably uniform computation time) in a MeanShift context, we could group them by hand in the scikit-learn code.

Funny story: I thought about this problem a few days ago in a different context so maybe it is the universe telling me to take the plunge and dive deep into this issue ;-).

martinosorb · 2016-12-14T15:07:53Z

So, we are using this code

import numpy as np
from sklearn.cluster import MeanShift

import tempfile
import os
from sklearn.externals.joblib import load, dump
from sklearn.externals.joblib import parallel
import shutil

parallel.MIN_IDEAL_BATCH_DURATION = 1.
parallel.MAX_IDEAL_BATCH_DURATION = parallel.MIN_IDEAL_BATCH_DURATION * 10


def crash_it(n):
    arr = np.random.random_sample(size=(n, 4))
    MS = MeanShift(bin_seeding=True, bandwidth=0.03,
                   cluster_all=True, min_bin_freq=1, n_jobs=-1)

    temp_folder = tempfile.mkdtemp()
    filename = os.path.join(temp_folder, 'joblib_test.mmap')
    print(filename)
    if os.path.exists(filename):
        os.unlink(filename)
    mmap_arr = np.memmap(filename, dtype=arr.dtype, shape=arr.shape, mode='w+')
    dump(arr, filename)
    mmap_arr = load(filename, mmap_mode='r')
    MS.fit_predict(mmap_arr)
    try:
        shutil.rmtree(temp_folder)
    except OSError:
        pass

crash_it(100000)

Only one core seems to work for n=1000000, while it's working reasonably well at n=100000. Note that using a mmap array does seem to help. This was run on 48 cores.

lesteve · 2016-12-14T15:20:42Z

Thanks a lot for the snippet, I'll take a closer look to see whether I can see the same patterns.

lesteve · 2016-12-19T15:03:28Z

Which version of scikit-learn are you using?

Things have moved around a bit in joblib with the recent parallel backends feature. With scikit-learn 0.18 the work-around is a bit modified and you need to do this instead:

from sklearn.externals.joblib._parallel_backends import AutoBatchingMixin

min_ideal_batch_duration = 1.
AutoBatchingMixin.MIN_IDEAL_BATCH_DURATION = min_ideal_batch_duration
AutoBatchingMixin.MAX_IDEAL_BATCH_DURATION = 10 * min_ideal_batch_duration

@martinosorb can you try this and let me know if that fixes your issue ?

martinosorb · 2016-12-19T16:37:26Z

I was using 0.17, now 0.18 with the change you suggest and I see no difference.
By the way, I repeated the test I had done and I actually see it's working at n=1M, even if only after ten seconds of single-core work. At 10M, single core for a long time.

lesteve · 2016-12-19T17:18:27Z

OK thanks for the feed-back, will keep looking then.

mhhennig · 2018-03-10T21:46:21Z

This happens when there is a very large number of seeds. Then joblib.delayed creates an equally huge number of tuples to work through, and for some reason this becomes very slow, preventing full utilisation of all CPUs.
I have a fix, instead of putting every seed individually through delayed(), seeds are split into equally sized batches, and then the batches are worked through in parallel using a little auxiliary function. It's a small change, and extensively tested, it works.
Do you want a pull request?

jnothman · 2018-03-10T22:06:45Z

is Joblib.Parallel's batch_size argument sufficient?

mhhennig · 2018-03-10T22:11:47Z

No, that does not help, we tried all sorts of things. The problems start when we have several Mio data points to cluster, and it seems to come down to the number of seeds. Maybe this even goes down to how python does lists, not sure.

lesteve · 2018-03-11T21:18:40Z

In principle, joblib should be able to tackle this via its auto-batching mechanism and reach a solution as efficient as your manual batching. In practice it looks like the heuristics that joblib use for its auto-batching are not great for big memory machines for some reason, which I am afraid I was not really able to figure out.

A few questions whose answers may help:

what kind of machine are you seeing the problem on, where they big memory servers as well (48 cores, 384GB RAM in the setup I was testing on IIRC) ?
what is the value of n_jobs?
what is the typical value of seeds you are seeing a problem on ?
could you provide a snippet to reproduce the problem, or can you reproduce the problem on some of the snippets that were already given in this issue?

martinosorb · 2018-03-12T12:25:15Z

I can still see the problem on a 48-core machine, with the same code I posted when I opened the issue (n_jobs = 20; seeds is, I believe, the same number as the points, so 100000). On smaller machines there is no problem.

As said above, setting ideal batch durations makes it better.

lesteve · 2018-03-12T13:58:09Z

Thanks @martinosorb for your answer! What about you @mhhennig?

martinosorb · 2018-03-12T15:20:47Z

Oh, mhhennig and I work together on the same machines. But let's see if he has anything to add.

mhhennig · 2018-03-12T19:07:53Z

Right, I have prepared a test data set that should illustrate the problem - this is 4D data, by the way:

https://datasync.ed.ac.uk/index.php/s/0m3ebEgihENqAps

(passwd is meanshift)

A short script in that folder will first execute the modified version, which takes about 9 minutes on 12 cores (6 physical) to cluster. Then it executes the shipped version, which should take a whole night or so to complete on the same machine.

lesteve · 2018-03-13T10:33:15Z

Thanks I will try to understand what goes wrong in joblib. If that fails as my previous attempt we can always adopt a pragmatic approach in scikit-learn and do some manual pre-batching if the number of seeds is too big.

mhhennig · 2018-03-13T10:58:14Z

Ok, have a look at the modified version too (that's in mean_shift_.py in the same folder). I suspect in any case this implementation is more efficient, although I have not tested this (yet). If you want to make it worse, by the way, just reduce min_bin_freq to say 10, as this will yield even more seeds.

lesteve mentioned this issue Jun 28, 2016

joblib.parallel.MIN_IDEAL_BATCH_DURATION improvements joblib/joblib#372

Open

lesteve closed this as completed Jun 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Some processes not working under clustering.MeanShift #6943

Some processes not working under clustering.MeanShift #6943

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Some processes not working under clustering.MeanShift #6943

Some processes not working under clustering.MeanShift #6943

Comments

Reproduce

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

There was an error while loading. Please reload this page.