8000 Some processes not working under clustering.MeanShift · Issue #6943 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Some processes not working under clustering.MeanShift #6943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
martinosorb opened this issue Jun 27, 2016 · 30 comments
Closed

Some processes not working under clustering.MeanShift #6943

martinosorb opened this issue Jun 27, 2016 · 30 comments

Comments

@martinosorb
Copy link

Hi,
I'm using the parallel version of clustering.MeanShift (which I had written, interestingly). I've now noticed that most of the processes are actually "sleeping", and only a few actually work. Even more oddly, this doesn't always happen:

  • the problem is worse on some machine than on others
  • the problem doesn't seem to appear when working with 2 dimensions instead of 4 (see code below).
  • changing the code to use multiprocessing instead of joblib makes it work

I have no idea where to start...

Reproduce

When running the code

from sklearn.cluster import MeanShift
import numpy as np

ndim = 4
points = np.random.random([100000, ndim])

MS = MeanShift(n_jobs=20, bandwidth=0.1)
print("Starting.")
MS.fit(points)

a call to htop shows:

screenshot from 2016-06-27 14-11-33

Versions

Linux-2.6.32-573.3.1.el6.x86_64-x86_64-with-redhat-6.6-Carbon
Python 3.4.2 (default, Feb 4 2015, 08:24:27)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)]
NumPy 1.11.1
SciPy 0.17.1
Scikit-Learn 0.17.1

@martinosorb
Copy link
Author

Possibly related: #6023

@lesteve
Copy link
Member
lesteve commented Jun 28, 2016

It works fine on my desktop but I can reproduce the behaviour on a big memory server with 48 cores. Will need to investigate more.

@martinosorb
Copy link
Author

Yes, here too, the problem shows up on large machines.

@lesteve
Copy link
Member
lesteve commented Jun 28, 2016

So it seems like the automatic batching of tasks is not well suited to some machines. I am not exactly sure why yet.

A work-around that works for me is to set joblib.parallel.MIN_IDEAL_BATCH_DURATION to a higher value. If you can test whether this snippet works for you, that'd be great:

import numpy as np

from sklearn.cluster import MeanShift
from sklearn.externals.joblib import parallel

parallel.MIN_IDEAL_BATCH_DURATION = 1.
parallel.MAX_IDEAL_BATCH_DURATION = parallel.MIN_IDEAL_BATCH_DURATION * 10

ndim = 4
points = np.random.random([100000, ndim])

MS = MeanShift(n_jobs=20, bandwidth=0.1)
print("Starting.")
MS.fit(points)

@martinosorb
Copy link
Author

This does indeed make things better. Should we add your workaround to the MeanShift code, or are you going to fix this in joblib?
Anyway, thanks a lot.

@lesteve
Copy link
Member
lesteve commented Jun 28, 2016

This does indeed make things better.

Glad to hear that!

Should we add your workaround to the MeanShift code, or are you going to fix this in joblib?

I think this kind of work-around is best left in client code rather than in scikit-learn. I'll try to understand the problem in more details and if there is a fix it will happen in joblib.

@jnothman
Copy link
Member

Great sleuthing @lesteve! (I'm wondering how you narrowed it down to that, or whether the timing of the problem gave it away.)

@lesteve
Copy link
Member
lesteve commented Jun 28, 2016

I'm wondering how you narrowed it down to that, or whether the timing of the problem gave it away

To be perfectly honest, I tried different things before it started to make sense. At one point tried with different batch_size in the Parallel object and realized than the auto batching wasn't performing very well.

@lesteve
Copy link
Member
lesteve commented Jun 28, 2016

I have opened an issue in joblib: joblib/joblib#372. I'll close this one.

@lesteve lesteve closed this as completed Jun 28, 2016
@martinosorb
Copy link
Author

@lesteve I'm still experiencing this problem when working with very large datasets. My supervisor worked on it and found the only possible workaround was to define

def _mean_shift_multi_seeds(my_means, X, nbrs, max_iter):
    return [_mean_shift_single_seed(my_mean, X, nbrs, max_iter)
            for my_mean in my_means]

and then

    all_res = Parallel(n_jobs=n_jobs, max_nbytes=1e6, verbose=2)(
        delayed(_mean_shift_multi_seeds, has_shareable_memory)
        (seeds[i*nseeds:(i+1)*nseeds], X, nbrs, max_iter)
        for i in range(n_jobs))

In other words, manually splitting the seeds in a number of arrays corresponding to the number of jobs we want to spawn. However, this means that joblib can't handle it properly. How can we solve this? At the moment, we have an ill-functioning method in sklearn.

@lesteve
Copy link
Member
lesteve commented Dec 14, 2016

@lesteve I'm still experiencing this problem when working with very large datasets.

I am not sure what you mean by this, are you saying the work-around of setting parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION as in #6943 (comment) is sub-optimal in some cases? Are you saying the work-around is not helping at all? Are you still seeing the same behaviour that it only happens on some machines and not others?

In other words, manually splitting the seeds in a number of arrays corresponding to the number of jobs we want to spawn. However, this means that joblib can't handle it properly. How can we solve this? At the moment, we have an ill-functioning method in sklearn.

Out of interest, what is n_jobs in your case and what is the shape of seeds?

As far as I can tell, your work-around is to have just fewer but longer tasks. This agrees with the conclusion we reached before: the automatic batching of tasks is not working great in your setup.

@martinosorb
Copy link
Author

Yes, setting the batch duration is not enough when datasets are very large. What I don't understand is the following: if I set n_jobs to be 20, say, joblib should, at least at some point, split seeds in 20 parts and give them to the 8 processes. So why it seems to work only if I manually split seeds in 8?
I'll ask my supervisor about the details of his experiment.

You're right, the automatic batching doesn't work well, 8000 but "my setup" is simply to use sklearn.cluster.MeanShift...

@lesteve
Copy link
Member
lesteve commented Dec 14, 2016

Yes, setting the batch duration is not enough when datasets are very large.

Thanks for clarifying this. If you had typical sizes for which you start encountering the problem (data shape, n_jobs, seeds shape), that would be great.

Also have you tried tweaking parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION to see whether you could get to an acceptable speed.

You're right, the automatic batching doesn't work well, but "my setup" is simply to use sklearn.cluster.MeanShift...

From what I remember the problem was arising on a big memory server but not on my laptop, this is why I was talking about setup. My guess was that IPC (Inter-Process Communication) had more overhead on the former than on the latter but I never found time to investigate further since.

What I don't understand is the following: if I set n_jobs to be 20, say, joblib should, at least at some point, split seeds in 20 parts and give them to the 8 processes. So why it seems to work only if I manually split seeds in 8?

I am not following you, so I'll try to clarify: n_jobs=20 means you have a pool of 20 subprocesses waiting for tasks to execute. The number of tasks is the length of the iterator you use in your Parallel call, in your case len(seeds). If your tasks take a very short time to run in the subprocess you are dominated by the IPC overhead and running in parallel may actually take more time than running sequentially.

Now auto-batching was done to try to alleviate this kind of problems. Auto-batching tries to group tasks in batches (a batch is a Python list of tasks basically) and does so dynamically by measuring the time taken by each batch. It has some heuristics based on parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION to decide dynamically the size of the batch. It seems like these heuristics are not well-suited in the context of a big memory server.

@martinosorb
Copy link
Author

I see, I'm starting to understand. Now, I can keep using the trick I mentioned, but should we do anything to sklearn? Or just wait for joblib to solve this?

@lesteve
Copy link
Member
lesteve commented Dec 14, 2016

As I was saying above:

If you had typical sizes for which you start encountering the problem (data shape, n_jobs, seeds shape), that would be great.

It would be great to know when the work-around seems to alleviate the problem and when it does not seem effective any more. Seeing the influence of parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION would be great too.

This pieces of information would allow me to check whether I can reproduce the same patterns on the big memory server I have access to and investigate the problem in more details.

Now, I can keep using the trick I mentioned, but should we do anything to sklearn? Or just wait for joblib to solve this?

I was hoping that setting parallel.MIN_IDEAL_BATCH_DURATION and parallel.MAX_IDEAL_BATCH_DURATION would be enough as a stop-gap solution but apparently it is not (not sure why yet).

Ideally we would find a way to fix this problem in joblib. On the other hand if fixing in joblib in a generic manner turns out too hard and if we know that individual tasks are very short (and with a reasonably uniform computation time) in a MeanShift context, we could group them by hand in the scikit-learn code.

Funny story: I thought about this problem a few days ago in a different context so maybe it is the universe telling me to take the plunge and dive deep into this issue ;-).

8000
@martinosorb
Copy link
Author

So, we are using this code

import numpy as np
from sklearn.cluster import MeanShift

import tempfile
import os
from sklearn.externals.joblib import load, dump
from sklearn.externals.joblib import parallel
import shutil

parallel.MIN_IDEAL_BATCH_DURATION = 1.
parallel.MAX_IDEAL_BATCH_DURATION = parallel.MIN_IDEAL_BATCH_DURATION * 10


def crash_it(n):
    arr = np.random.random_sample(size=(n, 4))
    MS = MeanShift(bin_seeding=True, bandwidth=0.03,
                   cluster_all=True, min_bin_freq=1, n_jobs=-1)

    temp_folder = tempfile.mkdtemp()
    filename = os.path.join(temp_folder, 'joblib_test.mmap')
    print(filename)
    if os.path.exists(filename):
        os.unlink(filename)
    mmap_arr = np.memmap(filename, dtype=arr.dtype, shape=arr.shape, mode='w+')
    dump(arr, filename)
    mmap_arr = load(filename, mmap_mode='r')
    MS.fit_predict(mmap_arr)
    try:
        shutil.rmtree(temp_folder)
    except OSError:
        pass

crash_it(100000)

Only one core seems to work for n=1000000, while it's working reasonably well at n=100000. Note that using a mmap array does seem to help. This was run on 48 cores.

@lesteve
Copy link
Member
lesteve commented Dec 14, 2016

Thanks a lot for the snippet, I'll take a closer look to see whether I can see the same patterns.

@lesteve
Copy link
Member
lesteve commented Dec 19, 2016

Which version of scikit-learn are you using?

Things have moved around a bit in joblib with the recent parallel backends feature. With scikit-learn 0.18 the work-around is a bit modified and you need to do this instead:

from sklearn.externals.joblib._parallel_backends import AutoBatchingMixin

min_ideal_batch_duration = 1.
AutoBatchingMixin.MIN_IDEAL_BATCH_DURATION = min_ideal_batch_duration
AutoBatchingMixin.MAX_IDEAL_BATCH_DURATION = 10 * min_ideal_batch_duration

@martinosorb can you try this and let me know if that fixes your issue ?

@martinosorb
Copy link
Author

I was using 0.17, now 0.18 with the change you suggest and I see no difference.
By the way, I repeated the test I had done and I actually see it's working at n=1M, even if only after ten seconds of single-core work. At 10M, single core for a long time.

@lesteve
Copy link
Member
lesteve commented Dec 19, 2016

OK thanks for the feed-back, will keep looking then.

@mhhennig
Copy link

This happens when there is a very large number of seeds. Then joblib.delayed creates an equally huge number of tuples to work through, and for some reason this becomes very slow, preventing full utilisation of all CPUs.
I have a fix, instead of putting every seed individually through delayed(), seeds are split into equally sized batches, and then the batches are worked through in parallel using a little auxiliary function. It's a small change, and extensively tested, it works.
Do you want a pull request?

@jnothman
Copy link
Member
jnothman commented Mar 10, 2018 via email

@mhhennig
Copy link

No, that does not help, we tried all sorts of things. The problems start when we have several Mio data points to cluster, and it seems to come down to the number of seeds. Maybe this even goes down to how python does lists, not sure.

@lesteve
Copy link
Member
lesteve commented Mar 11, 2018

In principle, joblib should be able to tackle this via its auto-batching mechanism and reach a solution as efficient as your manual batching. In practice it looks like the heuristics that joblib use for its auto-batching are not great for big memory machines for some reason, which I am afraid I was not really able to figure out.

A few questions whose answers may help:

  • what kind of machine are you seeing the problem on, where they big memory servers as well (48 cores, 384GB RAM in the setup I was testing on IIRC) ?
  • what is the value of n_jobs?
  • what is the typical value of seeds you are seeing a problem on ?
  • could you provide a snippet to reproduce the problem, or can you reproduce the problem on some of the snippets that were already given in this issue?

@martinosorb
Copy link
Author

I can still see the problem on a 48-core machine, with the same code I posted when I opened the issue (n_jobs = 20; seeds is, I believe, the same number as the points, so 100000). On smaller machines there is no problem.

As said above, setting ideal batch durations makes it better.

@lesteve
Copy link
Member
lesteve commented Mar 12, 2018

Thanks @martinosorb for your answer! What about you @mhhennig?

@martinosorb
Copy link
Author

Oh, mhhennig and I work together on the same machines. But let's see if he has anything to add.

@mhhennig
Copy link

Right, I have prepared a test data set that should illustrate the problem - this is 4D data, by the way:

https://datasync.ed.ac.uk/index.php/s/0m3ebEgihENqAps

(passwd is meanshift)

A short script in that folder will first execute the modified version, which takes about 9 minutes on 12 cores (6 physical) to cluster. Then it executes the shipped version, which should take a whole night or so to complete on the same machine.

@lesteve
Copy link
Member
lesteve commented Mar 13, 2018

Thanks I will try to understand what goes wrong in joblib. If that fails as my previous attempt we can always adopt a pragmatic approach in scikit-learn and do some manual pre-batching if the number of seeds is too big.

@mhhennig
Copy link

Ok, have a look at the modified version too (that's in mean_shift_.py in the same folder). I suspect in any case this implementation is more efficient, although I have not tested this (yet). If you want to make it worse, by the way, just reduce min_bin_freq to say 10, as this will yield even more seeds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0