Parallelization Notes

Different algorithms supported by OPS have different concerns when in comes to parallelization. Here we discuss some of the challenges around parallelization of each algorithm.

Note that, of course, all methods can benefit from parallelization within the trajectory. The question here is how to run multiple trajectories simultaneously.

Transition Path Sampling (TPS)

Independent jobs

Whether doing fixed path length or flexible path length, you can run several independent walkers for TPS.

Transition Interface Sampling (TIS)

Independent jobs

In regular TIS (without replica exchange), the interfaces are completely independent, and so each interface can be run as a separate job. Since some interfaces will have shorter average path lengths, those jobs will end sooner.

Replica Exchange Transition Interface Sampling (RETIS)

???

This is hard to load-balance. Different ensembles have different distributions of path lengths. Although they have well-defined average path lengths, the actual lengths can vary widely. The best approach I (DWHS) have thought of is to have each trajectory run in a separate job, and use the filesystem to communicate that we're still waiting on other jobs. The code to do this (OneWrapper) was an ugly hack of Python and bash, and is about 3/4 tested. OneWrapper was designed under the assumption that you have a high-demand, high-availability cluster: so if you free a node, others will use it, and you don't have to wait too long for a node to be free.

Single Replica Transition Interface Sampling (SRTIS)

Fixed bias: Independent jobs

Adaptive bias: ???

In fixed bias SRTIS, we have independent walkers. In adaptive bias, the bias needs to be adjusted regularly. This requires careful thought on how parallelization affects detailed balance, which Peter and Weina may have already done, but I haven't.

Adaptive Multilevel Splitting (AMS) (not supported in 1.0)

Initial trajectories: Independent jobs

AMS itself: None possible?

AMS start by getting a lot (maybe 10k?) of trajectories in what TIS would call the "innermost interface." This can be done by running an arbitrary number of trajectories from the initial state (although I'd caution against that number getting too large).

After the initial trajectories are calculated, each AMS step replaces one trajectory. I do not think this can be parallelized, beyond the parallelization of the trajectory dynamics.

Automatic queue creation

This was an idea that I had to kind of optimize the decision which trajectory / ensemble to run next or in general how to create a queue of potential samples to be produced next.

Please, note that this concept is orthogonal to the question how to distribute these jobs to workers on a cluster. It is only meant to figure out what we have to do and potentially in which order, not who will do the work.

This idea is based on the pathsampler which uses a more or less complex tree of shooters. The definition of a shooting move will allow us to look ahead in what we will have to do in the future and hence create a list of jobs that need to be executed. Once one of the jobs (path generation) is finished we can include the information into our decision of what to do next.

The information going into the decisions on which samples to run can depend on lots of information, but foremost we have to figure out which simulations we can currently do, because we have all prerequisites namely the preceding trajectories and can pick frames from these to run the simulations.

To keep this close to the shooter concept we have a few types of information that are required before a pathmover can work. These are usually:

results of submovers, which in our context are the returned SampleSet which results from applying the changes from a submove and if the move was accepted (I think -- there could be more complex decisions based on the returned pathmovechange)
results of path generations, which give Trajectory elements
results of random number generators
the current input SampleSet, more precise some of the samples in the input SampleSet

We will assume that some of these ingredient can be evaluated independently. Especially the random numbers are instantly available. This means that we might be able to make predictions about the resulting change, especially if we assume that SampleSets will only be influenced/altered be returned samples from a PathMover. An examples would be that for a RandomChoiceMover we can decide which movers to run even if the move will be called far in the future. This result will instantly have implications on which movers will be run and we can make predictions on this fact.

This overall scheme is then to maintain a list of ingredients for the movers that we try to complete and given new ingredients we update what can be done. This would look like a partially evaluated PathMoveTree and requires that a PathMover can make predictions about the result given on the current present ingredients. More ingredients will always reduce the number of possible outcomes.

This will be the basic idea that can then be extended e.g.

the expensive part will be the actual MD simulations and be might have an estimation of the pathlength distribution (either from just looking what was already produced, or adding some prior knowledge, e.g. "minus is expensive").
if we have to choose from several possible next samples we can use the potential time is takes and predict how many new ingredients will be required next. This requires a PathMover to speculate about the consequences given additional ingredients, e.g. "Pick the simulation that least likely will result in no more work and just waiting to finish jobs."
If there is nothing to do (which is a waste of computational resources unless we have a smart usage of a cluster like David suggests) we could predict the most likely outcome and collect ingredients in advance. These potential ingredients might be discarded later, e.g. "A move might be rejected which could result in no changes. We could run ingredients based on the assumption that the move fails. The decisions which to run could be based on more information like how often failed that move before."

My current opinion is that this is not too complicated to implement and get working. It requires some knowledge about potential likelihoods for moves which needs some experienced TIS people (check) and some additional functionality in PathMover and PathMoveChange. It also needs a special running of PathMoverTrees, concept of Ingredients, etc... All possible...

Only questions is if this will give such a big gain if our PathMovers are designed that nothing much in the future can be done anyway. I think this was valid point that David had. If everything depends on a single MinusMove we just have to wait and all workers will be idle which is not good. Still this could be pretty cool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly