Scheduling Algorithms
(or the Process of Choosing the Best Job to Run)
Scheduling Algorithms
Introduction By default Workload managers, their schedulers, and algorithms Future directions
Introduction
Efficient scheduling across nodes is necessary to maximize application performance regardless of the efficiency of your parallel algorithms. Dynamic scheduling in a heterogeneous environment is significantly more complicated. Programming parallel applications is difficult, and not worth the effort unless large performance gains can be realized.
Scheduling
Scheduling is a key part of the workload management software which usually perform some or all of:
Queuing Scheduling Monitoring Resource management Accounting
Cont
The difficult part of scheduling is to balance policy enforcement with resource optimization in order to pick the best job to run. Essentially one can think of the scheduler performing the following loop:
Select the best job to run, according to policy and available resources. Start the job. Stop the job and/or clean up after completion. repeat.
Cont
Scheduling can be described as either
Application level
APST, AMWAT, MARS, (AppLeS) Prophet UNIX, Condor, Maui, PBS
Resource level
By Default: Unix Scheduling Algorithms
UNIX is fundamentally a time-share (TS) operating system [1978] TS scheduler provides illusion to each user that they are the only one accessing system resources Employs a round-robin scheduling algorithm
Round-robin
Each process is placed in a run-queue Allocated a service quantum of time (commonly set to 10 milliseconds) Processes that demand less time run without being interrupted Processes exceeding the service quantum are interrupted and returned to the back of the run-queue to await further processing
Round-robin Cont
Round-robin favours short process demands
Consequently it is biased in favour of a greedy user who runs many short-demand processes The UNIX scheduler schedules processes not users What was needed was a fair-share (FS) scheduler
Fair-share
Implemented on UNIX systems circa 1988 In context of computer resources, fair is meant to imply equity not equality in resource consumption
Fair-share Cont
The FS scheduler has two levels of scheduling: process and user Process level scheduling same as in standard UNIX (priority and nice values act as bias to scheduler as it repositions processes in the run-queue) User level scheduler relationship can be seen in the following (simplified) pseudo-code
Fair-share Algorithm: User Level Scheduling
/*Every 400 milliseconds, accumulate the CPU usage per user*/
Parameters: delta = 4 Sec; decayUsage; For (I=0; I<USERS; I++) { usage[I] *= decayUsage; usage[I] += cost[I]; cost[I] = 0; }
Fair-share Algorithm: Process Level Scheduling
Parameters: priorityDecay = Real number in the range [0..1]
Every 100 milliseconds, decay ALL process priority values
for (k=0; k<PROCS; k++) { sharepriority[k] *= priorityDecay; } priorityDecay = a * p_nice[k] + b; /*Every 10 milliseconds (or tick) alter the active process priorities per
user*/
for (i=0; i<USERS; i++) { sharepriority[i] += usage[i] * p_active[i]; }
Fair-share Cont
Whereas process-level scheduling still occurs 100 times a second, user-level scheduling adjustments (usage parameter) occur once every 4 seconds Also, once a second, process-level priority adjustments that were made in the previous second begin to be forgotten. This is to avoid starving a process
Commercial Products
O/S
AIX
Manager
WLM
Capping
Yes
Parameter
Targets/limits
Compaq
HP-UX IRIX Solaris
Tru64
PRM SHARE II SRM
NO
YES NO NO
Shares
Entitlements Shares Shares
Cluster Managers, Their Schedulers and Algorithms
Condor
Full featured batch system Used to manage clusters of beowulf nodes Based on Network Queuing System (NQS) 1986 NASA first on Cray and then ported to other architectures IEEE POSIX standard (1003.2d)
Portable Batch Scheduler (PBS)
Cluster Managers, Their Schedulers and Algorithms
Maui
Developed to manage machines at the Maui HPCC External scheduler uses native schedulers such as PBS and LoadLeveler Included in cluster building toolkits such as Rocks and OSCAR Genetic algorithms
GAs
Condor
DAGman (Directed Acyclic Graph Manager)
Meta-scheduler for condor jobs Submits jobs in an order represented by a DAG and processes the results
Condor Cont
An example input file Job A A.condor Job B B.condor Job C C.condor Job D D.condor Parent A child B C Parent B C child D #Nodes B and C are run in parallel
A
B D C
Condor Cont
Jobs queues are not strictly FIFO
Condor implements priority scheduling Priority preemption can also be used Priority calculations are based on ratios instead of absolutes
Starvation is prevented by a FS algorithm which attempts to give all users the same amount of machine allocation time over a specified interval (configurable)
PBS
Includes several built-in schedulers Default is FIFO but not strict (based on queue priority)
Behaviour is to maximize CPU utilization Starving-jobs mechanism helps
PBS Scheduling Policies
PBS has seven scheduling algorithms (* represents default)
Decaying FS algorithm Round-robin for jobs and queues Queue priority * Job type or job ordering Strict FIFO Load balancing jobs between hosts Dedicated times/nodes
Maui
Scheduling behaviour is constrained by way of throttling policies
Both soft and hard limits used Applied to each iteration
Three main algorithms used
Backfill Priority FS
Maui Backfill Algorithm
Backfill is scheduling optimization Based on earliest-job-start information
Two passes
Pass 1 jobs that meet soft policies Pass 2 expands list from pass 1 to include hard fairness policies
Maui applies the backfill algorithm specified by the BACKFILLPOLICY parameter, be it firstfit, bestfit, or balfit
Backfill Algorithm
Assuming bestfit the following steps apply:
1. feasible backfill jobs are filtered selecting those that actually fit the current backfill window 2. base degree-of-fit on scheduling criteria (I.e. processors, seconds etc.) 3. job with best fit is started and the backfill window size adjusted 4. while backfill jobs and idle resources remain repeat step 1
Maui Priority Algorithm
Default is trivial FIFO but is weighted and combined based on service, requested resources, fair-share, direct priority, target, and bypass
Priority = serviceweight * servicefactor + resourceweight * resourcefactor + fairshareweight * fairsharefactor + directspecweight * directspecfactor + targetweight * targetfactor + bypassweight * bypassfactor
Each *weight value is a configurable parameter and each *factor is calculated from subcomponents listed above (I.e. user, group, priority, QoS etc.)
Maui Fare-share Algorithm
Composed of several parts which handle historical information, fairshare windows, usage, and impact All are site configurable parameters Purpose of a fairshare algorithm is to steer existing workload
Genetic Algorithms (GA)
A typical GA structure
Genetic Algorithm () [ Initialize population;/*an initial population of strings randomly generated*/ Evaluate population;/*check fitness of string against fitness function*/ While termination criterion not reached [ /*genetic operators*/ Select solution for next population; Perform crossover and mutation; Evaluate population; ] ]
GAs Cont
GAs mimic the evolutionary process. Introduced by J.H Holland 1975. In nature, organisms are always competing for scant resources such as food and space.
GAs Cont
A GA can be analyzed by breaking it up into the following sequence:
Encoding mechanism convert variables
into a form that can be processed by the GA. Fitness function - each generated string has a fitness value (or weight) assigned which determines the survival rate of the string.
GAs Cont
Operators.
Reproduction process by which individual strings are
copied according to their fitness values. Higher fitness values are more likely to be selected. Crossover all strings in the population are randomly paired up and may or may not be subjected to a crossover depending on a randomly generated outcome. Mutation an occasional alteration to the strings in the population and is used to safeguard against premature convergence.
Future Directions
High Performance Grid Schedulers
Scheduling holds the key to performance in grid environments; however, high-performance application scheduling on computation grids represent a world in which much progress needs to be made. Part of the problem lies in the inherent difficulty of the scheduling problem, whose optimal solution is considered infeasible even in the simplest environments.
Current High Performance Grid Scheduler Projects
AppLeS
Application Level Scheduler
Best of candidate schedules based on users performance criteria
MARS Meta-computer Adaptive Runtime Scheduler
Determines candidate schedule that minimizes execution time
Prophet
Determines schedule with the minimal predicted execution time Virtual Distributed Computing Environment
VDCE
List scheduling used to match resources with application tasks
SEA
Scheduling Expert Advisor
Ready tasks enabled in program graph are next to be scheduled
Projects continued
I-SOFT
Centralized scheduler maintains user queues and static capacities (FIFO) Offline genetic algorithm (GA) mappings indexed by dynamic parameters used to determine mapping for current iteration Determination of performance model for candidate schedules with minimal execution time Globally controlled or locally controlled load balancing
IOS
SPP(X)
DOME
Conclusion
Scheduling in a heterogeneous environment is hard. Keywords to remember
Adaptive Dynamic
Links
http://www.openpbs.org/ http://www.cs.wisc.edu/condor http://www.supercluster.org/maui