doc/html/gang_scheduling.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <H1>Gang Scheduling</H1>

 <P>
 SLURM supports timesliced gang scheduling in which two or more jobs are
 allocated to the same resources and these jobs are alternately suspended to
 let one job at a time have dedicated access to the resources for a configured
 period of time.
 SLURM also supports preemptive priority job scheduling in which a a higher
 priority job can preempt a lower priority one until the higher priority job
 completes.
 See the <a href="preempt.html">Preemption</a> document for more information.
 </P>
 <P>
 A resource manager that supports timeslicing can improve responsiveness
 and utilization by allowing more jobs to begin running sooner.
 Shorter-running jobs no longer have to wait in a queue behind longer-running
 jobs.
 Instead they can be run "in parallel" with the longer-running jobs, which will
 allow them to start and finish quicker.
 Throughput is also improved because overcommitting the resources provides
 opportunities for "local backfilling" to occur (see example below).
 </P>
 <P>
 In SLURM version 2.0 and earlier, the <I>sched/gang</I> plugin provides
 timeslicing.
 In SLURM version 2.1, the gang scheduling logic was moved directly into the
 main code bases to permit use of both gang scheduling plus the backfill
 scheduler plugin, <i>sched/backfill</I>.
 In either case, the gang scheduling logic monitors each of the partitions in
 SLURM.
 If a new job has been allocated to resources in a partition that have already
 been allocated to an existing job, then the plugin will suspend the new job
 until the configured <I>SchedulerTimeslice</I> interval has elapsed.
 Then it will suspend the running job and let the new job make use of the
 resources for a <I>SchedulerTimeslice</I> interval.
 This will continue until one of the jobs terminates.
 </P>

 <H2>Configuration</H2>

 <P>
 There are several important configuration parameters relating to
 gang scheduling:
 </P>
 <UL>
 <LI>
 <B>SelectType</B>: The SLURM gang scheduler supports nodes
 allocated by the <I>select/linear</I> plugin and socket/core/CPU resources
 allocated by the <I>select/cons_res</I> plugin.
 </LI>
 <LI>
 <B>SelectTypeParameter</B>: Since resources will be getting overallocated
 with jobs, the resource selection plugin should be configured to track the
 amount of memory used by each job to ensure that memory page swapping does
 not occur. When <I>select/linear</I> is chosen, we recommend setting
 <I>SelectTypeParameter=CR_Memory</I>. When <I>select/cons_res</I> is
 chosen, we recommend including Memory as a resource (ex.
 <I>SelectTypeParameter=CR_Core_Memory</I>).
 </LI>
 <LI>
 <B>DefMemPerCPU</B>: Since job requests may not explicitly specify
 a memory requirement, we also recommend configuring
 <I>DefMemPerCPU</I> (default memory per allocated CPU) or
 <I>DefMemPerNode</I> (default memory per allocated node).
 It may also be desirable to configure
 <I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or
 <I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>.
 Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option
 at job submission time to specify their memory requirements.
 </LI>
 <LI>
 <B>JobAcctGatherType and JobAcctGatherFrequency</B>:
 If you wish to enforce memory limits, accounting must be enabled
 using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
 parameters. If accounting is enabled and a job exceeds its configured
 memory limits, it will be canceled in order to prevent it from
 adversely effecting other jobs sharing the same resources.
 </LI>
 <LI>
 <B>PreemptMode</B>: set the <I>GANG</I> option.
 Additional options may be specified to enable job preemption in addition
 to gang scheduling.
 </LI>
 <LI>
 <B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds.
 To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval
 (in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval
 to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase
 the overhead of gang scheduling.
 </LI>
 <LI>
 <B>Shared</B>: Configure the partition's <I>Shared</I> setting to
 <I>FORCE</I> for all partitions in which timeslicing is to take place.
 The <I>FORCE</I> option supports an additional parameter that controls
 how many jobs can share a resource (FORCE[:max_share]). By default the
 max_share value is 4. To allow up to 6 jobs from this partition to be
 allocated to a common resource, set <I>Shared=FORCE:6</I>. To only let 2 jobs
 timeslice on the same resources, set <I>Shared=FORCE:2</I>.
 NOTE: Gang scheduling is performed independently for each partition, so
 configuring partitions with overlapping nodes and gang scheduling is generally
 not recommended.
 </LI>
 </UL>
 <P>
 In order to enable gang scheduling after making the configuration changes
 described above, restart SLURM if it is already running. Any change to the
 plugin settings in SLURM requires a full restart of the daemons. If you
 just change the partition <I>Shared</I> setting, this can be updated with
 <I>scontrol reconfig</I>.
 </P>
 <P>
 For an advanced topic discussion on the potential use of swap space,
 see "Making use of swap space" in the "Future Work" section below.
 </P>

 <H2>Timeslicer Design and Operation</H2>

 <P>
 When enabled, the gang scheduler keeps track of the resources
 allocated to all jobs. For each partition an "active bitmap" is maintained that
 tracks all concurrently running jobs in the SLURM cluster. Each time a new
 job is allocated to resources in a partition, the gang scheduler
 compares these newly allocated resources with the resources already maintained
 in the "active bitmap".
 If these two sets of resources are disjoint then the new job is added to the "active bitmap". If these two sets of resources overlap then
 the new job is suspended. All jobs are tracked in a per-partition job queue
 within the gang scheduler logic.
 </P>
 <P>
 A separate <I>timeslicer thread</I> is spawned by the gang scheduler
 on startup. This thread sleeps for the configured <I>SchedulerTimeSlice</I>
 interval. When it wakes up, it checks each partition for suspended jobs. If
 suspended jobs are found then the <I>timeslicer thread</I> moves all running
 jobs to the end of the job queue. It then reconstructs the "active bitmap" for
 this partition beginning with the suspended job that has waited the longest to
 run (this will be the first suspended job in the run queue). Each following job
 is then compared with the new "active bitmap", and if the job can be run
 concurrently with the other "active" jobs then the job is added. Once this is
 complete then the <I>timeslicer thread</I> suspends any currently running jobs
 that are no longer part of the "active bitmap", and resumes jobs that are new
 to the "active bitmap".
 </P>
 <P>
 This <I>timeslicer thread</I> algorithm for rotating jobs is designed to prevent jobs from starving (remaining in the suspended state indefinitely) and
 to be as fair as possible in the distribution of runtime while still keeping
 all of the resources as busy as possible.
 </P>
 <P>
 The gang scheduler suspends jobs via the same internal functions that
 support <I>scontrol suspend</I> and <I>scontrol resume</I>.
 A good way to observe the operation of the timeslicer is by running
 <I>squeue -i&lt;time&gt;</I> in a terminal window where <I>time</I> is set
 equal to <I>SchedulerTimeSlice</I>.
 </P>

 <H2>A Simple Example</H2>

 <P>
 The following example is configured with <I>select/linear</I> and <I>Shared=FORCE</I>.
 This example takes place on a small cluster of 5 nodes:
 </P>
 <PRE>
 [user@n16 load]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     5   idle n[12-16]
 </PRE>
 <P>
 Here are the Scheduler settings (excerpt of output):
 </P>
 <PRE>
 [user@n16 load]$ <B>scontrol show config</B>
 ...
 PreemptMode             = GANG
 ...
 SchedulerTimeSlice      = 30
 SchedulerType           = sched/builtin
 ...
 </PRE>
 <P>
 The <I>myload</I> script launches a simple load-generating app that runs
 for the given number of seconds. Submit <I>myload</I> to run on all nodes:
 </P>
 <PRE>
 [user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
 sbatch: Submitted batch job 3

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
     3    active  myload  user     0:05     5 n[12-16]
 </PRE>
 <P>
 Submit it again and watch the gang scheduler suspend it:
 </P>
 <PRE>
 [user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
 sbatch: Submitted batch job 4

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
     3    active  myload  user  R  0:13     5 n[12-16]
     4    active  myload  user  S  0:00     5 n[12-16]
 </PRE>
 <P>
 After 30 seconds the gang scheduler swaps jobs, and now job 4 is the
 active one:
 </P>
 <PRE>
 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
     4    active  myload  user  R  0:08     5 n[12-16]
     3    active  myload  user  S  0:41     5 n[12-16]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
     4    active  myload  user  R  0:21     5 n[12-16]
     3    active  myload  user  S  0:41     5 n[12-16]
 </PRE>
 <P>
 After another 30 seconds the gang scheduler sets job 3 running again:
 </P>
 <PRE>
 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
     3    active  myload  user  R  0:50     5 n[12-16]
     4    active  myload  user  S  0:30     5 n[12-16]
 </PRE>

 <P>
 <B>A possible side effect of timeslicing</B>: Note that jobs that are
 immediately suspended may cause their <I>srun</I> commands to produce the
 following output:
 </P>
 <PRE>
 [user@n16 load]$ <B>cat slurm-4.out</B>
 srun: Job step creation temporarily disabled, retrying
 srun: Job step creation still disabled, retrying
 srun: Job step creation still disabled, retrying
 srun: Job step creation still disabled, retrying
 srun: Job step created
 </PRE>
 <P>
 This occurs because <I>srun</I> is attempting to launch a jobstep in an
 allocation that has been suspended. The <I>srun</I> process will continue in a
 retry loop to launch the jobstep until the allocation has been resumed and the
 jobstep can be launched.
 </P>
 <P>
 When the gang scheduler is enabled, this type of output in the user
 jobs should be considered benign.
 </P>

 <H2>More examples</H2>

 <P>
 The following example shows how the timeslicer algorithm keeps the resources
 busy. Job 10 runs continually, while jobs 9 and 11 are timesliced:
 </P>

 <PRE>
 [user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
 sbatch: Submitted batch job 9

 [user@n16 load]$ <B>sbatch -N2 ./myload 300</B>
 sbatch: Submitted batch job 10

 [user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
 sbatch: Submitted batch job 11

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
     9    active  myload  user  R  0:11     3 n[12-14]
    10    active  myload  user  R  0:08     2 n[15-16]
    11    active  myload  user  S  0:00     3 n[12-14]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    10    active  myload  user  R  0:50     2 n[15-16]
    11    active  myload  user  R  0:12     3 n[12-14]
     9    active  myload  user  S  0:41     3 n[12-14]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    10    active  myload  user  R  1:04     2 n[15-16]
    11    active  myload  user  R  0:26     3 n[12-14]
     9    active  myload  user  S  0:41     3 n[12-14]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
     9    active  myload  user  R  0:46     3 n[12-14]
    10    active  myload  user  R  1:13     2 n[15-16]
    11    active  myload  user  S  0:30     3 n[12-14]
 </PRE>
 </P>
 <P>
 The next example displays "local backfilling":
 </P>
 <PRE>
 [user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
 sbatch: Submitted batch job 12

 [user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
 sbatch: Submitted batch job 13

 [user@n16 load]$ <B>sbatch -N2 ./myload 300</B>
 sbatch: Submitted batch job 14

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    12    active  myload  user  R  0:14     3 n[12-14]
    14    active  myload  user  R  0:06     2 n[15-16]
    13    active  myload  user  S  0:00     5 n[12-16]
 </PRE>
 <P>
 Without timeslicing and without the backfill scheduler enabled, job 14 has to
 wait for job 13 to finish.
 </P>
 <P>
 This is called "local" backfilling because the backfilling only occurs with
 jobs close enough in the queue to get allocated by the scheduler as part of
 oversubscribing the resources. Recall that the number of jobs that can
 overcommit a resource is controlled by the <I>Shared=FORCE:max_share</I> value,
 so this value effectively controls the scope of "local backfilling".
 </P>
 <P>
 Normal backfill algorithms check <U>all</U> jobs in the wait queue.
 </P>

 <H2>Consumable Resource Examples</H2>

 <P>
 The following two examples illustrate the primary difference between
 <I>CR_CPU</I> and <I>CR_Core</I> when consumable resource selection is enabled
 (<I>select/cons_res</I>).
 </P>
 <P>
 When <I>CR_CPU</I> (or <I>CR_CPU_Memory</I>) is configured then the selector
 treats the CPUs as simple, <I>interchangeable</I> computing resources
 <U>unless</U> task affinity is enabled. However when task affinity is enabled
 with <I>CR_CPU</I> or <I>CR_Core</I> (or <I>CR_Core_Memory</I>) is enabled, the
 selector treats the CPUs as individual resources that are <U>specifically</U>
 allocated to jobs.
 This subtle difference is highlighted when timeslicing is enabled.
 </P>
 <P>
 In both examples 6 jobs are submitted. Each job requests 2 CPUs per node, and
 all of the nodes contain two quad-core processors. The timeslicer will
 initially let the first 4 jobs run and suspend the last 2 jobs.
 The manner in which these jobs are timesliced depends upon the configured
 <I>SelectTypeParameter</I>.
 </P>
 <P>
 In the first example <I>CR_Core_Memory</I> is configured. Note that jobs 46
 and 47 don't <U>ever</U> get suspended. This is because they are not sharing
 their cores with any other job.
 Jobs 48 and 49 were allocated to the same cores as jobs 44 and 45.
 The timeslicer recognizes this and timeslices only those jobs:
 </P>
 <PRE>
 [user@n16 load]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     5   idle n[12-16]

 [user@n16 load]$ <B>scontrol show config | grep Select</B>
 SelectType              = select/cons_res
 SelectTypeParameters    = CR_CORE_MEMORY

 [user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B>
 NODELIST             NODES CPUS  S:C:T
 n[12-16]             5     8     2:4:1

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 44

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 45

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 46

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 47

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 48

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 49

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    44    active  myload  user  R  0:09     5 n[12-16]
    45    active  myload  user  R  0:08     5 n[12-16]
    46    active  myload  user  R  0:08     5 n[12-16]
    47    active  myload  user  R  0:07     5 n[12-16]
    48    active  myload  user  S  0:00     5 n[12-16]
    49    active  myload  user  S  0:00     5 n[12-16]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    46    active  myload  user  R  0:49     5 n[12-16]
    47    active  myload  user  R  0:48     5 n[12-16]
    48    active  myload  user  R  0:06     5 n[12-16]
    49    active  myload  user  R  0:06     5 n[12-16]
    44    active  myload  user  S  0:44     5 n[12-16]
    45    active  myload  user  S  0:43     5 n[12-16]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    44    active  myload  user  R  1:23     5 n[12-16]
    45    active  myload  user  R  1:22     5 n[12-16]
    46    active  myload  user  R  2:22     5 n[12-16]
    47    active  myload  user  R  2:21     5 n[12-16]
    48    active  myload  user  S  1:00     5 n[12-16]
    49    active  myload  user  S  1:00     5 n[12-16]
 </PRE>
 <P>
 Note the runtime of all 6 jobs in the output of the last <I>squeue</I> command.
 Jobs 46 and 47 have been running continuously, while jobs 44 and 45 are
 splitting their runtime with jobs 48 and 49.
 </P>
 <P>
 The next example has <I>CR_CPU_Memory</I> configured and the same 6 jobs are
 submitted. Here the selector and the timeslicer treat the CPUs as countable
 resources which results in all 6 jobs sharing time on the CPUs:
 </P>
 <PRE>
 [user@n16 load]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     5   idle n[12-16]

 [user@n16 load]$ <B>scontrol show config | grep Select</B>
 SelectType              = select/cons_res
 SelectTypeParameters    = CR_CPU_MEMORY

 [user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B>
 NODELIST             NODES CPUS  S:C:T
 n[12-16]             5     8     2:4:1

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 51

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 52

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 53

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 54

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 55

 [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
 sbatch: Submitted batch job 56

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    51    active  myload  user  R  0:11     5 n[12-16]
    52    active  myload  user  R  0:11     5 n[12-16]
    53    active  myload  user  R  0:10     5 n[12-16]
    54    active  myload  user  R  0:09     5 n[12-16]
    55    active  myload  user  S  0:00     5 n[12-16]
    56    active  myload  user  S  0:00     5 n[12-16]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    51    active  myload  user  R  1:09     5 n[12-16]
    52    active  myload  user  R  1:09     5 n[12-16]
    55    active  myload  user  R  0:23     5 n[12-16]
    56    active  myload  user  R  0:23     5 n[12-16]
    53    active  myload  user  S  0:45     5 n[12-16]
    54    active  myload  user  S  0:44     5 n[12-16]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    53    active  myload  user  R  0:55     5 n[12-16]
    54    active  myload  user  R  0:54     5 n[12-16]
    55    active  myload  user  R  0:40     5 n[12-16]
    56    active  myload  user  R  0:40     5 n[12-16]
    51    active  myload  user  S  1:16     5 n[12-16]
    52    active  myload  user  S  1:16     5 n[12-16]

 [user@n16 load]$ <B>squeue</B>
 JOBID PARTITION    NAME  USER ST  TIME NODES NODELIST
    51    active  myload  user  R  3:18     5 n[12-16]
    52    active  myload  user  R  3:18     5 n[12-16]
    53    active  myload  user  R  3:17     5 n[12-16]
    54    active  myload  user  R  3:16     5 n[12-16]
    55    active  myload  user  S  3:00     5 n[12-16]
    56    active  myload  user  S  3:00     5 n[12-16]
 </PRE>
 <P>
 Note that the runtime of all 6 jobs is roughly equal. Jobs 51-54 ran first so
 they're slightly ahead, but so far all jobs have run for at least 3 minutes.
 </P>
 <P>
 At the core level this means that SLURM relies on the Linux kernel to move
 jobs around on the cores to maximize performance.
 This is different than when <I>CR_Core_Memory</I> was configured and the jobs
 would effectively remain "pinned" to their specific cores for the duration of
 the job.
 Note that <I>CR_Core_Memory</I> supports CPU binding, while
 <I>CR_CPU_Memory</I> does not.
 </P>

 <H2>Future Work</H2>

 <P>
 <B>Making use of swap space</B>: (note that this topic is not currently
 scheduled for development, unless someone would like to pursue this) It should
 be noted that timeslicing does provide an interesting mechanism for high
 performance jobs to make use of swap space.
 The optimal scenario is one in which suspended jobs are "swapped out" and
 active jobs are "swapped in".
 The swapping activity would only occur once every  <I>SchedulerTimeslice</I>
 interval.
 </P>
 <P>
 However, SLURM should first be modified to include support for scheduling jobs
 into swap space and to provide controls to prevent overcommitting swap space.
 For now this idea could be experimented with by disabling memory support in
 the selector and submitting appropriately sized jobs.
 </P>

 <p style="text-align:center;">Last modified 24 June 2011</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<H1>Gang Scheduling</H1>

	<P>
	SLURM supports timesliced gang scheduling in which two or more jobs are
	allocated to the same resources and these jobs are alternately suspended to
	let one job at a time have dedicated access to the resources for a configured
	period of time.
	SLURM also supports preemptive priority job scheduling in which a a higher
	priority job can preempt a lower priority one until the higher priority job
	completes.
	See the <a href="preempt.html">Preemption</a> document for more information.
	</P>
	<P>
	A resource manager that supports timeslicing can improve responsiveness
	and utilization by allowing more jobs to begin running sooner.
	Shorter-running jobs no longer have to wait in a queue behind longer-running
	jobs.
	Instead they can be run "in parallel" with the longer-running jobs, which will
	allow them to start and finish quicker.
	Throughput is also improved because overcommitting the resources provides
	opportunities for "local backfilling" to occur (see example below).
	</P>
	<P>
	In SLURM version 2.0 and earlier, the <I>sched/gang</I> plugin provides
	timeslicing.
	In SLURM version 2.1, the gang scheduling logic was moved directly into the
	main code bases to permit use of both gang scheduling plus the backfill
	scheduler plugin, <i>sched/backfill</I>.
	In either case, the gang scheduling logic monitors each of the partitions in
	SLURM.
	If a new job has been allocated to resources in a partition that have already
	been allocated to an existing job, then the plugin will suspend the new job
	until the configured <I>SchedulerTimeslice</I> interval has elapsed.
	Then it will suspend the running job and let the new job make use of the
	resources for a <I>SchedulerTimeslice</I> interval.
	This will continue until one of the jobs terminates.
	</P>

	<H2>Configuration</H2>

	<P>
	There are several important configuration parameters relating to
	gang scheduling:
	</P>
	<UL>
	<LI>
	<B>SelectType</B>: The SLURM gang scheduler supports nodes
	allocated by the <I>select/linear</I> plugin and socket/core/CPU resources
	allocated by the <I>select/cons_res</I> plugin.
	</LI>
	<LI>
	<B>SelectTypeParameter</B>: Since resources will be getting overallocated
	with jobs, the resource selection plugin should be configured to track the
	amount of memory used by each job to ensure that memory page swapping does
	not occur. When <I>select/linear</I> is chosen, we recommend setting
	<I>SelectTypeParameter=CR_Memory</I>. When <I>select/cons_res</I> is
	chosen, we recommend including Memory as a resource (ex.
	<I>SelectTypeParameter=CR_Core_Memory</I>).
	</LI>
	<LI>
	<B>DefMemPerCPU</B>: Since job requests may not explicitly specify
	a memory requirement, we also recommend configuring
	<I>DefMemPerCPU</I> (default memory per allocated CPU) or
	<I>DefMemPerNode</I> (default memory per allocated node).
	It may also be desirable to configure
	<I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or
	<I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>.
	Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option
	at job submission time to specify their memory requirements.
	</LI>
	<LI>
	<B>JobAcctGatherType and JobAcctGatherFrequency</B>:
	If you wish to enforce memory limits, accounting must be enabled
	using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
	parameters. If accounting is enabled and a job exceeds its configured
	memory limits, it will be canceled in order to prevent it from
	adversely effecting other jobs sharing the same resources.
	</LI>
	<LI>
	<B>PreemptMode</B>: set the <I>GANG</I> option.
	Additional options may be specified to enable job preemption in addition
	to gang scheduling.
	</LI>
	<LI>
	<B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds.
	To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval
	(in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval
	to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase
	the overhead of gang scheduling.
	</LI>
	<LI>
	<B>Shared</B>: Configure the partition's <I>Shared</I> setting to
	<I>FORCE</I> for all partitions in which timeslicing is to take place.
	The <I>FORCE</I> option supports an additional parameter that controls
	how many jobs can share a resource (FORCE[:max_share]). By default the
	max_share value is 4. To allow up to 6 jobs from this partition to be
	allocated to a common resource, set <I>Shared=FORCE:6</I>. To only let 2 jobs
	timeslice on the same resources, set <I>Shared=FORCE:2</I>.
	NOTE: Gang scheduling is performed independently for each partition, so
	configuring partitions with overlapping nodes and gang scheduling is generally
	not recommended.
	</LI>
	</UL>
	<P>
	In order to enable gang scheduling after making the configuration changes
	described above, restart SLURM if it is already running. Any change to the
	plugin settings in SLURM requires a full restart of the daemons. If you
	just change the partition <I>Shared</I> setting, this can be updated with
	<I>scontrol reconfig</I>.
	</P>
	<P>
	For an advanced topic discussion on the potential use of swap space,
	see "Making use of swap space" in the "Future Work" section below.
	</P>

	<H2>Timeslicer Design and Operation</H2>

	<P>
	When enabled, the gang scheduler keeps track of the resources
	allocated to all jobs. For each partition an "active bitmap" is maintained that
	tracks all concurrently running jobs in the SLURM cluster. Each time a new
	job is allocated to resources in a partition, the gang scheduler
	compares these newly allocated resources with the resources already maintained
	in the "active bitmap".
	If these two sets of resources are disjoint then the new job is added to the "active bitmap". If these two sets of resources overlap then
	the new job is suspended. All jobs are tracked in a per-partition job queue
	within the gang scheduler logic.
	</P>
	<P>
	A separate <I>timeslicer thread</I> is spawned by the gang scheduler
	on startup. This thread sleeps for the configured <I>SchedulerTimeSlice</I>
	interval. When it wakes up, it checks each partition for suspended jobs. If
	suspended jobs are found then the <I>timeslicer thread</I> moves all running
	jobs to the end of the job queue. It then reconstructs the "active bitmap" for
	this partition beginning with the suspended job that has waited the longest to
	run (this will be the first suspended job in the run queue). Each following job
	is then compared with the new "active bitmap", and if the job can be run
	concurrently with the other "active" jobs then the job is added. Once this is
	complete then the <I>timeslicer thread</I> suspends any currently running jobs
	that are no longer part of the "active bitmap", and resumes jobs that are new
	to the "active bitmap".
	</P>
	<P>
	This <I>timeslicer thread</I> algorithm for rotating jobs is designed to prevent jobs from starving (remaining in the suspended state indefinitely) and
	to be as fair as possible in the distribution of runtime while still keeping
	all of the resources as busy as possible.
	</P>
	<P>
	The gang scheduler suspends jobs via the same internal functions that
	support <I>scontrol suspend</I> and <I>scontrol resume</I>.
	A good way to observe the operation of the timeslicer is by running
	<I>squeue -i<time></I> in a terminal window where <I>time</I> is set
	equal to <I>SchedulerTimeSlice</I>.
	</P>

	<H2>A Simple Example</H2>

	<P>
	The following example is configured with <I>select/linear</I> and <I>Shared=FORCE</I>.
	This example takes place on a small cluster of 5 nodes:
	</P>
	<PRE>
	[user@n16 load]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 5 idle n[12-16]
	</PRE>
	<P>
	Here are the Scheduler settings (excerpt of output):
	</P>
	<PRE>
	[user@n16 load]$ <B>scontrol show config</B>
	...
	PreemptMode = GANG
	...
	SchedulerTimeSlice = 30
	SchedulerType = sched/builtin
	...
	</PRE>
	<P>
	The <I>myload</I> script launches a simple load-generating app that runs
	for the given number of seconds. Submit <I>myload</I> to run on all nodes:
	</P>
	<PRE>
	[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
	sbatch: Submitted batch job 3

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	3 active myload user 0:05 5 n[12-16]
	</PRE>
	<P>
	Submit it again and watch the gang scheduler suspend it:
	</P>
	<PRE>
	[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
	sbatch: Submitted batch job 4

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	3 active myload user R 0:13 5 n[12-16]
	4 active myload user S 0:00 5 n[12-16]
	</PRE>
	<P>
	After 30 seconds the gang scheduler swaps jobs, and now job 4 is the
	active one:
	</P>
	<PRE>
	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	4 active myload user R 0:08 5 n[12-16]
	3 active myload user S 0:41 5 n[12-16]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	4 active myload user R 0:21 5 n[12-16]
	3 active myload user S 0:41 5 n[12-16]
	</PRE>
	<P>
	After another 30 seconds the gang scheduler sets job 3 running again:
	</P>
	<PRE>
	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	3 active myload user R 0:50 5 n[12-16]
	4 active myload user S 0:30 5 n[12-16]
	</PRE>

	<P>
	<B>A possible side effect of timeslicing</B>: Note that jobs that are
	immediately suspended may cause their <I>srun</I> commands to produce the
	following output:
	</P>
	<PRE>
	[user@n16 load]$ <B>cat slurm-4.out</B>
	srun: Job step creation temporarily disabled, retrying
	srun: Job step creation still disabled, retrying
	srun: Job step creation still disabled, retrying
	srun: Job step creation still disabled, retrying
	srun: Job step created
	</PRE>
	<P>
	This occurs because <I>srun</I> is attempting to launch a jobstep in an
	allocation that has been suspended. The <I>srun</I> process will continue in a
	retry loop to launch the jobstep until the allocation has been resumed and the
	jobstep can be launched.
	</P>
	<P>
	When the gang scheduler is enabled, this type of output in the user
	jobs should be considered benign.
	</P>

	<H2>More examples</H2>

	<P>
	The following example shows how the timeslicer algorithm keeps the resources
	busy. Job 10 runs continually, while jobs 9 and 11 are timesliced:
	</P>

	<PRE>
	[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
	sbatch: Submitted batch job 9

	[user@n16 load]$ <B>sbatch -N2 ./myload 300</B>
	sbatch: Submitted batch job 10

	[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
	sbatch: Submitted batch job 11

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	9 active myload user R 0:11 3 n[12-14]
	10 active myload user R 0:08 2 n[15-16]
	11 active myload user S 0:00 3 n[12-14]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	10 active myload user R 0:50 2 n[15-16]
	11 active myload user R 0:12 3 n[12-14]
	9 active myload user S 0:41 3 n[12-14]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	10 active myload user R 1:04 2 n[15-16]
	11 active myload user R 0:26 3 n[12-14]
	9 active myload user S 0:41 3 n[12-14]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	9 active myload user R 0:46 3 n[12-14]
	10 active myload user R 1:13 2 n[15-16]
	11 active myload user S 0:30 3 n[12-14]
	</PRE>
	</P>
	<P>
	The next example displays "local backfilling":
	</P>
	<PRE>
	[user@n16 load]$ <B>sbatch -N3 ./myload 300</B>
	sbatch: Submitted batch job 12

	[user@n16 load]$ <B>sbatch -N5 ./myload 300</B>
	sbatch: Submitted batch job 13

	[user@n16 load]$ <B>sbatch -N2 ./myload 300</B>
	sbatch: Submitted batch job 14

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	12 active myload user R 0:14 3 n[12-14]
	14 active myload user R 0:06 2 n[15-16]
	13 active myload user S 0:00 5 n[12-16]
	</PRE>
	<P>
	Without timeslicing and without the backfill scheduler enabled, job 14 has to
	wait for job 13 to finish.
	</P>
	<P>
	This is called "local" backfilling because the backfilling only occurs with
	jobs close enough in the queue to get allocated by the scheduler as part of
	oversubscribing the resources. Recall that the number of jobs that can
	overcommit a resource is controlled by the <I>Shared=FORCE:max_share</I> value,
	so this value effectively controls the scope of "local backfilling".
	</P>
	<P>
	Normal backfill algorithms check <U>all</U> jobs in the wait queue.
	</P>

	<H2>Consumable Resource Examples</H2>

	<P>
	The following two examples illustrate the primary difference between
	<I>CR_CPU</I> and <I>CR_Core</I> when consumable resource selection is enabled
	(<I>select/cons_res</I>).
	</P>
	<P>
	When <I>CR_CPU</I> (or <I>CR_CPU_Memory</I>) is configured then the selector
	treats the CPUs as simple, <I>interchangeable</I> computing resources
	<U>unless</U> task affinity is enabled. However when task affinity is enabled
	with <I>CR_CPU</I> or <I>CR_Core</I> (or <I>CR_Core_Memory</I>) is enabled, the
	selector treats the CPUs as individual resources that are <U>specifically</U>
	allocated to jobs.
	This subtle difference is highlighted when timeslicing is enabled.
	</P>
	<P>
	In both examples 6 jobs are submitted. Each job requests 2 CPUs per node, and
	all of the nodes contain two quad-core processors. The timeslicer will
	initially let the first 4 jobs run and suspend the last 2 jobs.
	The manner in which these jobs are timesliced depends upon the configured
	<I>SelectTypeParameter</I>.
	</P>
	<P>
	In the first example <I>CR_Core_Memory</I> is configured. Note that jobs 46
	and 47 don't <U>ever</U> get suspended. This is because they are not sharing
	their cores with any other job.
	Jobs 48 and 49 were allocated to the same cores as jobs 44 and 45.
	The timeslicer recognizes this and timeslices only those jobs:
	</P>
	<PRE>
	[user@n16 load]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 5 idle n[12-16]

	[user@n16 load]$ <B>scontrol show config \| grep Select</B>
	SelectType = select/cons_res
	SelectTypeParameters = CR_CORE_MEMORY

	[user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B>
	NODELIST NODES CPUS S:C:T
	n[12-16] 5 8 2:4:1

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 44

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 45

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 46

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 47

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 48

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 49

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	44 active myload user R 0:09 5 n[12-16]
	45 active myload user R 0:08 5 n[12-16]
	46 active myload user R 0:08 5 n[12-16]
	47 active myload user R 0:07 5 n[12-16]
	48 active myload user S 0:00 5 n[12-16]
	49 active myload user S 0:00 5 n[12-16]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	46 active myload user R 0:49 5 n[12-16]
	47 active myload user R 0:48 5 n[12-16]
	48 active myload user R 0:06 5 n[12-16]
	49 active myload user R 0:06 5 n[12-16]
	44 active myload user S 0:44 5 n[12-16]
	45 active myload user S 0:43 5 n[12-16]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	44 active myload user R 1:23 5 n[12-16]
	45 active myload user R 1:22 5 n[12-16]
	46 active myload user R 2:22 5 n[12-16]
	47 active myload user R 2:21 5 n[12-16]
	48 active myload user S 1:00 5 n[12-16]
	49 active myload user S 1:00 5 n[12-16]
	</PRE>
	<P>
	Note the runtime of all 6 jobs in the output of the last <I>squeue</I> command.
	Jobs 46 and 47 have been running continuously, while jobs 44 and 45 are
	splitting their runtime with jobs 48 and 49.
	</P>
	<P>
	The next example has <I>CR_CPU_Memory</I> configured and the same 6 jobs are
	submitted. Here the selector and the timeslicer treat the CPUs as countable
	resources which results in all 6 jobs sharing time on the CPUs:
	</P>
	<PRE>
	[user@n16 load]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 5 idle n[12-16]

	[user@n16 load]$ <B>scontrol show config \| grep Select</B>
	SelectType = select/cons_res
	SelectTypeParameters = CR_CPU_MEMORY

	[user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B>
	NODELIST NODES CPUS S:C:T
	n[12-16] 5 8 2:4:1

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 51

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 52

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 53

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 54

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 55

	[user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B>
	sbatch: Submitted batch job 56

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	51 active myload user R 0:11 5 n[12-16]
	52 active myload user R 0:11 5 n[12-16]
	53 active myload user R 0:10 5 n[12-16]
	54 active myload user R 0:09 5 n[12-16]
	55 active myload user S 0:00 5 n[12-16]
	56 active myload user S 0:00 5 n[12-16]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	51 active myload user R 1:09 5 n[12-16]
	52 active myload user R 1:09 5 n[12-16]
	55 active myload user R 0:23 5 n[12-16]
	56 active myload user R 0:23 5 n[12-16]
	53 active myload user S 0:45 5 n[12-16]
	54 active myload user S 0:44 5 n[12-16]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	53 active myload user R 0:55 5 n[12-16]
	54 active myload user R 0:54 5 n[12-16]
	55 active myload user R 0:40 5 n[12-16]
	56 active myload user R 0:40 5 n[12-16]
	51 active myload user S 1:16 5 n[12-16]
	52 active myload user S 1:16 5 n[12-16]

	[user@n16 load]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	51 active myload user R 3:18 5 n[12-16]
	52 active myload user R 3:18 5 n[12-16]
	53 active myload user R 3:17 5 n[12-16]
	54 active myload user R 3:16 5 n[12-16]
	55 active myload user S 3:00 5 n[12-16]
	56 active myload user S 3:00 5 n[12-16]
	</PRE>
	<P>
	Note that the runtime of all 6 jobs is roughly equal. Jobs 51-54 ran first so
	they're slightly ahead, but so far all jobs have run for at least 3 minutes.
	</P>
	<P>
	At the core level this means that SLURM relies on the Linux kernel to move
	jobs around on the cores to maximize performance.
	This is different than when <I>CR_Core_Memory</I> was configured and the jobs
	would effectively remain "pinned" to their specific cores for the duration of
	the job.
	Note that <I>CR_Core_Memory</I> supports CPU binding, while
	<I>CR_CPU_Memory</I> does not.
	</P>

	<H2>Future Work</H2>

	<P>
	<B>Making use of swap space</B>: (note that this topic is not currently
	scheduled for development, unless someone would like to pursue this) It should
	be noted that timeslicing does provide an interesting mechanism for high
	performance jobs to make use of swap space.
	The optimal scenario is one in which suspended jobs are "swapped out" and
	active jobs are "swapped in".
	The swapping activity would only occur once every <I>SchedulerTimeslice</I>
	interval.
	</P>
	<P>
	However, SLURM should first be modified to include support for scheduling jobs
	into swap space and to provide controls to prevent overcommitting swap space.
	For now this idea could be experimented with by disabling memory support in
	the selector and submitting appropriately sized jobs.
	</P>

	<p style="text-align:center;">Last modified 24 June 2011</p>

	<!--#include virtual="footer.txt"-->