doc/html/preempt.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <H1>Preemption</H1>

 <P>
 SLURM supports job preemption, the act of stopping one or more "low-priority"
 jobs to let a "high-priority" job run uninterrupted until it completes.
 Job preemption is implemented as a variation of SLURM's
 <a href="gang_scheduling.html">Gang Scheduling</a> logic.
 When a high-priority job has been allocated resources that have already been
 allocated to one or more low priority jobs, the low priority job(s) are
 preempted.
 The low priority job(s) can resume once the high priority job completes.
 Alternately, the low priority job(s) can be requeued and started using other
 resources if so configured in newer versions of SLURM.
 </P>
 <P>
 In SLURM version 2.0 and earlier, high priority work is identified by the
 priority of the job's partition and low priority jobs are always suspended.
 The job preemption logic is within the <I>sched/gang</I> plugin.
 In SLURM version 2.1 and higher, the job's partition priority or its
 Quality Of Service (QOS) can be used to identify the which jobs can preempt
 or be preempted by other jobs.
 </P>
 <P>
 SLURM version 2.1 offers several options for the job preemption mechanism
 including checkpoint, requeue, or cancel.
 the option of requeuing low priority jobs
 Checkpointed jobs are not automatically requeued or restarted.
 Requeued jobs may restart faster by using different resources.
 All of these new job preemption mechanisms release a job's memory space for
 use by other jobs.
 In SLURM version 2.1, some job preemption logic was moved into the
 <I>select</I> plugin and main code base to permit use of both job preemption
 plus the backfill scheduler plugin, <i>sched/backfill</I>.
 </P>

 <H2>Configuration</H2>
 <P>
 There are several important configuration parameters relating to preemption:
 </P>
 <UL>
 <LI>
 <B>SelectType</B>: SLURM job preemption logic supports nodes allocated by the
 <I>select/linear</I> plugin and socket/core/CPU resources allocated by the
 <I>select/cons_res</I> plugin.
 </LI>
 <LI>
 <B>SelectTypeParameter</B>: Since resources may be getting over-allocated
 with jobs (suspended jobs remain in memory), the resource selection
 plugin should be configured to track the amount of memory used by each job to
 ensure that memory page swapping does not occur. When <I>select/linear</I> is
 chosen, we recommend setting <I>SelectTypeParameter=CR_Memory</I>. When
 <I>select/cons_res</I> is chosen, we recommend including Memory as a resource
 (ex. <I>SelectTypeParameter=CR_Core_Memory</I>).
 <BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management
 parameters are not critical.
 </LI>
 <LI>
 <B>DefMemPerCPU</B>: Since job requests may not explicitly specify
 a memory requirement, we also recommend configuring
 <I>DefMemPerCPU</I> (default memory per allocated CPU) or
 <I>DefMemPerNode</I> (default memory per allocated node).
 It may also be desirable to configure
 <I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or
 <I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>.
 Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option
 at job submission time to specify their memory requirements.
 <BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management
 parameters are not critical.
 </LI>
 <LI>
 <B>JobAcctGatherType and JobAcctGatherFrequency</B>: The "maximum data segment
 size" and "maximum virtual memory size" system limits will be configured for
 each job to ensure that the job does not exceed its requested amount of memory.
 If you wish to enable additional enforcement of memory limits, configure job
 accounting with the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
 parameters. When accounting is enabled and a job exceeds its configured memory
 limits, it will be canceled in order to prevent it from adversely effecting
 other jobs sharing the same resources.
 <BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management
 parameters are not critical.
 </LI>
 <LI>
 <B>PreemptMode</B>: Configure to <I>CANCEL</I>, <I>CHECKPOINT</I>,
 <I>SUSPEND</I> or <I>REQUEUE</I> depending on the desired action for low
 priority jobs.
 <UL>
 <LI>A value of <I>CANCEL</I> will always cancel the job.</LI>
 <LI>A value of <I>CHECKPOINT</I> will checkpoint (if possible) or kill low
 priority jobs.</LI>
 Checkpointed jobs are not automatically restarted.
 <LI>A value of <I>REQUEUE</I> will requeue (if possible) or kill low priority
 jobs. Requeued jobs are permitted to be restarted on different resources.</LI>
 <LI>A value of <I>SUSPEND</I> will suspend and automatically resume the low
 priority jobs. The <I>SUSPEND</I> option must be used with the <I>GANG</I>
 option (e.g. "PreemptMode=SUSPEND,GANG").</LI>
 </UL>
 </LI>
 <LI>
 <B>PreemptType</B>: Configure to the desired mechanism used to identify
 which jobs can preempt other jobs.
 <UL>
 <LI><I>preempt/none</I> indicates that jobs will not preempt each other
 (default).</LI>
 <LI><I>preempt/partition_prio</I> indicates that jobs from one partition
 can preempt jobs from lower priority partitions.</LI>
 <LI><I>preempt/qos</I> indicates that jobs from one Quality Of Service (QOS)
 can preempt jobs from a lower QOS. These jobs can be in the same partition
 or different partitions. PreemptMode must be set to CANCEL, CHECKPOINT,
 SUSPEND or REQUEUE. This option requires the use of a database identifying
 available QOS and their preemption rules. </LI>
 </UL>
 </LI>
 <LI>
 <B>Priority</B>: Configure the partition's <I>Priority</I> setting relative to
 other partitions to control the preemptive behavior when
 <I>PreemptType=preempt/partition_prio</I>.
 This option is not relevant if <I>PreemptType=preempt/qos</I>.
 If two jobs from two
 different partitions are allocated to the same resources, the job in the
 partition with the greater <I>Priority</I> value will preempt the job in the
 partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values
 of the two partitions are equal then no preemption will occur. The default
 <I>Priority</I> value is 1.
 <BR><B>NOTE:</B> Unless <I>PreemptType=preempt/partition_prio</I> the
 partition <I>Priority</I> is not critical.
 </LI>
 <LI>
 <B>Shared</B>: Configure the partition's <I>Shared</I> setting to
 <I>FORCE</I> for all partitions in which job preemption is to take place.
 The <I>FORCE</I> option supports an additional parameter that controls
 how many jobs can share a resource (FORCE[:max_share]). By default the
 max_share value is 4. In order to preempt jobs (and not gang schedule them),
 always set max_share to 1. To allow up to 2 jobs from this partition to be
 allocated to a common resource (and gang scheduled), set
 <I>Shared=FORCE:2</I>.
 </LI>
 </UL>
 <P>
 To enable preemption after making the configuration changes described above,
 restart SLURM if it is already running. Any change to the plugin settings in
 SLURM requires a full restart of the daemons. If you just change the partition
 <I>Priority</I> or <I>Shared</I> setting, this can be updated with
 <I>scontrol reconfig</I>.
 </P>

 <H2>Preemption Design and Operation</H2>

 <P>
 The select plugin will identify resources where a pending job can begin
 execution.
 When <I>PreemptMode</I> is configured to CANCEL, CHECKPOINT, SUSPEND or
 REQUEUE, the select plugin will also preempt running jobs as needed to
 initiate the pending job.
 When <I>PreemptMode=SUSPEND,GANG</I> the select plugin will initiate the
 pending job and rely upon the gang scheduling logic to perform job suspend
 and resume as described below.
 </P>
 <P>
 When enabled, the gang scheduling logic (which is also supports job
 preemption) keeps track of the resources allocated to all jobs.
 For each partition an "active bitmap" is maintained that tracks all
 concurrently running jobs in the SLURM cluster.
 Each partition also maintains a job list for that partition, and a list of
 "shadow" jobs.
 The "shadow" jobs are high priority job allocations that "cast shadows" on the
 active bitmaps of the low priority jobs.
 Jobs caught in these "shadows" will be preempted.
 </P>
 <P>
 Each time a new job is allocated to resources in a partition and begins
 running, the gang scheduler adds a "shadow" of this job to all lower priority
 partitions.
 The active bitmap of these lower priority partitions are then rebuilt, with the shadow jobs added first.
 Any existing jobs that were replaced by one or more "shadow" jobs are
 suspended (preempted). Conversely, when a high priority running job completes,
 it's "shadow" goes away and the active bitmaps of the lower priority
 partitions are rebuilt to see if any suspended jobs can be resumed.
 </P>
 <P>
 The gang scheduler plugin is designed to be <I>reactive</I> to the resource
 allocation decisions made by the "select" plugins.
 The "select" plugins have been enhanced to recognize when job preemption has
 been configured, and to factor in the priority of each partition when selecting resources for a job.
 When choosing resources for each job, the selector avoids resources that are
 in use by other jobs (unless sharing has been configured, in which case it
 does some load-balancing).
 However, when job preemption is enabled, the select plugins may choose
 resources that are already in use by jobs from partitions with a lower
 priority setting, even when sharing is disabled in those partitions.
 </P>
 <P>
 This leaves the gang scheduler in charge of controlling which jobs should run
 on the over-allocated resources.
 If <I>PreemptMode=SUSPEND</I>, jobs are suspended using the same internal
 functions that support <I>scontrol suspend</I> and <I>scontrol resume</I>.
 A good way to observe the operation of the gang scheduler is by running
 <I>squeue -i&lt;time&gt;</I> in a terminal window.
 </P>

 <H2>A Simple Example</H2>

 <P>
 The following example is configured with <I>select/linear</I> and
 <I>PreemptMode=SUSPEND,GANG</I>.
 This example takes place on a cluster of 5 nodes:
 </P>
 <PRE>
 [user@n16 ~]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     5   idle n[12-16]
 hipri        up   infinite     5   idle n[12-16]
 </PRE>
 <P>
 Here are the Partition settings:
 </P>
 <PRE>
 [user@n16 ~]$ <B>grep PartitionName /shared/slurm/slurm.conf</B>
 PartitionName=DEFAULT Shared=FORCE:1 Nodes=n[12-16]
 PartitionName=active Priority=1 Default=YES
 PartitionName=hipri  Priority=2
 </PRE>
 <P>
 The <I>runit.pl</I> script launches a simple load-generating app that runs
 for the given number of seconds. Submit 5 single-node <I>runit.pl</I> jobs to
 run on all nodes:
 </P>
 <PRE>
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 485
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 486
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 487
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 488
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 489
 [user@n16 ~]$ <B>squeue -Si</B>
 JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
   485    active runit.pl   user   R   0:06      1 n12
   486    active runit.pl   user   R   0:06      1 n13
   487    active runit.pl   user   R   0:05      1 n14
   488    active runit.pl   user   R   0:05      1 n15
   489    active runit.pl   user   R   0:04      1 n16
 </PRE>
 <P>
 Now submit a short-running 3-node job to the <I>hipri</I> partition:
 </P>
 <PRE>
 [user@n16 ~]$ <B>sbatch -N3 -p hipri ./runit.pl 30</B>
 sbatch: Submitted batch job 490
 [user@n16 ~]$ <B>squeue -Si</B>
 JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
   485    active runit.pl   user   S   0:27      1 n12
   486    active runit.pl   user   S   0:27      1 n13
   487    active runit.pl   user   S   0:26      1 n14
   488    active runit.pl   user   R   0:29      1 n15
   489    active runit.pl   user   R   0:28      1 n16
   490     hipri runit.pl   user   R   0:03      3 n[12-14]
 </PRE>
 <P>
 Job 490 in the <I>hipri</I> partition preempted jobs 485, 486, and 487 from
 the <I>active</I> partition. Jobs 488 and 489 in the <I>active</I> partition
 remained running.
 </P>
 <P>
 This state persisted until job 490 completed, at which point the preempted jobs
 were resumed:
 </P>
 <PRE>
 [user@n16 ~]$ <B>squeue</B>
 JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
   485    active runit.pl   user   R   0:30      1 n12
   486    active runit.pl   user   R   0:30      1 n13
   487    active runit.pl   user   R   0:29      1 n14
   488    active runit.pl   user   R   0:59      1 n15
   489    active runit.pl   user   R   0:58      1 n16
 </PRE>


 <H2><A NAME="future_work">Future Ideas</A></H2>

 <P>
 <B>More intelligence in the select plugins</B>: This implementation of
 preemption relies on intelligent job placement by the <I>select</I> plugins.
 In SLURM version 2.0 preemptive placement support was added to the
 SelectType plugins, but there is still room for improvement.
 </P><P>
 Take the following example:
 </P>
 <PRE>
 [user@n8 ~]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     5   idle n[1-5]
 hipri        up   infinite     5   idle n[1-5]
 [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
 sbatch: Submitted batch job 17
 [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
 sbatch: Submitted batch job 18
 [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
 sbatch: Submitted batch job 19
 [user@n8 ~]$ <B>squeue</B>
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
      17    active  sleepme  cholmes   R       0:03      1 n1
      18    active  sleepme  cholmes   R       0:03      1 n2
      19    active  sleepme  cholmes   R       0:02      1 n3
 [user@n8 ~]$ <B>sbatch -N3 -n6 -p hipri ./sleepme 20</B>
 sbatch: Submitted batch job 20
 [user@n8 ~]$ <B>squeue -Si</B>
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
      17    active  sleepme  cholmes   S       0:16      1 n1
      18    active  sleepme  cholmes   S       0:16      1 n2
      19    active  sleepme  cholmes   S       0:15      1 n3
      20     hipri  sleepme  cholmes   R       0:03      3 n[1-3]
 [user@n8 ~]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     3  alloc n[1-3]
 active*      up   infinite     2   idle n[4-5]
 hipri        up   infinite     3  alloc n[1-3]
 hipri        up   infinite     2   idle n[4-5]
 </PRE>
 <P>
 It would be more ideal if the "hipri" job were placed on nodes n[3-5], which
 would allow jobs 17 and 18 to continue running. However, a new "intelligent"
 algorithm would have to include factors such as job size and required nodes in
 order to support ideal placements such as this, which can quickly complicate
 the design. Any and all help is welcome here!
 </P>

 <p style="text-align:center;">Last modified 19 August 2009</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<H1>Preemption</H1>

	<P>
	SLURM supports job preemption, the act of stopping one or more "low-priority"
	jobs to let a "high-priority" job run uninterrupted until it completes.
	Job preemption is implemented as a variation of SLURM's
	<a href="gang_scheduling.html">Gang Scheduling</a> logic.
	When a high-priority job has been allocated resources that have already been
	allocated to one or more low priority jobs, the low priority job(s) are
	preempted.
	The low priority job(s) can resume once the high priority job completes.
	Alternately, the low priority job(s) can be requeued and started using other
	resources if so configured in newer versions of SLURM.
	</P>
	<P>
	In SLURM version 2.0 and earlier, high priority work is identified by the
	priority of the job's partition and low priority jobs are always suspended.
	The job preemption logic is within the <I>sched/gang</I> plugin.
	In SLURM version 2.1 and higher, the job's partition priority or its
	Quality Of Service (QOS) can be used to identify the which jobs can preempt
	or be preempted by other jobs.
	</P>
	<P>
	SLURM version 2.1 offers several options for the job preemption mechanism
	including checkpoint, requeue, or cancel.
	the option of requeuing low priority jobs
	Checkpointed jobs are not automatically requeued or restarted.
	Requeued jobs may restart faster by using different resources.
	All of these new job preemption mechanisms release a job's memory space for
	use by other jobs.
	In SLURM version 2.1, some job preemption logic was moved into the
	<I>select</I> plugin and main code base to permit use of both job preemption
	plus the backfill scheduler plugin, <i>sched/backfill</I>.
	</P>

	<H2>Configuration</H2>
	<P>
	There are several important configuration parameters relating to preemption:
	</P>
	<UL>
	<LI>
	<B>SelectType</B>: SLURM job preemption logic supports nodes allocated by the
	<I>select/linear</I> plugin and socket/core/CPU resources allocated by the
	<I>select/cons_res</I> plugin.
	</LI>
	<LI>
	<B>SelectTypeParameter</B>: Since resources may be getting over-allocated
	with jobs (suspended jobs remain in memory), the resource selection
	plugin should be configured to track the amount of memory used by each job to
	ensure that memory page swapping does not occur. When <I>select/linear</I> is
	chosen, we recommend setting <I>SelectTypeParameter=CR_Memory</I>. When
	<I>select/cons_res</I> is chosen, we recommend including Memory as a resource
	(ex. <I>SelectTypeParameter=CR_Core_Memory</I>).
	<BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management
	parameters are not critical.
	</LI>
	<LI>
	<B>DefMemPerCPU</B>: Since job requests may not explicitly specify
	a memory requirement, we also recommend configuring
	<I>DefMemPerCPU</I> (default memory per allocated CPU) or
	<I>DefMemPerNode</I> (default memory per allocated node).
	It may also be desirable to configure
	<I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or
	<I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>.
	Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option
	at job submission time to specify their memory requirements.
	<BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management
	parameters are not critical.
	</LI>
	<LI>
	<B>JobAcctGatherType and JobAcctGatherFrequency</B>: The "maximum data segment
	size" and "maximum virtual memory size" system limits will be configured for
	each job to ensure that the job does not exceed its requested amount of memory.
	If you wish to enable additional enforcement of memory limits, configure job
	accounting with the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
	parameters. When accounting is enabled and a job exceeds its configured memory
	limits, it will be canceled in order to prevent it from adversely effecting
	other jobs sharing the same resources.
	<BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management
	parameters are not critical.
	</LI>
	<LI>
	<B>PreemptMode</B>: Configure to <I>CANCEL</I>, <I>CHECKPOINT</I>,
	<I>SUSPEND</I> or <I>REQUEUE</I> depending on the desired action for low
	priority jobs.
	<UL>
	<LI>A value of <I>CANCEL</I> will always cancel the job.</LI>
	<LI>A value of <I>CHECKPOINT</I> will checkpoint (if possible) or kill low
	priority jobs.</LI>
	Checkpointed jobs are not automatically restarted.
	<LI>A value of <I>REQUEUE</I> will requeue (if possible) or kill low priority
	jobs. Requeued jobs are permitted to be restarted on different resources.</LI>
	<LI>A value of <I>SUSPEND</I> will suspend and automatically resume the low
	priority jobs. The <I>SUSPEND</I> option must be used with the <I>GANG</I>
	option (e.g. "PreemptMode=SUSPEND,GANG").</LI>
	</UL>
	</LI>
	<LI>
	<B>PreemptType</B>: Configure to the desired mechanism used to identify
	which jobs can preempt other jobs.
	<UL>
	<LI><I>preempt/none</I> indicates that jobs will not preempt each other
	(default).</LI>
	<LI><I>preempt/partition_prio</I> indicates that jobs from one partition
	can preempt jobs from lower priority partitions.</LI>
	<LI><I>preempt/qos</I> indicates that jobs from one Quality Of Service (QOS)
	can preempt jobs from a lower QOS. These jobs can be in the same partition
	or different partitions. PreemptMode must be set to CANCEL, CHECKPOINT,
	SUSPEND or REQUEUE. This option requires the use of a database identifying
	available QOS and their preemption rules. </LI>
	</UL>
	</LI>
	<LI>
	<B>Priority</B>: Configure the partition's <I>Priority</I> setting relative to
	other partitions to control the preemptive behavior when
	<I>PreemptType=preempt/partition_prio</I>.
	This option is not relevant if <I>PreemptType=preempt/qos</I>.
	If two jobs from two
	different partitions are allocated to the same resources, the job in the
	partition with the greater <I>Priority</I> value will preempt the job in the
	partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values
	of the two partitions are equal then no preemption will occur. The default
	<I>Priority</I> value is 1.
	<BR><B>NOTE:</B> Unless <I>PreemptType=preempt/partition_prio</I> the
	partition <I>Priority</I> is not critical.
	</LI>
	<LI>
	<B>Shared</B>: Configure the partition's <I>Shared</I> setting to
	<I>FORCE</I> for all partitions in which job preemption is to take place.
	The <I>FORCE</I> option supports an additional parameter that controls
	how many jobs can share a resource (FORCE[:max_share]). By default the
	max_share value is 4. In order to preempt jobs (and not gang schedule them),
	always set max_share to 1. To allow up to 2 jobs from this partition to be
	allocated to a common resource (and gang scheduled), set
	<I>Shared=FORCE:2</I>.
	</LI>
	</UL>
	<P>
	To enable preemption after making the configuration changes described above,
	restart SLURM if it is already running. Any change to the plugin settings in
	SLURM requires a full restart of the daemons. If you just change the partition
	<I>Priority</I> or <I>Shared</I> setting, this can be updated with
	<I>scontrol reconfig</I>.
	</P>

	<H2>Preemption Design and Operation</H2>

	<P>
	The select plugin will identify resources where a pending job can begin
	execution.
	When <I>PreemptMode</I> is configured to CANCEL, CHECKPOINT, SUSPEND or
	REQUEUE, the select plugin will also preempt running jobs as needed to
	initiate the pending job.
	When <I>PreemptMode=SUSPEND,GANG</I> the select plugin will initiate the
	pending job and rely upon the gang scheduling logic to perform job suspend
	and resume as described below.
	</P>
	<P>
	When enabled, the gang scheduling logic (which is also supports job
	preemption) keeps track of the resources allocated to all jobs.
	For each partition an "active bitmap" is maintained that tracks all
	concurrently running jobs in the SLURM cluster.
	Each partition also maintains a job list for that partition, and a list of
	"shadow" jobs.
	The "shadow" jobs are high priority job allocations that "cast shadows" on the
	active bitmaps of the low priority jobs.
	Jobs caught in these "shadows" will be preempted.
	</P>
	<P>
	Each time a new job is allocated to resources in a partition and begins
	running, the gang scheduler adds a "shadow" of this job to all lower priority
	partitions.
	The active bitmap of these lower priority partitions are then rebuilt, with the shadow jobs added first.
	Any existing jobs that were replaced by one or more "shadow" jobs are
	suspended (preempted). Conversely, when a high priority running job completes,
	it's "shadow" goes away and the active bitmaps of the lower priority
	partitions are rebuilt to see if any suspended jobs can be resumed.
	</P>
	<P>
	The gang scheduler plugin is designed to be <I>reactive</I> to the resource
	allocation decisions made by the "select" plugins.
	The "select" plugins have been enhanced to recognize when job preemption has
	been configured, and to factor in the priority of each partition when selecting resources for a job.
	When choosing resources for each job, the selector avoids resources that are
	in use by other jobs (unless sharing has been configured, in which case it
	does some load-balancing).
	However, when job preemption is enabled, the select plugins may choose
	resources that are already in use by jobs from partitions with a lower
	priority setting, even when sharing is disabled in those partitions.
	</P>
	<P>
	This leaves the gang scheduler in charge of controlling which jobs should run
	on the over-allocated resources.
	If <I>PreemptMode=SUSPEND</I>, jobs are suspended using the same internal
	functions that support <I>scontrol suspend</I> and <I>scontrol resume</I>.
	A good way to observe the operation of the gang scheduler is by running
	<I>squeue -i<time></I> in a terminal window.
	</P>

	<H2>A Simple Example</H2>

	<P>
	The following example is configured with <I>select/linear</I> and
	<I>PreemptMode=SUSPEND,GANG</I>.
	This example takes place on a cluster of 5 nodes:
	</P>
	<PRE>
	[user@n16 ~]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 5 idle n[12-16]
	hipri up infinite 5 idle n[12-16]
	</PRE>
	<P>
	Here are the Partition settings:
	</P>
	<PRE>
	[user@n16 ~]$ <B>grep PartitionName /shared/slurm/slurm.conf</B>
	PartitionName=DEFAULT Shared=FORCE:1 Nodes=n[12-16]
	PartitionName=active Priority=1 Default=YES
	PartitionName=hipri Priority=2
	</PRE>
	<P>
	The <I>runit.pl</I> script launches a simple load-generating app that runs
	for the given number of seconds. Submit 5 single-node <I>runit.pl</I> jobs to
	run on all nodes:
	</P>
	<PRE>
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 485
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 486
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 487
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 488
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 489
	[user@n16 ~]$ <B>squeue -Si</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	485 active runit.pl user R 0:06 1 n12
	486 active runit.pl user R 0:06 1 n13
	487 active runit.pl user R 0:05 1 n14
	488 active runit.pl user R 0:05 1 n15
	489 active runit.pl user R 0:04 1 n16
	</PRE>
	<P>
	Now submit a short-running 3-node job to the <I>hipri</I> partition:
	</P>
	<PRE>
	[user@n16 ~]$ <B>sbatch -N3 -p hipri ./runit.pl 30</B>
	sbatch: Submitted batch job 490
	[user@n16 ~]$ <B>squeue -Si</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	485 active runit.pl user S 0:27 1 n12
	486 active runit.pl user S 0:27 1 n13
	487 active runit.pl user S 0:26 1 n14
	488 active runit.pl user R 0:29 1 n15
	489 active runit.pl user R 0:28 1 n16
	490 hipri runit.pl user R 0:03 3 n[12-14]
	</PRE>
	<P>
	Job 490 in the <I>hipri</I> partition preempted jobs 485, 486, and 487 from
	the <I>active</I> partition. Jobs 488 and 489 in the <I>active</I> partition
	remained running.
	</P>
	<P>
	This state persisted until job 490 completed, at which point the preempted jobs
	were resumed:
	</P>
	<PRE>
	[user@n16 ~]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	485 active runit.pl user R 0:30 1 n12
	486 active runit.pl user R 0:30 1 n13
	487 active runit.pl user R 0:29 1 n14
	488 active runit.pl user R 0:59 1 n15
	489 active runit.pl user R 0:58 1 n16
	</PRE>


	<H2><A NAME="future_work">Future Ideas</A></H2>

	<P>
	<B>More intelligence in the select plugins</B>: This implementation of
	preemption relies on intelligent job placement by the <I>select</I> plugins.
	In SLURM version 2.0 preemptive placement support was added to the
	SelectType plugins, but there is still room for improvement.
	</P><P>
	Take the following example:
	</P>
	<PRE>
	[user@n8 ~]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 5 idle n[1-5]
	hipri up infinite 5 idle n[1-5]
	[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
	sbatch: Submitted batch job 17
	[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
	sbatch: Submitted batch job 18
	[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
	sbatch: Submitted batch job 19
	[user@n8 ~]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	17 active sleepme cholmes R 0:03 1 n1
	18 active sleepme cholmes R 0:03 1 n2
	19 active sleepme cholmes R 0:02 1 n3
	[user@n8 ~]$ <B>sbatch -N3 -n6 -p hipri ./sleepme 20</B>
	sbatch: Submitted batch job 20
	[user@n8 ~]$ <B>squeue -Si</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	17 active sleepme cholmes S 0:16 1 n1
	18 active sleepme cholmes S 0:16 1 n2
	19 active sleepme cholmes S 0:15 1 n3
	20 hipri sleepme cholmes R 0:03 3 n[1-3]
	[user@n8 ~]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 3 alloc n[1-3]
	active* up infinite 2 idle n[4-5]
	hipri up infinite 3 alloc n[1-3]
	hipri up infinite 2 idle n[4-5]
	</PRE>
	<P>
	It would be more ideal if the "hipri" job were placed on nodes n[3-5], which
	would allow jobs 17 and 18 to continue running. However, a new "intelligent"
	algorithm would have to include factors such as job size and required nodes in
	order to support ideal placements such as this, which can quickly complicate
	the design. Any and all help is welcome here!
	</P>

	<p style="text-align:center;">Last modified 19 August 2009</p>

	<!--#include virtual="footer.txt"-->