doc/html/preempt.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <H1>Preemption</H1>

 <P>
 SLURM version 1.2 and earlier supported dedication of resources
 to jobs based on a simple "first come, first served" policy with backfill.
 Beginning in SLURM version 1.3, priority partitions and priority-based
 <I>preemption</I> are supported. Preemption is the act of suspending one or more
 "low-priority" jobs to let a "high-priority" job run uninterrupted until it
 completes. Preemption provides the ability to prioritize the workload on a
 cluster.
 </P>
 <P>
 The SLURM version 1.3.1 <I>sched/gang</I> plugin supports preemption.
 When configured,
 the plugin monitors each of the partitions in SLURM. If a new job in a
 high-priority partition has been allocated to resources that have already been
 allocated to one or more existing jobs from lower priority partitions, the
 plugin respects the partition priority and suspends the low-priority job(s). The
 low-priority job(s) remain suspended until the job from the high-priority
 partition completes. Once the high-priority job completes then the low-priority
 job(s) are resumed.
 </P>

 <H2>Configuration</H2>
 <P>
 There are several important configuration parameters relating to preemption:
 </P>
 <UL>
 <LI>
 <B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes
 allocated by the <I>select/linear</I> plugin and socket/core/CPU resources
 allocated by the <I>select/cons_res</I> plugin.
 </LI>
 <LI>
 <B>SelectTypeParameter</B>: Since resources will be getting overallocated
 with jobs (suspended jobs remain in memory), the resource selection
 plugin should be configured to track the amount of memory used by each job to
 ensure that memory page swapping does not occur. When <I>select/linear</I> is
 chosen, we recommend setting <I>SelectTypeParameter=CR_Memory</I>. When
 <I>select/cons_res</I> is chosen, we recommend including Memory as a resource
 (ex. <I>SelectTypeParameter=CR_Core_Memory</I>).
 </LI>
 <LI>
 <B>DefMemPerCPU</B>: Since job requests may not explicitly specify
 a memory requirement, we also recommend configuring
 <I>DefMemPerCPU</I> (default memory per allocated CPU) or
 <I>DefMemPerNode</I> (default memory per allocated node).
 It may also be desirable to configure
 <I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or
 <I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>.
 Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option
 at job submission time to specify their memory requirements.
 </LI>
 <LI>
 <B>JobAcctGatherType and JobAcctGatherFrequency</B>: The "maximum data segment
 size" and "maximum virtual memory size" system limits will be configured for
 each job to ensure that the job does not exceed its requested amount of memory.
 If you wish to enable additional enforcement of memory limits, configure job
 accounting with the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
 parameters. When accounting is enabled and a job exceeds its configured memory
 limits, it will be canceled in order to prevent it from adversely effecting
 other jobs sharing the same resources.
 </LI>
 <LI>
 <B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting
 <I>SchedulerType=sched/gang</I> in <I>slurm.conf</I>.
 </LI>
 <LI>
 <B>Priority</B>: Configure the partition's <I>Priority</I> setting relative to
 other partitions to control the preemptive behavior. If two jobs from two
 different partitions are allocated to the same resources, the job in the
 partition with the greater <I>Priority</I> value will preempt the job in the
 partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values
 of the two partitions are equal then no preemption will occur. The default
 <I>Priority</I> value is 1.
 </LI>
 <LI>
 <B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds.
 To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval
 (in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval
 to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase
 the overhead of gang scheduling. This parameter is only relevant if timeslicing
 within a partition will be configured. Preemption and timeslicing can occur at
 the same time.
 </LI>
 </UL>
 <P>
 To enable preemption after making the configuration changes described above,
 restart SLURM if it is already running. Any change to the plugin settings in
 SLURM requires a full restart of the daemons. If you just change the partition
 <I>Priority</I> or <I>Shared</I> setting, this can be updated with
 <I>scontrol reconfig</I>.
 </P>

 <H2>Preemption Design and Operation</H2>

 <P>
 When enabled, the <I>sched/gang</I> plugin keeps track of the resources
 allocated to all jobs. For each partition an "active bitmap" is maintained that
 tracks all concurrently running jobs in the SLURM cluster. Each partition also
 maintains a job list for that partition, and a list of "shadow" jobs. The
 "shadow" jobs are job allocations from higher priority partitions that "cast
 shadows" on the active bitmaps of the lower priority partitions. Jobs in lower
 priority partitions that are caught in these "shadows" will be suspended.
 </P>
 <P>
 Each time a new job is allocated to resources in a partition and begins running,
 the <I>sched/gang</I> plugin adds a "shadow" of this job to all lower priority
 partitions. The active bitmap of these lower priority partitions are then
 rebuilt, with the shadow jobs added first. Any existing jobs that were replaced
 by one or more "shadow" jobs are suspended (preempted). Conversely, when a
 high-priority running job completes, it's "shadow" goes away and the active
 bitmaps of the lower priority partitions are rebuilt to see if any suspended
 jobs can be resumed.
 </P>
 <P>
 The gang scheduler plugin is designed to be <I>reactive</I> to the resource
 allocation decisions made by the "select" plugins. The "select" plugins have
 been enhanced to recognize when "sched/gang" has been configured, and to factor
 in the priority of each partition when selecting resources for a job. When
 choosing resources for each job, the selector avoids resources that are in use
 by other jobs (unless sharing has been configured, in which case it does some
 load-balancing). However, when "sched/gang" is enabled, the select plugins may
 choose resources that are already in use by jobs from partitions with a lower
 priority setting, even when sharing is disabled in those partitions.
 </P>
 <P>
 This leaves the gang scheduler in charge of controlling which jobs should run on
 the overallocated resources. The <I>sched/gang</I> plugin suspends jobs via the
 same internal functions that support <I>scontrol suspend</I> and <I>scontrol
 resume</I>. A good way to observe the act of preemption is by running <I>watch
 squeue</I> in a terminal window.
 </P>
 <P>
 The <I>sched/gang</I> plugin suspends jobs via the same internal functions that
 support <I>scontrol suspend</I> and <I>scontrol resume</I>. A good way to
 observe the act of preemption is by running <I>watch squeue</I> in a terminal
 window.
 </P>

 <H2>A Simple Example</H2>

 <P>
 The following example is configured with <I>select/linear</I> and
 <I>sched/gang</I>. This example takes place on a cluster of 5 nodes:
 </P>
 <PRE>
 [user@n16 ~]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     5   idle n[12-16]
 hipri        up   infinite     5   idle n[12-16]
 </PRE>
 <P>
 Here are the Partition settings:
 </P>
 <PRE>
 [user@n16 ~]$ <B>grep PartitionName /shared/slurm/slurm.conf</B>
 PartitionName=active Priority=1 Default=YES Shared=NO Nodes=n[12-16]
 PartitionName=hipri  Priority=2             Shared=NO Nodes=n[12-16]
 </PRE>
 <P>
 The <I>runit.pl</I> script launches a simple load-generating app that runs
 for the given number of seconds. Submit 5 single-node <I>runit.pl</I> jobs to
 run on all nodes:
 </P>
 <PRE>
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 485
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 486
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 487
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 488
 [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
 sbatch: Submitted batch job 489
 [user@n16 ~]$ <B>squeue -Si</B>
 JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
   485    active runit.pl   user   R   0:06      1 n12
   486    active runit.pl   user   R   0:06      1 n13
   487    active runit.pl   user   R   0:05      1 n14
   488    active runit.pl   user   R   0:05      1 n15
   489    active runit.pl   user   R   0:04      1 n16
 </PRE>
 <P>
 Now submit a short-running 3-node job to the <I>hipri</I> partition:
 </P>
 <PRE>
 [user@n16 ~]$ <B>sbatch -N3 -p hipri ./runit.pl 30</B>
 sbatch: Submitted batch job 490
 [user@n16 ~]$ <B>squeue -Si</B>
 JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
   485    active runit.pl   user   S   0:27      1 n12
   486    active runit.pl   user   S   0:27      1 n13
   487    active runit.pl   user   S   0:26      1 n14
   488    active runit.pl   user   R   0:29      1 n15
   489    active runit.pl   user   R   0:28      1 n16
   490     hipri runit.pl   user   R   0:03      3 n[12-14]
 </PRE>
 <P>
 Job 490 in the <I>hipri</I> partition preempted jobs 485, 486, and 487 from
 the <I>active</I> partition. Jobs 488 and 489 in the <I>active</I> partition
 remained running.
 </P>
 <P>
 This state persisted until job 490 completed, at which point the preempted jobs
 were resumed:
 </P>
 <PRE>
 [user@n16 ~]$ <B>squeue</B>
 JOBID PARTITION     NAME   USER  ST   TIME  NODES NODELIST
   485    active runit.pl   user   R   0:30      1 n12
   486    active runit.pl   user   R   0:30      1 n13
   487    active runit.pl   user   R   0:29      1 n14
   488    active runit.pl   user   R   0:59      1 n15
   489    active runit.pl   user   R   0:58      1 n16
 </PRE>


 <H2><A NAME="future_work">Future Ideas</A></H2>

 <P>
 <B>More intelligence in the select plugins</B>: This implementation of
 preemption relies on intelligent job placement by the <I>select</I> plugins. In
 SLURM 1.3.1 the <I>select/linear</I> plugin has a decent preemptive placement
 algorithm, but the consumable resource <I>select/cons_res</I> plugin had no
 preemptive placement support. In SLURM 1.4 preemptive placement support was
 added to the <I>select/cons_res</I> plugin, but there is still room for
 improvement.
 </P><P>
 Take the following example:
 </P>
 <PRE>
 [user@n8 ~]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     5   idle n[1-5]
 hipri        up   infinite     5   idle n[1-5]
 [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
 sbatch: Submitted batch job 17
 [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
 sbatch: Submitted batch job 18
 [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
 sbatch: Submitted batch job 19
 [user@n8 ~]$ <B>squeue</B>
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
      17    active  sleepme  cholmes   R       0:03      1 n1
      18    active  sleepme  cholmes   R       0:03      1 n2
      19    active  sleepme  cholmes   R       0:02      1 n3
 [user@n8 ~]$ <B>sbatch -N3 -n6 -p hipri ./sleepme 20</B>
 sbatch: Submitted batch job 20
 [user@n8 ~]$ <B>squeue -Si</B>
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
      17    active  sleepme  cholmes   S       0:16      1 n1
      18    active  sleepme  cholmes   S       0:16      1 n2
      19    active  sleepme  cholmes   S       0:15      1 n3
      20     hipri  sleepme  cholmes   R       0:03      3 n[1-3]
 [user@n8 ~]$ <B>sinfo</B>
 PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
 active*      up   infinite     3  alloc n[1-3]
 active*      up   infinite     2   idle n[4-5]
 hipri        up   infinite     3  alloc n[1-3]
 hipri        up   infinite     2   idle n[4-5]
 </PRE>
 <P>
 It would be more ideal if the "hipri" job were placed on nodes n[3-5], which
 would allow jobs 17 and 18 to continue running. However, a new "intelligent"
 algorithm would have to include factors such as job size and required nodes in
 order to support ideal placements such as this, which can quickly complicate
 the design. Any and all help is welcome here!
 </P>
 <P>
 <B>Preemptive backfill</B>: the current backfill scheduler plugin
 ("sched/backfill") is a nice way to make efficient use of otherwise idle
 resources. But SLURM only supports one scheduler plugin at a time. Fortunately,
 given the design of the new "sched/gang" plugin, there is no direct overlap
 between the backfill functionality and the gang-scheduling functionality. Thus,
 it's possible that these two plugins could technically be merged into a new
 scheduler plugin that supported preemption <U>and</U> backfill. <B>NOTE:</B>
 this is only an idea based on a code review so there would likely need to be
 some additional development, and plenty of testing!
 </P><P>

 </P>
 <P>
 <B>Requeue a preempted job</B>: In some situations is may be desirable to
 requeue a low-priority job rather than suspend it. Suspending a job leaves the
 job in memory. Requeuing a job involves terminating the job and resubmitting it
 again. The "sched/gang" plugin would need to be modified to recognize when a job
 is able to be requeued and when it can requeue a job (for preemption only, not
 for timeslicing!), and perform the requeue request.
 </P>

 <p style="text-align:center;">Last modified 5 December 2008</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<H1>Preemption</H1>

	<P>
	SLURM version 1.2 and earlier supported dedication of resources
	to jobs based on a simple "first come, first served" policy with backfill.
	Beginning in SLURM version 1.3, priority partitions and priority-based
	<I>preemption</I> are supported. Preemption is the act of suspending one or more
	"low-priority" jobs to let a "high-priority" job run uninterrupted until it
	completes. Preemption provides the ability to prioritize the workload on a
	cluster.
	</P>
	<P>
	The SLURM version 1.3.1 <I>sched/gang</I> plugin supports preemption.
	When configured,
	the plugin monitors each of the partitions in SLURM. If a new job in a
	high-priority partition has been allocated to resources that have already been
	allocated to one or more existing jobs from lower priority partitions, the
	plugin respects the partition priority and suspends the low-priority job(s). The
	low-priority job(s) remain suspended until the job from the high-priority
	partition completes. Once the high-priority job completes then the low-priority
	job(s) are resumed.
	</P>

	<H2>Configuration</H2>
	<P>
	There are several important configuration parameters relating to preemption:
	</P>
	<UL>
	<LI>
	<B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes
	allocated by the <I>select/linear</I> plugin and socket/core/CPU resources
	allocated by the <I>select/cons_res</I> plugin.
	</LI>
	<LI>
	<B>SelectTypeParameter</B>: Since resources will be getting overallocated
	with jobs (suspended jobs remain in memory), the resource selection
	plugin should be configured to track the amount of memory used by each job to
	ensure that memory page swapping does not occur. When <I>select/linear</I> is
	chosen, we recommend setting <I>SelectTypeParameter=CR_Memory</I>. When
	<I>select/cons_res</I> is chosen, we recommend including Memory as a resource
	(ex. <I>SelectTypeParameter=CR_Core_Memory</I>).
	</LI>
	<LI>
	<B>DefMemPerCPU</B>: Since job requests may not explicitly specify
	a memory requirement, we also recommend configuring
	<I>DefMemPerCPU</I> (default memory per allocated CPU) or
	<I>DefMemPerNode</I> (default memory per allocated node).
	It may also be desirable to configure
	<I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or
	<I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>.
	Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option
	at job submission time to specify their memory requirements.
	</LI>
	<LI>
	<B>JobAcctGatherType and JobAcctGatherFrequency</B>: The "maximum data segment
	size" and "maximum virtual memory size" system limits will be configured for
	each job to ensure that the job does not exceed its requested amount of memory.
	If you wish to enable additional enforcement of memory limits, configure job
	accounting with the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
	parameters. When accounting is enabled and a job exceeds its configured memory
	limits, it will be canceled in order to prevent it from adversely effecting
	other jobs sharing the same resources.
	</LI>
	<LI>
	<B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting
	<I>SchedulerType=sched/gang</I> in <I>slurm.conf</I>.
	</LI>
	<LI>
	<B>Priority</B>: Configure the partition's <I>Priority</I> setting relative to
	other partitions to control the preemptive behavior. If two jobs from two
	different partitions are allocated to the same resources, the job in the
	partition with the greater <I>Priority</I> value will preempt the job in the
	partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values
	of the two partitions are equal then no preemption will occur. The default
	<I>Priority</I> value is 1.
	</LI>
	<LI>
	<B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds.
	To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval
	(in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval
	to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase
	the overhead of gang scheduling. This parameter is only relevant if timeslicing
	within a partition will be configured. Preemption and timeslicing can occur at
	the same time.
	</LI>
	</UL>
	<P>
	To enable preemption after making the configuration changes described above,
	restart SLURM if it is already running. Any change to the plugin settings in
	SLURM requires a full restart of the daemons. If you just change the partition
	<I>Priority</I> or <I>Shared</I> setting, this can be updated with
	<I>scontrol reconfig</I>.
	</P>

	<H2>Preemption Design and Operation</H2>

	<P>
	When enabled, the <I>sched/gang</I> plugin keeps track of the resources
	allocated to all jobs. For each partition an "active bitmap" is maintained that
	tracks all concurrently running jobs in the SLURM cluster. Each partition also
	maintains a job list for that partition, and a list of "shadow" jobs. The
	"shadow" jobs are job allocations from higher priority partitions that "cast
	shadows" on the active bitmaps of the lower priority partitions. Jobs in lower
	priority partitions that are caught in these "shadows" will be suspended.
	</P>
	<P>
	Each time a new job is allocated to resources in a partition and begins running,
	the <I>sched/gang</I> plugin adds a "shadow" of this job to all lower priority
	partitions. The active bitmap of these lower priority partitions are then
	rebuilt, with the shadow jobs added first. Any existing jobs that were replaced
	by one or more "shadow" jobs are suspended (preempted). Conversely, when a
	high-priority running job completes, it's "shadow" goes away and the active
	bitmaps of the lower priority partitions are rebuilt to see if any suspended
	jobs can be resumed.
	</P>
	<P>
	The gang scheduler plugin is designed to be <I>reactive</I> to the resource
	allocation decisions made by the "select" plugins. The "select" plugins have
	been enhanced to recognize when "sched/gang" has been configured, and to factor
	in the priority of each partition when selecting resources for a job. When
	choosing resources for each job, the selector avoids resources that are in use
	by other jobs (unless sharing has been configured, in which case it does some
	load-balancing). However, when "sched/gang" is enabled, the select plugins may
	choose resources that are already in use by jobs from partitions with a lower
	priority setting, even when sharing is disabled in those partitions.
	</P>
	<P>
	This leaves the gang scheduler in charge of controlling which jobs should run on
	the overallocated resources. The <I>sched/gang</I> plugin suspends jobs via the
	same internal functions that support <I>scontrol suspend</I> and <I>scontrol
	resume</I>. A good way to observe the act of preemption is by running <I>watch
	squeue</I> in a terminal window.
	</P>
	<P>
	The <I>sched/gang</I> plugin suspends jobs via the same internal functions that
	support <I>scontrol suspend</I> and <I>scontrol resume</I>. A good way to
	observe the act of preemption is by running <I>watch squeue</I> in a terminal
	window.
	</P>

	<H2>A Simple Example</H2>

	<P>
	The following example is configured with <I>select/linear</I> and
	<I>sched/gang</I>. This example takes place on a cluster of 5 nodes:
	</P>
	<PRE>
	[user@n16 ~]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 5 idle n[12-16]
	hipri up infinite 5 idle n[12-16]
	</PRE>
	<P>
	Here are the Partition settings:
	</P>
	<PRE>
	[user@n16 ~]$ <B>grep PartitionName /shared/slurm/slurm.conf</B>
	PartitionName=active Priority=1 Default=YES Shared=NO Nodes=n[12-16]
	PartitionName=hipri Priority=2 Shared=NO Nodes=n[12-16]
	</PRE>
	<P>
	The <I>runit.pl</I> script launches a simple load-generating app that runs
	for the given number of seconds. Submit 5 single-node <I>runit.pl</I> jobs to
	run on all nodes:
	</P>
	<PRE>
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 485
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 486
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 487
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 488
	[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
	sbatch: Submitted batch job 489
	[user@n16 ~]$ <B>squeue -Si</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	485 active runit.pl user R 0:06 1 n12
	486 active runit.pl user R 0:06 1 n13
	487 active runit.pl user R 0:05 1 n14
	488 active runit.pl user R 0:05 1 n15
	489 active runit.pl user R 0:04 1 n16
	</PRE>
	<P>
	Now submit a short-running 3-node job to the <I>hipri</I> partition:
	</P>
	<PRE>
	[user@n16 ~]$ <B>sbatch -N3 -p hipri ./runit.pl 30</B>
	sbatch: Submitted batch job 490
	[user@n16 ~]$ <B>squeue -Si</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	485 active runit.pl user S 0:27 1 n12
	486 active runit.pl user S 0:27 1 n13
	487 active runit.pl user S 0:26 1 n14
	488 active runit.pl user R 0:29 1 n15
	489 active runit.pl user R 0:28 1 n16
	490 hipri runit.pl user R 0:03 3 n[12-14]
	</PRE>
	<P>
	Job 490 in the <I>hipri</I> partition preempted jobs 485, 486, and 487 from
	the <I>active</I> partition. Jobs 488 and 489 in the <I>active</I> partition
	remained running.
	</P>
	<P>
	This state persisted until job 490 completed, at which point the preempted jobs
	were resumed:
	</P>
	<PRE>
	[user@n16 ~]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST
	485 active runit.pl user R 0:30 1 n12
	486 active runit.pl user R 0:30 1 n13
	487 active runit.pl user R 0:29 1 n14
	488 active runit.pl user R 0:59 1 n15
	489 active runit.pl user R 0:58 1 n16
	</PRE>


	<H2><A NAME="future_work">Future Ideas</A></H2>

	<P>
	<B>More intelligence in the select plugins</B>: This implementation of
	preemption relies on intelligent job placement by the <I>select</I> plugins. In
	SLURM 1.3.1 the <I>select/linear</I> plugin has a decent preemptive placement
	algorithm, but the consumable resource <I>select/cons_res</I> plugin had no
	preemptive placement support. In SLURM 1.4 preemptive placement support was
	added to the <I>select/cons_res</I> plugin, but there is still room for
	improvement.
	</P><P>
	Take the following example:
	</P>
	<PRE>
	[user@n8 ~]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 5 idle n[1-5]
	hipri up infinite 5 idle n[1-5]
	[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
	sbatch: Submitted batch job 17
	[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
	sbatch: Submitted batch job 18
	[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
	sbatch: Submitted batch job 19
	[user@n8 ~]$ <B>squeue</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	17 active sleepme cholmes R 0:03 1 n1
	18 active sleepme cholmes R 0:03 1 n2
	19 active sleepme cholmes R 0:02 1 n3
	[user@n8 ~]$ <B>sbatch -N3 -n6 -p hipri ./sleepme 20</B>
	sbatch: Submitted batch job 20
	[user@n8 ~]$ <B>squeue -Si</B>
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	17 active sleepme cholmes S 0:16 1 n1
	18 active sleepme cholmes S 0:16 1 n2
	19 active sleepme cholmes S 0:15 1 n3
	20 hipri sleepme cholmes R 0:03 3 n[1-3]
	[user@n8 ~]$ <B>sinfo</B>
	PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
	active* up infinite 3 alloc n[1-3]
	active* up infinite 2 idle n[4-5]
	hipri up infinite 3 alloc n[1-3]
	hipri up infinite 2 idle n[4-5]
	</PRE>
	<P>
	It would be more ideal if the "hipri" job were placed on nodes n[3-5], which
	would allow jobs 17 and 18 to continue running. However, a new "intelligent"
	algorithm would have to include factors such as job size and required nodes in
	order to support ideal placements such as this, which can quickly complicate
	the design. Any and all help is welcome here!
	</P>
	<P>
	<B>Preemptive backfill</B>: the current backfill scheduler plugin
	("sched/backfill") is a nice way to make efficient use of otherwise idle
	resources. But SLURM only supports one scheduler plugin at a time. Fortunately,
	given the design of the new "sched/gang" plugin, there is no direct overlap
	between the backfill functionality and the gang-scheduling functionality. Thus,
	it's possible that these two plugins could technically be merged into a new
	scheduler plugin that supported preemption <U>and</U> backfill. <B>NOTE:</B>
	this is only an idea based on a code review so there would likely need to be
	some additional development, and plenty of testing!
	</P><P>

	</P>
	<P>
	<B>Requeue a preempted job</B>: In some situations is may be desirable to
	requeue a low-priority job rather than suspend it. Suspending a job leaves the
	job in memory. Requeuing a job involves terminating the job and resubmitting it
	again. The "sched/gang" plugin would need to be modified to recognize when a job
	is able to be requeued and when it can requeue a job (for preemption only, not
	for timeslicing!), and perform the requeue request.
	</P>

	<p style="text-align:center;">Last modified 5 December 2008</p>

	<!--#include virtual="footer.txt"-->