| <!--#include virtual="header.txt"--> |
| |
| <H1>Preemption</H1> |
| |
| <P> |
| SLURM supports job preemption, the act of stopping one or more "low-priority" |
| jobs to let a "high-priority" job run uninterrupted until it completes. |
| Job preemption is implemented as a variation of SLURM's |
| <a href="gang_scheduling.html">Gang Scheduling</a> logic. |
| When a high-priority job has been allocated resources that have already been |
| allocated to one or more low priority jobs, the low priority job(s) are |
| preempted. |
| The low priority job(s) can resume once the high priority job completes. |
| Alternately, the low priority job(s) can be requeued and started using other |
| resources if so configured in newer versions of SLURM. |
| </P> |
| <P> |
| In SLURM version 2.0 and earlier, high priority work is identified by the |
| priority of the job's partition and low priority jobs are always suspended. |
| The job preemption logic is within the <I>sched/gang</I> plugin. |
| In SLURM version 2.1 and higher, the job's partition priority or its |
| Quality Of Service (QOS) can be used to identify the which jobs can preempt |
| or be preempted by other jobs. |
| </P> |
| <P> |
| SLURM version 2.1 offers several options for the job preemption mechanism |
| including checkpoint, requeue, or cancel. |
| the option of requeuing low priority jobs |
| Checkpointed jobs are not automatically requeued or restarted. |
| Requeued jobs may restart faster by using different resources. |
| All of these new job preemption mechanisms release a job's memory space for |
| use by other jobs. |
| In SLURM version 2.1, some job preemption logic was moved into the |
| <I>select</I> plugin and main code base to permit use of both job preemption |
| plus the backfill scheduler plugin, <i>sched/backfill</I>. |
| </P> |
| |
| <H2>Configuration</H2> |
| <P> |
| There are several important configuration parameters relating to preemption: |
| </P> |
| <UL> |
| <LI> |
| <B>SelectType</B>: SLURM job preemption logic supports nodes allocated by the |
| <I>select/linear</I> plugin and socket/core/CPU resources allocated by the |
| <I>select/cons_res</I> plugin. |
| </LI> |
| <LI> |
| <B>SelectTypeParameter</B>: Since resources may be getting over-allocated |
| with jobs (suspended jobs remain in memory), the resource selection |
| plugin should be configured to track the amount of memory used by each job to |
| ensure that memory page swapping does not occur. When <I>select/linear</I> is |
| chosen, we recommend setting <I>SelectTypeParameter=CR_Memory</I>. When |
| <I>select/cons_res</I> is chosen, we recommend including Memory as a resource |
| (ex. <I>SelectTypeParameter=CR_Core_Memory</I>). |
| <BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management |
| parameters are not critical. |
| </LI> |
| <LI> |
| <B>DefMemPerCPU</B>: Since job requests may not explicitly specify |
| a memory requirement, we also recommend configuring |
| <I>DefMemPerCPU</I> (default memory per allocated CPU) or |
| <I>DefMemPerNode</I> (default memory per allocated node). |
| It may also be desirable to configure |
| <I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or |
| <I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>. |
| Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option |
| at job submission time to specify their memory requirements. |
| <BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management |
| parameters are not critical. |
| </LI> |
| <LI> |
| <B>JobAcctGatherType and JobAcctGatherFrequency</B>: The "maximum data segment |
| size" and "maximum virtual memory size" system limits will be configured for |
| each job to ensure that the job does not exceed its requested amount of memory. |
| If you wish to enable additional enforcement of memory limits, configure job |
| accounting with the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I> |
| parameters. When accounting is enabled and a job exceeds its configured memory |
| limits, it will be canceled in order to prevent it from adversely effecting |
| other jobs sharing the same resources. |
| <BR><B>NOTE:</B> Unless <I>PreemptMode=SUSPEND,GANG</I> these memory management |
| parameters are not critical. |
| </LI> |
| <LI> |
| <B>PreemptMode</B>: Configure to <I>CANCEL</I>, <I>CHECKPOINT</I>, |
| <I>SUSPEND</I> or <I>REQUEUE</I> depending on the desired action for low |
| priority jobs. |
| <UL> |
| <LI>A value of <I>CANCEL</I> will always cancel the job.</LI> |
| <LI>A value of <I>CHECKPOINT</I> will checkpoint (if possible) or kill low |
| priority jobs.</LI> |
| Checkpointed jobs are not automatically restarted. |
| <LI>A value of <I>REQUEUE</I> will requeue (if possible) or kill low priority |
| jobs. Requeued jobs are permitted to be restarted on different resources.</LI> |
| <LI>A value of <I>SUSPEND</I> will suspend and automatically resume the low |
| priority jobs. The <I>SUSPEND</I> option must be used with the <I>GANG</I> |
| option (e.g. "PreemptMode=SUSPEND,GANG").</LI> |
| </UL> |
| </LI> |
| <LI> |
| <B>PreemptType</B>: Configure to the desired mechanism used to identify |
| which jobs can preempt other jobs. |
| <UL> |
| <LI><I>preempt/none</I> indicates that jobs will not preempt each other |
| (default).</LI> |
| <LI><I>preempt/partition_prio</I> indicates that jobs from one partition |
| can preempt jobs from lower priority partitions.</LI> |
| <LI><I>preempt/qos</I> indicates that jobs from one Quality Of Service (QOS) |
| can preempt jobs from a lower QOS. These jobs can be in the same partition |
| or different partitions. PreemptMode must be set to CANCEL, CHECKPOINT, |
| SUSPEND or REQUEUE. This option requires the use of a database identifying |
| available QOS and their preemption rules. </LI> |
| </UL> |
| </LI> |
| <LI> |
| <B>Priority</B>: Configure the partition's <I>Priority</I> setting relative to |
| other partitions to control the preemptive behavior when |
| <I>PreemptType=preempt/partition_prio</I>. |
| This option is not relevant if <I>PreemptType=preempt/qos</I>. |
| If two jobs from two |
| different partitions are allocated to the same resources, the job in the |
| partition with the greater <I>Priority</I> value will preempt the job in the |
| partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values |
| of the two partitions are equal then no preemption will occur. The default |
| <I>Priority</I> value is 1. |
| <BR><B>NOTE:</B> Unless <I>PreemptType=preempt/partition_prio</I> the |
| partition <I>Priority</I> is not critical. |
| </LI> |
| <LI> |
| <B>Shared</B>: Configure the partition's <I>Shared</I> setting to |
| <I>FORCE</I> for all partitions in which job preemption is to take place. |
| The <I>FORCE</I> option supports an additional parameter that controls |
| how many jobs can share a resource (FORCE[:max_share]). By default the |
| max_share value is 4. In order to preempt jobs (and not gang schedule them), |
| always set max_share to 1. To allow up to 2 jobs from this partition to be |
| allocated to a common resource (and gang scheduled), set |
| <I>Shared=FORCE:2</I>. |
| </LI> |
| </UL> |
| <P> |
| To enable preemption after making the configuration changes described above, |
| restart SLURM if it is already running. Any change to the plugin settings in |
| SLURM requires a full restart of the daemons. If you just change the partition |
| <I>Priority</I> or <I>Shared</I> setting, this can be updated with |
| <I>scontrol reconfig</I>. |
| </P> |
| |
| <H2>Preemption Design and Operation</H2> |
| |
| <P> |
| The select plugin will identify resources where a pending job can begin |
| execution. |
| When <I>PreemptMode</I> is configured to CANCEL, CHECKPOINT, SUSPEND or |
| REQUEUE, the select plugin will also preempt running jobs as needed to |
| initiate the pending job. |
| When <I>PreemptMode=SUSPEND,GANG</I> the select plugin will initiate the |
| pending job and rely upon the gang scheduling logic to perform job suspend |
| and resume as described below. |
| </P> |
| <P> |
| When enabled, the gang scheduling logic (which is also supports job |
| preemption) keeps track of the resources allocated to all jobs. |
| For each partition an "active bitmap" is maintained that tracks all |
| concurrently running jobs in the SLURM cluster. |
| Each partition also maintains a job list for that partition, and a list of |
| "shadow" jobs. |
| The "shadow" jobs are high priority job allocations that "cast shadows" on the |
| active bitmaps of the low priority jobs. |
| Jobs caught in these "shadows" will be preempted. |
| </P> |
| <P> |
| Each time a new job is allocated to resources in a partition and begins |
| running, the gang scheduler adds a "shadow" of this job to all lower priority |
| partitions. |
| The active bitmap of these lower priority partitions are then rebuilt, with the shadow jobs added first. |
| Any existing jobs that were replaced by one or more "shadow" jobs are |
| suspended (preempted). Conversely, when a high priority running job completes, |
| it's "shadow" goes away and the active bitmaps of the lower priority |
| partitions are rebuilt to see if any suspended jobs can be resumed. |
| </P> |
| <P> |
| The gang scheduler plugin is designed to be <I>reactive</I> to the resource |
| allocation decisions made by the "select" plugins. |
| The "select" plugins have been enhanced to recognize when job preemption has |
| been configured, and to factor in the priority of each partition when selecting resources for a job. |
| When choosing resources for each job, the selector avoids resources that are |
| in use by other jobs (unless sharing has been configured, in which case it |
| does some load-balancing). |
| However, when job preemption is enabled, the select plugins may choose |
| resources that are already in use by jobs from partitions with a lower |
| priority setting, even when sharing is disabled in those partitions. |
| </P> |
| <P> |
| This leaves the gang scheduler in charge of controlling which jobs should run |
| on the over-allocated resources. |
| If <I>PreemptMode=SUSPEND</I>, jobs are suspended using the same internal |
| functions that support <I>scontrol suspend</I> and <I>scontrol resume</I>. |
| A good way to observe the operation of the gang scheduler is by running |
| <I>squeue -i<time></I> in a terminal window. |
| </P> |
| |
| <H2>A Simple Example</H2> |
| |
| <P> |
| The following example is configured with <I>select/linear</I> and |
| <I>PreemptMode=SUSPEND,GANG</I>. |
| This example takes place on a cluster of 5 nodes: |
| </P> |
| <PRE> |
| [user@n16 ~]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 5 idle n[12-16] |
| hipri up infinite 5 idle n[12-16] |
| </PRE> |
| <P> |
| Here are the Partition settings: |
| </P> |
| <PRE> |
| [user@n16 ~]$ <B>grep PartitionName /shared/slurm/slurm.conf</B> |
| PartitionName=DEFAULT Shared=FORCE:1 Nodes=n[12-16] |
| PartitionName=active Priority=1 Default=YES |
| PartitionName=hipri Priority=2 |
| </PRE> |
| <P> |
| The <I>runit.pl</I> script launches a simple load-generating app that runs |
| for the given number of seconds. Submit 5 single-node <I>runit.pl</I> jobs to |
| run on all nodes: |
| </P> |
| <PRE> |
| [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B> |
| sbatch: Submitted batch job 485 |
| [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B> |
| sbatch: Submitted batch job 486 |
| [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B> |
| sbatch: Submitted batch job 487 |
| [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B> |
| sbatch: Submitted batch job 488 |
| [user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B> |
| sbatch: Submitted batch job 489 |
| [user@n16 ~]$ <B>squeue -Si</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 485 active runit.pl user R 0:06 1 n12 |
| 486 active runit.pl user R 0:06 1 n13 |
| 487 active runit.pl user R 0:05 1 n14 |
| 488 active runit.pl user R 0:05 1 n15 |
| 489 active runit.pl user R 0:04 1 n16 |
| </PRE> |
| <P> |
| Now submit a short-running 3-node job to the <I>hipri</I> partition: |
| </P> |
| <PRE> |
| [user@n16 ~]$ <B>sbatch -N3 -p hipri ./runit.pl 30</B> |
| sbatch: Submitted batch job 490 |
| [user@n16 ~]$ <B>squeue -Si</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 485 active runit.pl user S 0:27 1 n12 |
| 486 active runit.pl user S 0:27 1 n13 |
| 487 active runit.pl user S 0:26 1 n14 |
| 488 active runit.pl user R 0:29 1 n15 |
| 489 active runit.pl user R 0:28 1 n16 |
| 490 hipri runit.pl user R 0:03 3 n[12-14] |
| </PRE> |
| <P> |
| Job 490 in the <I>hipri</I> partition preempted jobs 485, 486, and 487 from |
| the <I>active</I> partition. Jobs 488 and 489 in the <I>active</I> partition |
| remained running. |
| </P> |
| <P> |
| This state persisted until job 490 completed, at which point the preempted jobs |
| were resumed: |
| </P> |
| <PRE> |
| [user@n16 ~]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 485 active runit.pl user R 0:30 1 n12 |
| 486 active runit.pl user R 0:30 1 n13 |
| 487 active runit.pl user R 0:29 1 n14 |
| 488 active runit.pl user R 0:59 1 n15 |
| 489 active runit.pl user R 0:58 1 n16 |
| </PRE> |
| |
| |
| <H2><A NAME="future_work">Future Ideas</A></H2> |
| |
| <P> |
| <B>More intelligence in the select plugins</B>: This implementation of |
| preemption relies on intelligent job placement by the <I>select</I> plugins. |
| In SLURM version 2.0 preemptive placement support was added to the |
| SelectType plugins, but there is still room for improvement. |
| </P><P> |
| Take the following example: |
| </P> |
| <PRE> |
| [user@n8 ~]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 5 idle n[1-5] |
| hipri up infinite 5 idle n[1-5] |
| [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B> |
| sbatch: Submitted batch job 17 |
| [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B> |
| sbatch: Submitted batch job 18 |
| [user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B> |
| sbatch: Submitted batch job 19 |
| [user@n8 ~]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 17 active sleepme cholmes R 0:03 1 n1 |
| 18 active sleepme cholmes R 0:03 1 n2 |
| 19 active sleepme cholmes R 0:02 1 n3 |
| [user@n8 ~]$ <B>sbatch -N3 -n6 -p hipri ./sleepme 20</B> |
| sbatch: Submitted batch job 20 |
| [user@n8 ~]$ <B>squeue -Si</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 17 active sleepme cholmes S 0:16 1 n1 |
| 18 active sleepme cholmes S 0:16 1 n2 |
| 19 active sleepme cholmes S 0:15 1 n3 |
| 20 hipri sleepme cholmes R 0:03 3 n[1-3] |
| [user@n8 ~]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 3 alloc n[1-3] |
| active* up infinite 2 idle n[4-5] |
| hipri up infinite 3 alloc n[1-3] |
| hipri up infinite 2 idle n[4-5] |
| </PRE> |
| <P> |
| It would be more ideal if the "hipri" job were placed on nodes n[3-5], which |
| would allow jobs 17 and 18 to continue running. However, a new "intelligent" |
| algorithm would have to include factors such as job size and required nodes in |
| order to support ideal placements such as this, which can quickly complicate |
| the design. Any and all help is welcome here! |
| </P> |
| |
| <p style="text-align:center;">Last modified 19 August 2009</p> |
| |
| <!--#include virtual="footer.txt"--> |