| <!--#include virtual="header.txt"--> |
| |
| <H1>Gang Scheduling</H1> |
| |
| <P> |
| SLURM supports timesliced gang scheduling in which two or more jobs are |
| allocated to the same resources and these jobs are alternately suspended to |
| let one job at a time have dedicated access to the resources for a configured |
| period of time. |
| SLURM also supports preemptive priority job scheduling in which a a higher |
| priority job can preempt a lower priority one until the higher priority job |
| completes. |
| See the <a href="preempt.html">Preemption</a> document for more information. |
| </P> |
| <P> |
| A resource manager that supports timeslicing can improve responsiveness |
| and utilization by allowing more jobs to begin running sooner. |
| Shorter-running jobs no longer have to wait in a queue behind longer-running |
| jobs. |
| Instead they can be run "in parallel" with the longer-running jobs, which will |
| allow them to start and finish quicker. |
| Throughput is also improved because overcommitting the resources provides |
| opportunities for "local backfilling" to occur (see example below). |
| </P> |
| <P> |
| In SLURM version 2.0 and earlier, the <I>sched/gang</I> plugin provides |
| timeslicing. |
| In SLURM version 2.1, the gang scheduling logic was moved directly into the |
| main code bases to permit use of both gang scheduling plus the backfill |
| scheduler plugin, <i>sched/backfill</I>. |
| In either case, the gang scheduling logic monitors each of the partitions in |
| SLURM. |
| If a new job has been allocated to resources in a partition that have already |
| been allocated to an existing job, then the plugin will suspend the new job |
| until the configured <I>SchedulerTimeslice</I> interval has elapsed. |
| Then it will suspend the running job and let the new job make use of the |
| resources for a <I>SchedulerTimeslice</I> interval. |
| This will continue until one of the jobs terminates. |
| </P> |
| |
| <H2>Configuration</H2> |
| |
| <P> |
| There are several important configuration parameters relating to |
| gang scheduling: |
| </P> |
| <UL> |
| <LI> |
| <B>SelectType</B>: The SLURM gang scheduler supports nodes |
| allocated by the <I>select/linear</I> plugin and socket/core/CPU resources |
| allocated by the <I>select/cons_res</I> plugin. |
| </LI> |
| <LI> |
| <B>SelectTypeParameter</B>: Since resources will be getting overallocated |
| with jobs, the resource selection plugin should be configured to track the |
| amount of memory used by each job to ensure that memory page swapping does |
| not occur. When <I>select/linear</I> is chosen, we recommend setting |
| <I>SelectTypeParameter=CR_Memory</I>. When <I>select/cons_res</I> is |
| chosen, we recommend including Memory as a resource (ex. |
| <I>SelectTypeParameter=CR_Core_Memory</I>). |
| </LI> |
| <LI> |
| <B>DefMemPerCPU</B>: Since job requests may not explicitly specify |
| a memory requirement, we also recommend configuring |
| <I>DefMemPerCPU</I> (default memory per allocated CPU) or |
| <I>DefMemPerNode</I> (default memory per allocated node). |
| It may also be desirable to configure |
| <I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or |
| <I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>. |
| Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option |
| at job submission time to specify their memory requirements. |
| </LI> |
| <LI> |
| <B>JobAcctGatherType and JobAcctGatherFrequency</B>: |
| If you wish to enforce memory limits, accounting must be enabled |
| using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I> |
| parameters. If accounting is enabled and a job exceeds its configured |
| memory limits, it will be canceled in order to prevent it from |
| adversely effecting other jobs sharing the same resources. |
| </LI> |
| <LI> |
| <B>PreemptMode</B>: set the <I>GANG</I> option. |
| Additional options may be specified to enable job preemption in addition |
| to gang scheduling. |
| </LI> |
| <LI> |
| <B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds. |
| To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval |
| (in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval |
| to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase |
| the overhead of gang scheduling. |
| </LI> |
| <LI> |
| <B>Shared</B>: Configure the partition's <I>Shared</I> setting to |
| <I>FORCE</I> for all partitions in which timeslicing is to take place. |
| The <I>FORCE</I> option supports an additional parameter that controls |
| how many jobs can share a resource (FORCE[:max_share]). By default the |
| max_share value is 4. To allow up to 6 jobs from this partition to be |
| allocated to a common resource, set <I>Shared=FORCE:6</I>. To only let 2 jobs |
| timeslice on the same resources, set <I>Shared=FORCE:2</I>. |
| NOTE: Gang scheduling is performed independently for each partition, so |
| configuring partitions with overlapping nodes and gang scheduling is generally |
| not recommended. |
| </LI> |
| </UL> |
| <P> |
| In order to enable gang scheduling after making the configuration changes |
| described above, restart SLURM if it is already running. Any change to the |
| plugin settings in SLURM requires a full restart of the daemons. If you |
| just change the partition <I>Shared</I> setting, this can be updated with |
| <I>scontrol reconfig</I>. |
| </P> |
| <P> |
| For an advanced topic discussion on the potential use of swap space, |
| see "Making use of swap space" in the "Future Work" section below. |
| </P> |
| |
| <H2>Timeslicer Design and Operation</H2> |
| |
| <P> |
| When enabled, the gang scheduler keeps track of the resources |
| allocated to all jobs. For each partition an "active bitmap" is maintained that |
| tracks all concurrently running jobs in the SLURM cluster. Each time a new |
| job is allocated to resources in a partition, the gang scheduler |
| compares these newly allocated resources with the resources already maintained |
| in the "active bitmap". |
| If these two sets of resources are disjoint then the new job is added to the "active bitmap". If these two sets of resources overlap then |
| the new job is suspended. All jobs are tracked in a per-partition job queue |
| within the gang scheduler logic. |
| </P> |
| <P> |
| A separate <I>timeslicer thread</I> is spawned by the gang scheduler |
| on startup. This thread sleeps for the configured <I>SchedulerTimeSlice</I> |
| interval. When it wakes up, it checks each partition for suspended jobs. If |
| suspended jobs are found then the <I>timeslicer thread</I> moves all running |
| jobs to the end of the job queue. It then reconstructs the "active bitmap" for |
| this partition beginning with the suspended job that has waited the longest to |
| run (this will be the first suspended job in the run queue). Each following job |
| is then compared with the new "active bitmap", and if the job can be run |
| concurrently with the other "active" jobs then the job is added. Once this is |
| complete then the <I>timeslicer thread</I> suspends any currently running jobs |
| that are no longer part of the "active bitmap", and resumes jobs that are new |
| to the "active bitmap". |
| </P> |
| <P> |
| This <I>timeslicer thread</I> algorithm for rotating jobs is designed to prevent jobs from starving (remaining in the suspended state indefinitely) and |
| to be as fair as possible in the distribution of runtime while still keeping |
| all of the resources as busy as possible. |
| </P> |
| <P> |
| The gang scheduler suspends jobs via the same internal functions that |
| support <I>scontrol suspend</I> and <I>scontrol resume</I>. |
| A good way to observe the operation of the timeslicer is by running |
| <I>squeue -i<time></I> in a terminal window where <I>time</I> is set |
| equal to <I>SchedulerTimeSlice</I>. |
| </P> |
| |
| <H2>A Simple Example</H2> |
| |
| <P> |
| The following example is configured with <I>select/linear</I> and <I>Shared=FORCE</I>. |
| This example takes place on a small cluster of 5 nodes: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 5 idle n[12-16] |
| </PRE> |
| <P> |
| Here are the Scheduler settings (excerpt of output): |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>scontrol show config</B> |
| ... |
| PreemptMode = GANG |
| ... |
| SchedulerTimeSlice = 30 |
| SchedulerType = sched/builtin |
| ... |
| </PRE> |
| <P> |
| The <I>myload</I> script launches a simple load-generating app that runs |
| for the given number of seconds. Submit <I>myload</I> to run on all nodes: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sbatch -N5 ./myload 300</B> |
| sbatch: Submitted batch job 3 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 3 active myload user 0:05 5 n[12-16] |
| </PRE> |
| <P> |
| Submit it again and watch the gang scheduler suspend it: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sbatch -N5 ./myload 300</B> |
| sbatch: Submitted batch job 4 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 3 active myload user R 0:13 5 n[12-16] |
| 4 active myload user S 0:00 5 n[12-16] |
| </PRE> |
| <P> |
| After 30 seconds the gang scheduler swaps jobs, and now job 4 is the |
| active one: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 4 active myload user R 0:08 5 n[12-16] |
| 3 active myload user S 0:41 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 4 active myload user R 0:21 5 n[12-16] |
| 3 active myload user S 0:41 5 n[12-16] |
| </PRE> |
| <P> |
| After another 30 seconds the gang scheduler sets job 3 running again: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 3 active myload user R 0:50 5 n[12-16] |
| 4 active myload user S 0:30 5 n[12-16] |
| </PRE> |
| |
| <P> |
| <B>A possible side effect of timeslicing</B>: Note that jobs that are |
| immediately suspended may cause their <I>srun</I> commands to produce the |
| following output: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>cat slurm-4.out</B> |
| srun: Job step creation temporarily disabled, retrying |
| srun: Job step creation still disabled, retrying |
| srun: Job step creation still disabled, retrying |
| srun: Job step creation still disabled, retrying |
| srun: Job step created |
| </PRE> |
| <P> |
| This occurs because <I>srun</I> is attempting to launch a jobstep in an |
| allocation that has been suspended. The <I>srun</I> process will continue in a |
| retry loop to launch the jobstep until the allocation has been resumed and the |
| jobstep can be launched. |
| </P> |
| <P> |
| When the gang scheduler is enabled, this type of output in the user |
| jobs should be considered benign. |
| </P> |
| |
| <H2>More examples</H2> |
| |
| <P> |
| The following example shows how the timeslicer algorithm keeps the resources |
| busy. Job 10 runs continually, while jobs 9 and 11 are timesliced: |
| </P> |
| |
| <PRE> |
| [user@n16 load]$ <B>sbatch -N3 ./myload 300</B> |
| sbatch: Submitted batch job 9 |
| |
| [user@n16 load]$ <B>sbatch -N2 ./myload 300</B> |
| sbatch: Submitted batch job 10 |
| |
| [user@n16 load]$ <B>sbatch -N3 ./myload 300</B> |
| sbatch: Submitted batch job 11 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 9 active myload user R 0:11 3 n[12-14] |
| 10 active myload user R 0:08 2 n[15-16] |
| 11 active myload user S 0:00 3 n[12-14] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 10 active myload user R 0:50 2 n[15-16] |
| 11 active myload user R 0:12 3 n[12-14] |
| 9 active myload user S 0:41 3 n[12-14] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 10 active myload user R 1:04 2 n[15-16] |
| 11 active myload user R 0:26 3 n[12-14] |
| 9 active myload user S 0:41 3 n[12-14] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 9 active myload user R 0:46 3 n[12-14] |
| 10 active myload user R 1:13 2 n[15-16] |
| 11 active myload user S 0:30 3 n[12-14] |
| </PRE> |
| </P> |
| <P> |
| The next example displays "local backfilling": |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sbatch -N3 ./myload 300</B> |
| sbatch: Submitted batch job 12 |
| |
| [user@n16 load]$ <B>sbatch -N5 ./myload 300</B> |
| sbatch: Submitted batch job 13 |
| |
| [user@n16 load]$ <B>sbatch -N2 ./myload 300</B> |
| sbatch: Submitted batch job 14 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 12 active myload user R 0:14 3 n[12-14] |
| 14 active myload user R 0:06 2 n[15-16] |
| 13 active myload user S 0:00 5 n[12-16] |
| </PRE> |
| <P> |
| Without timeslicing and without the backfill scheduler enabled, job 14 has to |
| wait for job 13 to finish. |
| </P> |
| <P> |
| This is called "local" backfilling because the backfilling only occurs with |
| jobs close enough in the queue to get allocated by the scheduler as part of |
| oversubscribing the resources. Recall that the number of jobs that can |
| overcommit a resource is controlled by the <I>Shared=FORCE:max_share</I> value, |
| so this value effectively controls the scope of "local backfilling". |
| </P> |
| <P> |
| Normal backfill algorithms check <U>all</U> jobs in the wait queue. |
| </P> |
| |
| <H2>Consumable Resource Examples</H2> |
| |
| <P> |
| The following two examples illustrate the primary difference between |
| <I>CR_CPU</I> and <I>CR_Core</I> when consumable resource selection is enabled |
| (<I>select/cons_res</I>). |
| </P> |
| <P> |
| When <I>CR_CPU</I> (or <I>CR_CPU_Memory</I>) is configured then the selector |
| treats the CPUs as simple, <I>interchangeable</I> computing resources |
| <U>unless</U> task affinity is enabled. However when task affinity is enabled |
| with <I>CR_CPU</I> or <I>CR_Core</I> (or <I>CR_Core_Memory</I>) is enabled, the |
| selector treats the CPUs as individual resources that are <U>specifically</U> |
| allocated to jobs. |
| This subtle difference is highlighted when timeslicing is enabled. |
| </P> |
| <P> |
| In both examples 6 jobs are submitted. Each job requests 2 CPUs per node, and |
| all of the nodes contain two quad-core processors. The timeslicer will |
| initially let the first 4 jobs run and suspend the last 2 jobs. |
| The manner in which these jobs are timesliced depends upon the configured |
| <I>SelectTypeParameter</I>. |
| </P> |
| <P> |
| In the first example <I>CR_Core_Memory</I> is configured. Note that jobs 46 |
| and 47 don't <U>ever</U> get suspended. This is because they are not sharing |
| their cores with any other job. |
| Jobs 48 and 49 were allocated to the same cores as jobs 44 and 45. |
| The timeslicer recognizes this and timeslices only those jobs: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 5 idle n[12-16] |
| |
| [user@n16 load]$ <B>scontrol show config | grep Select</B> |
| SelectType = select/cons_res |
| SelectTypeParameters = CR_CORE_MEMORY |
| |
| [user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B> |
| NODELIST NODES CPUS S:C:T |
| n[12-16] 5 8 2:4:1 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 44 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 45 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 46 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 47 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 48 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 49 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 44 active myload user R 0:09 5 n[12-16] |
| 45 active myload user R 0:08 5 n[12-16] |
| 46 active myload user R 0:08 5 n[12-16] |
| 47 active myload user R 0:07 5 n[12-16] |
| 48 active myload user S 0:00 5 n[12-16] |
| 49 active myload user S 0:00 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 46 active myload user R 0:49 5 n[12-16] |
| 47 active myload user R 0:48 5 n[12-16] |
| 48 active myload user R 0:06 5 n[12-16] |
| 49 active myload user R 0:06 5 n[12-16] |
| 44 active myload user S 0:44 5 n[12-16] |
| 45 active myload user S 0:43 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 44 active myload user R 1:23 5 n[12-16] |
| 45 active myload user R 1:22 5 n[12-16] |
| 46 active myload user R 2:22 5 n[12-16] |
| 47 active myload user R 2:21 5 n[12-16] |
| 48 active myload user S 1:00 5 n[12-16] |
| 49 active myload user S 1:00 5 n[12-16] |
| </PRE> |
| <P> |
| Note the runtime of all 6 jobs in the output of the last <I>squeue</I> command. |
| Jobs 46 and 47 have been running continuously, while jobs 44 and 45 are |
| splitting their runtime with jobs 48 and 49. |
| </P> |
| <P> |
| The next example has <I>CR_CPU_Memory</I> configured and the same 6 jobs are |
| submitted. Here the selector and the timeslicer treat the CPUs as countable |
| resources which results in all 6 jobs sharing time on the CPUs: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 5 idle n[12-16] |
| |
| [user@n16 load]$ <B>scontrol show config | grep Select</B> |
| SelectType = select/cons_res |
| SelectTypeParameters = CR_CPU_MEMORY |
| |
| [user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B> |
| NODELIST NODES CPUS S:C:T |
| n[12-16] 5 8 2:4:1 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 51 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 52 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 53 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 54 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 55 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 56 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 51 active myload user R 0:11 5 n[12-16] |
| 52 active myload user R 0:11 5 n[12-16] |
| 53 active myload user R 0:10 5 n[12-16] |
| 54 active myload user R 0:09 5 n[12-16] |
| 55 active myload user S 0:00 5 n[12-16] |
| 56 active myload user S 0:00 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 51 active myload user R 1:09 5 n[12-16] |
| 52 active myload user R 1:09 5 n[12-16] |
| 55 active myload user R 0:23 5 n[12-16] |
| 56 active myload user R 0:23 5 n[12-16] |
| 53 active myload user S 0:45 5 n[12-16] |
| 54 active myload user S 0:44 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 53 active myload user R 0:55 5 n[12-16] |
| 54 active myload user R 0:54 5 n[12-16] |
| 55 active myload user R 0:40 5 n[12-16] |
| 56 active myload user R 0:40 5 n[12-16] |
| 51 active myload user S 1:16 5 n[12-16] |
| 52 active myload user S 1:16 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 51 active myload user R 3:18 5 n[12-16] |
| 52 active myload user R 3:18 5 n[12-16] |
| 53 active myload user R 3:17 5 n[12-16] |
| 54 active myload user R 3:16 5 n[12-16] |
| 55 active myload user S 3:00 5 n[12-16] |
| 56 active myload user S 3:00 5 n[12-16] |
| </PRE> |
| <P> |
| Note that the runtime of all 6 jobs is roughly equal. Jobs 51-54 ran first so |
| they're slightly ahead, but so far all jobs have run for at least 3 minutes. |
| </P> |
| <P> |
| At the core level this means that SLURM relies on the Linux kernel to move |
| jobs around on the cores to maximize performance. |
| This is different than when <I>CR_Core_Memory</I> was configured and the jobs |
| would effectively remain "pinned" to their specific cores for the duration of |
| the job. |
| Note that <I>CR_Core_Memory</I> supports CPU binding, while |
| <I>CR_CPU_Memory</I> does not. |
| </P> |
| |
| <H2>Future Work</H2> |
| |
| <P> |
| <B>Making use of swap space</B>: (note that this topic is not currently |
| scheduled for development, unless someone would like to pursue this) It should |
| be noted that timeslicing does provide an interesting mechanism for high |
| performance jobs to make use of swap space. |
| The optimal scenario is one in which suspended jobs are "swapped out" and |
| active jobs are "swapped in". |
| The swapping activity would only occur once every <I>SchedulerTimeslice</I> |
| interval. |
| </P> |
| <P> |
| However, SLURM should first be modified to include support for scheduling jobs |
| into swap space and to provide controls to prevent overcommitting swap space. |
| For now this idea could be experimented with by disabling memory support in |
| the selector and submitting appropriately sized jobs. |
| </P> |
| |
| <p style="text-align:center;">Last modified 24 June 2011</p> |
| |
| <!--#include virtual="footer.txt"--> |