| <!--#include virtual="header.txt"--> |
| |
| <H1>Gang Scheduling</H1> |
| |
| <P> |
| SLURM version 1.2 and earlier supported dedication of resources to jobs. |
| Beginning in SLURM version 1.3, timesliced gang scheduling is supported. |
| Timesliced gang scheduling is when two or more jobs are allocated to the same |
| resources and these jobs are alternately suspended to let one job at a time have |
| dedicated access to the resources for a configured period of time. |
| </P> |
| <P> |
| Preemptive priority job scheduling is another form of gang-scheduling that is |
| supported by SLURM. See the <a href="preempt.html">Preemption</a> document for |
| more information. |
| </P> |
| <P> |
| A resource manager that supports timeslicing can improve it's responsiveness |
| and utilization by allowing more jobs to begin running sooner. Shorter-running |
| jobs no longer have to wait in a queue behind longer-running jobs. Instead they |
| can be run "in parallel" with the longer-running jobs, which will allow them |
| to finish quicker. Throughput is also improved because overcommitting the |
| resources provides opportunities for "local backfilling" to occur (see example |
| below). |
| </P> |
| <P> |
| The SLURM 1.3.0 the <I>sched/gang</I> plugin provides timeslicing. When enabled, |
| it monitors each of the partitions in SLURM. If a new job has been allocated to |
| resources in a partition that have already been allocated to an existing job, |
| then the plugin will suspend the new job until the configured |
| <I>SchedulerTimeslice</I> interval has elapsed. Then it will suspend the |
| running job and let the new job make use of the resources for a |
| <I>SchedulerTimeslice</I> interval. This will continue until one of the |
| jobs terminates. |
| </P> |
| |
| <H2>Configuration</H2> |
| |
| <P> |
| There are several important configuration parameters relating to |
| gang scheduling: |
| </P> |
| <UL> |
| <LI> |
| <B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes |
| allocated by the <I>select/linear</I> plugin and socket/core/CPU resources |
| allocated by the <I>select/cons_res</I> plugin. |
| </LI> |
| <LI> |
| <B>SelectTypeParameter</B>: Since resources will be getting overallocated |
| with jobs, the resource selection plugin should be configured to track the |
| amount of memory used by each job to ensure that memory page swapping does |
| not occur. When <I>select/linear</I> is chosen, we recommend setting |
| <I>SelectTypeParameter=CR_Memory</I>. When <I>select/cons_res</I> is |
| chosen, we recommend including Memory as a resource (ex. |
| <I>SelectTypeParameter=CR_Core_Memory</I>). |
| </LI> |
| <LI> |
| <B>DefMemPerCPU</B>: Since job requests may not explicitly specify |
| a memory requirement, we also recommend configuring |
| <I>DefMemPerCPU</I> (default memory per allocated CPU) or |
| <I>DefMemPerNode</I> (default memory per allocated node). |
| It may also be desirable to configure |
| <I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or |
| <I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>. |
| Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option |
| at job submission time to specify their memory requirements. |
| </LI> |
| <LI> |
| <B>JobAcctGatherType and JobAcctGatherFrequency</B>: |
| If you wish to enforce memory limits, accounting must be enabled |
| using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I> |
| parameters. If accounting is enabled and a job exceeds its configured |
| memory limits, it will be canceled in order to prevent it from |
| adversely effecting other jobs sharing the same resources. |
| </LI> |
| <LI> |
| <B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting |
| <I>SchedulerType=sched/gang</I> in <I>slurm.conf</I>. |
| </LI> |
| <LI> |
| <B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds. |
| To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval |
| (in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval |
| to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase |
| the overhead of gang scheduling. |
| </LI> |
| <LI> |
| <B>Shared</B>: Configure the partitions <I>Shared</I> setting to |
| <I>FORCE</I> for all partitions in which timeslicing is to take place. |
| The <I>FORCE</I> option now supports an additional parameter that controls |
| how many jobs can share a resource (FORCE[:max_share]). By default the |
| max_share value is 4. To allow up to 6 jobs from this partition to be |
| allocated to a common resource, set <I>Shared=FORCE:6</I>. To only let 2 jobs |
| timeslice on the same resources, set <I>Shared=FORCE:2</I>. |
| </LI> |
| </UL> |
| <P> |
| In order to enable gang scheduling after making the configuration changes |
| described above, restart SLURM if it is already running. Any change to the |
| plugin settings in SLURM requires a full restart of the daemons. If you |
| just change the partition <I>Shared</I> setting, this can be updated with |
| <I>scontrol reconfig</I>. |
| </P> |
| <P> |
| For an advanced topic discussion on the potential use of swap space, |
| see "Making use of swap space" in the "Future Work" section below. |
| </P> |
| |
| <H2>Timeslicer Design and Operation</H2> |
| |
| <P> |
| When enabled, the <I>sched/gang</I> plugin keeps track of the resources |
| allocated to all jobs. For each partition an "active bitmap" is maintained that |
| tracks all concurrently running jobs in the SLURM cluster. Each time a new |
| job is allocated to resources in a partition, the <I>sched/gang</I> plugin |
| compares these newly allocated resources with the resources already maintained |
| in the "active bitmap". If these two sets of resources are disjoint then the new |
| job is added to the "active bitmap". If these two sets of resources overlap then |
| the new job is suspended. All jobs are tracked in a per-partition job queue |
| within the <I>sched/gang</I> plugin. |
| </P> |
| <P> |
| A separate <I>timeslicer thread</I> is spawned by the <I>sched/gang</I> plugin |
| on startup. This thread sleeps for the configured <I>SchedulerTimeSlice</I> |
| interval. When it wakes up, it checks each partition for suspended jobs. If |
| suspended jobs are found then the <I>timeslicer thread</I> moves all running |
| jobs to the end of the job queue. It then reconstructs the "active bitmap" for |
| this partition beginning with the suspended job that has waited the longest to |
| run (this will be the first suspended job in the run queue). Each following job |
| is then compared with the new "active bitmap", and if the job can be run |
| concurrently with the other "active" jobs then the job is added. Once this is |
| complete then the <I>timeslicer thread</I> suspends any currently running jobs |
| that are no longer part of the "active bitmap", and resumes jobs that are new to |
| the "active bitmap". |
| </P> |
| <P> |
| This <I>timeslicer thread</I> algorithm for rotating jobs is designed to prevent |
| jobs from starving (remaining in the suspended state indefinitely) and to be as |
| fair as possible in the distribution of runtime while still keeping all of the |
| resources as busy as possible. |
| </P> |
| <P> |
| The <I>sched/gang</I> plugin suspends jobs via the same internal functions that |
| support <I>scontrol suspend</I> and <I>scontrol resume</I>. A good way to |
| observe the operation of the timeslicer is by running <I>watch squeue</I> in a |
| terminal window. |
| </P> |
| |
| <H2>A Simple Example</H2> |
| |
| <P> |
| The following example is configured with <I>select/linear</I>, |
| <I>sched/gang</I>, and <I>Shared=FORCE</I>. This example takes place on a small |
| cluster of 5 nodes: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 5 idle n[12-16] |
| </PRE> |
| <P> |
| Here are the Scheduler settings (the last two settings are the relevant ones): |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>scontrol show config | grep Sched</B> |
| FastSchedule = 1 |
| SchedulerPort = 7321 |
| SchedulerRootFilter = 1 |
| SchedulerTimeSlice = 30 |
| SchedulerType = sched/gang |
| </PRE> |
| <P> |
| The <I>myload</I> script launches a simple load-generating app that runs |
| for the given number of seconds. Submit <I>myload</I> to run on all nodes: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sbatch -N5 ./myload 300</B> |
| sbatch: Submitted batch job 3 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 3 active myload user 0:05 5 n[12-16] |
| </PRE> |
| <P> |
| Submit it again and watch the <I>sched/gang</I> plugin suspend it: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sbatch -N5 ./myload 300</B> |
| sbatch: Submitted batch job 4 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 3 active myload user R 0:13 5 n[12-16] |
| 4 active myload user S 0:00 5 n[12-16] |
| </PRE> |
| <P> |
| After 30 seconds the <I>sched/gang</I> plugin swaps jobs, and now job 4 is the |
| active one: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 4 active myload user R 0:08 5 n[12-16] |
| 3 active myload user S 0:41 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 4 active myload user R 0:21 5 n[12-16] |
| 3 active myload user S 0:41 5 n[12-16] |
| </PRE> |
| <P> |
| After another 30 seconds the <I>sched/gang</I> plugin sets job 3 running again: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 3 active myload user R 0:50 5 n[12-16] |
| 4 active myload user S 0:30 5 n[12-16] |
| </PRE> |
| |
| <P> |
| <B>A possible side effect of timeslicing</B>: Note that jobs that are |
| immediately suspended may cause their srun commands to produce the following |
| output: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>cat slurm-4.out</B> |
| srun: Job step creation temporarily disabled, retrying |
| srun: Job step creation still disabled, retrying |
| srun: Job step creation still disabled, retrying |
| srun: Job step creation still disabled, retrying |
| srun: Job step created |
| </PRE> |
| <P> |
| This occurs because <I>srun</I> is attempting to launch a jobstep in an |
| allocation that has been suspended. The <I>srun</I> process will continue in a |
| retry loop to launch the jobstep until the allocation has been resumed and the |
| jobstep can be launched. |
| </P> |
| <P> |
| When the <I>sched/gang</I> plugin is enabled, this type of output in the user |
| jobs should be considered benign. |
| </P> |
| |
| <H2>More examples</H2> |
| |
| <P> |
| The following example shows how the timeslicer algorithm keeps the resources |
| busy. Job 10 runs continually, while jobs 9 and 11 are timesliced: |
| </P> |
| |
| <PRE> |
| [user@n16 load]$ <B>sbatch -N3 ./myload 300</B> |
| sbatch: Submitted batch job 9 |
| |
| [user@n16 load]$ <B>sbatch -N2 ./myload 300</B> |
| sbatch: Submitted batch job 10 |
| |
| [user@n16 load]$ <B>sbatch -N3 ./myload 300</B> |
| sbatch: Submitted batch job 11 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 9 active myload user R 0:11 3 n[12-14] |
| 10 active myload user R 0:08 2 n[15-16] |
| 11 active myload user S 0:00 3 n[12-14] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 10 active myload user R 0:50 2 n[15-16] |
| 11 active myload user R 0:12 3 n[12-14] |
| 9 active myload user S 0:41 3 n[12-14] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 10 active myload user R 1:04 2 n[15-16] |
| 11 active myload user R 0:26 3 n[12-14] |
| 9 active myload user S 0:41 3 n[12-14] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 9 active myload user R 0:46 3 n[12-14] |
| 10 active myload user R 1:13 2 n[15-16] |
| 11 active myload user S 0:30 3 n[12-14] |
| </PRE> |
| </P> |
| <P> |
| The next example displays "local backfilling": |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sbatch -N3 ./myload 300</B> |
| sbatch: Submitted batch job 12 |
| |
| [user@n16 load]$ <B>sbatch -N5 ./myload 300</B> |
| sbatch: Submitted batch job 13 |
| |
| [user@n16 load]$ <B>sbatch -N2 ./myload 300</B> |
| sbatch: Submitted batch job 14 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 12 active myload user R 0:14 3 n[12-14] |
| 14 active myload user R 0:06 2 n[15-16] |
| 13 active myload user S 0:00 5 n[12-16] |
| </PRE> |
| <P> |
| Without timeslicing and without the backfill scheduler enabled, job 14 has to |
| wait for job 13 to finish. |
| </P> |
| <P> |
| This is called "local" backfilling because the backfilling only occurs with jobs |
| close enough in the queue to get allocated by the scheduler as part of |
| oversubscribing the resources. Recall that the number of jobs that can |
| overcommit a resource is controlled by the <I>Shared=FORCE:max_share</I> value, |
| so this value effectively controls the scope of "local backfilling". |
| </P> |
| <P> |
| Normal backfill algorithms check <U>all</U> jobs in the wait queue. |
| </P> |
| |
| <H2>Consumable Resource Examples</H2> |
| |
| <P> |
| The following two examples illustrate the primary difference between |
| <I>CR_CPU</I> and <I>CR_Core</I> when consumable resource selection is enabled |
| (<I>select/cons_res</I>). |
| </P> |
| <P> |
| When <I>CR_CPU</I> (or <I>CR_CPU_Memory</I>) is configured then the selector |
| treats the CPUs as simple, <I>interchangeable</I> computing resources. However |
| when <I>CR_Core</I> (or <I>CR_Core_Memory</I>) is enabled the selector treats |
| the CPUs as individual resources that are <U>specifically</U> allocated to jobs. |
| This subtle difference is highlighted when timeslicing is enabled. |
| </P> |
| <P> |
| In both examples 6 jobs are submitted. Each job requests 2 CPUs per node, and |
| all of the nodes contain two quad-core processors. The timeslicer will initially |
| let the first 4 jobs run and suspend the last 2 jobs. The manner in which these |
| jobs are timesliced depends upon the configured <I>SelectTypeParameter</I>. |
| </P> |
| <P> |
| In the first example <I>CR_Core_Memory</I> is configured. Note that jobs 46 and |
| 47 don't <U>ever</U> get suspended. This is because they are not sharing their |
| cores with any other job. Jobs 48 and 49 were allocated to the same cores as |
| jobs 45 and 46. The timeslicer recognizes this and timeslices only those jobs: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 5 idle n[12-16] |
| |
| [user@n16 load]$ <B>scontrol show config | grep Select</B> |
| SelectType = select/cons_res |
| SelectTypeParameters = CR_CORE_MEMORY |
| |
| [user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B> |
| NODELIST NODES CPUS S:C:T |
| n[12-16] 5 8 2:4:1 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 44 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 45 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 46 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 47 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 48 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 49 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 44 active myload user R 0:09 5 n[12-16] |
| 45 active myload user R 0:08 5 n[12-16] |
| 46 active myload user R 0:08 5 n[12-16] |
| 47 active myload user R 0:07 5 n[12-16] |
| 48 active myload user S 0:00 5 n[12-16] |
| 49 active myload user S 0:00 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 46 active myload user R 0:49 5 n[12-16] |
| 47 active myload user R 0:48 5 n[12-16] |
| 48 active myload user R 0:06 5 n[12-16] |
| 49 active myload user R 0:06 5 n[12-16] |
| 44 active myload user S 0:44 5 n[12-16] |
| 45 active myload user S 0:43 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 44 active myload user R 1:23 5 n[12-16] |
| 45 active myload user R 1:22 5 n[12-16] |
| 46 active myload user R 2:22 5 n[12-16] |
| 47 active myload user R 2:21 5 n[12-16] |
| 48 active myload user S 1:00 5 n[12-16] |
| 49 active myload user S 1:00 5 n[12-16] |
| </PRE> |
| <P> |
| Note the runtime of all 6 jobs in the output of the last <I>squeue</I> command. |
| Jobs 46 and 47 have been running continuously, while jobs 45 and 46 are |
| splitting their runtime with jobs 48 and 49. |
| </P> |
| <P> |
| The next example has <I>CR_CPU_Memory</I> configured and the same 6 jobs are |
| submitted. Here the selector and the timeslicer treat the CPUs as countable |
| resources which results in all 6 jobs sharing time on the CPUs: |
| </P> |
| <PRE> |
| [user@n16 load]$ <B>sinfo</B> |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| active* up infinite 5 idle n[12-16] |
| |
| [user@n16 load]$ <B>scontrol show config | grep Select</B> |
| SelectType = select/cons_res |
| SelectTypeParameters = CR_CPU_MEMORY |
| |
| [user@n16 load]$ <B>sinfo -o "%20N %5D %5c %5z"</B> |
| NODELIST NODES CPUS S:C:T |
| n[12-16] 5 8 2:4:1 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 51 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 52 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 53 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 54 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 55 |
| |
| [user@n16 load]$ <B>sbatch -n10 -N5 ./myload 300</B> |
| sbatch: Submitted batch job 56 |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 51 active myload user R 0:11 5 n[12-16] |
| 52 active myload user R 0:11 5 n[12-16] |
| 53 active myload user R 0:10 5 n[12-16] |
| 54 active myload user R 0:09 5 n[12-16] |
| 55 active myload user S 0:00 5 n[12-16] |
| 56 active myload user S 0:00 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 51 active myload user R 1:09 5 n[12-16] |
| 52 active myload user R 1:09 5 n[12-16] |
| 55 active myload user R 0:23 5 n[12-16] |
| 56 active myload user R 0:23 5 n[12-16] |
| 53 active myload user S 0:45 5 n[12-16] |
| 54 active myload user S 0:44 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 53 active myload user R 0:55 5 n[12-16] |
| 54 active myload user R 0:54 5 n[12-16] |
| 55 active myload user R 0:40 5 n[12-16] |
| 56 active myload user R 0:40 5 n[12-16] |
| 51 active myload user S 1:16 5 n[12-16] |
| 52 active myload user S 1:16 5 n[12-16] |
| |
| [user@n16 load]$ <B>squeue</B> |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 51 active myload user R 3:18 5 n[12-16] |
| 52 active myload user R 3:18 5 n[12-16] |
| 53 active myload user R 3:17 5 n[12-16] |
| 54 active myload user R 3:16 5 n[12-16] |
| 55 active myload user S 3:00 5 n[12-16] |
| 56 active myload user S 3:00 5 n[12-16] |
| </PRE> |
| <P> |
| Note that the runtime of all 6 jobs is roughly equal. Jobs 51-54 ran first so |
| they're slightly ahead, but so far all jobs have run for at least 3 minutes. |
| </P> |
| <P> |
| At the core level this means that SLURM relies on the linux kernel to move jobs |
| around on the cores to maximize performance. This is different than when |
| <I>CR_Core_Memory</I> was configured and the jobs would effectively remain |
| "pinned" to their specific cores for the duration of the job. Note that |
| <I>CR_Core_Memory</I> supports CPU binding, while <I>CR_CPU_Memory</I> does not. |
| </P> |
| |
| <H2>Future Work</H2> |
| |
| <P> |
| <B>Making use of swap space</B>: (note that this topic is not currently |
| scheduled for development, unless someone would like to pursue this) It should |
| be noted that timeslicing does provide an interesting mechanism for high |
| performance jobs to make use of swap space. The optimal scenario is one in which |
| suspended jobs are "swapped out" and active jobs are "swapped in". The swapping |
| activity would only occur once every <I>SchedulerTimeslice</I> interval. |
| </P> |
| <P> |
| However, SLURM should first be modified to include support for scheduling jobs |
| into swap space and to provide controls to prevent overcommitting swap space. |
| For now this idea could be experimented with by disabling memory support in the |
| selector and submitting appropriately sized jobs. |
| </P> |
| |
| <p style="text-align:center;">Last modified 5 December 2008</p> |
| |
| <!--#include virtual="footer.txt"--> |