| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">Consumable Resources in Slurm</a></h1> |
| |
| <p>Slurm, using the default node allocation plug-in, allocates nodes to jobs in |
| exclusive mode. This means that even when all the resources within a node are |
| not utilized by a given job, another job will not have access to these resources. |
| Nodes possess resources such as processors, memory, swap, local |
| disk, etc. and jobs consume these resources. The exclusive use default policy |
| in Slurm can result in inefficient utilization of the cluster and of its nodes |
| resources. |
| Slurm's <i>cons_tres</i> plugin is available to |
| manage resources on a much more fine-grained basis as described below.</p> |
| |
| <h2 id="using_cons_tres"> |
| Using the Consumable Trackable Resource Plugin: <b>select/cons_tres</b> |
| <a class="slurm_link" href="#using_cons_tres"></a> |
| </h2> |
| |
| <p>The Consumable Trackable Resources (<b>cons_tres</b>) plugin has been built |
| to work with several resources. It can track a Board, Socket, Core, CPU, Memory |
| as well as any combination of the logical processors with Memory:</p> |
| <ul> |
| <li><b>CPU</b> (<i>CR_CPU</i>): CPU as a consumable resource.</li> |
| <ul> |
| <li>No notion of sockets, cores, or threads.</li> |
| <li>On a multi-core system CPUs will be cores.</li> |
| <li>On a multi-core/hyperthread system CPUs will be threads.</li> |
| <li>On a single-core system CPUs are CPUs.</li> |
| </ul> |
| <li><b>Board</b> (<i>CR_Board</i>): Baseboard as a consumable resource.</li> |
| <li><b>Socket</b> (<i>CR_Socket</i>): Socket as a consumable resource.</li> |
| <li><b>Core</b> (<i>CR_Core</i>): Core as a consumable resource.</li> |
| <li><b>Socket and Memory</b> (<i>CR_Socket_Memory</i>): Socket |
| and Memory as consumable resources.</li> |
| <li><b>Core and Memory</b> (<i>CR_Core_Memory</i>): Core and |
| Memory as consumable resources.</li> |
| <li><b>CPU and Memory</b> (<i>CR_CPU_Memory</i>) CPU and Memory |
| as consumable resources.</li> |
| </ul> |
| |
| <p>All CR_* parameters assume <b>OverSubscribe=No</b> or |
| <b>OverSubscribe=Force</b>.</p> |
| |
| <p>The cons_tres plugin also provides functionality specifically |
| related to GPUs.</p> |
| |
| <p>Additional parameters available for the <b>cons_tres</b> plugin:</p> |
| <ul> |
| <li><b>DefCpuPerGPU</b>: Default number of CPUs allocated per GPU.</li> |
| <li><b>DefMemPerGPU</b>: Default amount of memory allocated per GPU.</li> |
| </ul> |
| <p>Additional job submit options available for the <b>cons_tres</b> plugin:</p> |
| <ul> |
| <li><b>--cpus-per-gpu=</b>: Number of CPUs for every GPU.</li> |
| <li><b>--gpus=</b>: Count of GPUs for entire job allocation.</li> |
| <li><b>--gpu-bind=</b>: Bind task to specific GPU(s).</li> |
| <li><b>--gpu-freq=</b>: Request specific GPU/memory frequencies.</li> |
| <li><b>--gpus-per-node=</b>: Number of GPUs per node.</li> |
| <li><b>--gpus-per-socket=</b>: Number of GPUs per socket.</li> |
| <li><b>--gpus-per-task=</b>: Number of GPUs per task.</li> |
| <li><b>--mem-per-gpu=</b>: Amount of memory for each GPU.</li> |
| </ul> |
| |
| <p>srun's <i>-B</i> extension for sockets, cores, and threads is |
| ignored within the node allocation mechanism when CR_CPU or |
| CR_CPU_MEMORY is selected. It is used to compute the total |
| number of tasks when <i>-n</i> is not specified.</p> |
| |
| <p>In the cases where Memory is a consumable resource, the <b>RealMemory</b> |
| parameter must be set in the slurm.conf to define a node's amount of real |
| memory.</p> |
| |
| <p>The job submission commands (salloc, sbatch and srun) support the options |
| <i>--mem=MB</i> and <i>--mem-per-cpu=MB</i>, permitting users to specify |
| the maximum amount of real memory required per node or per allocated CPU. |
| This option is required in the environments where Memory is a consumable |
| resource. It is important to specify enough memory since Slurm will not allow |
| the application to use more than the requested amount of real memory. The |
| default value for --mem is inherited from <b>DefMemPerNode</b>. See |
| <a href="srun.html#OPT_mem">srun</a>(1) for more details.</p> |
| |
| <p>Using <i>--overcommit</i> or <i>-O</i> is allowed. When the process to |
| logical processor pinning is enabled by using an appropriate TaskPlugin |
| configuration parameter, the extra processes will time share the allocated |
| resources.</p> |
| |
| <p>The Consumable Trackable Resource plugin is enabled via the SelectType |
| parameter in the slurm.conf.</p> |
| <pre> |
| # Excerpt from sample slurm.conf file |
| SelectType=select/cons_tres |
| </pre> |
| |
| <h2 id="general">General Comments<a class="slurm_link" href="#general"></a></h2> |
| |
| <p>Slurm's default <b>select/linear</b> plugin is using a best fit algorithm |
| based on number of consecutive nodes.</p> |
| |
| <p>The <b>select/cons_tres</b> plugin is enabled or disabled cluster-wide.</p> |
| |
| <p>In the case where <b>select/linear</b> is enabled, the normal Slurm |
| behaviors are not disrupted. The major change users see when using the |
| <b>select/cons_tres</b> plugin is that jobs can be |
| co-scheduled on nodes when resources permit it. Generic resources (such as GPUs) |
| can also be tracked individually with this plugin. |
| The rest of Slurm, such as srun and its options (except srun -s ...), etc. are not |
| affected by this plugin. Slurm is, from the user's point of view, working the |
| same way as when using the default node selection scheme.</p> |
| |
| <p>The <i>--exclusive</i> srun option allows users to request nodes in |
| exclusive mode even when consumable resources is enabled. See |
| <a href="srun.html#OPT_exclusive">srun</a>(1) for details. </p> |
| |
| <p>srun's <i>-s</i> or <i>--oversubscribe</i> is incompatible with the consumable |
| resource environment and will therefore not be honored. Since this |
| environment's nodes are shared by default, <i>--exclusive</i> allows users to |
| obtain dedicated nodes.</p> |
| |
| <p>The <i>--oversubscribe</i> and <i>--exclusive</i> options are mutually |
| exclusive when used at job submission. If both options are set when submitting |
| a job, the job submission command used will fatal.</p> |
| |
| |
| <h2 id="example_mem">Examples of CR_Socket_Memory, and CR_CPU_Memory |
| type consumable resources |
| <a class="slurm_link" href="#example_mem"></a> |
| </h2> |
| |
| <pre> |
| # sinfo -lNe |
| NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY |
| hydra[12-16] 5 allNodes* ... 4 2:2:1 2007 |
| </pre> |
| |
| <p>Using select/cons_tres plug-in with CR_Socket_Memory (2 sockets/node)</p> |
| <pre> |
| Example 1: |
| # srun -N 5 -n 5 --mem=1000 sleep 100 & <-- running |
| # srun -n 1 -w hydra12 --mem=2000 sleep 100 & <-- queued and waiting for resources |
| |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 1890 allNodes sleep sballe PD 0:00 1 (Resources) |
| 1889 allNodes sleep sballe R 0:08 5 hydra[12-16] |
| |
| Example 2: |
| # srun -N 5 -n 10 --mem=10 sleep 100 & <-- running |
| # srun -n 1 --mem=10 sleep 100 & <-- queued and waiting for resourcessqueue |
| |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 1831 allNodes sleep sballe PD 0:00 1 (Resources) |
| 1830 allNodes sleep sballe R 0:07 5 hydra[12-16] |
| </pre> |
| |
| <p>Using select/cons_tres plug-in with CR_CPU_Memory (4 CPUs/node)</p> |
| <pre> |
| Example 1: |
| # srun -N 5 -n 5 --mem=1000 sleep 100 & <-- running |
| # srun -N 5 -n 5 --mem=10 sleep 100 & <-- running |
| # srun -N 5 -n 5 --mem=1000 sleep 100 & <-- queued and waiting for resources |
| |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 1835 allNodes sleep sballe PD 0:00 5 (Resources) |
| 1833 allNodes sleep sballe R 0:10 5 hydra[12-16] |
| 1834 allNodes sleep sballe R 0:07 5 hydra[12-16] |
| |
| Example 2: |
| # srun -N 5 -n 20 --mem=10 sleep 100 & <-- running |
| # srun -n 1 --mem=10 sleep 100 & <-- queued and waiting for resources |
| |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 1837 allNodes sleep sballe PD 0:00 1 (Resources) |
| 1836 allNodes sleep sballe R 0:11 5 hydra[12-16] |
| </pre> |
| |
| |
| <h2 id="example_node"> |
| Example of Node Allocations Using Consumable Resource Plugin |
| <a class="slurm_link" href="#example_node"></a> |
| </h2> |
| |
| <p>The following example illustrates the different ways four jobs |
| are allocated across a cluster using (1) Slurm's default allocation method |
| (exclusive mode) and (2) a processor as consumable resource |
| approach.</p> |
| |
| <p>It is important to understand that the example listed below is a |
| contrived example and is only given here to illustrate the use of CPUs as |
| consumable resources. Job 2 and Job 3 call for the node count to equal |
| the processor count. This would typically be done because |
| that one task per node requires all of the memory, disk space, etc. The |
| bottleneck would not be processor count.</p> |
| |
| <p>Trying to execute more than one job per node will almost certainly severely |
| impact a parallel job's performance. |
| The biggest beneficiary of CPUs as consumable resources will be serial jobs or |
| jobs with modest parallelism, which can effectively share resources. On many |
| systems with larger processor count, jobs typically run one fewer task than |
| there are processors to minimize interference by the kernel and daemons.</p> |
| |
| <p>The example cluster is composed of 4 nodes (10 CPUs in total):</p> |
| |
| <ul> |
| <li>linux01 (with 2 processors), </li> |
| <li>linux02 (with 2 processors), </li> |
| <li>linux03 (with 2 processors), and</li> |
| <li>linux04 (with 4 processors). </li> |
| </ul> |
| |
| <p>The four jobs are the following:</p> |
| |
| <ul> |
| <li>[2] srun -n 4 -N 4 sleep 120 &</li> |
| <li>[3] srun -n 3 -N 3 sleep 120 &</li> |
| <li>[4] srun -n 1 sleep 120 &</li> |
| <li>[5] srun -n 3 sleep 120 &</li> |
| </ul> |
| |
| <p>The user launches them in the same order as listed above.</p> |
| |
| |
| <h2 id="using_default">Using Slurm's Default Node Allocation (Non-shared Mode) |
| <a class="slurm_link" href="#using_default"></a> |
| </h2> |
| |
| <p>The four jobs have been launched and 3 of the jobs are now |
| pending, waiting to get resources allocated to them. Only Job 2 is running |
| since it uses one CPU on all 4 nodes. This means that linux01 to linux03 each |
| have one idle CPU and linux04 has 3 idle CPUs.</p> |
| |
| <pre> |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 3 lsf sleep root PD 0:00 3 (Resources) |
| 4 lsf sleep root PD 0:00 1 (Resources) |
| 5 lsf sleep root PD 0:00 1 (Resources) |
| 2 lsf sleep root R 0:14 4 linux[01-04] |
| </pre> |
| |
| <p>Once Job 2 is finished, Job 3 is scheduled and runs on |
| linux01, linux02, and linux03. Job 3 is only using one CPU on each of the 3 |
| nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3 |
| and Job 4 can run concurrently on the cluster.</p> |
| |
| <p>Job 5 has to wait for idle nodes to be able to run.</p> |
| |
| <pre> |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 5 lsf sleep root PD 0:00 1 (Resources) |
| 3 lsf sleep root R 0:11 3 linux[01-03] |
| 4 lsf sleep root R 0:11 1 linux04 |
| </pre> |
| |
| <p>Once Job 3 finishes, Job 5 is allocated resources and can run.</p> |
| |
| <p>The advantage of the exclusive mode scheduling policy is |
| that the a job gets all the resources of the assigned nodes for optimal |
| parallel performance. The drawback is |
| that jobs can tie up large amount of resources that it does not use and which |
| cannot be shared with other jobs.</p> |
| |
| |
| <h2 id="using_proc">Using a Processor Consumable Resource Approach |
| <a class="slurm_link" href="#using_proc"></a> |
| </h2> |
| |
| <p>We will run through the same scenario again using the <b>cons_tres</b> |
| plugin and CPUs as the consumable resource. The output of squeue shows that we |
| have 3 out of the 4 jobs allocated and running. This is a 2 running job |
| increase over the default Slurm approach.</p> |
| |
| <p> Job 2 is running on nodes linux01 |
| to linux04. Job 2's allocation is the same as for Slurm's default allocation |
| which is that it uses one CPU on each of the 4 nodes. Once Job 2 is scheduled |
| and running, nodes linux01, linux02 and linux03 still have one idle CPU each |
| and node linux04 has 3 idle CPUs. The main difference between this approach and |
| the exclusive mode approach described above is that idle CPUs within a node |
| are now allowed to be assigned to other jobs.</p> |
| |
| <p>It is important to note that |
| <i>assigned</i> doesn't mean <i>oversubscription</i>. The consumable resource approach |
| tracks how much of each available resource (in our case CPUs) must be dedicated |
| to a given job. This allows us to prevent per node oversubscription of |
| resources (CPUs).</p> |
| |
| <p>Once Job 2 is running, Job 3 is |
| scheduled onto node linux01, linux02, and Linux03 (using one CPU on each of the |
| nodes) and Job 4 is scheduled onto one of the remaining idle CPUs on Linux04.</p> |
| |
| <p>Job 2, Job 3, and Job 4 are now running concurrently on the cluster.</p> |
| |
| <pre> |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 5 lsf sleep root PD 0:00 1 (Resources) |
| 2 lsf sleep root R 0:13 4 linux[01-04] |
| 3 lsf sleep root R 0:09 3 linux[01-03] |
| 4 lsf sleep root R 0:05 1 linux04 |
| |
| # sinfo -lNe |
| NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON |
| linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none |
| linux04 1 lsf* allocated 4 3813 1 1 (null) none |
| </pre> |
| |
| <p>Once Job 2 finishes, Job 5, which was pending, is allocated available resources and is then |
| running as illustrated below:</p> |
| |
| <pre> |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 3 lsf sleep root R 1:58 3 linux[01-03] |
| 4 lsf sleep root R 1:54 1 linux04 |
| 5 lsf sleep root R 0:02 3 linux[01-03] |
| # sinfo -lNe |
| NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON |
| linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none |
| linux04 1 lsf* idle 4 3813 1 1 (null) none |
| </pre> |
| |
| <p>Job 3, Job 4, and Job 5 are now running concurrently on the cluster.</p> |
| |
| <pre> |
| # squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 5 lsf sleep root R 1:52 3 linux[01-03] |
| </pre> |
| |
| <p>Job 3 and Job 4 have finished and Job 5 is still running on nodes linux[01-03].</p> |
| |
| <p>The advantage of the consumable resource scheduling policy |
| is that the job throughput can increase dramatically. The overall job |
| throughput and productivity of the cluster increases, thereby reducing the |
| amount of time users have to wait for their job to complete as well as |
| increasing the overall efficiency of the use of the cluster. The drawback is |
| that users do not have entire nodes dedicated to their jobs by default.</p> |
| |
| <p>We have added the <i>--exclusive</i> option to srun (see |
| <a href="srun.html#OPT_exclusive">srun</a>(1) for more details), |
| which allows users to specify that they would like |
| their nodes to be allocated in exclusive mode. |
| This is to accommodate users who might have mpi/threaded/openMP |
| programs that will take advantage of all the CPUs on a node but only need |
| one mpi process per node.</p> |
| |
| |
| <p style="text-align:center;">Last modified 09 July 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |