blob: fa85e51ce875fac84b82d4d77e67651240dd4b1b [file] [log] [blame]
<!--#include virtual="header.txt"-->
<h1><a name="top">Consumable Resources in Slurm</a></h1>
<p>Slurm, using the default node allocation plug-in, allocates nodes to jobs in
exclusive mode. This means that even when all the resources within a node are
not utilized by a given job, another job will not have access to these resources.
Nodes possess resources such as processors, memory, swap, local
disk, etc. and jobs consume these resources. The exclusive use default policy
in Slurm can result in inefficient utilization of the cluster and of its nodes
resources.
Slurm's <i>cons_tres</i> plugin is available to
manage resources on a much more fine-grained basis as described below.</p>
<h2 id="using_cons_tres">
Using the Consumable Trackable Resource Plugin: <b>select/cons_tres</b>
<a class="slurm_link" href="#using_cons_tres"></a>
</h2>
<p>The Consumable Trackable Resources (<b>cons_tres</b>) plugin has been built
to work with several resources. It can track a Board, Socket, Core, CPU, Memory
as well as any combination of the logical processors with Memory:</p>
<ul>
<li><b>CPU</b> (<i>CR_CPU</i>): CPU as a consumable resource.</li>
<ul>
<li>No notion of sockets, cores, or threads.</li>
<li>On a multi-core system CPUs will be cores.</li>
<li>On a multi-core/hyperthread system CPUs will be threads.</li>
<li>On a single-core system CPUs are CPUs.</li>
</ul>
<li><b>Board</b> (<i>CR_Board</i>): Baseboard as a consumable resource.</li>
<li><b>Socket</b> (<i>CR_Socket</i>): Socket as a consumable resource.</li>
<li><b>Core</b> (<i>CR_Core</i>): Core as a consumable resource.</li>
<li><b>Socket and Memory</b> (<i>CR_Socket_Memory</i>): Socket
and Memory as consumable resources.</li>
<li><b>Core and Memory</b> (<i>CR_Core_Memory</i>): Core and
Memory as consumable resources.</li>
<li><b>CPU and Memory</b> (<i>CR_CPU_Memory</i>) CPU and Memory
as consumable resources.</li>
</ul>
<p>All CR_* parameters assume <b>OverSubscribe=No</b> or
<b>OverSubscribe=Force</b>.</p>
<p>The cons_tres plugin also provides functionality specifically
related to GPUs.</p>
<p>Additional parameters available for the <b>cons_tres</b> plugin:</p>
<ul>
<li><b>DefCpuPerGPU</b>: Default number of CPUs allocated per GPU.</li>
<li><b>DefMemPerGPU</b>: Default amount of memory allocated per GPU.</li>
</ul>
<p>Additional job submit options available for the <b>cons_tres</b> plugin:</p>
<ul>
<li><b>--cpus-per-gpu=</b>: Number of CPUs for every GPU.</li>
<li><b>--gpus=</b>: Count of GPUs for entire job allocation.</li>
<li><b>--gpu-bind=</b>: Bind task to specific GPU(s).</li>
<li><b>--gpu-freq=</b>: Request specific GPU/memory frequencies.</li>
<li><b>--gpus-per-node=</b>: Number of GPUs per node.</li>
<li><b>--gpus-per-socket=</b>: Number of GPUs per socket.</li>
<li><b>--gpus-per-task=</b>: Number of GPUs per task.</li>
<li><b>--mem-per-gpu=</b>: Amount of memory for each GPU.</li>
</ul>
<p>srun's <i>-B</i> extension for sockets, cores, and threads is
ignored within the node allocation mechanism when CR_CPU or
CR_CPU_MEMORY is selected. It is used to compute the total
number of tasks when <i>-n</i> is not specified.</p>
<p>In the cases where Memory is a consumable resource, the <b>RealMemory</b>
parameter must be set in the slurm.conf to define a node's amount of real
memory.</p>
<p>The job submission commands (salloc, sbatch and srun) support the options
<i>--mem=MB</i> and <i>--mem-per-cpu=MB</i>, permitting users to specify
the maximum amount of real memory required per node or per allocated CPU.
This option is required in the environments where Memory is a consumable
resource. It is important to specify enough memory since Slurm will not allow
the application to use more than the requested amount of real memory. The
default value for --mem is inherited from <b>DefMemPerNode</b>. See
<a href="srun.html#OPT_mem">srun</a>(1) for more details.</p>
<p>Using <i>--overcommit</i> or <i>-O</i> is allowed. When the process to
logical processor pinning is enabled by using an appropriate TaskPlugin
configuration parameter, the extra processes will time share the allocated
resources.</p>
<p>The Consumable Trackable Resource plugin is enabled via the SelectType
parameter in the slurm.conf.</p>
<pre>
# Excerpt from sample slurm.conf file
SelectType=select/cons_tres
</pre>
<h2 id="general">General Comments<a class="slurm_link" href="#general"></a></h2>
<p>Slurm's default <b>select/linear</b> plugin is using a best fit algorithm
based on number of consecutive nodes.</p>
<p>The <b>select/cons_tres</b> plugin is enabled or disabled cluster-wide.</p>
<p>In the case where <b>select/linear</b> is enabled, the normal Slurm
behaviors are not disrupted. The major change users see when using the
<b>select/cons_tres</b> plugin is that jobs can be
co-scheduled on nodes when resources permit it. Generic resources (such as GPUs)
can also be tracked individually with this plugin.
The rest of Slurm, such as srun and its options (except srun -s ...), etc. are not
affected by this plugin. Slurm is, from the user's point of view, working the
same way as when using the default node selection scheme.</p>
<p>The <i>--exclusive</i> srun option allows users to request nodes in
exclusive mode even when consumable resources is enabled. See
<a href="srun.html#OPT_exclusive">srun</a>(1) for details. </p>
<p>srun's <i>-s</i> or <i>--oversubscribe</i> is incompatible with the consumable
resource environment and will therefore not be honored. Since this
environment's nodes are shared by default, <i>--exclusive</i> allows users to
obtain dedicated nodes.</p>
<p>The <i>--oversubscribe</i> and <i>--exclusive</i> options are mutually
exclusive when used at job submission. If both options are set when submitting
a job, the job submission command used will fatal.</p>
<h2 id="example_mem">Examples of CR_Socket_Memory, and CR_CPU_Memory
type consumable resources
<a class="slurm_link" href="#example_mem"></a>
</h2>
<pre>
# sinfo -lNe
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY
hydra[12-16] 5 allNodes* ... 4 2:2:1 2007
</pre>
<p>Using select/cons_tres plug-in with CR_Socket_Memory (2 sockets/node)</p>
<pre>
Example 1:
# srun -N 5 -n 5 --mem=1000 sleep 100 & <-- running
# srun -n 1 -w hydra12 --mem=2000 sleep 100 & <-- queued and waiting for resources
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1890 allNodes sleep sballe PD 0:00 1 (Resources)
1889 allNodes sleep sballe R 0:08 5 hydra[12-16]
Example 2:
# srun -N 5 -n 10 --mem=10 sleep 100 & <-- running
# srun -n 1 --mem=10 sleep 100 & <-- queued and waiting for resourcessqueue
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1831 allNodes sleep sballe PD 0:00 1 (Resources)
1830 allNodes sleep sballe R 0:07 5 hydra[12-16]
</pre>
<p>Using select/cons_tres plug-in with CR_CPU_Memory (4 CPUs/node)</p>
<pre>
Example 1:
# srun -N 5 -n 5 --mem=1000 sleep 100 & <-- running
# srun -N 5 -n 5 --mem=10 sleep 100 & <-- running
# srun -N 5 -n 5 --mem=1000 sleep 100 & <-- queued and waiting for resources
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1835 allNodes sleep sballe PD 0:00 5 (Resources)
1833 allNodes sleep sballe R 0:10 5 hydra[12-16]
1834 allNodes sleep sballe R 0:07 5 hydra[12-16]
Example 2:
# srun -N 5 -n 20 --mem=10 sleep 100 & <-- running
# srun -n 1 --mem=10 sleep 100 & <-- queued and waiting for resources
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1837 allNodes sleep sballe PD 0:00 1 (Resources)
1836 allNodes sleep sballe R 0:11 5 hydra[12-16]
</pre>
<h2 id="example_node">
Example of Node Allocations Using Consumable Resource Plugin
<a class="slurm_link" href="#example_node"></a>
</h2>
<p>The following example illustrates the different ways four jobs
are allocated across a cluster using (1) Slurm's default allocation method
(exclusive mode) and (2) a processor as consumable resource
approach.</p>
<p>It is important to understand that the example listed below is a
contrived example and is only given here to illustrate the use of CPUs as
consumable resources. Job 2 and Job 3 call for the node count to equal
the processor count. This would typically be done because
that one task per node requires all of the memory, disk space, etc. The
bottleneck would not be processor count.</p>
<p>Trying to execute more than one job per node will almost certainly severely
impact a parallel job's performance.
The biggest beneficiary of CPUs as consumable resources will be serial jobs or
jobs with modest parallelism, which can effectively share resources. On many
systems with larger processor count, jobs typically run one fewer task than
there are processors to minimize interference by the kernel and daemons.</p>
<p>The example cluster is composed of 4 nodes (10 CPUs in total):</p>
<ul>
<li>linux01 (with 2 processors), </li>
<li>linux02 (with 2 processors), </li>
<li>linux03 (with 2 processors), and</li>
<li>linux04 (with 4 processors). </li>
</ul>
<p>The four jobs are the following:</p>
<ul>
<li>[2] srun -n 4 -N 4 sleep 120 &amp;</li>
<li>[3] srun -n 3 -N 3 sleep 120 &amp;</li>
<li>[4] srun -n 1 sleep 120 &amp;</li>
<li>[5] srun -n 3 sleep 120 &amp;</li>
</ul>
<p>The user launches them in the same order as listed above.</p>
<h2 id="using_default">Using Slurm's Default Node Allocation (Non-shared Mode)
<a class="slurm_link" href="#using_default"></a>
</h2>
<p>The four jobs have been launched and 3 of the jobs are now
pending, waiting to get resources allocated to them. Only Job 2 is running
since it uses one CPU on all 4 nodes. This means that linux01 to linux03 each
have one idle CPU and linux04 has 3 idle CPUs.</p>
<pre>
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 lsf sleep root PD 0:00 3 (Resources)
4 lsf sleep root PD 0:00 1 (Resources)
5 lsf sleep root PD 0:00 1 (Resources)
2 lsf sleep root R 0:14 4 linux[01-04]
</pre>
<p>Once Job 2 is finished, Job 3 is scheduled and runs on
linux01, linux02, and linux03. Job 3 is only using one CPU on each of the 3
nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3
and Job 4 can run concurrently on the cluster.</p>
<p>Job 5 has to wait for idle nodes to be able to run.</p>
<pre>
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 lsf sleep root PD 0:00 1 (Resources)
3 lsf sleep root R 0:11 3 linux[01-03]
4 lsf sleep root R 0:11 1 linux04
</pre>
<p>Once Job 3 finishes, Job 5 is allocated resources and can run.</p>
<p>The advantage of the exclusive mode scheduling policy is
that the a job gets all the resources of the assigned nodes for optimal
parallel performance. The drawback is
that jobs can tie up large amount of resources that it does not use and which
cannot be shared with other jobs.</p>
<h2 id="using_proc">Using a Processor Consumable Resource Approach
<a class="slurm_link" href="#using_proc"></a>
</h2>
<p>We will run through the same scenario again using the <b>cons_tres</b>
plugin and CPUs as the consumable resource. The output of squeue shows that we
have 3 out of the 4 jobs allocated and running. This is a 2 running job
increase over the default Slurm approach.</p>
<p> Job 2 is running on nodes linux01
to linux04. Job 2's allocation is the same as for Slurm's default allocation
which is that it uses one CPU on each of the 4 nodes. Once Job 2 is scheduled
and running, nodes linux01, linux02 and linux03 still have one idle CPU each
and node linux04 has 3 idle CPUs. The main difference between this approach and
the exclusive mode approach described above is that idle CPUs within a node
are now allowed to be assigned to other jobs.</p>
<p>It is important to note that
<i>assigned</i> doesn't mean <i>oversubscription</i>. The consumable resource approach
tracks how much of each available resource (in our case CPUs) must be dedicated
to a given job. This allows us to prevent per node oversubscription of
resources (CPUs).</p>
<p>Once Job 2 is running, Job 3 is
scheduled onto node linux01, linux02, and Linux03 (using one CPU on each of the
nodes) and Job 4 is scheduled onto one of the remaining idle CPUs on Linux04.</p>
<p>Job 2, Job 3, and Job 4 are now running concurrently on the cluster.</p>
<pre>
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 lsf sleep root PD 0:00 1 (Resources)
2 lsf sleep root R 0:13 4 linux[01-04]
3 lsf sleep root R 0:09 3 linux[01-03]
4 lsf sleep root R 0:05 1 linux04
# sinfo -lNe
NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none
linux04 1 lsf* allocated 4 3813 1 1 (null) none
</pre>
<p>Once Job 2 finishes, Job 5, which was pending, is allocated available resources and is then
running as illustrated below:</p>
<pre>
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 lsf sleep root R 1:58 3 linux[01-03]
4 lsf sleep root R 1:54 1 linux04
5 lsf sleep root R 0:02 3 linux[01-03]
# sinfo -lNe
NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none
linux04 1 lsf* idle 4 3813 1 1 (null) none
</pre>
<p>Job 3, Job 4, and Job 5 are now running concurrently on the cluster.</p>
<pre>
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 lsf sleep root R 1:52 3 linux[01-03]
</pre>
<p>Job 3 and Job 4 have finished and Job 5 is still running on nodes linux[01-03].</p>
<p>The advantage of the consumable resource scheduling policy
is that the job throughput can increase dramatically. The overall job
throughput and productivity of the cluster increases, thereby reducing the
amount of time users have to wait for their job to complete as well as
increasing the overall efficiency of the use of the cluster. The drawback is
that users do not have entire nodes dedicated to their jobs by default.</p>
<p>We have added the <i>--exclusive</i> option to srun (see
<a href="srun.html#OPT_exclusive">srun</a>(1) for more details),
which allows users to specify that they would like
their nodes to be allocated in exclusive mode.
This is to accommodate users who might have mpi/threaded/openMP
programs that will take advantage of all the CPUs on a node but only need
one mpi process per node.</p>
<p style="text-align:center;">Last modified 09 July 2025</p>
<!--#include virtual="footer.txt"-->