doc/html/cons_tres.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1><a name="top">Consumable Resources in Slurm</a></h1>

 <p>Slurm, using the default node allocation plug-in, allocates nodes to jobs in
 exclusive mode.  This means that even when all the resources within a node are
 not utilized by a given job, another job will not have access to these resources.
 Nodes possess resources such as processors, memory, swap, local
 disk, etc. and jobs consume these resources. The exclusive use default policy
 in Slurm can result in inefficient utilization of the cluster and of its nodes
 resources.
 Slurm's <i>cons_tres</i> plugin is available to
 manage resources on a much more fine-grained basis as described below.</p>

 <h2 id="using_cons_tres">
 Using the Consumable Trackable Resource Plugin: <b>select/cons_tres</b>
 <a class="slurm_link" href="#using_cons_tres"></a>
 </h2>

 <p>The Consumable Trackable Resources (<b>cons_tres</b>) plugin has been built
 to work with several resources. It can track a Board, Socket, Core, CPU, Memory
 as well as any combination of the logical processors with Memory:</p>
 <ul>
   <li><b>CPU</b> (<i>CR_CPU</i>): CPU as a consumable resource.</li>
   <ul>
     <li>No notion of sockets, cores, or threads.</li>
     <li>On a multi-core system CPUs will be cores.</li>
     <li>On a multi-core/hyperthread system CPUs will be threads.</li>
     <li>On a single-core system CPUs are CPUs.</li>
   </ul>
   <li><b>Board</b> (<i>CR_Board</i>): Baseboard as a consumable resource.</li>
   <li><b>Socket</b> (<i>CR_Socket</i>): Socket as a consumable resource.</li>
   <li><b>Core</b> (<i>CR_Core</i>): Core as a consumable resource.</li>
   <li><b>Socket and Memory</b> (<i>CR_Socket_Memory</i>): Socket
   and Memory as consumable resources.</li>
   <li><b>Core and Memory</b> (<i>CR_Core_Memory</i>): Core and
   Memory as consumable resources.</li>
   <li><b>CPU and Memory</b> (<i>CR_CPU_Memory</i>) CPU and Memory
   as consumable resources.</li>
 </ul>

 <p>All CR_* parameters assume <b>OverSubscribe=No</b> or
 <b>OverSubscribe=Force</b>.</p>

 <p>The cons_tres plugin also provides functionality specifically
 related to GPUs.</p>

 <p>Additional parameters available for the <b>cons_tres</b> plugin:</p>
   <ul>
   <li><b>DefCpuPerGPU</b>: Default number of CPUs allocated per GPU.</li>
   <li><b>DefMemPerGPU</b>: Default amount of memory allocated per GPU.</li>
   </ul>
 <p>Additional job submit options available for the <b>cons_tres</b> plugin:</p>
   <ul>
   <li><b>--cpus-per-gpu=</b>: Number of CPUs for every GPU.</li>
   <li><b>--gpus=</b>: Count of GPUs for entire job allocation.</li>
   <li><b>--gpu-bind=</b>: Bind task to specific GPU(s).</li>
   <li><b>--gpu-freq=</b>: Request specific GPU/memory frequencies.</li>
   <li><b>--gpus-per-node=</b>: Number of GPUs per node.</li>
   <li><b>--gpus-per-socket=</b>: Number of GPUs per socket.</li>
   <li><b>--gpus-per-task=</b>: Number of GPUs per task.</li>
   <li><b>--mem-per-gpu=</b>: Amount of memory for each GPU.</li>
   </ul>

 <p>srun's <i>-B</i> extension for sockets, cores, and threads is
 ignored within the node allocation mechanism when CR_CPU or
 CR_CPU_MEMORY is selected. It is used to compute the total
 number of tasks when <i>-n</i> is not specified.</p>

 <p>In the cases where Memory is a consumable resource, the <b>RealMemory</b>
 parameter must be set in the slurm.conf to define a node's amount of real
 memory.</p>

 <p>The job submission commands (salloc, sbatch and srun) support the options
 <i>--mem=MB</i> and <i>--mem-per-cpu=MB</i>, permitting users to specify
 the maximum amount of real memory required per node or per allocated CPU.
 This option is required in the environments where Memory is a consumable
 resource. It is important to specify enough memory since Slurm will not allow
 the application to use more than the requested amount of real memory. The
 default value for --mem is inherited from <b>DefMemPerNode</b>. See
 <a href="srun.html#OPT_mem">srun</a>(1) for more details.</p>

 <p>Using <i>--overcommit</i> or <i>-O</i> is allowed. When the process to
 logical processor pinning is enabled by using an appropriate TaskPlugin
 configuration parameter, the extra processes will time share the allocated
 resources.</p>

 <p>The Consumable Trackable Resource plugin is enabled via the SelectType
 parameter in the slurm.conf.</p>
 <pre>
 # Excerpt from sample slurm.conf file
 SelectType=select/cons_tres
 </pre>

 <h2 id="general">General Comments<a class="slurm_link" href="#general"></a></h2>

 <p>Slurm's default <b>select/linear</b> plugin is using a best fit algorithm
 based on number of consecutive nodes.</p>

 <p>The <b>select/cons_tres</b> plugin is enabled or disabled cluster-wide.</p>

 <p>In the case where <b>select/linear</b> is enabled, the normal Slurm
 behaviors are not disrupted. The major change users see when using the
 <b>select/cons_tres</b> plugin is that jobs can be
 co-scheduled on nodes when resources permit it. Generic resources (such as GPUs)
 can also be tracked individually with this plugin.
 The rest of Slurm, such as srun and its options (except srun -s ...), etc. are not
 affected by this plugin. Slurm is, from the user's point of view, working the
 same way as when using the default node selection scheme.</p>

 <p>The <i>--exclusive</i> srun option allows users to request nodes in
 exclusive mode even when consumable resources is enabled. See
 <a href="srun.html#OPT_exclusive">srun</a>(1) for details. </p>

 <p>srun's <i>-s</i> or <i>--oversubscribe</i> is incompatible with the consumable
 resource environment and will therefore not be honored. Since this
 environment's nodes are shared by default, <i>--exclusive</i> allows users to
 obtain dedicated nodes.</p>

 <p>The <i>--oversubscribe</i> and <i>--exclusive</i> options are mutually
 exclusive when used at job submission. If both options are set when submitting
 a job, the job submission command used will fatal.</p>


 <h2 id="example_mem">Examples of CR_Socket_Memory, and CR_CPU_Memory
 type consumable resources
 <a class="slurm_link" href="#example_mem"></a>
 </h2>

 <pre>
 # sinfo -lNe
 NODELIST     NODES PARTITION  STATE  CPUS  S:C:T MEMORY
 hydra[12-16]     5 allNodes*  ...       4  2:2:1   2007
 </pre>

 <p>Using select/cons_tres plug-in with CR_Socket_Memory (2 sockets/node)</p>
 <pre>
 Example 1:
 # srun -N 5 -n 5 --mem=1000 sleep 100 &        <-- running
 # srun -n 1 -w hydra12 --mem=2000 sleep 100 &  <-- queued and waiting for resources

 # squeue
 JOBID PARTITION   NAME   USER ST  TIME  NODES NODELIST(REASON)
  1890  allNodes  sleep sballe PD  0:00      1 (Resources)
  1889  allNodes  sleep sballe  R  0:08      5 hydra[12-16]

 Example 2:
 # srun -N 5 -n 10 --mem=10 sleep 100 & <-- running
 # srun -n 1 --mem=10 sleep 100 & <-- queued and waiting for resourcessqueue

 # squeue
 JOBID PARTITION   NAME   USER ST  TIME  NODES NODELIST(REASON)
  1831  allNodes  sleep sballe PD  0:00      1 (Resources)
  1830  allNodes  sleep sballe  R  0:07      5 hydra[12-16]
 </pre>

 <p>Using select/cons_tres plug-in with CR_CPU_Memory (4 CPUs/node)</p>
 <pre>
 Example 1:
 # srun -N 5 -n 5 --mem=1000 sleep 100 &  <-- running
 # srun -N 5 -n 5 --mem=10 sleep 100 &    <-- running
 # srun -N 5 -n 5 --mem=1000 sleep 100 &  <-- queued and waiting for resources

 # squeue
 JOBID PARTITION   NAME   USER ST  TIME  NODES NODELIST(REASON)
  1835  allNodes  sleep sballe PD  0:00      5 (Resources)
  1833  allNodes  sleep sballe  R  0:10      5 hydra[12-16]
  1834  allNodes  sleep sballe  R  0:07      5 hydra[12-16]

 Example 2:
 # srun -N 5 -n 20 --mem=10 sleep 100 & <-- running
 # srun -n 1 --mem=10 sleep 100 &       <-- queued and waiting for resources

 # squeue
 JOBID PARTITION   NAME   USER ST  TIME  NODES NODELIST(REASON)
  1837  allNodes  sleep sballe PD  0:00      1 (Resources)
  1836  allNodes  sleep sballe  R  0:11      5 hydra[12-16]
 </pre>


 <h2 id="example_node">
 Example of Node Allocations Using Consumable Resource Plugin
 <a class="slurm_link" href="#example_node"></a>
 </h2>

 <p>The following example illustrates the different ways four jobs
 are allocated across a cluster using (1) Slurm's default allocation method
 (exclusive mode) and (2) a processor as consumable resource
 approach.</p>

 <p>It is important to understand that the example listed below is a
 contrived example and is only given here to illustrate the use of CPUs as
 consumable resources. Job 2 and Job 3 call for the node count to equal
 the processor count. This would typically be done because
 that one task per node requires all of the memory, disk space, etc. The
 bottleneck would not be processor count.</p>

 <p>Trying to execute more than one job per node will almost certainly severely
 impact a parallel job's performance.
 The biggest beneficiary of CPUs as consumable resources will be serial jobs or
 jobs with modest parallelism, which can effectively share resources. On many
 systems with larger processor count, jobs typically run one fewer task than
 there are processors to minimize interference by the kernel and daemons.</p>

 <p>The example cluster is composed of 4 nodes (10 CPUs in total):</p>

 <ul>
  <li>linux01 (with 2 processors), </li>
  <li>linux02 (with 2 processors), </li>
  <li>linux03 (with 2 processors), and</li>
  <li>linux04 (with 4 processors). </li>
 </ul>

 <p>The four jobs are the following:</p>

 <ul>
  <li>[2] srun -n 4 -N 4 sleep 120 &amp;</li>
  <li>[3] srun -n 3 -N 3 sleep 120 &amp;</li>
  <li>[4] srun -n 1 sleep 120 &amp;</li>
  <li>[5] srun -n 3 sleep 120 &amp;</li>
 </ul>

 <p>The user launches them in the same order as listed above.</p>


 <h2 id="using_default">Using Slurm's Default Node Allocation (Non-shared Mode)
 <a class="slurm_link" href="#using_default"></a>
 </h2>

 <p>The four jobs have been launched and 3 of the jobs are now
 pending, waiting to get resources allocated to them. Only Job 2 is running
 since it uses one CPU on all 4 nodes. This means that linux01 to linux03 each
 have one idle CPU and linux04 has 3 idle CPUs.</p>

 <pre>
 # squeue
 JOBID PARTITION   NAME  USER  ST  TIME  NODES NODELIST(REASON)
     3       lsf  sleep  root  PD  0:00      3 (Resources)
     4       lsf  sleep  root  PD  0:00      1 (Resources)
     5       lsf  sleep  root  PD  0:00      1 (Resources)
     2       lsf  sleep  root   R  0:14      4 linux[01-04]
 </pre>

 <p>Once Job 2 is finished, Job 3 is scheduled and runs on
 linux01, linux02, and linux03. Job 3 is only using one CPU on each of the 3
 nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3
 and Job 4 can run concurrently on the cluster.</p>

 <p>Job 5 has to wait for idle nodes to be able to run.</p>

 <pre>
 # squeue
 JOBID PARTITION   NAME  USER  ST  TIME  NODES NODELIST(REASON)
     5       lsf  sleep  root  PD  0:00      1 (Resources)
     3       lsf  sleep  root   R  0:11      3 linux[01-03]
     4       lsf  sleep  root   R  0:11      1 linux04
 </pre>

 <p>Once Job 3 finishes, Job 5 is allocated resources and can run.</p>

 <p>The advantage of the exclusive mode scheduling policy is
 that the a job gets all the resources of the assigned nodes for optimal
 parallel performance. The drawback is
 that jobs can tie up large amount of resources that it does not use and which
 cannot be shared with other jobs.</p>


 <h2 id="using_proc">Using a Processor Consumable Resource Approach
 <a class="slurm_link" href="#using_proc"></a>
 </h2>

 <p>We will run through the same scenario again using the <b>cons_tres</b>
 plugin and CPUs as the consumable resource. The output of squeue shows that we
 have 3 out of the 4 jobs allocated and running. This is a 2 running job
 increase over the default Slurm approach.</p>

 <p> Job 2 is running on nodes linux01
 to linux04. Job 2's allocation is the same as for Slurm's default allocation
 which is that it uses one CPU on each of the 4 nodes. Once Job 2 is scheduled
 and running, nodes linux01, linux02 and linux03 still have one idle CPU each
 and node linux04 has 3 idle CPUs. The main difference between this approach and
 the exclusive mode approach described above is that idle CPUs within a node
 are now allowed to be assigned to other jobs.</p>

 <p>It is important to note that
 <i>assigned</i> doesn't mean <i>oversubscription</i>. The consumable resource approach
 tracks how much of each available resource (in our case CPUs) must be dedicated
 to a given job. This allows us to prevent per node oversubscription of
 resources (CPUs).</p>

 <p>Once Job 2 is running, Job 3 is
 scheduled onto node linux01, linux02, and Linux03 (using one CPU on each of the
 nodes) and Job 4 is scheduled onto one of the remaining idle CPUs on Linux04.</p>

 <p>Job 2, Job 3, and Job 4 are now running concurrently on the cluster.</p>

 <pre>
 # squeue
 JOBID PARTITION   NAME  USER  ST  TIME  NODES NODELIST(REASON)
     5       lsf  sleep  root  PD  0:00      1 (Resources)
     2       lsf  sleep  root   R  0:13      4 linux[01-04]
     3       lsf  sleep  root   R  0:09      3 linux[01-03]
     4       lsf  sleep  root   R  0:05      1 linux04

 # sinfo -lNe
 NODELIST     NODES PARTITION       STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
 linux[01-03]     3      lsf*   allocated    2   2981        1      1   (null) none
 linux04          1      lsf*   allocated    4   3813        1      1   (null) none
 </pre>

 <p>Once Job 2 finishes, Job 5, which was pending, is allocated available resources and is then
 running as illustrated below:</p>

 <pre>
 # squeue
 JOBID PARTITION   NAME  USER  ST  TIME  NODES NODELIST(REASON)
    3       lsf   sleep  root   R  1:58      3 linux[01-03]
    4       lsf   sleep  root   R  1:54      1 linux04
    5       lsf   sleep  root   R  0:02      3 linux[01-03]
 # sinfo -lNe
 NODELIST     NODES PARTITION       STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
 linux[01-03]     3      lsf*   allocated    2   2981        1      1   (null) none
 linux04          1      lsf*        idle    4   3813        1      1   (null) none
 </pre>

 <p>Job 3, Job 4, and Job 5 are now running concurrently on the cluster.</p>

 <pre>
 # squeue
 JOBID PARTITION   NAME  USER  ST  TIME  NODES NODELIST(REASON)
     5       lsf  sleep  root   R  1:52      3 linux[01-03]
 </pre>

 <p>Job 3 and Job 4 have finished and Job 5 is still running on nodes linux[01-03].</p>

 <p>The advantage of the consumable resource scheduling policy
 is that the job throughput can increase dramatically. The overall job
 throughput and productivity of the cluster increases, thereby reducing the
 amount of time users have to wait for their job to complete as well as
 increasing the overall efficiency of the use of the cluster. The drawback is
 that users do not have entire nodes dedicated to their jobs by default.</p>

 <p>We have added the <i>--exclusive</i> option to srun (see
 <a href="srun.html#OPT_exclusive">srun</a>(1) for more details),
 which allows users to specify that they would like
 their nodes to be allocated in exclusive mode.
 This is to accommodate users who might have mpi/threaded/openMP
 programs that will take advantage of all the CPUs on a node but only need
 one mpi process per node.</p>


 <p style="text-align:center;">Last modified 09 July 2025</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1><a name="top">Consumable Resources in Slurm</a></h1>

	<p>Slurm, using the default node allocation plug-in, allocates nodes to jobs in
	exclusive mode. This means that even when all the resources within a node are
	not utilized by a given job, another job will not have access to these resources.
	Nodes possess resources such as processors, memory, swap, local
	disk, etc. and jobs consume these resources. The exclusive use default policy
	in Slurm can result in inefficient utilization of the cluster and of its nodes
	resources.
	Slurm's <i>cons_tres</i> plugin is available to
	manage resources on a much more fine-grained basis as described below.</p>

	<h2 id="using_cons_tres">
	Using the Consumable Trackable Resource Plugin: <b>select/cons_tres</b>
	<a class="slurm_link" href="#using_cons_tres"></a>
	</h2>

	<p>The Consumable Trackable Resources (<b>cons_tres</b>) plugin has been built
	to work with several resources. It can track a Board, Socket, Core, CPU, Memory
	as well as any combination of the logical processors with Memory:</p>
	<ul>
	<li><b>CPU</b> (<i>CR_CPU</i>): CPU as a consumable resource.</li>
	<ul>
	<li>No notion of sockets, cores, or threads.</li>
	<li>On a multi-core system CPUs will be cores.</li>
	<li>On a multi-core/hyperthread system CPUs will be threads.</li>
	<li>On a single-core system CPUs are CPUs.</li>
	</ul>
	<li><b>Board</b> (<i>CR_Board</i>): Baseboard as a consumable resource.</li>
	<li><b>Socket</b> (<i>CR_Socket</i>): Socket as a consumable resource.</li>
	<li><b>Core</b> (<i>CR_Core</i>): Core as a consumable resource.</li>
	<li><b>Socket and Memory</b> (<i>CR_Socket_Memory</i>): Socket
	and Memory as consumable resources.</li>
	<li><b>Core and Memory</b> (<i>CR_Core_Memory</i>): Core and
	Memory as consumable resources.</li>
	<li><b>CPU and Memory</b> (<i>CR_CPU_Memory</i>) CPU and Memory
	as consumable resources.</li>
	</ul>

	<p>All CR_* parameters assume <b>OverSubscribe=No</b> or
	<b>OverSubscribe=Force</b>.</p>

	<p>The cons_tres plugin also provides functionality specifically
	related to GPUs.</p>

	<p>Additional parameters available for the <b>cons_tres</b> plugin:</p>
	<ul>
	<li><b>DefCpuPerGPU</b>: Default number of CPUs allocated per GPU.</li>
	<li><b>DefMemPerGPU</b>: Default amount of memory allocated per GPU.</li>
	</ul>
	<p>Additional job submit options available for the <b>cons_tres</b> plugin:</p>
	<ul>
	<li><b>--cpus-per-gpu=</b>: Number of CPUs for every GPU.</li>
	<li><b>--gpus=</b>: Count of GPUs for entire job allocation.</li>
	<li><b>--gpu-bind=</b>: Bind task to specific GPU(s).</li>
	<li><b>--gpu-freq=</b>: Request specific GPU/memory frequencies.</li>
	<li><b>--gpus-per-node=</b>: Number of GPUs per node.</li>
	<li><b>--gpus-per-socket=</b>: Number of GPUs per socket.</li>
	<li><b>--gpus-per-task=</b>: Number of GPUs per task.</li>
	<li><b>--mem-per-gpu=</b>: Amount of memory for each GPU.</li>
	</ul>

	<p>srun's <i>-B</i> extension for sockets, cores, and threads is
	ignored within the node allocation mechanism when CR_CPU or
	CR_CPU_MEMORY is selected. It is used to compute the total
	number of tasks when <i>-n</i> is not specified.</p>

	<p>In the cases where Memory is a consumable resource, the <b>RealMemory</b>
	parameter must be set in the slurm.conf to define a node's amount of real
	memory.</p>

	<p>The job submission commands (salloc, sbatch and srun) support the options
	<i>--mem=MB</i> and <i>--mem-per-cpu=MB</i>, permitting users to specify
	the maximum amount of real memory required per node or per allocated CPU.
	This option is required in the environments where Memory is a consumable
	resource. It is important to specify enough memory since Slurm will not allow
	the application to use more than the requested amount of real memory. The
	default value for --mem is inherited from <b>DefMemPerNode</b>. See
	<a href="srun.html#OPT_mem">srun</a>(1) for more details.</p>

	<p>Using <i>--overcommit</i> or <i>-O</i> is allowed. When the process to
	logical processor pinning is enabled by using an appropriate TaskPlugin
	configuration parameter, the extra processes will time share the allocated
	resources.</p>

	<p>The Consumable Trackable Resource plugin is enabled via the SelectType
	parameter in the slurm.conf.</p>
	<pre>
	# Excerpt from sample slurm.conf file
	SelectType=select/cons_tres
	</pre>

	<h2 id="general">General Comments<a class="slurm_link" href="#general"></a></h2>

	<p>Slurm's default <b>select/linear</b> plugin is using a best fit algorithm
	based on number of consecutive nodes.</p>

	<p>The <b>select/cons_tres</b> plugin is enabled or disabled cluster-wide.</p>

	<p>In the case where <b>select/linear</b> is enabled, the normal Slurm
	behaviors are not disrupted. The major change users see when using the
	<b>select/cons_tres</b> plugin is that jobs can be
	co-scheduled on nodes when resources permit it. Generic resources (such as GPUs)
	can also be tracked individually with this plugin.
	The rest of Slurm, such as srun and its options (except srun -s ...), etc. are not
	affected by this plugin. Slurm is, from the user's point of view, working the
	same way as when using the default node selection scheme.</p>

	<p>The <i>--exclusive</i> srun option allows users to request nodes in
	exclusive mode even when consumable resources is enabled. See
	<a href="srun.html#OPT_exclusive">srun</a>(1) for details. </p>

	<p>srun's <i>-s</i> or <i>--oversubscribe</i> is incompatible with the consumable
	resource environment and will therefore not be honored. Since this
	environment's nodes are shared by default, <i>--exclusive</i> allows users to
	obtain dedicated nodes.</p>

	<p>The <i>--oversubscribe</i> and <i>--exclusive</i> options are mutually
	exclusive when used at job submission. If both options are set when submitting
	a job, the job submission command used will fatal.</p>


	<h2 id="example_mem">Examples of CR_Socket_Memory, and CR_CPU_Memory
	type consumable resources
	<a class="slurm_link" href="#example_mem"></a>
	</h2>

	<pre>
	# sinfo -lNe
	NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY
	hydra[12-16] 5 allNodes* ... 4 2:2:1 2007
	</pre>

	<p>Using select/cons_tres plug-in with CR_Socket_Memory (2 sockets/node)</p>
	<pre>
	Example 1:
	# srun -N 5 -n 5 --mem=1000 sleep 100 & <-- running
	# srun -n 1 -w hydra12 --mem=2000 sleep 100 & <-- queued and waiting for resources

	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1890 allNodes sleep sballe PD 0:00 1 (Resources)
	1889 allNodes sleep sballe R 0:08 5 hydra[12-16]

	Example 2:
	# srun -N 5 -n 10 --mem=10 sleep 100 & <-- running
	# srun -n 1 --mem=10 sleep 100 & <-- queued and waiting for resourcessqueue

	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1831 allNodes sleep sballe PD 0:00 1 (Resources)
	1830 allNodes sleep sballe R 0:07 5 hydra[12-16]
	</pre>

	<p>Using select/cons_tres plug-in with CR_CPU_Memory (4 CPUs/node)</p>
	<pre>
	Example 1:
	# srun -N 5 -n 5 --mem=1000 sleep 100 & <-- running
	# srun -N 5 -n 5 --mem=10 sleep 100 & <-- running
	# srun -N 5 -n 5 --mem=1000 sleep 100 & <-- queued and waiting for resources

	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1835 allNodes sleep sballe PD 0:00 5 (Resources)
	1833 allNodes sleep sballe R 0:10 5 hydra[12-16]
	1834 allNodes sleep sballe R 0:07 5 hydra[12-16]

	Example 2:
	# srun -N 5 -n 20 --mem=10 sleep 100 & <-- running
	# srun -n 1 --mem=10 sleep 100 & <-- queued and waiting for resources

	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1837 allNodes sleep sballe PD 0:00 1 (Resources)
	1836 allNodes sleep sballe R 0:11 5 hydra[12-16]
	</pre>


	<h2 id="example_node">
	Example of Node Allocations Using Consumable Resource Plugin
	<a class="slurm_link" href="#example_node"></a>
	</h2>

	<p>The following example illustrates the different ways four jobs
	are allocated across a cluster using (1) Slurm's default allocation method
	(exclusive mode) and (2) a processor as consumable resource
	approach.</p>

	<p>It is important to understand that the example listed below is a
	contrived example and is only given here to illustrate the use of CPUs as
	consumable resources. Job 2 and Job 3 call for the node count to equal
	the processor count. This would typically be done because
	that one task per node requires all of the memory, disk space, etc. The
	bottleneck would not be processor count.</p>

	<p>Trying to execute more than one job per node will almost certainly severely
	impact a parallel job's performance.
	The biggest beneficiary of CPUs as consumable resources will be serial jobs or
	jobs with modest parallelism, which can effectively share resources. On many
	systems with larger processor count, jobs typically run one fewer task than
	there are processors to minimize interference by the kernel and daemons.</p>

	<p>The example cluster is composed of 4 nodes (10 CPUs in total):</p>

	<ul>
	<li>linux01 (with 2 processors), </li>
	<li>linux02 (with 2 processors), </li>
	<li>linux03 (with 2 processors), and</li>
	<li>linux04 (with 4 processors). </li>
	</ul>

	<p>The four jobs are the following:</p>

	<ul>
	<li>[2] srun -n 4 -N 4 sleep 120 &</li>
	<li>[3] srun -n 3 -N 3 sleep 120 &</li>
	<li>[4] srun -n 1 sleep 120 &</li>
	<li>[5] srun -n 3 sleep 120 &</li>
	</ul>

	<p>The user launches them in the same order as listed above.</p>


	<h2 id="using_default">Using Slurm's Default Node Allocation (Non-shared Mode)
	<a class="slurm_link" href="#using_default"></a>
	</h2>

	<p>The four jobs have been launched and 3 of the jobs are now
	pending, waiting to get resources allocated to them. Only Job 2 is running
	since it uses one CPU on all 4 nodes. This means that linux01 to linux03 each
	have one idle CPU and linux04 has 3 idle CPUs.</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	3 lsf sleep root PD 0:00 3 (Resources)
	4 lsf sleep root PD 0:00 1 (Resources)
	5 lsf sleep root PD 0:00 1 (Resources)
	2 lsf sleep root R 0:14 4 linux[01-04]
	</pre>

	<p>Once Job 2 is finished, Job 3 is scheduled and runs on
	linux01, linux02, and linux03. Job 3 is only using one CPU on each of the 3
	nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3
	and Job 4 can run concurrently on the cluster.</p>

	<p>Job 5 has to wait for idle nodes to be able to run.</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	5 lsf sleep root PD 0:00 1 (Resources)
	3 lsf sleep root R 0:11 3 linux[01-03]
	4 lsf sleep root R 0:11 1 linux04
	</pre>

	<p>Once Job 3 finishes, Job 5 is allocated resources and can run.</p>

	<p>The advantage of the exclusive mode scheduling policy is
	that the a job gets all the resources of the assigned nodes for optimal
	parallel performance. The drawback is
	that jobs can tie up large amount of resources that it does not use and which
	cannot be shared with other jobs.</p>


	<h2 id="using_proc">Using a Processor Consumable Resource Approach
	<a class="slurm_link" href="#using_proc"></a>
	</h2>

	<p>We will run through the same scenario again using the <b>cons_tres</b>
	plugin and CPUs as the consumable resource. The output of squeue shows that we
	have 3 out of the 4 jobs allocated and running. This is a 2 running job
	increase over the default Slurm approach.</p>

	<p> Job 2 is running on nodes linux01
	to linux04. Job 2's allocation is the same as for Slurm's default allocation
	which is that it uses one CPU on each of the 4 nodes. Once Job 2 is scheduled
	and running, nodes linux01, linux02 and linux03 still have one idle CPU each
	and node linux04 has 3 idle CPUs. The main difference between this approach and
	the exclusive mode approach described above is that idle CPUs within a node
	are now allowed to be assigned to other jobs.</p>

	<p>It is important to note that
	<i>assigned</i> doesn't mean <i>oversubscription</i>. The consumable resource approach
	tracks how much of each available resource (in our case CPUs) must be dedicated
	to a given job. This allows us to prevent per node oversubscription of
	resources (CPUs).</p>

	<p>Once Job 2 is running, Job 3 is
	scheduled onto node linux01, linux02, and Linux03 (using one CPU on each of the
	nodes) and Job 4 is scheduled onto one of the remaining idle CPUs on Linux04.</p>

	<p>Job 2, Job 3, and Job 4 are now running concurrently on the cluster.</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	5 lsf sleep root PD 0:00 1 (Resources)
	2 lsf sleep root R 0:13 4 linux[01-04]
	3 lsf sleep root R 0:09 3 linux[01-03]
	4 lsf sleep root R 0:05 1 linux04

	# sinfo -lNe
	NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
	linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none
	linux04 1 lsf* allocated 4 3813 1 1 (null) none
	</pre>

	<p>Once Job 2 finishes, Job 5, which was pending, is allocated available resources and is then
	running as illustrated below:</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	3 lsf sleep root R 1:58 3 linux[01-03]
	4 lsf sleep root R 1:54 1 linux04
	5 lsf sleep root R 0:02 3 linux[01-03]
	# sinfo -lNe
	NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
	linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none
	linux04 1 lsf* idle 4 3813 1 1 (null) none
	</pre>

	<p>Job 3, Job 4, and Job 5 are now running concurrently on the cluster.</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	5 lsf sleep root R 1:52 3 linux[01-03]
	</pre>

	<p>Job 3 and Job 4 have finished and Job 5 is still running on nodes linux[01-03].</p>

	<p>The advantage of the consumable resource scheduling policy
	is that the job throughput can increase dramatically. The overall job
	throughput and productivity of the cluster increases, thereby reducing the
	amount of time users have to wait for their job to complete as well as
	increasing the overall efficiency of the use of the cluster. The drawback is
	that users do not have entire nodes dedicated to their jobs by default.</p>

	<p>We have added the <i>--exclusive</i> option to srun (see
	<a href="srun.html#OPT_exclusive">srun</a>(1) for more details),
	which allows users to specify that they would like
	their nodes to be allocated in exclusive mode.
	This is to accommodate users who might have mpi/threaded/openMP
	programs that will take advantage of all the CPUs on a node but only need
	one mpi process per node.</p>


	<p style="text-align:center;">Last modified 09 July 2025</p>

	<!--#include virtual="footer.txt"-->