doc/html/cons_res.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1><a name="top">Consumable Resources in SLURM</a></h1>

 <p>SLURM, using the default node allocation plug-in, allocates nodes to jobs in
 exclusive mode.  This means that even when all the resources within a node are
 not utilized by a given job, another job will not have access to these resources.
 Nodes possess resources such as processors, memory, swap, local
 disk, etc. and jobs consume these resources. The exclusive use default policy
 in SLURM can result in inefficient utilization of the cluster and of its nodes
 resources.

 <p>A plug-in supporting CPUs as a consumable resource is available in
 SLURM 0.5.0 and newer versions of SLURM. Information on how to use
 this plug-in is described below.
 </p>

 <h2>Using the Consumable Resource Node Allocation Plugin: <b>select/cons_res</b></h2>

 <ol start=1 type=1>
  <li><b>SLURM v0.5 and up to SLURM v1.1: <u>ONLY</u> CPUs as a consumable resource</b></li>
  <ul>
   <li>Managing <b>CPUs</b> as a consumable resource means that SLURM will
     not overallocate CPUs. In this implementation it is possible to oversubscribe
     memory if co-located jobs are using more memory than is available on the node.
     See features for SLURM 1.2 (below) for memory as a consumable resource.</li>
   <li>The consumable resource plugin is enabled via SelectType in the
     slurm.conf (e.g. <i>SelectType=select/cons_res</i>).</li>
 <pre>
 #
 # "SelectType"			: node selection logic for scheduling.
 #	"select/bluegene"	: the default on BlueGene systems, aware of
 #				  system topology, manages bglblocks, etc.
 #	"select/cons_res"	: allocate individual consumable resources
 #				  (i.e. processors, memory, etc.)
 #	"select/linear"		: the default on non-BlueGene systems,
 #				  no topology awareness, oriented toward
 #				  allocating nodes to jobs rather than
 #				  resources within a node (e.g. CPUs)
 #
 # SelectType=select/linear
 SelectType=select/cons_res
 </pre>
   <li>The <b>select/cons_res</b> plug-in requires SHARED=No at the
     partition level.</li>
   <li>Using the <i>--overcommit</i> or <i>-O</i> switch in the
     <b>select/cons_res</b> environment is only possible when users
     request dedicated nodes using <i>--exclusive</i>.</li>
     Overcommiting CPUs in a non-dedicated environment would impact
     jobs that are co-located on the nodes which is not a desirable
     feature. This feature is available in SLURM 1.2 (see below).</li>
  </ul>
  <li><b>SLURM 1.2 and newer versions of SLURM</b></li>
   <ul>
    <li>Consumable resources has been enhanced with several new resources
     --namely CPU (same as in previous version), Socket, Core, Memory
     as well as any combination of the logical processors with Memory:</li>
    <ul>
      <li><b>CPU</b> (<i>CR_CPU</i>): CPU as a consumable resource.
        <ul>
        <li>No notion of sockets, cores, or threads.</li>
        <li>On a multi-core system CPUs will be cores.</li>
        <li>On a multi-core/hyperthread system CPUs will be threads.</li>
        <li>On a single-core systems CPUs are CPUs. ;-) </li>
        </ul>
      <li><b>Socket</b> (<i>CR_Socket</i>): Socket as a consumable
      resource.</li>
      <li/><b>Core</b> (<i>CR_Core</i>): Core as a consumable
      resource.</li>
      <li><b>Memory</b> (<i>CR_Memory</i>) Memory <u>only</u> as a
      consumable resource. Note! CR_Memory assumes Shared=Yes</li>
      <li><b>Socket and Memory</b> (<i>CR_Socket_Memory</i>): Socket
      and Memory as consumable resources.</li>
      <li><b>Core and Memory</b> (<i>CR_Core_Memory</i>): Core and
      Memory as consumable resources.</li>
      <li><b>CPU and Memory</b> (<i>CR_CPU_Memory</i>) CPU and Memory
      as consumable resources.</li>
    </ul>
   <li>In the cases where Memory is the consumable resource or one of
       the two consumable resources the <b>Memory</b> parameter which
       defines a node amount of real memory in slurm.conf must be
       set when fastschedule=1.
   <li>srun's <i>-E</i> extension for sockets, cores, and threads are
       ignored within the node allocation mechanism when CR_CPU or
       CR_CPU_MEMORY is selected. It is considered to compute the total
       number of tasks when -n is not specified. </li>
   <li>A new srun switch <i>--job-mem=MB</i> was added to allow users
       to specify the maximum amount of real memory per node required
       by their application. This switch is needed in the environments
       were Memory is a consumable resource. It is important to specify
       enough memory since slurmd will not allow the application to use
       more than the requested amount of real memory per node. The
       default value for --job-mem is 1 MB. see srun man page for more
       details.</li>
   <li><b>All CR_s assume Shared=No</b> or Shared=Force EXCEPT for
       <b>CR_MEMORY</b> which <b>assumes Shared=Yes</b></li>
   <li>The consumable resource plugin is enabled via SelectType and
       SelectTypeParameter in the slurm.conf.</li>
 <pre>
 #
 # "SelectType"			: node selection logic for scheduling.
 #	"select/bluegene"	: the default on BlueGene systems, aware of
 #				  system topology, manages bglblocks, etc.
 #	"select/cons_res"	: allocate individual consumable resources
 #				  (i.e. processors, memory, etc.)
 #	"select/linear"		: the default on non-BlueGene systems,
 #				  no topology awareness, oriented toward
 #				  allocating nodes to jobs rather than
 #				  resources within a node (e.g. CPUs)
 #
 # SelectType=select/linear
 SelectType=select/cons_res

 # o Define parameters to describe the SelectType plugin. For
 #    - select/bluegene - this parameter is currently ignored
 #    - select/linear   - this parameter is currently ignored
 #    - select/cons_res - the parameters available are
 #          - CR_CPU     (1) - CPUs as consumable resources.
 #                      	No notion of sockets, cores, or threads.
 #                      	On a multi-core system CPUs will be cores
 #                      	On a multi-core/hyperthread system CPUs will
 #                      		       be threads
 #                      	On a single-core systems CPUs are CPUs. ;-)
 #          - CR_Socket (2) - Sockets as a consumable resource.
 #          - CR_Core   (3) - Cores as a consumable resource.
 #				(Not yet implemented)
 #          - CR_Memory (4) - Memory as a consumable resource.
 #				Note! CR_Memory assumes Shared=Yes
 #          - CR_Socket_Memory (5) - Socket and Memory as consumable
 #				resources.
 #          - CR_Core_Memory (6) - Core and Memory as consumable
 #				resources. (Not yet implemented)
 #          - CR_CPU_Memory (7) - CPU and Memory as consumable
 #				resources.
 #
 # (#) refer to the output of "scontrol show config"
 #
 # NB!:	The -E extension for sockets, cores, and threads
 #	are ignored within the node allocation mechanism
 #	when CR_CPU or CR_CPU_MEMORY is selected.
 #	They are considered to compute the total number of
 #	tasks when -n is not specified
 #
 # NB! All CR_s assume Shared=No or Shared=Force EXCEPT for
 #	CR_MEMORY which assumes Shared=Yes
 #
 #SelectTypeParameters=CR_CPU (default)
 </pre>
   <li>Using <i>--overcommit</i> or <i>-O</i> is allowed in this new version
     of consumable resources. When the process to logical processor pinning is
     enabled (task/affinity plug-in) the extra processes will not affect
     co-scheduled jobs other than other jobs started with the -O flag.
     We are currently investigating alternative approaches of handling the
     pinning of jobs started with <i>--overcommit</i></li>
   <li><i>-c</i> or <i>--cpus-per-task</i> works in this version of
     consumable resources</li>
  </ul>
  <li><b>General comments</b></li>
  <ul>
   <li>SLURM's default <b>select/linear</b> plugin is using a best fit algorithm based on
     number of consecutive nodes. The same node allocation approach is used in
     <b>select/cons_res</b> for consistency.</li>
   <li>The <b>select/cons_res</b> plugin is enabled or disabled cluster-wide.</li>
   <li>In the case where <b>select/cons_res</b> is not enabled, the normal SLURM behaviors
     are not disrupted. The only changes, users see when using the <b>select/cons_res</b>
     plug-in, are that jobs can be co-scheduled on nodes when resources permits it.
     The rest of SLURM such as srun and switches (except srun -s ...), etc. are not
     affected by this plugin. SLURM is, from a user point of view, working the same
     way as when using the default node selection scheme.</li>
   <li>The <i>--exclusive</i> srun switch allows users to request nodes in
     exclusive mode even when consumable resources is enabled. see "man srun"
     for details. </li>
   <li>srun's <i>-s</i> or <i>--share</i> is incompatible with the consumable resource
     environment and will therefore not be honored. Since in this environment nodes
     are shared by default, <i>--exclusive</i> allows users to obtain dedicated nodes.</li>
  </ul>
 </ol>

 <p class="footer"><a href="#top">top</a></p>

 <h2>Limitation and future work</h2>

 <p>We are aware of several limitations with the current consumable
 resource plug-in and plan to make enhancement the plug-in as we get
 time as well as request from users to help us prioritize the features.

 Please send comments and requests about the consumable resources to
 <a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.

 <ol start=1 type=1>
   <li><b>Issue with --max_nodes, --max_sockets_per_node, --max_cores_per_socket and --max_threads_per_core</b></li>
     <ul>
       <li><b>Problem:</b> The example below was achieve when using CR_CPU
       (default mode). The systems are all "dual socket, dual core,
       single threaded systems (= 4 cpus per system)".</li>
       <li>The first 3 serial jobs are being allocated to node hydra12
       which means that one CPU is still available on hydra12.</li>
       <li>The 4th job "srun -N 2-2 -E 2:2 sleep 100" requires 8 CPUs
       and since the algorithm fills up nodes in a consecutive order
       (when not in dedicated mode) the algorithm will want to use the
       remaining CPUs on Hydra12 first. Because the user has requested
       a maximum of two nodes the allocation will put the job on
       hold until hydra12 becomes available or if backfill is enabled
       until hydra12's remaining CPU gets allocated to another job
       which will allow the 4th job to get two dedicated nodes</li>
       <li><b>Note!</b> If you want to specify <i>--max_????</i> this
       problem can be solved in the current implementation by asking
       for the nodes in dedicated mode using <i>--exclusive</i></li>.

 <pre>
 # srun sleep 100 &
 # srun sleep 100 &
 # srun sleep 100 &
 # squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    1132  allNodes    sleep   sballe   R       0:05      1 hydra12
    1133  allNodes    sleep   sballe   R       0:04      1 hydra12
    1134  allNodes    sleep   sballe   R       0:02      1 hydra12
 # srun -N 2-2 -E 2:2 sleep 100 &
 srun: job 1135 queued and waiting for resources
 #squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    1135  allNodes    sleep   sballe  PD       0:00      2 (Resources)
    1132  allNodes    sleep   sballe   R       0:24      1 hydra12
    1133  allNodes    sleep   sballe   R       0:23      1 hydra12
    1134  allNodes    sleep   sballe   R       0:21      1 hydra12
 #
 </pre>
     <li><b>Proposed solution:</b> Enhance the selection mechanism to go through {node,socket,core,thread}-tuplets to find available match for specific request (bounded knapsack problem). </li>
     </ul>
   <li><b>Binding of processes in the case when  <i>--overcommit</i> is specified.</b></li>
     <ul>
       <li>In the current implementation (SLURM 1.2) we have chosen not
       to bind process that have been started with <i>--overcommit</i>
       flag. The reasoning behind this decision is that the Linux
       scheduler will move non-bound processes to available resources
       when jobs with process pinning enabled are started. The
       non-bound jobs do not affect the bound jobs but co-scheduled
       non-bound job would affect each others runtime. We have decided
       that for now this is an adequate solution.
     </ul>
   </ul>
 </ol>

 <p class="footer"><a href="#top">top</a></p>

 <h2>Examples of CR_Memory, CR_Socket_Memory, and CR_CPU_Memory type consumable resources</h2>

 <pre>
 sinfo -lNe
 NODELIST     NODES PARTITION  STATE  CPUS  S:C:T MEMORY
 hydra[12-16]     5 allNodes*  ...       4  2:2:1   2007
 </pre>

 <p>Using select/cons_res plug-in with CR_Memory</p>
 <pre>
 Example:
 srun -N 5 -n 20 --job-mem=1000 sleep 100 &  <-- running
 srun -N 5 -n 20 --job-mem=10 sleep 100 &    <-- running
 srun -N 5 -n 10 --job-mem=1000 sleep 100 &  <-- queued and waiting for resources

 squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    1820  allNodes    sleep   sballe  PD       0:00      5 (Resources)
    1818  allNodes    sleep   sballe   R       0:17      5 hydra[12-16]
    1819  allNodes    sleep   sballe   R       0:11      5 hydra[12-16]
 </pre>

 <p>Using select/cons_res plug-in with CR_Socket_Memory (2 sockets/node)</p>
 <pre>
 Example 1:
 srun -N 5 -n 5 --job-mem=1000 sleep 100 &        <-- running
 srun -n 1 -w hydra12 --job-mem=2000 sleep 100 &  <-- queued and waiting for resources

 squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    1890  allNodes    sleep   sballe  PD       0:00      1 (Resources)
    1889  allNodes    sleep   sballe   R       0:08      5 hydra[12-16]

 Example 2:
 srun -N 5 -n 10 --job-mem=10 sleep 100 & <-- running
 srun -n 1 --job-mem=10 sleep 100 & <-- queued and waiting for resourcessqueue

 squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    1831  allNodes    sleep   sballe  PD       0:00      1 (Resources)
    1830  allNodes    sleep   sballe   R       0:07      5 hydra[12-16]
 </pre>

 <p>Using select/cons_res plug-in with CR_CPU_Memory (4 CPUs/node)</p>
 <pre>
 Example 1:
 srun -N 5 -n 5 --job-mem=1000 sleep 100 &  <-- running
 srun -N 5 -n 5 --job-mem=10 sleep 100 &    <-- running
 srun -N 5 -n 5 --job-mem=1000 sleep 100 &  <-- queued and waiting for resources

 squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    1835  allNodes    sleep   sballe  PD       0:00      5 (Resources)
    1833  allNodes    sleep   sballe   R       0:10      5 hydra[12-16]
    1834  allNodes    sleep   sballe   R       0:07      5 hydra[12-16]

 Example 2:
 srun -N 5 -n 20 --job-mem=10 sleep 100 & <-- running
 srun -n 1 --job-mem=10 sleep 100 &       <-- queued and waiting for resources

 squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    1837  allNodes    sleep   sballe  PD       0:00      1 (Resources)
    1836  allNodes    sleep   sballe   R       0:11      5 hydra[12-16]
 </pre>

 <p class="footer"><a href="#top">top</a></p>

 <h2>Example of Node Allocations Using Consumable Resource Plugin</h2>

 <p>The following example illustrates the different ways four jobs
 are allocated across a cluster using (1) SLURM's default allocation
 (exclusive mode) and (2) a processor as consumable resource
 approach.</p>

 <p>It is important to understand that the example listed below is a
 contrived example and is only given here to illustrate the use of cpu as
 consumable resources. Job 2 and Job 3 call for the node count to equal
 the processor count. This would typically be done because
 that one task per node requires all of the memory, disk space, etc. The
 bottleneck would not be processor count.</p>

 <p>Trying to execute more than one job per node will almost certainly severely
 impact parallel job's performance.
 The biggest beneficiary of cpus as consumable resources will be serial jobs or
 jobs with modest parallelism, which can effectively share resources. On a lot
 of systems with larger processor count, jobs typically run one fewer task than
 there are processors to minimize interference by the kernel and daemons.</p>

 <p>The example cluster is composed of 4 nodes (10 cpus in total):</p>

 <ul>
  <li>linux01 (with 2 processors), </li>
  <li>linux02 (with 2 processors), </li>
  <li>linux03 (with 2 processors), and</li>
  <li>linux04 (with 4 processors). </li>
 </ul>

 <p>The four jobs are the following:</p>

 <ul>
  <li>[2] srun -n 4 -N 4 sleep 120 &amp;</li>
  <li>[3] srun -n 3 -N 3 sleep 120 &amp;</li>
  <li>[4] srun -n 1 sleep 120 &amp;</li>
  <li>[5] srun -n 3 sleep 120 &amp;</li>
 </ul>

 <p>The user launches them in the same order as listed above.</p>

 <p class="footer"><a href="#top">top</a></p>

 <h2>Using SLURM's Default Node Allocation (Non-shared Mode)</h2>

 <p>The four jobs have been launched and 3 of the jobs are now
 pending, waiting to get resources allocated to them. Only Job 2 is running
 since it uses one cpu on all 4 nodes. This means that linux01 to linux03 each
 have one idle cpu and linux04 has 3 idle cpus.</p>

 <pre>
 # squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
       3       lsf    sleep     root  PD       0:00      3 (Resources)
       4       lsf    sleep     root  PD       0:00      1 (Resources)
       5       lsf    sleep     root  PD       0:00      1 (Resources)
       2       lsf    sleep     root   R       0:14      4 xc14n[13-16]
 </pre>

 <p>Once Job 2 is finished, Job 3 is scheduled and runs on
 linux01, linux02, and linux03. Job 3 is only using one cpu on each of the 3
 nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3
 and Job 4 can run concurrently on the cluster.</p>

 <p>Job 5 has to wait for idle nodes to be able to run.</p>

 <pre>
 # squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
       5       lsf    sleep     root  PD       0:00      1 (Resources)
       3       lsf    sleep     root   R       0:11      3 xc14n[13-15]
       4       lsf    sleep     root   R       0:11      1 xc14n16
 </pre>

 <p>Once Job 3 finishes, Job 5 is allocated resources and can run.</p>

 <p>The advantage of the exclusive mode scheduling policy is
 that the a job gets all the resources of the assigned nodes for optimal
 parallel performance. The drawback is
 that jobs can tie up large amount of resources that it does not use and which
 cannot be shared with other jobs.</p>

 <p class="footer"><a href="#top">top</a></p>

 <h2>Using a Processor Consumable Resource Approach</h2>

 <p>The output of squeue shows that we
 have 3 out of the 4 jobs allocated and running. This is a 2 running job
 increase over the default SLURM approach.</p>

 <p> Job 2 is running on nodes linux01
 to linux04. Job 2's allocation is the same as for SLURM's default allocation
 which is that it uses one cpu on each of the 4 nodes. Once Job 2 is scheduled
 and running, nodes linux01, linux02 and linux03 still have one idle cpu each
 and node linux04 has 3 idle cpus. The main difference between this approach and
 the exclusive mode approach described above is that idle cpus within a node
 are now allowed to be assigned to other jobs.</p>

 <p>It is important to note that
 <i>assigned</i> doesn't mean <i>oversubscription</i>. The consumable resource approach
 tracks how much of each available resource (in our case cpus) must be dedicated
 to a given job. This allows us to prevent per node oversubscription of
 resources (cpus).</p>

 <p>Once Job 2 is running, Job 3 is
 scheduled onto node linux01, linux02, and Linux03 (using one cpu on each of the
 nodes) and Job 4 is scheduled onto one of the remaining idle cpus on Linux04.</p>

 <p>Job 2, Job 3, and Job 4 are now running concurrently on the cluster.</p>

 <pre>

 # squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
       5       lsf    sleep     root  PD       0:00      1 (Resources)
       2       lsf    sleep     root   R       0:13      4 linux[01-04]
       3       lsf    sleep     root   R       0:09      3 linux[01-03]
       4       lsf    sleep     root   R       0:05      1 linux04

 # sinfo -lNe
 NODELIST     NODES PARTITION       STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
 linux[01-03]    3      lsf*   allocated    2   2981        1      1   (null) none
 linux04         1      lsf*   allocated    4   3813        1      1   (null) none
 </pre>

 <p>Once Job 2 finishes, Job 5, which was pending, is allocated available resources and is then
 running as illustrated below:</p>

 <pre>
 # squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
       3       lsf    sleep     root   R       1:58      3 linux[01-03]
       4       lsf    sleep     root   R       1:54      1 linux04
       5       lsf    sleep     root   R       0:02      3 linux[01-03]
 # sinfo -lNe
 NODELIST     NODES PARTITION       STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
 linux[01-03]     3      lsf*   allocated    2   2981        1      1   (null) none
 linux04          1      lsf*        idle    4   3813        1      1   (null) none
 </pre>

 <p>Job 3, Job 4, and Job 5 are now running concurrently on the cluster.</p>

 <pre>
 # squeue
   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
       5       lsf    sleep     root   R       1:52      3 linux[01-03]
 </pre>

 <p>Job 3 and Job 4 have finshed and Job 5 is still running on nodes linux[01-03].</p>

 <p>The advantage of the consumable resource scheduling policy
 is that the job throughput can increase dramatically. The overall job
 throughput/productivity of the cluster increases thereby reducing the amount of
 time users have to wait for their job to complete as well as increasing the
 overall efficiency of the use of the cluster. The drawback is that users do not
 have the entire node dedicated to their job since they have to share nodes with
 other jobs if they do not use all of the resources on the nodes.</p>

 <p>We have added a <i>"--exclusive"</i> switch to srun which allow users
 to specify that they would like their allocated
 nodes in exclusive mode. For more information see "man srun".
 The reason for that is if users have mpi//threaded/openMP
 programs that will take advantage of all the cpus within a node but only need
 one mpi process per node.</p>

 <p class="footer"><a href="#top">top</a></p>

 <p style="text-align:center;">Last modified 25 September 2006</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1><a name="top">Consumable Resources in SLURM</a></h1>

	<p>SLURM, using the default node allocation plug-in, allocates nodes to jobs in
	exclusive mode. This means that even when all the resources within a node are
	not utilized by a given job, another job will not have access to these resources.
	Nodes possess resources such as processors, memory, swap, local
	disk, etc. and jobs consume these resources. The exclusive use default policy
	in SLURM can result in inefficient utilization of the cluster and of its nodes
	resources.

	<p>A plug-in supporting CPUs as a consumable resource is available in
	SLURM 0.5.0 and newer versions of SLURM. Information on how to use
	this plug-in is described below.
	</p>

	<h2>Using the Consumable Resource Node Allocation Plugin: <b>select/cons_res</b></h2>

	<ol start=1 type=1>
	<li><b>SLURM v0.5 and up to SLURM v1.1: <u>ONLY</u> CPUs as a consumable resource</b></li>
	<ul>
	<li>Managing <b>CPUs</b> as a consumable resource means that SLURM will
	not overallocate CPUs. In this implementation it is possible to oversubscribe
	memory if co-located jobs are using more memory than is available on the node.
	See features for SLURM 1.2 (below) for memory as a consumable resource.</li>
	<li>The consumable resource plugin is enabled via SelectType in the
	slurm.conf (e.g. <i>SelectType=select/cons_res</i>).</li>
	<pre>
	#
	# "SelectType" : node selection logic for scheduling.
	# "select/bluegene" : the default on BlueGene systems, aware of
	# system topology, manages bglblocks, etc.
	# "select/cons_res" : allocate individual consumable resources
	# (i.e. processors, memory, etc.)
	# "select/linear" : the default on non-BlueGene systems,
	# no topology awareness, oriented toward
	# allocating nodes to jobs rather than
	# resources within a node (e.g. CPUs)
	#
	# SelectType=select/linear
	SelectType=select/cons_res
	</pre>
	<li>The <b>select/cons_res</b> plug-in requires SHARED=No at the
	partition level.</li>
	<li>Using the <i>--overcommit</i> or <i>-O</i> switch in the
	<b>select/cons_res</b> environment is only possible when users
	request dedicated nodes using <i>--exclusive</i>.</li>
	Overcommiting CPUs in a non-dedicated environment would impact
	jobs that are co-located on the nodes which is not a desirable
	feature. This feature is available in SLURM 1.2 (see below).</li>
	</ul>
	<li><b>SLURM 1.2 and newer versions of SLURM</b></li>
	<ul>
	<li>Consumable resources has been enhanced with several new resources
	--namely CPU (same as in previous version), Socket, Core, Memory
	as well as any combination of the logical processors with Memory:</li>
	<ul>
	<li><b>CPU</b> (<i>CR_CPU</i>): CPU as a consumable resource.
	<ul>
	<li>No notion of sockets, cores, or threads.</li>
	<li>On a multi-core system CPUs will be cores.</li>
	<li>On a multi-core/hyperthread system CPUs will be threads.</li>
	<li>On a single-core systems CPUs are CPUs. ;-) </li>
	</ul>
	<li><b>Socket</b> (<i>CR_Socket</i>): Socket as a consumable
	resource.</li>
	<li/><b>Core</b> (<i>CR_Core</i>): Core as a consumable
	resource.</li>
	<li><b>Memory</b> (<i>CR_Memory</i>) Memory <u>only</u> as a
	consumable resource. Note! CR_Memory assumes Shared=Yes</li>
	<li><b>Socket and Memory</b> (<i>CR_Socket_Memory</i>): Socket
	and Memory as consumable resources.</li>
	<li><b>Core and Memory</b> (<i>CR_Core_Memory</i>): Core and
	Memory as consumable resources.</li>
	<li><b>CPU and Memory</b> (<i>CR_CPU_Memory</i>) CPU and Memory
	as consumable resources.</li>
	</ul>
	<li>In the cases where Memory is the consumable resource or one of
	the two consumable resources the <b>Memory</b> parameter which
	defines a node amount of real memory in slurm.conf must be
	set when fastschedule=1.
	<li>srun's <i>-E</i> extension for sockets, cores, and threads are
	ignored within the node allocation mechanism when CR_CPU or
	CR_CPU_MEMORY is selected. It is considered to compute the total
	number of tasks when -n is not specified. </li>
	<li>A new srun switch <i>--job-mem=MB</i> was added to allow users
	to specify the maximum amount of real memory per node required
	by their application. This switch is needed in the environments
	were Memory is a consumable resource. It is important to specify
	enough memory since slurmd will not allow the application to use
	more than the requested amount of real memory per node. The
	default value for --job-mem is 1 MB. see srun man page for more
	details.</li>
	<li><b>All CR_s assume Shared=No</b> or Shared=Force EXCEPT for
	<b>CR_MEMORY</b> which <b>assumes Shared=Yes</b></li>
	<li>The consumable resource plugin is enabled via SelectType and
	SelectTypeParameter in the slurm.conf.</li>
	<pre>
	#
	# "SelectType" : node selection logic for scheduling.
	# "select/bluegene" : the default on BlueGene systems, aware of
	# system topology, manages bglblocks, etc.
	# "select/cons_res" : allocate individual consumable resources
	# (i.e. processors, memory, etc.)
	# "select/linear" : the default on non-BlueGene systems,
	# no topology awareness, oriented toward
	# allocating nodes to jobs rather than
	# resources within a node (e.g. CPUs)
	#
	# SelectType=select/linear
	SelectType=select/cons_res

	# o Define parameters to describe the SelectType plugin. For
	# - select/bluegene - this parameter is currently ignored
	# - select/linear - this parameter is currently ignored
	# - select/cons_res - the parameters available are
	# - CR_CPU (1) - CPUs as consumable resources.
	# No notion of sockets, cores, or threads.
	# On a multi-core system CPUs will be cores
	# On a multi-core/hyperthread system CPUs will
	# be threads
	# On a single-core systems CPUs are CPUs. ;-)
	# - CR_Socket (2) - Sockets as a consumable resource.
	# - CR_Core (3) - Cores as a consumable resource.
	# (Not yet implemented)
	# - CR_Memory (4) - Memory as a consumable resource.
	# Note! CR_Memory assumes Shared=Yes
	# - CR_Socket_Memory (5) - Socket and Memory as consumable
	# resources.
	# - CR_Core_Memory (6) - Core and Memory as consumable
	# resources. (Not yet implemented)
	# - CR_CPU_Memory (7) - CPU and Memory as consumable
	# resources.
	#
	# (#) refer to the output of "scontrol show config"
	#
	# NB!: The -E extension for sockets, cores, and threads
	# are ignored within the node allocation mechanism
	# when CR_CPU or CR_CPU_MEMORY is selected.
	# They are considered to compute the total number of
	# tasks when -n is not specified
	#
	# NB! All CR_s assume Shared=No or Shared=Force EXCEPT for
	# CR_MEMORY which assumes Shared=Yes
	#
	#SelectTypeParameters=CR_CPU (default)
	</pre>
	<li>Using <i>--overcommit</i> or <i>-O</i> is allowed in this new version
	of consumable resources. When the process to logical processor pinning is
	enabled (task/affinity plug-in) the extra processes will not affect
	co-scheduled jobs other than other jobs started with the -O flag.
	We are currently investigating alternative approaches of handling the
	pinning of jobs started with <i>--overcommit</i></li>
	<li><i>-c</i> or <i>--cpus-per-task</i> works in this version of
	consumable resources</li>
	</ul>
	<li><b>General comments</b></li>
	<ul>
	<li>SLURM's default <b>select/linear</b> plugin is using a best fit algorithm based on
	number of consecutive nodes. The same node allocation approach is used in
	<b>select/cons_res</b> for consistency.</li>
	<li>The <b>select/cons_res</b> plugin is enabled or disabled cluster-wide.</li>
	<li>In the case where <b>select/cons_res</b> is not enabled, the normal SLURM behaviors
	are not disrupted. The only changes, users see when using the <b>select/cons_res</b>
	plug-in, are that jobs can be co-scheduled on nodes when resources permits it.
	The rest of SLURM such as srun and switches (except srun -s ...), etc. are not
	affected by this plugin. SLURM is, from a user point of view, working the same
	way as when using the default node selection scheme.</li>
	<li>The <i>--exclusive</i> srun switch allows users to request nodes in
	exclusive mode even when consumable resources is enabled. see "man srun"
	for details. </li>
	<li>srun's <i>-s</i> or <i>--share</i> is incompatible with the consumable resource
	environment and will therefore not be honored. Since in this environment nodes
	are shared by default, <i>--exclusive</i> allows users to obtain dedicated nodes.</li>
	</ul>
	</ol>

	<p class="footer"><a href="#top">top</a></p>

	<h2>Limitation and future work</h2>

	<p>We are aware of several limitations with the current consumable
	resource plug-in and plan to make enhancement the plug-in as we get
	time as well as request from users to help us prioritize the features.

	Please send comments and requests about the consumable resources to
	<a href="mailto:slurm-dev@lists.llnl.gov">slurm-dev@lists.llnl.gov</a>.

	<ol start=1 type=1>
	<li><b>Issue with --max_nodes, --max_sockets_per_node, --max_cores_per_socket and --max_threads_per_core</b></li>
	<ul>
	<li><b>Problem:</b> The example below was achieve when using CR_CPU
	(default mode). The systems are all "dual socket, dual core,
	single threaded systems (= 4 cpus per system)".</li>
	<li>The first 3 serial jobs are being allocated to node hydra12
	which means that one CPU is still available on hydra12.</li>
	<li>The 4th job "srun -N 2-2 -E 2:2 sleep 100" requires 8 CPUs
	and since the algorithm fills up nodes in a consecutive order
	(when not in dedicated mode) the algorithm will want to use the
	remaining CPUs on Hydra12 first. Because the user has requested
	a maximum of two nodes the allocation will put the job on
	hold until hydra12 becomes available or if backfill is enabled
	until hydra12's remaining CPU gets allocated to another job
	which will allow the 4th job to get two dedicated nodes</li>
	<li><b>Note!</b> If you want to specify <i>--max_????</i> this
	problem can be solved in the current implementation by asking
	for the nodes in dedicated mode using <i>--exclusive</i></li>.

	<pre>
	# srun sleep 100 &
	# srun sleep 100 &
	# srun sleep 100 &
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1132 allNodes sleep sballe R 0:05 1 hydra12
	1133 allNodes sleep sballe R 0:04 1 hydra12
	1134 allNodes sleep sballe R 0:02 1 hydra12
	# srun -N 2-2 -E 2:2 sleep 100 &
	srun: job 1135 queued and waiting for resources
	#squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1135 allNodes sleep sballe PD 0:00 2 (Resources)
	1132 allNodes sleep sballe R 0:24 1 hydra12
	1133 allNodes sleep sballe R 0:23 1 hydra12
	1134 allNodes sleep sballe R 0:21 1 hydra12
	#
	</pre>
	<li><b>Proposed solution:</b> Enhance the selection mechanism to go through {node,socket,core,thread}-tuplets to find available match for specific request (bounded knapsack problem). </li>
	</ul>
	<li><b>Binding of processes in the case when <i>--overcommit</i> is specified.</b></li>
	<ul>
	<li>In the current implementation (SLURM 1.2) we have chosen not
	to bind process that have been started with <i>--overcommit</i>
	flag. The reasoning behind this decision is that the Linux
	scheduler will move non-bound processes to available resources
	when jobs with process pinning enabled are started. The
	non-bound jobs do not affect the bound jobs but co-scheduled
	non-bound job would affect each others runtime. We have decided
	that for now this is an adequate solution.
	</ul>
	</ul>
	</ol>

	<p class="footer"><a href="#top">top</a></p>

	<h2>Examples of CR_Memory, CR_Socket_Memory, and CR_CPU_Memory type consumable resources</h2>

	<pre>
	sinfo -lNe
	NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY
	hydra[12-16] 5 allNodes* ... 4 2:2:1 2007
	</pre>

	<p>Using select/cons_res plug-in with CR_Memory</p>
	<pre>
	Example:
	srun -N 5 -n 20 --job-mem=1000 sleep 100 & <-- running
	srun -N 5 -n 20 --job-mem=10 sleep 100 & <-- running
	srun -N 5 -n 10 --job-mem=1000 sleep 100 & <-- queued and waiting for resources

	squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1820 allNodes sleep sballe PD 0:00 5 (Resources)
	1818 allNodes sleep sballe R 0:17 5 hydra[12-16]
	1819 allNodes sleep sballe R 0:11 5 hydra[12-16]
	</pre>

	<p>Using select/cons_res plug-in with CR_Socket_Memory (2 sockets/node)</p>
	<pre>
	Example 1:
	srun -N 5 -n 5 --job-mem=1000 sleep 100 & <-- running
	srun -n 1 -w hydra12 --job-mem=2000 sleep 100 & <-- queued and waiting for resources

	squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1890 allNodes sleep sballe PD 0:00 1 (Resources)
	1889 allNodes sleep sballe R 0:08 5 hydra[12-16]

	Example 2:
	srun -N 5 -n 10 --job-mem=10 sleep 100 & <-- running
	srun -n 1 --job-mem=10 sleep 100 & <-- queued and waiting for resourcessqueue

	squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1831 allNodes sleep sballe PD 0:00 1 (Resources)
	1830 allNodes sleep sballe R 0:07 5 hydra[12-16]
	</pre>

	<p>Using select/cons_res plug-in with CR_CPU_Memory (4 CPUs/node)</p>
	<pre>
	Example 1:
	srun -N 5 -n 5 --job-mem=1000 sleep 100 & <-- running
	srun -N 5 -n 5 --job-mem=10 sleep 100 & <-- running
	srun -N 5 -n 5 --job-mem=1000 sleep 100 & <-- queued and waiting for resources

	squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1835 allNodes sleep sballe PD 0:00 5 (Resources)
	1833 allNodes sleep sballe R 0:10 5 hydra[12-16]
	1834 allNodes sleep sballe R 0:07 5 hydra[12-16]

	Example 2:
	srun -N 5 -n 20 --job-mem=10 sleep 100 & <-- running
	srun -n 1 --job-mem=10 sleep 100 & <-- queued and waiting for resources

	squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	1837 allNodes sleep sballe PD 0:00 1 (Resources)
	1836 allNodes sleep sballe R 0:11 5 hydra[12-16]
	</pre>

	<p class="footer"><a href="#top">top</a></p>

	<h2>Example of Node Allocations Using Consumable Resource Plugin</h2>

	<p>The following example illustrates the different ways four jobs
	are allocated across a cluster using (1) SLURM's default allocation
	(exclusive mode) and (2) a processor as consumable resource
	approach.</p>

	<p>It is important to understand that the example listed below is a
	contrived example and is only given here to illustrate the use of cpu as
	consumable resources. Job 2 and Job 3 call for the node count to equal
	the processor count. This would typically be done because
	that one task per node requires all of the memory, disk space, etc. The
	bottleneck would not be processor count.</p>

	<p>Trying to execute more than one job per node will almost certainly severely
	impact parallel job's performance.
	The biggest beneficiary of cpus as consumable resources will be serial jobs or
	jobs with modest parallelism, which can effectively share resources. On a lot
	of systems with larger processor count, jobs typically run one fewer task than
	there are processors to minimize interference by the kernel and daemons.</p>

	<p>The example cluster is composed of 4 nodes (10 cpus in total):</p>

	<ul>
	<li>linux01 (with 2 processors), </li>
	<li>linux02 (with 2 processors), </li>
	<li>linux03 (with 2 processors), and</li>
	<li>linux04 (with 4 processors). </li>
	</ul>

	<p>The four jobs are the following:</p>

	<ul>
	<li>[2] srun -n 4 -N 4 sleep 120 &</li>
	<li>[3] srun -n 3 -N 3 sleep 120 &</li>
	<li>[4] srun -n 1 sleep 120 &</li>
	<li>[5] srun -n 3 sleep 120 &</li>
	</ul>

	<p>The user launches them in the same order as listed above.</p>

	<p class="footer"><a href="#top">top</a></p>

	<h2>Using SLURM's Default Node Allocation (Non-shared Mode)</h2>

	<p>The four jobs have been launched and 3 of the jobs are now
	pending, waiting to get resources allocated to them. Only Job 2 is running
	since it uses one cpu on all 4 nodes. This means that linux01 to linux03 each
	have one idle cpu and linux04 has 3 idle cpus.</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	3 lsf sleep root PD 0:00 3 (Resources)
	4 lsf sleep root PD 0:00 1 (Resources)
	5 lsf sleep root PD 0:00 1 (Resources)
	2 lsf sleep root R 0:14 4 xc14n[13-16]
	</pre>

	<p>Once Job 2 is finished, Job 3 is scheduled and runs on
	linux01, linux02, and linux03. Job 3 is only using one cpu on each of the 3
	nodes. Job 4 can be allocated onto the remaining idle node (linux04) so Job 3
	and Job 4 can run concurrently on the cluster.</p>

	<p>Job 5 has to wait for idle nodes to be able to run.</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	5 lsf sleep root PD 0:00 1 (Resources)
	3 lsf sleep root R 0:11 3 xc14n[13-15]
	4 lsf sleep root R 0:11 1 xc14n16
	</pre>

	<p>Once Job 3 finishes, Job 5 is allocated resources and can run.</p>

	<p>The advantage of the exclusive mode scheduling policy is
	that the a job gets all the resources of the assigned nodes for optimal
	parallel performance. The drawback is
	that jobs can tie up large amount of resources that it does not use and which
	cannot be shared with other jobs.</p>

	<p class="footer"><a href="#top">top</a></p>

	<h2>Using a Processor Consumable Resource Approach</h2>

	<p>The output of squeue shows that we
	have 3 out of the 4 jobs allocated and running. This is a 2 running job
	increase over the default SLURM approach.</p>

	<p> Job 2 is running on nodes linux01
	to linux04. Job 2's allocation is the same as for SLURM's default allocation
	which is that it uses one cpu on each of the 4 nodes. Once Job 2 is scheduled
	and running, nodes linux01, linux02 and linux03 still have one idle cpu each
	and node linux04 has 3 idle cpus. The main difference between this approach and
	the exclusive mode approach described above is that idle cpus within a node
	are now allowed to be assigned to other jobs.</p>

	<p>It is important to note that
	<i>assigned</i> doesn't mean <i>oversubscription</i>. The consumable resource approach
	tracks how much of each available resource (in our case cpus) must be dedicated
	to a given job. This allows us to prevent per node oversubscription of
	resources (cpus).</p>

	<p>Once Job 2 is running, Job 3 is
	scheduled onto node linux01, linux02, and Linux03 (using one cpu on each of the
	nodes) and Job 4 is scheduled onto one of the remaining idle cpus on Linux04.</p>

	<p>Job 2, Job 3, and Job 4 are now running concurrently on the cluster.</p>

	<pre>

	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	5 lsf sleep root PD 0:00 1 (Resources)
	2 lsf sleep root R 0:13 4 linux[01-04]
	3 lsf sleep root R 0:09 3 linux[01-03]
	4 lsf sleep root R 0:05 1 linux04

	# sinfo -lNe
	NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
	linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none
	linux04 1 lsf* allocated 4 3813 1 1 (null) none
	</pre>

	<p>Once Job 2 finishes, Job 5, which was pending, is allocated available resources and is then
	running as illustrated below:</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	3 lsf sleep root R 1:58 3 linux[01-03]
	4 lsf sleep root R 1:54 1 linux04
	5 lsf sleep root R 0:02 3 linux[01-03]
	# sinfo -lNe
	NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON
	linux[01-03] 3 lsf* allocated 2 2981 1 1 (null) none
	linux04 1 lsf* idle 4 3813 1 1 (null) none
	</pre>

	<p>Job 3, Job 4, and Job 5 are now running concurrently on the cluster.</p>

	<pre>
	# squeue
	JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
	5 lsf sleep root R 1:52 3 linux[01-03]
	</pre>

	<p>Job 3 and Job 4 have finshed and Job 5 is still running on nodes linux[01-03].</p>

	<p>The advantage of the consumable resource scheduling policy
	is that the job throughput can increase dramatically. The overall job
	throughput/productivity of the cluster increases thereby reducing the amount of
	time users have to wait for their job to complete as well as increasing the
	overall efficiency of the use of the cluster. The drawback is that users do not
	have the entire node dedicated to their job since they have to share nodes with
	other jobs if they do not use all of the resources on the nodes.</p>

	<p>We have added a <i>"--exclusive"</i> switch to srun which allow users
	to specify that they would like their allocated
	nodes in exclusive mode. For more information see "man srun".
	The reason for that is if users have mpi//threaded/openMP
	programs that will take advantage of all the cpus within a node but only need
	one mpi process per node.</p>

	<p class="footer"><a href="#top">top</a></p>

	<p style="text-align:center;">Last modified 25 September 2006</p>

	<!--#include virtual="footer.txt"-->