| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">Sharing Consumable Resources</a></h1> |
| |
| <h2 id="cpu_management">CPU Management |
| <a class="slurm_link" href="#cpu_management"></a> |
| </h2> |
| <P> |
| (Disclaimer: In this "CPU Management" section, the term "consumable resource" |
| does not include memory. The management of memory as a consumable resource is |
| discussed in its own section below.) |
| </P> |
| <P> |
| The per-partition <CODE>OverSubscribe</CODE> setting applies to the <U>entity |
| being selected for scheduling</U>: |
| <UL><LI><P> |
| When the <CODE>select/linear</CODE> plugin is enabled, the |
| per-partition <CODE>OverSubscribe</CODE> setting controls whether or not the |
| <B>nodes</B> are shared among jobs. |
| </P></LI><LI><P> |
| When the default <CODE>select/cons_tres</CODE> plugin is |
| enabled, the per-partition <CODE>OverSubscribe</CODE> setting controls |
| whether or not the <B>configured consumable resources</B> are shared among jobs. |
| When a consumable resource such as a core, |
| socket, or CPU is shared, it means that more than one job can be assigned to it. |
| </P></LI></UL> |
| </P> |
| <P> |
| The following table describes this new functionality in more detail: |
| </P> |
| <TABLE CELLPADDING=3 CELLSPACING=1 BORDER=1> |
| <TR><TH>Selection Setting</TH> |
| <TH>Per-partition <CODE>OverSubscribe</CODE> Setting</TH> |
| <TH>Resulting Behavior</TH> |
| </TR><TR> |
| <TD ROWSPAN=3>SelectType=<B>select/linear</B></TD> |
| <TD>OverSubscribe=NO</TD> |
| <TD>Whole nodes are allocated to jobs. No node will run more than one job |
| per partition/queue.</TD> |
| </TR><TR> |
| <TD>OverSubscribe=YES</TD> |
| <TD>By default same as OverSubscribe=NO. Nodes allocated to a job may be shared with |
| other jobs if each job allows sharing via the <CODE>srun --oversubscribe</CODE> |
| option.</TD> |
| </TR><TR> |
| <TD>OverSubscribe=FORCE</TD> |
| <TD>Each whole node can be allocated to multiple jobs up to the count specified |
| per partition/queue (default 4 jobs per node)</TD> |
| </TR><TR> |
| <TD ROWSPAN=3>SelectType=<B>select/cons_tres</B><BR> |
| Plus one of the following:<BR> |
| SelectTypeParameters=<B>CR_Core</B><BR> |
| SelectTypeParameters=<B>CR_Core_Memory</B></TD> |
| <TD>OverSubscribe=NO</TD> |
| <TD>Cores are allocated to jobs. No core will run more than one job |
| per partition/queue.</TD> |
| </TR><TR> |
| <TD>OverSubscribe=YES</TD> |
| <TD>By default same as OverSubscribe=NO. Cores allocated to a job may be shared with |
| other jobs if each job allows sharing via the <CODE>srun --oversubscribe</CODE> |
| option.</TD> |
| </TR><TR> |
| <TD>OverSubscribe=FORCE</TD> |
| <TD>Each core can be allocated to multiple jobs up to the count specified |
| per partition/queue (default 4 jobs per core).</TD> |
| </TR><TR> |
| <TD ROWSPAN=3>SelectType=<B>select/cons_tres</B><BR> |
| Plus one of the following:<BR> |
| SelectTypeParameters=<B>CR_CPU</B><BR> |
| SelectTypeParameters=<B>CR_CPU_Memory</B></TD> |
| <TD>OverSubscribe=NO</TD> |
| <TD>CPUs are allocated to jobs. No CPU will run more than one job |
| per partition/queue.</TD> |
| </TR><TR> |
| <TD>OverSubscribe=YES</TD> |
| <TD>By default same as OverSubscribe=NO. CPUs allocated to a job may be shared with |
| other jobs if each job allows sharing via the <CODE>srun --oversubscribe</CODE> |
| option.</TD> |
| </TR><TR> |
| <TD>OverSubscribe=FORCE</TD> |
| <TD>Each CPU can be allocated to multiple jobs up to the count specified |
| per partition/queue (default 4 jobs per CPU).</TD> |
| </TR><TR> |
| <TD ROWSPAN=3>SelectType=<B>select/cons_tres</B><BR> |
| Plus one of the following:<BR> |
| SelectTypeParameters=<B>CR_Socket</B><BR> |
| SelectTypeParameters=<B>CR_Socket_Memory</B></TD> |
| <TD>OverSubscribe=NO</TD> |
| <TD>Sockets are allocated to jobs. No Socket will run more than one job |
| per partition/queue.</TD> |
| </TR><TR> |
| <TD>OverSubscribe=YES</TD> |
| <TD>By default same as OverSubscribe=NO. Sockets allocated to a job may be shared with |
| other jobs if each job allows sharing via the <CODE>srun --oversubscribe</CODE> |
| option.</TD> |
| </TR><TR> |
| <TD>OverSubscribe=FORCE</TD> |
| <TD>Each socket can be allocated to multiple jobs up to the count specified |
| per partition/queue (default 4 jobs per socket).</TD> |
| </TR> |
| </TABLE> |
| <P>When <CODE>OverSubscribe=FORCE</CODE> is configured, the consumable resources are |
| scheduled for jobs using a <B>least-loaded</B> algorithm. Thus, idle |
| CPUs|cores|sockets will be allocated to a job before busy ones, and |
| CPUs|cores|sockets running one job will be allocated to a job before ones |
| running two or more jobs. This is the same approach that the |
| <CODE>select/linear</CODE> plugin uses when allocating "shared" nodes. |
| </P> |
| <P> |
| Note that the <B>granularity</B> of the "least-loaded" algorithm is what |
| distinguishes the consumable resource and linear plugins |
| when <CODE>OverSubscribe=FORCE</CODE> is configured. With the |
| <CODE>select/cons_tres</CODE> plugin enabled, |
| the CPUs of a node are not |
| overcommitted until <B>all</B> of the rest of the CPUs are overcommitted on the |
| other nodes. Thus if one job allocates half of the CPUs on a node and then a |
| second job is submitted that requires more than half of the CPUs, the |
| consumable resource plugin will attempt to place this new job on other |
| busy nodes that have more than half of the CPUs available for use. The |
| <CODE>select/linear</CODE> plugin simply counts jobs on nodes, and does not |
| track the CPU usage on each node. |
| </P><P> |
| The sharing functionality in the |
| <CODE>select/cons_tres</CODE> plugin also supports the |
| new <CODE>OverSubscribe=FORCE:<num></CODE> syntax. If <CODE>OverSubscribe=FORCE:3</CODE> |
| is configured with a consumable resource plugin and <CODE>CR_Core</CODE> or |
| <CODE>CR_Core_Memory</CODE>, then the plugin will |
| run up to 3 jobs on each <U>core</U> of each node in the partition. If |
| <CODE>CR_Socket</CODE> or <CODE>CR_Socket_Memory</CODE> is configured, then the |
| plugin will run up to 3 jobs on each <U>socket</U> |
| of each node in the partition. |
| </P> |
| <h2 id="multiple_partitions">Nodes in Multiple Partitions |
| <a class="slurm_link" href="#multiple_partitions"></a> |
| </h2> |
| <P> |
| Slurm has supported configuring nodes in more than one partition since version |
| 0.7.0. The following table describes how nodes configured in two partitions with |
| different <CODE>OverSubscribe</CODE> settings will be allocated to jobs. Note that |
| "shared" jobs are jobs that are submitted to partitions configured with |
| <CODE>OverSubscribe=FORCE</CODE> or with <CODE>OverSubscribe=YES</CODE> and the job requested |
| sharing with the <CODE>srun --oversubscribe</CODE> option. Conversely, "non-shared" |
| jobs are jobs that are submitted to partitions configured with |
| <CODE>OverSubscribe=NO</CODE> or <CODE>OverSubscribe=YES</CODE> and the job did <U>not</U> |
| request shareable resources. |
| </P> |
| <TABLE CELLPADDING=3 CELLSPACING=1 BORDER=1> |
| <TR><TH> </TH><TH>First job "shareable"</TH><TH>First job not |
| "shareable"</TH></TR> |
| <TR><TH>Second job "shareable"</TH><TD>Both jobs can run on the same nodes and |
| may share resources</TD><TD>Jobs do not run on the same nodes</TD></TR> |
| <TR><TH>Second job not "shareable"</TH><TD>Jobs do not run on the same nodes</TD> |
| <TD>Jobs can run on the same nodes but will not share resources</TD></TR> |
| </TABLE> |
| <P> |
| The next table contains several scenarios with the <CODE>select/cons_tres</CODE> |
| plugin enabled to further |
| clarify how a node is used when it is configured in more than one partition and |
| the partitions have different "OverSubscribe" policies. |
| </P> |
| <TABLE CELLPADDING=3 CELLSPACING=1 BORDER=1> |
| <TR><TH>Slurm configuration</TH> |
| <TH>Resulting Behavior</TH> |
| </TR><TR> |
| <TD>Two <CODE>OverSubscribe=NO</CODE> partitions assigned the same set of nodes</TD> |
| <TD>Jobs from either partition will be assigned to all available consumable |
| resources. No consumable resource will be shared. One node could have 2 jobs |
| running on it, and each job could be from a different partition.</TD> |
| </TR><TR> |
| <TD>Two partitions assigned the same set of nodes: one partition is |
| <CODE>OverSubscribe=FORCE</CODE>, and the other is <CODE>OverSubscribe=NO</CODE></TD> |
| <TD>A node will only run jobs from one partition at a time. If a node is |
| running jobs from the <CODE>OverSubscribe=NO</CODE> partition, then none of its |
| consumable resources will be shared. If a node is running jobs from the |
| <CODE>OverSubscribe=FORCE</CODE> partition, then its consumable resources can be |
| shared.</TD> |
| </TR><TR> |
| <TD>Two <CODE>OverSubscribe=FORCE</CODE> partitions assigned the same set of nodes</TD> |
| <TD>Jobs from either partition will be assigned consumable resources. All |
| consumable resources can be shared. One node could have 2 jobs running on it, |
| and each job could be from a different partition.</TD> |
| </TR><TR> |
| <TD>Two partitions assigned the same set of nodes: one partition is |
| <CODE>OverSubscribe=FORCE:3</CODE>, and the other is <CODE>OverSubscribe=FORCE:5</CODE></TD> |
| <TD>Generally the same behavior as above. However no consumable resource will |
| ever run more than 3 jobs from the first partition, and no consumable resource |
| will ever run more than 5 jobs from the second partition. A consumable resource |
| could have up to 8 jobs running on it at one time.</TD> |
| </TR> |
| </TABLE> |
| <P> |
| Note that the "mixed shared setting" configuration (row #2 above) introduces the |
| possibility of <B>starvation</B> between jobs in each partition. If a set of |
| nodes are running jobs from the <CODE>OverSubscribe=NO</CODE> partition, then these |
| nodes will continue to only be available to jobs from that partition, even if |
| jobs submitted to the <CODE>OverSubscribe=FORCE</CODE> partition have a higher |
| priority. This works in reverse also, and in fact it's easier for jobs from the |
| <CODE>OverSubscribe=FORCE</CODE> partition to hold onto the nodes longer because the |
| consumable resource "sharing" provides more resource availability for new jobs |
| to begin running "on top of" the existing jobs. This happens with the |
| <CODE>select/linear</CODE> plugin also, so it's not specific to the |
| <CODE>select/cons_tres</CODE> plugin. |
| </P> |
| |
| <h2 id="memory_management">Memory Management |
| <a class="slurm_link" href="#memory_management"></a> |
| </h2> |
| <P> |
| The management of memory as a consumable resource remains unchanged and |
| can be used to prevent oversubscription of memory, which would result in |
| having memory pages swapped out and severely degraded performance. |
| </P> |
| <TABLE CELLPADDING=3 CELLSPACING=1 BORDER=1> |
| <TR><TH>Selection Setting</TH> |
| <TH>Resulting Behavior</TH> |
| </TR><TR> |
| <TD>SelectType=<B>select/linear</B></TD> |
| <TD>Memory allocation is not tracked. Jobs are allocated to nodes without |
| considering if there is enough free memory. Swapping could occur!</TD> |
| </TR><TR> |
| <TD>SelectType=<B>select/linear</B> plus<BR> |
| SelectTypeParameters=<B>CR_Memory</B></TD> |
| <TD>Memory allocation is tracked. Nodes that do not have enough available |
| memory to meet the jobs memory requirement will not be allocated to the job. |
| </TD> |
| </TR><TR> |
| <TD>SelectType=<B>select/cons_tres</B><BR> |
| Plus one of the following:<BR> |
| SelectTypeParameters=<B>CR_Core</B><BR> |
| SelectTypeParameters=<B>CR_CPU</B><BR> |
| SelectTypeParameters=<B>CR_Socket</B></TD> |
| <TD>Memory allocation is not tracked. Jobs are allocated to consumable resources |
| without considering if there is enough free memory. Swapping could occur!</TD> |
| </TR><TR> |
| <TD>SelectType=<B>select/cons_tres</B><BR> |
| Plus one of the following:<BR> |
| SelectTypeParameters=<B>CR_Core_Memory</B><BR> |
| SelectTypeParameters=<B>CR_CPU_Memory</B><BR> |
| SelectTypeParameters=<B>CR_Socket_Memory</B></TD> |
| <TD>Memory allocation for all jobs are tracked. Nodes that do not have enough |
| available memory to meet the jobs memory requirement will not be allocated to |
| the job.</TD> |
| </TR> |
| </TABLE> |
| <P>Users can specify their job's memory requirements one of two ways. The |
| <CODE>srun --mem=<num></CODE> option can be used to specify the jobs |
| memory requirement on a per allocated node basis. This option is recommended |
| for use with the <CODE>select/linear</CODE> plugin, which allocates |
| whole nodes to jobs. The |
| <CODE>srun --mem-per-cpu=<num></CODE> option can be used to specify the |
| jobs memory requirement on a per allocated CPU basis. This is recommended |
| for use with the <CODE>select/cons_tres</CODE> |
| plugin, which can allocate individual CPUs to jobs.</P> |
| |
| <P>Default and maximum values for memory on a per node or per CPU basis can |
| be configured by the system administrator using the following |
| <CODE>slurm.conf</CODE> options: <CODE>DefMemPerCPU</CODE>, |
| <CODE>DefMemPerNode</CODE>, <CODE>MaxMemPerCPU</CODE> and |
| <CODE>MaxMemPerNode</CODE>. |
| Users can use the <CODE>--mem</CODE> or <CODE>--mem-per-cpu</CODE> option |
| at job submission time to override the default value, but they cannot exceed |
| the maximum value. |
| </P><P> |
| Enforcement of a jobs memory allocation is performed by setting the "maximum |
| data segment size" and the "maximum virtual memory size" system limits to the |
| appropriate values before launching the tasks. Enforcement is also managed by |
| the accounting plugin, which periodically gathers data about running jobs. Set |
| <CODE>JobAcctGather</CODE> and <CODE>JobAcctFrequency</CODE> to |
| values suitable for your system.</P> |
| <P> |
| <B>NOTE</B>: The <CODE>--oversubscribe</CODE> and <CODE>--exclusive</CODE> |
| options are mutually exclusive when used at job submission. If both options are |
| set when submitting a job, the job submission command used will fatal. |
| </P> |
| |
| <p style="text-align:center;">Last modified 09 July 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |