| <!--#include virtual="header.txt"--> |
| |
| <h1>Support for Multi-core/Multi-thread Architectures</h1> |
| |
| <b>Note:</b> This document describes features added to SLURM version 1.2. |
| |
| <h2>Contents</h2> |
| <UL> |
| <LI> <a href=#defs>Definitions</a> |
| <LI> <a href=#flags>Overview of new srun flags</a> |
| <LI> <a href=#motivation>Motivation behind high-level srun flags</a> |
| <LI> <a href=#utilities>Extensions to sinfo/squeue/scontrol</a> |
| <LI> <a href=#config>Configuration settings in slurm.conf</a> |
| </UL> |
| |
| <!--------------------------------------------------------------------------> |
| <a name=defs> |
| <h2>Definitions</h2></a> |
| |
| <P> <b>Socket/Core/Thread</b> - Figure 1 illustrates the notion of |
| Socket, Core and Thread as it is defined in SLURM's multi-core/multi-thread |
| support documentation.</p> |
| |
| <center> |
| <img src="mc_support.gif"> |
| <br> |
| Figure 1: Definitions of Socket, Core, & Thread |
| </center> |
| |
| <dl> |
| <dt><b>Affinity</b> |
| <dd>The state of being bound to a specific logical processor. |
| <dt><b>Affinity Mask</b> |
| <dd>A bitmask where indices correspond to logical processors. |
| The least significant bit corresponds to the first |
| logical processor number on the system, while the most |
| significant bit corresponds to the last logical processor |
| number on the system. |
| A '1' in a given position indicates a process can run |
| on the associated logical processor. |
| <dt><b>Fat Masks</b> |
| <dd>Affinity masks with more than 1 bit set |
| allowing a process to run on more than one logical processor. |
| </dl> |
| |
| <!--------------------------------------------------------------------------> |
| <a name=flags> |
| <h2>Overview of new srun flags</h2></a> |
| |
| <p> Several new flags have been defined to allow users to |
| better take advantage of the new architecture by |
| explicitly specifying the number of sockets, cores, and threads required |
| by their application. Table 1 summarizes the new multi-core flags. |
| |
| <P> |
| <table border=1 cellspacing=0 cellpadding=0> |
| <tr><td colspan=2> |
| <b><a href="#srun_lowlevelmc">Low-level (explicit binding)</a></b> |
| </td></tr> |
| <tr> |
| <td> --cpu_bind=... </td> |
| <td>Explicit process affinity binding and control options |
| </td></tr> |
| <tr><td colspan=2> |
| <b><a href="#srun_highlevelmc">High-level (automatic mask generation)</a></b> |
| </td></tr> |
| <tr> |
| <td> --sockets-per-node=<i>S</i></td> |
| <td>Number of sockets in a node to dedicate to a job (minimum or range) |
| </td></tr> |
| <tr> |
| <td> --cores-per-socket=<i>C</i></td> |
| <td> Number of cores in a CPU to dedicate to a job (minimum or range) |
| </td></tr> |
| <tr> |
| <td> --threads-per-core=<i>T</i></td> |
| <td> Number of threads in a core |
| to dedicate to a job (minimum or range) |
| </td></tr> |
| <tr> |
| <td> -B <i>S[:C[:T]]</i></td> |
| <td> Combined shorcut option for --sockets-per-node, --cores-per_cpu, --threads-per_core |
| </td></tr> |
| <tr><td colspan=2> |
| <b><a href="#srun_dist">New Distributions</b> |
| </td></tr> |
| <tr> |
| <td> -m / --distribution </td> |
| <td> Distributions of: block | cyclic | hostfile |
| | <a href="dist_plane.html"><u>plane=<i>x</i></u></a> |
| | <u>[block|cyclic]:[block|cyclic]</u> |
| </td></tr> |
| <tr><td colspan=2> |
| <b><a href="#srun_constraints">New Constraints</b> |
| </td></tr> |
| <tr> |
| <td> --minsockets=<i>MinS</i></td> |
| <td> Nodes must meet this minimum number of sockets |
| </td></tr> |
| <tr> |
| <td> --mincores=<i>MinC</i></td> |
| <td> Nodes must meet this minimum number of cores per socket |
| </td></tr> |
| <tr> |
| <td> --minthreads=<i>MinT</i></td> |
| <td> Nodes must meet this minimum number of threads per core |
| </td></tr> |
| <tr><td colspan=2> |
| <b><a href="#srun_consres">Memory as a consumable resource</a></b> |
| </td></tr> |
| <tr> |
| <td> --job-mem=<i>mem</i></td> |
| <td> maximum amount of real memory per node required by the job. |
| </td></tr> |
| <tr><td colspan=2> |
| <b><a href="#srun_ntasks">Task invocation control</a></b> |
| </td></tr> |
| <tr> |
| <td> --ntasks-per-node=<i>ntasks</i></td> |
| <td> number of tasks to invoke on each node |
| </td></tr> |
| <td> --ntasks-per-socket=<i>ntasks</i></td> |
| <td> number of tasks to invoke on each socket |
| </td></tr> |
| <td> --ntasks-per-core=<i>ntasks</i></td> |
| <td> number of tasks to invoke on each core |
| </td></tr> |
| <tr><td colspan=2> |
| <b><a href="#srun_hints">Application hints</a></b> |
| </td></tr> |
| <tr> |
| <td> --hint=compute_bound</td> |
| <td> use all cores in each physical CPU |
| </td></tr> |
| <td> --hint=memory_bound</td> |
| <td> use only one core in each physical CPU |
| </td></tr> |
| <td> --hint=[no]multithread</td> |
| <td> [don't] use extra threads with in-core multi-threading |
| </td></tr> |
| </table> |
| |
| <p> |
| <center> |
| Table 1: New srun flags to support the multi-core/multi-threaded environment |
| </center> |
| |
| <p>It is important to note that many of these |
| flags are only meaningful if the processes' affinity is set. In order for |
| the affinity to be set, the task/affinity plugin must be first enabled in |
| slurm.conf: |
| |
| <PRE> |
| TaskPlugin=task/affinity # enable task affinity |
| </PRE> |
| |
| <p>See the "Task Launch" section if generating slurm.conf via |
| <a href="configurator.html">configurator.html</a>. |
| |
| <a name="srun_lowlevelmc"> |
| <h3>Low-level --cpu_bind=... - Explicit binding interface</h3></a> |
| |
| <p>The following srun flag provides a low-level core binding interface:</p> |
| |
| |
| <PRE> |
| --cpu_bind= Bind tasks to CPUs |
| q[uiet] quietly bind before task runs (default) |
| v[erbose] verbosely report binding before task runs |
| no[ne] don't bind tasks to CPUs (default) |
| rank bind by task rank |
| map_cpu:<i><list></i> specify a CPU ID binding for each task |
| where <i><list></i> is <i><cpuid1>,<cpuid2>,...<cpuidN></i> |
| mask_cpu:<i><list></i> specify a CPU ID binding mask for each task |
| where <i><list></i> is <i><mask1>,<mask2>,...<maskN></i> |
| sockets auto-generated masks bind to sockets |
| cores auto-generated masks bind to cores |
| threads auto-generated masks bind to threads |
| help show this help message |
| </PRE> |
| |
| <p> The affinity can be either set to either a specific logical processor |
| (socket, core, threads) or at a coarser granularity than the lowest level |
| of logical processor (core or thread). |
| In the later case the processes are allowed to roam within a specific |
| socket or core. |
| |
| <p>Examples:</p> |
| |
| <UL> |
| <UL> |
| <LI> srun -n 8 -N 4 --cpu_bind=mask_cpu:0x1,0x4 a.out |
| <LI> srun -n 8 -N 4 --cpu_bind=mask_cpu:0x3,0xD a.out |
| </UL> |
| </UL> |
| |
| <p>See also 'srun --cpu_bind=help' and 'man srun'</p> |
| |
| <a name="srun_highlevelmc"> |
| <h3>High-level -B <i>S[:C[:T]]</i> - Automatic mask generation interface</h3></a> |
| |
| <p>We have updated the node |
| selection infrastructure with a mechanism that allows selection of logical |
| processors at a finer granularity. Users are able to request a specific number |
| of nodes, sockets, cores, and threads:</p> |
| |
| <PRE> |
| -B --extra-node-info=<i>S[:C[:T]]</i> Expands to: |
| --sockets-per-node=<i>S</i> number of sockets per node to allocate |
| --cores-per-socket=<i>C</i> number of cores per socket to allocate |
| --threads-per-core=<i>T</i> number of threads per core to allocate |
| each field can be 'min[-max]' or wildcard '*' |
| |
| <font face="serif">Total cpus requested = (<i>Nodes</i>) x (<i>S</i> x <i>C</i> x <i>T</i>)</font> |
| </PRE> |
| |
| <p>Examples: |
| |
| <UL> |
| <UL> |
| <LI> srun -n 8 -N 4 -B 2:1 a.out |
| <LI> srun -n 8 -N 4 -B 2 a.out |
| <BR> |
| note: compare the above with the previous corresponding --cpu_bind=... examples |
| |
| <LI> srun -n 16 -N 4 a.out |
| <LI> srun -n 16 -N 4 -B 2:2:1 a.out |
| <LI> srun -n 16 -N 4 -B 2-2:2-2:1-1 a.out |
| <BR> or |
| <LI> srun -n 16 -N 4 --sockets-per-node 2-2 --cores-per-socket 2-2 --threads-per-core 1-1 a.out |
| <LI> srun -n 16 -N 2-4 -B '1-2:*:1' a.out |
| <LI> srun -n 16 -N 4:2 -B '2:*:1-1' a.out |
| <LI> srun -n 16 -N 4:4 -B '1-1:1-1' a.out |
| </UL> |
| </UL> |
| |
| <p>Notes:</p> |
| <ul> |
| <li> Adding --cpu_bind=no to the command line will cause the processes |
| to not be bound the logical processors. |
| <li> Adding --cpu_bind=verbose to the command line (or setting the |
| CPU_BIND environment variable to "verbose") will cause each task |
| to report the affinity mask in use |
| <li> Binding is on by default when -B is used. The default binding on |
| multi-core/multi-threaded systems is equivalent to the level of |
| resource enumerated in the -B option. |
| </ul> |
| |
| <p>See also 'srun --help' and 'man srun'</p> |
| |
| <a name="srun_dist"> |
| <h3>New distributions: Extensions to -m / --distribution</h3></a> |
| |
| <p>The -m / --distribution option for distributing processes across nodes |
| has been extended to also describe the distribution within the lowest level |
| of logical processors. |
| Available distributions include: |
| <br> |
| block | cyclic | hostfile | <u>plane=<i>x</i></u> | <u>[block|cyclic]:[block|cyclic]</u>) |
| </p> |
| |
| <p>The new <A HREF="dist_plane.html">plane distribution</A> (plane=<i>x</i>) |
| results in a block cyclic distribution of blocksize equal to <i>x</i>. |
| In the following we use "lowest level of logical processors" |
| to describe sockets, cores or threads depending of the architecture. |
| The new distribution divides |
| the cluster into planes (including a number of the lowest level of logical |
| processors on each node) and then schedule first within each plane and then |
| across planes.</p> |
| |
| <p>For the two dimensional distributions ([block|cyclic]:[block|cyclic]), |
| the second distribution (after ":") allows users to specify a distribution |
| method for processes within a node and applies to the lowest level of logical |
| processors (sockets, core or thread depending on the architecture).</p> |
| |
| <p>The binding is enabled automatically when high level flags are used as long as the task/affinity plug-in |
| is enabled. To disable binding at the job level use --cpu_bind=no.</p> |
| |
| <p>The distribution flags can be combined with the other switches: |
| |
| <UL> |
| <UL> |
| <LI>srun -n 16 -N 4 -B '2:*:1' -m block:cyclic --cpu_bind=socket a.out |
| <LI>srun -n 16 -N 4 -B '2:*:1' -m plane=2 --cpu_bind=core a.out |
| <LI>srun -n 16 -N 4 -B '2:*:1' -m plane=2 a.out |
| </UL> |
| </UL> |
| |
| <p>The default distribution on multi-core/multi-threaded systems is equivalent |
| to -m block:cyclic with --cpu_bind=thread.</p> |
| |
| <p>See also 'srun --help'</p> |
| |
| <a name="srun_constraints"> |
| <h3>New Constraints</h3></a> |
| |
| <p>To compliment the existing SLURM job minimum constraints |
| (CPUs, memory, temp disk), |
| constraint flags have also been added to allow a user to |
| specify a minimum number of sockets, cores, or threads:</p> |
| |
| <PRE> |
| --mincpus=<i>n</i> minimum number of logical cpus per node |
| <u>--minsockets=<i>n</i></u> minimum number of sockets per node |
| <u>--mincores=<i>n</i></u> minimum number of cores per cpu |
| <u>--minthreads=<i>n</i></u> minimum number of threads per core |
| --mem=<i>MB</i> minimum amount of real memory |
| --tmp=<i>MB</i> minimum amount of temporary disk |
| </PRE> |
| |
| <p>These constraints are separate from the -N or -B allocation minimums. |
| Using these constraints allows the user to exclude smaller nodes from |
| the allocation request.</p> |
| |
| <p>See also 'srun --help' and 'man srun'</p> |
| |
| <a name="srun_consres"> |
| <h3>Memory as a Consumable Resource</h3></a> |
| |
| <p>The --job-mem flag specifies the maximum amount of memory in MB |
| needed by the job per node. This flag is used to support the memory |
| as a consumable resource allocation strategy.</p> |
| |
| <PRE> |
| --job-mem=<i>MB</i> maximum amount of real memory per node |
| required by the job. |
| --mem >= --job-mem if --mem is specified. |
| </PRE> |
| |
| <p>This flag allows the scheduler to co-allocate jobs on specific nodes |
| given that their added memory requirement do not exceed the amount |
| of memory on the nodes.</p> |
| |
| |
| <p>In order to use memory as a consumable resource, the select/cons_res |
| plugin must be first enabled in slurm.conf: |
| <PRE> |
| SelectType=select/cons_res # enable consumable resources |
| SelectTypeParameters=CR_Memory # memory as a consumable resource |
| </PRE> |
| |
| <p> Using memory as a consumable resource can also be combined with |
| the CPU, Socket, or Core consumable resources via: |
| <PRE> |
| CR_CPU_Memory, CR_Socket_Memory, CR_Core_Memory |
| </PRE> |
| |
| <p>See the "Resource Selection" section if generating slurm.conf |
| via <a href="configurator.html">configurator.html</a>. |
| |
| <p>See also 'srun --help' and 'man srun'</p> |
| |
| <a name="srun_ntasks"> |
| <h3>Task invocation as a function of logical processors</h3></a> |
| |
| <p>The <tt>--ntasks-per-{node,socket,core}=<i>ntasks</i></tt> flags |
| allow the user to request that no more than <tt><i>ntasks</i></tt> |
| be invoked on each node, socket, or core. |
| This is similiar to using <tt>--cpus-per-task=<i>ncpus</i></tt> |
| but does not require knowledge of the actual number of cpus on |
| each node. In some cases, it is more convenient to be able to |
| request that no more than a specific number of ntasks be invoked |
| on each node, socket, or core. Examples of this include submitting |
| a hybrid MPI/OpenMP app where only one MPI "task/rank" should be |
| assigned to each node while allowing the OpenMP portion to utilize |
| all of the parallelism present in the node, or submitting a single |
| setup/cleanup/monitoring job to each node of a pre-existing |
| allocation as one step in a larger job script. |
| This can now be specified via the following flags:</p> |
| |
| <PRE> |
| --ntasks-per-node=<i>n</i> number of tasks to invoke on each node |
| --ntasks-per-socket=<i>n</i> number of tasks to invoke on each socket |
| --ntasks-per-core=<i>n</i> number of tasks to invoke on each core |
| </PRE> |
| |
| <p>For example, given a cluster with nodes containing two sockets, |
| each containing two cores, the following commands illustrate the |
| behavior of these flags:</p> |
| <pre> |
| % srun -n 4 hostname |
| hydra12 |
| hydra12 |
| hydra12 |
| hydra12 |
| % srun -n 4 --ntasks-per-node=1 hostname |
| hydra12 |
| hydra14 |
| hydra15 |
| hydra13 |
| % srun -n 4 --ntasks-per-node=2 hostname |
| hydra12 |
| hydra13 |
| hydra13 |
| hydra12 |
| % srun -n 4 --ntasks-per-socket=1 hostname |
| hydra12 |
| hydra13 |
| hydra13 |
| hydra12 |
| % srun -n 4 --ntasks-per-core=1 hostname |
| hydra12 |
| hydra12 |
| hydra12 |
| hydra12 |
| </pre> |
| |
| <p>See also 'srun --help' and 'man srun'</p> |
| |
| <a name="srun_hints"> |
| <h3>Application hints</h3></a> |
| |
| <p>Different applications will have various levels of resource |
| requirements. Some applications tend to be computationally intensive |
| but require little to no inter-process communication. Some applications |
| will be memory bound, saturating the memory bandwidth of a processor |
| before exhausting the computational capabilities. Other applications |
| will be highly communication intensive causing processes to block |
| awaiting messages from other processes. Applications with these |
| different properties tend to run well on a multi-core system given |
| the right mappings.</p> |
| |
| <p>For computationally intensive applications, all cores in a multi-core |
| system would normally be used. For memory bound applications, only |
| using a single core on each CPU will result in the highest per |
| core memory bandwidth. For communication intensive applications, |
| using in-core multi-threading (e.g. hyperthreading, SMT, or TMT) |
| may also improve performance. |
| The following command line flags can be used to communicate these |
| types of application hints to the SLURM multi-core support:</p> |
| |
| <PRE> |
| --hint= Bind tasks according to application hints |
| compute_bound use all cores in each physical CPU |
| memory_bound use only one core in each physical CPU |
| [no]multithread [don't] use extra threads with in-core multi-threading |
| help show this help message |
| </PRE> |
| |
| <p>For example, given a cluster with nodes containing two sockets, |
| each containing two cores, the following commands illustrate the |
| behavior of these flags. In the verbose --cpu_bind output, tasks |
| are described as 'hostname, task Global_ID Local_ID [PID]':</p> |
| <pre> |
| % srun -n 4 --hint=compute_bound --cpu_bind=verbose sleep 1 |
| cpu_bind=MASK - hydra12, task 0 0 [15425]: mask 0x1 set |
| cpu_bind=MASK - hydra12, task 1 1 [15426]: mask 0x4 set |
| cpu_bind=MASK - hydra12, task 2 2 [15427]: mask 0x2 set |
| cpu_bind=MASK - hydra12, task 3 3 [15428]: mask 0x8 set |
| |
| % srun -n 4 --hint=memory_bound --cpu_bind=verbose sleep 1 |
| cpu_bind=MASK - hydra12, task 0 0 [15550]: mask 0x1 set |
| cpu_bind=MASK - hydra12, task 1 1 [15551]: mask 0x4 set |
| cpu_bind=MASK - <u><b>hydra13</b></u>, task 2 <u><b>0</b></u> [14974]: mask 0x1 set |
| cpu_bind=MASK - <u><b>hydra13</b></u>, task 3 <u><b>1</b></u> [14975]: mask 0x4 set |
| </pre> |
| |
| <p>See also 'srun --hint=help' and 'man srun'</p> |
| <!--------------------------------------------------------------------------> |
| <a name=motivation> |
| <h2>Motivation behind high-level srun flags</h2></a> |
| |
| <p >The motivation behind allowing users to use higher level <i>srun</i> |
| flags instead of --cpu_bind is that the later can be difficult to use. The |
| proposed high-level flags are easier to use than --cpu_bind because:</p> |
| |
| <ul> |
| <li>Affinity mask generation happens automatically when using the high-level flags. </li> |
| <li>The length and complexity of the --cpu_bind flag vs. the length |
| of the combination of -B and --distribution flags make the high-level |
| flags much easier to use.</li> |
| </ul> |
| |
| <p>Also as illustrated in the example below it is much simpler to specify |
| a different layout using the high-level flags since users do not have to |
| recalculate mask or CPU IDs. The new approach is very effortless compared to |
| rearranging the mask or map.</p> |
| |
| <p>Given a 32-process MPI job and a four dual-CPU dual-core node |
| cluster, we want to use a block distribution across the four nodes and then a |
| cyclic distribution within the node across the physical processors. We have had |
| several requests from users that they would like this distribution to be the |
| default distribution on multi-core clusters. Below we show how to obtain the |
| wanted layout using 1) the new high-level flags and 2) --cpubind</p> |
| |
| <h3>High-Level flags</h3> |
| |
| <p>Using SLURM's new high-level flag, users can obtain the above layout with:</p> |
| |
| <DL> |
| <DL> |
| <DT><p>% mpirun -srun -n 32 -N 4 -B 4:2 --distribution=block:cyclic a.out</p> |
| or |
| <DT><p>% mpirun -srun -n 32 -N 4 -B 4:2 a.out</p> |
| </DL> |
| </DL> |
| |
| <p>(since --distribution=block:cyclic is the default distribution)</p> |
| |
| <p>The cores are shown as c0 and c1 and the processors are shown |
| as p0 through p3. The resulting task IDs are: </p> |
| |
| <table border=0 align=center> |
| <tr> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p0</td><td align=center> 0 </td><td align=center> 4 </td></tr> |
| <tr><td>p2</td><td align=center> 2 </td><td align=center> 6 </td></tr> |
| </table> |
| </td> |
| <td width=5> |
| </td> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p1</td><td align=center> 1 </td><td align=center> 5 </td></tr> |
| <tr><td>p3</td><td align=center> 3 </td><td align=center> 7 </td></tr> |
| </table> |
| </td> |
| </tr> |
| </table> |
| |
| <p>The computation and assignment of the task IDs is transparent |
| to the user. Users don't have to worry about the core numbering (Section |
| Pinning processes to cores) or any setting any CPU affinities. By default CPU affinity |
| will be set when using multi-core supporting flags. </p> |
| |
| <h3>Low-level flag --cpu_bind</h3> |
| |
| <p>Using SLURM's --cpu_bind flag, users must compute the CPU IDs or |
| masks as well as make sure they understand the core numbering on their |
| system. Another problem arises when core numbering is not the same on all |
| nodes. The --cpu_bind option only allows users to specify a single |
| mask for all the nodes. Using SLURM high-level flags remove this limitation |
| since SLURM will correctly generate the appropriate masks for each requested nodes.</p> |
| |
| <h3>On a four dual-CPU dual-core node cluster with core block numbering </h3> |
| |
| <p>The cores are shown as c0 and c1 and the processors are shown |
| as p0 through p3. The CPU IDs within a node in the block numbering are: |
| (this information is available from the /proc/cpuinfo file on the system)</p> |
| |
| <table border=0 align=center> |
| <tr> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p0</td><td align=center> 0 </td><td align=center> 1 </td></tr> |
| <tr><td>p2</td><td align=center> 4 </td><td align=center> 5 </td></tr> |
| </table> |
| </td> |
| <td width=5> |
| </td> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p1</td><td align=center> 2 </td><td align=center> 3 </td></tr> |
| <tr><td>p3</td><td align=center> 6 </td><td align=center> 7 </td></tr> |
| </table> |
| </td> |
| </tr> |
| </table> |
| |
| <p > resulting in the following mapping for processor/cores |
| and task IDs which users need to calculate: </p> |
| |
| <center> |
| mapping for processors/cores |
| </center> |
| <table border=0 align=center> |
| <tr> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p0</td><td align=center> 0x01 </td><td align=center> 0x02 </td></tr> |
| <tr><td>p2</td><td align=center> 0x10 </td><td align=center> 0x20 </td></tr> |
| </table> |
| </td> |
| <td width=5> |
| </td> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p1</td><td align=center> 0x04 </td><td align=center> 0x08 </td></tr> |
| <tr><td>p3</td><td align=center> 0x40 </td><td align=center> 0x80 </td></tr> |
| </table> |
| </td> |
| </tr> |
| </table> |
| |
| <p> |
| <center> |
| task IDs |
| </center> |
| <table border=0 align=center> |
| <tr> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p0</td><td align=center> 0 </td><td align=center> 4 </td></tr> |
| <tr><td>p2</td><td align=center> 2 </td><td align=center> 6 </td></tr> |
| </table> |
| </td> |
| <td width=5> |
| </td> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p1</td><td align=center> 1 </td><td align=center> 5 </td></tr> |
| <tr><td>p3</td><td align=center> 3 </td><td align=center> 7 </td></tr> |
| </table> |
| </td> |
| </tr> |
| </table> |
| |
| <p>The above maps and task IDs can be translated into the |
| following mpirun command:</p> |
| |
| <DL> |
| <DL> |
| <DT><p>% mpirun -srun -n 32 -N 4 --cpu_bind=mask_cpu:1,4,10,40,2,8,20,80 a.out</p> |
| or |
| <DT><p>% mpirun -srun -n 32 -N 4 --cpu_bind=map_cpu:0,2,4,6,1,3,5,7 a.out</p> |
| </DL> |
| </DL> |
| |
| <h3>Same cluster but with its core numbered cyclic instead of block</h3> |
| |
| <p>On a system with cyclically numbered cores, the correct mask |
| argument to the <i>mpirun</i>/<i>srun</i> command looks like: (this will |
| achieve the same layout as the command above on a system with core block |
| numbering.)</p> |
| |
| <DL> |
| <DL> |
| <DT><p>% mpirun -srun -n 32 -N 4 --cpu_bind=map_cpu:0,1,2,3,4,5,6,7 a.out</p> |
| </DL> |
| </DL> |
| |
| <h3>Block map_cpu on a system with cyclic core numbering</h3> |
| |
| <p>If users do not check their systemÂ’s core numbering before specifying |
| the map_cpu list and thereby do not realize that the new system has cyclic core |
| numbering instead of block numbering then they will not get the expected |
| layout.. For example, if they decide to re-use their mpirun command from above:</p> |
| |
| <DL> |
| <DL> |
| <DT><p>% mpirun -srun -n 32 -N 4 --cpu_bind=map_cpu:0,2,4,6,1,3,5,7 a.out</p> |
| </DL> |
| </DL> |
| |
| <p>they get the following unintentional task ID layout:</p> |
| |
| <table border=0 align=center> |
| <tr> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p0</td><td align=center> 0 </td><td align=center> 2 </td></tr> |
| <tr><td>p2</td><td align=center> 1 </td><td align=center> 3 </td></tr> |
| </table> |
| </td> |
| <td width=5> |
| </td> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p1</td><td align=center> 4 </td><td align=center> 6 </td></tr> |
| <tr><td>p3</td><td align=center> 5 </td><td align=center> 7 </td></tr> |
| </table> |
| </td> |
| </tr> |
| </table> |
| |
| <p>since the processor IDs within a node in the cyclic numbering are:</p> |
| |
| <table border=0 align=center> |
| <tr> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p0</td><td align=center> 0 </td><td align=center> 4 </td></tr> |
| <tr><td>p2</td><td align=center> 2 </td><td align=center> 6 </td></tr> |
| </table> |
| </td> |
| <td width=5> |
| </td> |
| <td> |
| <table border=1 align=center> |
| <tr><td> </td> |
| <td width=40 align=center>c0</td><td width=40 align=center>c1</td></tr> |
| <tr><td>p1</td><td align=center> 1 </td><td align=center> 5 </td></tr> |
| <tr><td>p3</td><td align=center> 3 </td><td align=center> 7 </td></tr> |
| </table> |
| </td> |
| </tr> |
| </table> |
| |
| <p>The important conclusion is that using the --cpu_bind flag is not |
| trivial and that it assumes that users are experts.</p> |
| |
| <!--------------------------------------------------------------------------> |
| <a name=utilities> |
| <h2>Extensions to sinfo/squeue/scontrol</h2></a> |
| |
| <p>Several extensions have also been made to the other SLURM utilities to |
| make working with multi-core/multi-threaded systems easier.</p> |
| |
| <!- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --> |
| <h3>sinfo</h3> |
| |
| <p>The long version (-l) of the sinfo node listing (-N) has been |
| extended to display the sockets, cores, and threads present for each |
| node. For example: |
| |
| <PRE> |
| % sinfo -N |
| NODELIST NODES PARTITION STATE |
| hydra[12-15] 4 parts* idle |
| |
| % sinfo -lN |
| Thu Sep 14 17:47:13 2006 |
| NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON |
| hydra[12-15] 4 parts* idle 8+ 2+:4+:1+ 2007 41447 1 (null) none |
| |
| % sinfo -lNe |
| Thu Sep 14 17:47:18 2006 |
| NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON |
| |
| hydra[12-14] 3 parts* idle 8 2:4:1 2007 41447 1 (null) none |
| hydra15 1 parts* idle 64 8:4:2 2007 41447 1 (null) none |
| </PRE> |
| |
| <p>For user specified output formats (-o/--format) and sorting (-S/--sort), |
| the following identifiers are available:</p> |
| |
| <PRE> |
| %X Number of sockets per node |
| %Y Number of cores per socket |
| %Z Number of threads per core |
| %z Extended processor information: number of |
| sockets, core, threads (S:C:T) per node |
| </PRE> |
| |
| <p>For example:</p> |
| |
| <PRE> |
| % sinfo -o '%9P %4c %8z %8X %8Y %8Z' |
| PARTITION CPUS S:C:T SOCKETS CORES THREADS |
| parts* 4 2:2:1 2 2 1 |
| </PRE> |
| |
| <p>See also 'sinfo --help' and 'man sinfo'</p> |
| |
| <!- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --> |
| <h3>squeue</h3> |
| |
| <p>For user specified output formats (-o/--format) and sorting (-S/--sort), |
| the following identifiers are available:</p> |
| |
| <PRE> |
| %H Minimum number of sockets per node requested by the job. |
| This reports the value of the srun --minsockets option. |
| %I Minimum number of cores per socket requested by the job. |
| This reports the value of the srun --mincores option. |
| %J Minimum number of threads per core requested by the job. |
| This reports the value of the srun --minthreads option. |
| %m Minimum size of memory (in MB) requested by the job |
| %X Number of requested sockets per node |
| %Y Number of requested cores per socket |
| %Z Number of requested threads per core |
| %z Extended processor information: number of requested |
| sockets, cores, threads (S:C:T) per node |
| </PRE> |
| |
| <p>Below is an example squeue output after running 7 copies of:</p> |
| |
| <UL> |
| <UL> |
| <DL>% srun -n 4 -B 2:2:1 |
| --minsockets=4 --mincores=2 --minthreads=1 --mem=1024 sleep 100 & |
| </UL> |
| </UL> |
| |
| <PRE> |
| % squeue -o '%.5i %.2t %.4M %.5D %7X %7Y %7Z %7z %R' |
| JOBID ST TIME NODES SOCKETS CORES THREADS S:C:T NODELIST(REASON) |
| 17 PD 0:00 1 2 2 1 2:2:1 (Resources) |
| 18 PD 0:00 1 2 2 1 2:2:1 (Resources) |
| 19 PD 0:00 1 2 2 1 2:2:1 (Resources) |
| 13 R 1:27 1 2 2 1 2:2:1 hydra12 |
| 14 R 1:26 1 2 2 1 2:2:1 hydra13 |
| 15 R 1:26 1 2 2 1 2:2:1 hydra14 |
| 16 R 1:26 1 2 2 1 2:2:1 hydra15 |
| |
| % squeue -o '%.5i %.2t %.4M %.5D %9c %11H %9I %11J' |
| JOBID ST TIME NODES MIN_PROCS MIN_SOCKETS MIN_CORES MIN_THREADS |
| 17 PD 0:00 1 1 4 2 1 |
| 18 PD 0:00 1 1 4 2 1 |
| 19 PD 0:00 1 1 4 2 1 |
| 13 R 1:29 1 0 0 0 0 |
| 14 R 1:28 1 0 0 0 0 |
| 15 R 1:28 1 0 0 0 0 |
| 16 R 1:28 1 0 0 0 0 |
| </PRE> |
| |
| <p> |
| The display of the minimum size of memory requested by the job has |
| been extended to also show the amount of memory requested by |
| the --job-mem flag. If --job-mem and --mem are set to the |
| same value, a single number is display for MIN_MEMORY. Otherwise |
| a range is reported: |
| |
| <p>submit job 21: |
| <pre> |
| % srun sleep 100 & |
| </pre> |
| |
| <p>submit job 22: |
| <pre> |
| % srun --job-mem=2048MB --mem=1024MB sleep 100 & |
| srun: mem < job-mem - resizing mem to be equal to job-mem |
| </pre> |
| |
| <p>submit job 23: |
| <pre> |
| % srun --job-mem=2048MB --mem=10240MB sleep 100 & |
| </pre> |
| |
| <pre> |
| % squeue -o "%.5i %.2t %.4M %.5D %m" |
| JOBID ST TIME NODES MIN_MEMORY |
| 21 PD 0:00 1 0-1 |
| 22 PD 0:00 1 2048 |
| 23 PD 0:00 1 2048-10240 |
| 17 R 1:12 1 0 |
| 18 R 1:11 1 0 |
| 19 R 1:11 1 0 |
| 20 R 1:10 1 0 |
| </pre> |
| |
| <p>In the above examples, note that once a job starts running, the |
| MIN_* constraints are all reported as zero regardless of what |
| their initial values were (since they are meaningless once |
| the job starts running). |
| |
| <p>See also 'squeue --help' and 'man squeue'</p> |
| |
| <!- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - --> |
| <h3>scontrol</h3> |
| |
| <p>The following job settings can be adjusted using scontrol: |
| |
| <PRE> |
| Requested Allocation: |
| ReqSockets=<count> Set the job's count of required sockets |
| ReqCores=<count> Set the job's count of required cores |
| ReqThreads=<count> Set the job's count of required threads |
| |
| Constraints: |
| MinSockets=<count> Set the job's minimum number of sockets per node |
| MinCores=<count> Set the job's minimum number of cores per socket |
| MinThreads=<count> Set the job's minimum number of threads per core |
| </PRE> |
| |
| <p>For example:</p> |
| |
| <PRE> |
| # scontrol update JobID=18 MinThreads=2 |
| # scontrol update JobID=18 MinCores=4 |
| # scontrol update JobID=18 MinSockets=8 |
| |
| % squeue -o '%.5i %.2t %.4M %.5D %9c %11H %9I %11J' |
| JOBID ST TIME NODES MIN_PROCS MIN_SOCKETS MIN_CORES MIN_THREADS |
| 17 PD 0:00 1 1 4 2 1 |
| 18 PD 0:00 1 1 8 4 2 |
| 19 PD 0:00 1 1 4 2 1 |
| 13 R 1:35 1 0 0 0 0 |
| 14 R 1:34 1 0 0 0 0 |
| 15 R 1:34 1 0 0 0 0 |
| 16 R 1:34 1 0 0 0 0 |
| </PRE> |
| |
| <p>The 'scontrol show job' command can be used to display |
| the number of allocated CPUs per node as well as the socket, cores, |
| and threads specified in the request and contraints. |
| |
| <PRE> |
| % srun -N 2 -B 2:1-1 sleep 100 & |
| % scontrol show job 20 |
| JobId=20 UserId=(30352) GroupId=users(1051) |
| Name=sleep |
| Priority=4294901749 Partition=parts BatchFlag=0 |
| AllocNode:Sid=hydra16:3892 TimeLimit=UNLIMITED |
| JobState=RUNNING StartTime=09/25-17:17:30 EndTime=NONE |
| NodeList=hydra[12-14] NodeListIndices=0,2,-1 |
| <u>AllocCPUs=1,2,1</u> |
| ReqProcs=4 ReqNodes=2 <u>ReqS:C:T=2:1-1</u> |
| Shared=0 Contiguous=0 CPUs/task=0 |
| MinProcs=0 <u>MinSockets=0 MinCores=0 MinThreads=0</u> |
| MinMemory=0 MinTmpDisk=0 Features=(null) |
| Dependency=0 Account=(null) Reason=None Network=(null) |
| ReqNodeList=(null) ReqNodeListIndices=-1 |
| ExcNodeList=(null) ExcNodeListIndices=-1 |
| SubmitTime=09/25-17:17:30 SuspendTime=None PreSusTime=0 |
| </PRE> |
| |
| <p>See also 'scontrol --help' and 'man scontrol'</p> |
| |
| <!--------------------------------------------------------------------------> |
| <a name=config> |
| <h2>Configuration settings in slurm.conf</h2></a> |
| |
| <p>Several slurm.conf settings are available to control the multi-core |
| features described above. |
| |
| <p>In addition to the description below, also see the "Task Launch" and |
| "Resource Selection" sections if generating slurm.conf |
| via <a href="configurator.html">configurator.html</a>. |
| |
| <p>As previously mentioned, in order for the affinity to be set, the |
| task/affinity plugin must be first enabled in slurm.conf: |
| |
| <PRE> |
| TaskPlugin=task/affinity # enable task affinity |
| </PRE> |
| |
| <p>This setting is part of the task launch specific parameters:</p> |
| |
| <PRE> |
| # o Define task launch specific parameters |
| # |
| # "TaskProlog" : Define a program to be executed as the user before each |
| # task begins execution. |
| # "TaskEpilog" : Define a program to be executed as the user after each |
| # task terminates. |
| # "TaskPlugin" : Define a task launch plugin. This may be used to |
| # provide resource management within a node (e.g. pinning |
| # tasks to specific processors). Permissible values are: |
| # "task/none" : no task launch actions, the default. |
| # "task/affinity" : CPU affinity support |
| # |
| # Example: |
| # |
| # TaskProlog=/usr/local/slurm/etc/task_prolog # default is none |
| # TaskEpilog=/usr/local/slurm/etc/task_epilog # default is none |
| # TaskPlugin=task/affinity # default is task/none |
| </PRE> |
| |
| <p>SLURM will automatically detect the architecture of the nodes used |
| by examining /proc/cpuinfo. If, for some reason, the administrator |
| wishes to override the automatically selected architecture, the |
| NodeName parameter can be used in combination with FastSchedule: |
| |
| <PRE> |
| FastSchedule=1 |
| NodeName=dualcore[01-16] Procs=4 CoresPerSocket=2 ThreadsPerCore=1 |
| </PRE> |
| |
| <p>Below is a more complete description of the configuration possible |
| using NodeName: |
| |
| <PRE> |
| # |
| # o Node configuration |
| # |
| # The configuration information of nodes (or machines) to be managed |
| # by SLURM is described here. The only required value in this section |
| # of the config file is the "NodeName" field, which specifies the |
| # hostnames of the node or nodes to manage. It is recommended, however, |
| # that baseline values for the node configuration be established |
| # using the following parameters (see slurm.config(5) for more info): |
| # |
| # "NodeName" : The only required node configuration parameter, NodeName |
| # specifies a node or set of nodes to be managed by SLURM. |
| # The special NodeName of "DEFAULT" may be used to establish |
| # default node configuration parameters for subsequent node |
| # records. Typically this would be the string that |
| # `/bin/hostname -s` would return on the node. However |
| # NodeName may be an arbitrary string if NodeHostname is |
| # used (see below). |
| # |
| # "Feature" : comma separated list of "features" for the given node(s) |
| # |
| # "NodeAddr" : preferred address for contacting the node. This may be |
| # either a name or IP address. |
| # |
| # "NodeHostname" |
| # : the string that `/bin/hostname -s` would return on the |
| # node. In other words, NodeName may be the name other than |
| # the real hostname. |
| # |
| # "RealMemory" : Amount of real memory (in Megabytes) |
| # |
| # "Procs" : Number of logical processors on the node. |
| # If Procs is omitted, it will be inferred from: |
| # Sockets, CoresPerSocket, and ThreadsPerCore. |
| # |
| # "Sockets" : Number of physical processor sockets/chips on the node. |
| # If Sockets is omitted, it will be inferred from: |
| # Procs, CoresPerSocket, and ThreadsPerCore. |
| # |
| # "CoresPerSocket" |
| # : Number of cores in a single physical processor socket |
| # The CoresPerSocket value describes physical cores, not |
| # the logical number of processors per socket. |
| # The default value is 1. |
| # |
| # "ThreadsPerCore" |
| # : Number of logical threads in a single physical core. |
| # The default value is 1. |
| # |
| # "State" : Initial state (IDLE, DOWN, etc.) |
| # |
| # "TmpDisk" : Temporary disk space available on node |
| # |
| # "Weight" : Priority of node for scheduling purposes |
| # |
| # If any of the above values are set for a node or group of nodes, and |
| # that node checks in to the slurm controller with less than the |
| # configured resources, the node's state will be set to DOWN, in order |
| # to avoid scheduling any jobs on a possibly misconfigured machine. |
| # |
| # Example Node configuration: |
| # |
| # NodeName=DEFAULT Procs=2 TmpDisk=64000 State=UNKNOWN |
| # NodeName=host[0-25] NodeAddr=ehost[0-25] Weight=16 |
| # NodeName=host26 NodeAddr=ehost26 Weight=32 Feature=graphics_card |
| # NodeName=dualcore01 Procs=4 CoresPerSocket=2 ThreadsPerCore=1 |
| # NodeName=dualcore02 Procs=4 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 |
| # NodeName=multicore03 Procs=64 Sockets=8 CoresPerSocket=4 ThreadsPerCore=2 |
| </PRE> |
| |
| <!--------------------------------------------------------------------------> |
| <p style="text-align:center;">Last modified 2 February 2007</p> |
| |
| <!--#include virtual="footer.txt"--> |
| |