| <!--#include virtual="header.txt"--> |
| |
| <h1>Intel Knights Landing (KNL) User and Administrator Guide</h1> |
| |
| <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2> |
| |
| <p>This document describes the unique features of Slurm on the computers with |
| the Intel Knights Landing processor. |
| You should be familiar with the Slurm's mode of operation on Linux clusters |
| before studying the relatively few differences in Intel KNL system operation |
| described in this document.</p> |
| |
| <h2 id="user_tools">User Tools |
| <a class="slurm_link" href="#user_tools"></a> |
| </h2> |
| |
| <p>The desired NUMA and MCDRAM modes for a KNL processor should be specified |
| using the -C or --constraint option of Slurm's job submission commands: salloc, |
| sbatch, and srun. Currently available NUMA and MCDRAM modes are shown in the |
| table below. Each node's available and current NUMA and MCDRAM modes are |
| visible in the "available features" and "active features" fields respectively, |
| which may be seen using the scontrol, sinfo, or sview commands. |
| Note that a node may need to be rebooted to get the desired NUMA and MCDRAM |
| modes and nodes may only be rebooted when they contain no running jobs |
| (i.e. sufficient resources may be available to run a pending job, but until |
| the node is idle and can be rebooted, the pending job may not be allocated |
| resources). Also note that the job will be charged for resources from the time |
| of resource allocation, which may include time to reboot a node into the |
| desired NUMA and MCDRAM configuration.</p> |
| |
| <p>Slurm supports a very rich set of options for the node constraint options |
| (exclusive OR, node counts for each constraint, etc.). |
| See the man pages for the salloc, sbatch and srun commands for more information |
| about the constraint syntax. |
| Jobs may specify their desired NUMA and/or MCDRAM configuration. If no |
| NUMA and/or MCDRAM configuration is specified, then a node with any possible |
| value for that configuration will be used.</p> |
| |
| <table width="100%" border=1 cellspacing=0 cellpadding=4> |
| <tr> |
| <th width="15%">Type</th> |
| <th width="15%">Name</th> |
| <th width="70%">Description</th> |
| </tr> |
| <tr><td>MCDRAM</td><td>cache</td><td>All of MCDRAM to be used as cache</td></tr> |
| <tr><td>MCDRAM</td><td>equal</td><td>MCDRAM to be used partly as cache and partly combined with primary memory</td></tr> |
| <tr><td>MCDRAM</td><td>flat</td><td>MCDRAM to be combined with primary memory into a "flat" memory space</td></tr> |
| <tr><td>NUMA</td><td>a2a</td><td>All to all</td></tr> |
| <tr><td>NUMA</td><td>hemi</td><td>Hemisphere</td></tr> |
| <tr><td>NUMA</td><td>snc2</td><td>Sub-NUMA cluster 2</td></tr> |
| <tr><td>NUMA</td><td>snc4</td><td>Sub-NUMA cluster 4 (<a href="#note">NOTE</a>)</td></tr> |
| <tr><td>NUMA</td><td>quad</td><td>Quadrant (<a href="#note2">NOTE</a>)</td></tr> |
| </table> |
| |
| <p>Jobs requiring some or all of the KNL high bandwidth memory (HBM) should |
| explicitly request that memory using Slurm's Generic RESource (GRES) options. |
| The HBM will always be known by Slurm GRES name of "hbm". |
| Examples below demonstrate use of HBM.</p> |
| |
| <p>Sorting of the free cache pages at step startup using Intel's zonesort |
| module can be configured as the default for all steps using the |
| "LaunchParameters=mem_sort" option in the slurm.conf file. |
| Individual job steps can enable or disable sorting using the "--mem-bind=sort" |
| or "--mem-bind=nosort" command line options for srun. |
| Sorting will be performed only for the NUMA nodes allocated to the job step.</p> |
| |
| <p><a id="note"><b>NOTE</b></a>: Slurm provides limited support |
| for restricting use of HBM. At some point in the future, the amount of HBM |
| requested by the job will be compared with the total amount of HBM and number of |
| memory-containing NUMA nodes available on the KNL processor. The job will then |
| be bound to specific NUMA nodes in order to limit the total amount of HBM |
| available to the job, and thus reserve the remaining HBM for other jobs running |
| on that KNL processor.</p> |
| |
| <p><a id="note2"><b>NOTE</b></a>: Slurm can only |
| support homogeneous nodes (e.g. the same number of cores per NUMA node). |
| KNL processors with <u>68 cores</u> (a subset of KNL models) will not have |
| homogeneous NUMA nodes in snc4 mode, but each NUMA node will have |
| either 16 or 18 cores. This will result in Slurm using the lower core count, |
| finding a total of 256 threads rather than 272 threads and setting the node |
| to a DOWN state.</p> |
| |
| <h3 id="accounting">Accounting<a class="slurm_link" href="#accounting"></a></h3> |
| |
| <p>If a node requires rebooting for a job's required configuration, the job |
| will be charged for the resource allocation from the time of allocation through |
| the lifetime of the job, including the time consumed for booting the nodes. |
| The job's time limit will be calculated from the time that all nodes are ready |
| for use. |
| For example, a job with a 10 minute time limit may be allocated resources at |
| 10:00:00. |
| If the nodes require rebooting, they might not be available for use until |
| 10:20:00, 20 minutes after allocation, and the job will begin execution at |
| that time. |
| The job must complete no later than 10:30:00 in order to satisfy its time limit |
| (10 minutes after execution actually begins). |
| However, the job will be charged for 30 minutes of resource use, which includes |
| the boot time.</p> |
| |
| <h3 id="use_case">Sample Use Cases |
| <a class="slurm_link" href="#use_case"></a> |
| </h3> |
| |
| <pre> |
| $ sbatch -C flat,a2a -N2 --gres=hbm:8g --exclusive my.script |
| $ srun --constraint=hemi,cache -n36 a.out |
| $ srun --constraint=flat --gres=hbm:2g -n36 a.out |
| |
| $ sinfo -o "%30N %20b %f" |
| NODELIST ACTIVE_FEATURES AVAIL_FEATURES |
| nid000[10-11] |
| nid000[12-35] flat,a2a flat,a2a,snc2,hemi |
| nid000[36-43] cache,a2a flat,equal,cache,a2a,hemi |
| </pre> |
| |
| <h3 id="topology">Network Topology |
| <a class="slurm_link" href="#topology"></a> |
| </h3> |
| |
| <p>Slurm will optimize performance using those resources available without |
| rebooting. If node rebooting is required, then it will optimize layout with |
| respect to network bandwidth using both nodes currently in the desired |
| configuration and those which can be made available after rebooting. |
| This can result in more nodes being rebooted than strictly needed, but will |
| improve application performance.</p> |
| |
| <p>Users can specify they want all resources allocated on a specific count of |
| leaf switches (Dragonfly group) using Slurm's <b>--switches</b> option. |
| They can also specify how much additional time they are willing to wait for |
| such a configuration. If the desired configuration can not be made available |
| within the specified time interval, the job will be allocated nodes optimized |
| with respect to network bandwidth to the extent possible. On a Dragonfly |
| network, this means allocating resources over either single group or |
| distributed evenly over as many groups as possible. For example:</p> |
| <pre> |
| srun --switches=1@10:00 N16 a.out |
| </pre> |
| <p>Note that system administrators can disable use of the <b>--switches</b> |
| option or limit the amount of time the job can be deferred using the |
| <b>SchedulerParameters</b> <b>max-switch-wait</b> option.</p> |
| |
| <h3 id="boot_problems">Booting Problems |
| <a class="slurm_link" href="#boot_problems"></a> |
| </h3> |
| |
| <p>If node boots fail, those nodes are drained and the job is requeued so that |
| it can be allocated a different set of nodes. The nodes originally allocated |
| to the job will remain available to the job, so likely a small number of |
| additional nodes will be required.</p> |
| |
| <h2 id="administration">System Administration |
| <a class="slurm_link" href="#administration"></a> |
| </h2> |
| |
| <p>Four important components are required to use Slurm on an Intel KNL system.</p> |
| <ol> |
| <li>Slurm needs a mechanism to determine the node's current topology (e.g. |
| how many NUMA exist and which cores are associated with each NUMA). Slurm |
| relies upon <a href="http://www.open-mpi.org/projects/hwloc/"> |
| Portable Hardware Locality (HWLOC)</a> for this functionality. Please install |
| HWLOC before building Slurm.</li> |
| |
| <li>The node features plugin manages the available and active features |
| information available for each KNL node.</li> |
| |
| <li>A configuration file is used to define various timeouts, default |
| configuration, etc. The configuration file name and contents will depend upon |
| the node features plugins used. See the <a href="knl.conf.html">knl.conf</a> |
| man page for more information.</li> |
| |
| <li>A mechanism is required to boot nodes in the desired configuration. This |
| mechanism must be integrated with existing Slurm infrastructure for |
| <a href="sbatch.html">rebooting nodes on user request (--reboot)</a>.</li> |
| </ol> |
| |
| <p>In addition, there is a DebugFlags option of "NodeFeatures" which will |
| generate detailed information about KNL operations.</p> |
| |
| <p>The KNL-specific available and active features are configured differently |
| based upon the plugin configured.<br> |
| <u>For the knl_generic plugin</u>, KNL-specific features should be defined |
| in the "slurm.conf" configuration file. When the slurmd daemon starts on each |
| compute node, it will update the available and active features as needed.<br> |
| Features which are not KNL-specific (e.g. rack number, "knl", etc.) will be |
| copied from the node's "Features" configuration in "slurm.conf" to both the |
| available and active feature fields and not modified by the NodeFeatures |
| plugin.</p> |
| |
| <p><b>NOTE</b>: For Dell KNL systems you must also include the |
| <i>SystemType=Dell</i> option for successful operation and will likely need to |
| increase the <i>SyscfgTimeout</i> to allow enough time for the command to |
| successfully complete. Experience at one site has shown that a 10 second |
| timeout may be necessary, configured as <i>SyscfgTimeout=10000</i>.</p> |
| |
| <p>Slurm does not support the concept of multiple NUMA nodes |
| within a single socket. If a KNL node is booted with multiple NUMA, then each |
| NUMA node will appear in Slurm as a separate socket. |
| In the slurm.conf configuration file, set node socket and |
| core counts to values which are appropriate for some NUMA mode to be used on the |
| node. When the node boots and the slurmd daemon on the node starts, it will |
| report to the slurmctld daemon the node's actual socket (NUMA) and core counts, |
| which will update Slurm data structures for the node to the values which are |
| currently configured. |
| Note that Slurm currently does not support the concept of |
| differing numbers of cores in each socket (or NUMA node). We are currently |
| working to address these issues.</p> |
| |
| <h3 id="operation">Mode of Operation |
| <a class="slurm_link" href="#operation"></a> |
| </h3> |
| |
| <ol> |
| <li>The node's configured "Features" are copied to the available and active |
| feature fields.</li> |
| <li>The node features plugin determines the node's current MCDRAM and NUMA |
| values as well as those which are available and adds those values to the node's |
| active and available feature fields respectively. Note that these values may |
| not be available until the node has booted and the slurmd daemon on the |
| compute node sends that information to the slurmctld daemon.</li> |
| <li>Jobs will be allocated nodes already in the requested MCDRAM and NUMA mode |
| if possible. If insufficient resources are available with the requested |
| configuration then other nodes will be selected and booted into the desired |
| configuration once no other jobs are active on the node. Until a node is idle, |
| its configuration can not be changed. Note that node reboot time is roughly |
| on the order of 20 minutes.</li> |
| </ol> |
| |
| <h3 id="config">Generic Cluster Configuration |
| <a class="slurm_link" href="#config"></a> |
| </h3> |
| |
| <p>All other clusters should have NodeFeaturesPlugins configured to "knl_generic". |
| This plugin performs all operations directly on the compute nodes using Intel's |
| <i>syscfg</i> program to get and modify the node's MCDRAM and NUMA mode and |
| uses the Linux <i>reboot</i> program to reboot the compute node in order for |
| modifications in MCDRAM and/or NUMA mode to take effect. |
| Make sure that RebootProgram is defined in the slurm.conf file. |
| This plugin currently does <u>not</u> permit the specification of ResumeProgram, |
| SuspendProgram, SuspendTime, etc. in slurm.conf, however that limitation may |
| be removed in the future (the ResumeProgram currently has no means of changing |
| the node's MCDRAM and/or NUMA mode prior to reboot).</p> |
| |
| <p><b>NOTE</b>: The syscfg program reports the MCDRAM and NUMA mode to be used |
| when the node is next booted. If the syscfg program is used to modify the MCDRAM |
| or NUMA mode of a node, but it is not rebooted, then Slurm will be making |
| scheduling decisions based upon incorrect state information. If you want to |
| change node state information outside of Slurm then use the following procedure: |
| <ol> |
| <li>Drain the nodes of interest</li> |
| <li>Change their MCDRAM and/or NUMA mode</li> |
| <li>Reboot the nodes, then</li> |
| <li>Restore them to service in Slurm</li> |
| </ol> |
| </p> |
| |
| <h4>Sample knl_generic.conf File</h4> |
| |
| <pre> |
| # Sample knl_generic.conf |
| SyscfgPath=/usr/bin/syscfg |
| DefaultNUMA=a2a # NUMA=all2all |
| AllowNUMA=a2a,snc2,hemi |
| DefaultMCDRAM=cache # MCDRAM=cache |
| </pre> |
| |
| <h4>Sample slurm.conf File</h4> |
| |
| <pre> |
| # Sample slurm.conf |
| NodeFeaturesPlugins=knl_generic |
| DebugFlags=NodeFeatures |
| GresTypes=hbm |
| RebootProgram=/sbin/reboot |
| ... |
| Nodename=default Sockets=1 CoresPerSocket=68 ThreadsPerCore=4 RealMemory=128000 Feature=knl |
| NodeName=nid[00000-00127] State=UNKNOWN |
| </pre> |
| |
| |
| <p style="text-align:center;">Last modified 13 March 2024</p> |
| |
| <!--#include virtual="footer.txt"--> |