| <!--#include virtual="header.txt"--> |
| |
| <h1>Large Cluster Administration Guide</h1> |
| |
| <p>This document contains Slurm administrator information specifically |
| for clusters containing 1,024 nodes or more. |
| Some examples of large systems currently managed by Slurm are: |
| <ul> |
| <li>Frontier at Oak Ridge National Laboratory (ORNL) with 8,699,904 cores.</li> |
| <li>Tianhe-2 at the National University of Defense Technology in China with |
| 4,981,760 cores.</li> |
| <li>Perlmutter at National Energy Research Scientific Computing (NERSC) with |
| 761,856 cores.</li> |
| </ul> |
| Slurm operation on systems orders of magnitude larger has been validated |
| using emulation. |
| Getting optimal performance at that scale does require some tuning and |
| this document should help get you off to a good start. |
| A working knowledge of Slurm should be considered a prerequisite |
| for this material.</p> |
| |
| <h2 id="perf">Performance<a class="slurm_link" href="#perf"></a></h2> |
| |
| <p>Times below are for execution of an MPI program printing "Hello world" and |
| exiting and includes the time for processing output. Your performance may |
| vary due to differences in hardware, software, and configuration.</p> |
| <ul> |
| <li>1,966,080 tasks on 122,880 compute nodes of a BlueGene/Q: 322 seconds</li> |
| <li>30,000 tasks on 15,000 compute nodes of a Linux cluster: 30 seconds</li> |
| </ul> |
| |
| <h2 id="sys_config">System Configuration |
| <a class="slurm_link" href="#sys_config"></a> |
| </h2> |
| |
| <p>Three system configuration parameters must be set to support a large number |
| of open files and TCP connections with large bursts of messages. Changes can |
| be made using the <b>/etc/rc.d/rc.local</b> or <b>/etc/sysctl.conf</b> |
| script to preserve changes after reboot. In either case, you can write values |
| directly into these files |
| (e.g. <i>"echo 388067 > /proc/sys/fs/file-max"</i>).</p> |
| <ul> |
| <li><b>/proc/sys/fs/file-max</b>: |
| The maximum number of concurrently open files. The appropriate amount is highly |
| dependent on system specs and workload. We recommend starting with a minimum of |
| 388067 or the default for your OS, whichever is greater. This may need to be |
| adjusted upwards, depending on your needs.</li> |
| <li><b>/proc/sys/net/ipv4/tcp_max_syn_backlog</b>: |
| Maximum number of remembered connection requests, which still have not |
| received an acknowledgment from the connecting client. |
| The default value is 1024 for systems with more than 128Mb of memory, and 128 |
| for low memory machines. If server suffers of overload, try to increase this |
| number.</li> |
| <li><b>/proc/sys/net/core/somaxconn</b>: |
| Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to |
| 128. The value should be raised substantially to support bursts of request. |
| For example, to support a burst of 1024 requests, set somaxconn to 1024.</li> |
| </ul> |
| |
| <p>The transmit queue length (<b>txqueuelen</b>) may also need to be modified |
| using the ifconfig command. A value of 4096 has been found to work well for one |
| site with a very large cluster |
| (e.g. <i>"ifconfig <interface> txqueuelen 4096"</i>).</p> |
| |
| <h3 id="process_limit">Thread/Process Limit |
| <a class="slurm_link" href="#process_limit"></a> |
| </h3> |
| |
| <p>There is a newly introduced limit in SLES 12 SP2 (used on Cray systems |
| with CLE 6.0UP04, to be released mid-2017). |
| The version of systemd shipped with SLES 12 SP2 contains support for the |
| <a href="https://www.suse.com/releasenotes/x86_64/SUSE-SLES/12-SP2/#fate-320358"> |
| PIDs cgroup controller</a>. |
| Under the new systemd version, each init script or systemd service is limited |
| to 512 threads/processes by default. |
| This could cause issues for the slurmctld and slurmd daemons on large clusters |
| or systems with a high job throughput rate. |
| To increase the limit beyond the default:</p> |
| <ul> |
| <li>If using a systemd service file: Add <i>TasksMax=N</i> to the [Service] |
| section. N can be a specific number, or special value <i>infinity</i>.</li> |
| <li>If using an init script: Create the file<br> |
| /etc/systemd/system/<init script name>.service.d/override.conf<br> |
| with these contents: |
| <pre> |
| [Service] |
| TasksMax=N |
| </pre></li> |
| </ul></p> |
| <p>Note: Earlier versions of systemd that don't support the PIDs cgroup |
| controller simply ignore the TasksMax setting.</p> |
| |
| <h2 id="limits">User Limits<a class="slurm_link" href="#limits"></a></h2> |
| |
| <p>The <b>ulimit</b> values in effect for the <b>slurmctld</b> daemon should |
| be set quite high for memory size, open file count and stack size.</p> |
| |
| <h2 id="select_type">Node Selection Plugin (SelectType) |
| <a class="slurm_link" href="#select_type"></a> |
| </h2> |
| |
| <p>While allocating individual processors within a node is great |
| for smaller clusters, the overhead of keeping track of the individual |
| processors and memory within each node adds significant overhead. |
| For best scalability, allocate whole nodes using <i>select/linear</i> |
| and avoid <i>select/cons_tres</i>.</p> |
| |
| <h2 id="acct_gather">Job Accounting Gather Plugin (JobAcctGatherType) |
| <a class="slurm_link" href="#acct_gather"></a> |
| </h2> |
| |
| <p>Job accounting relies upon the <i>slurmstepd</i> daemon on each compute |
| node periodically sampling data. |
| This data collection will take compute cycles away from the application |
| inducing what is known as <i>system noise</i>. |
| For large parallel applications, this system noise can detract from |
| application scalability. |
| For optimal application performance, disabling job accounting |
| is best (<i>jobacct_gather/none</i>). |
| Consider use of job completion records (<i>JobCompType</i>) for accounting |
| purposes as this entails far less overhead. |
| If job accounting is required, configure the sampling interval |
| to a relatively large size (e.g. <i>JobAcctGatherFrequency=task=300</i>). |
| Some experimentation may be required to deal with collisions |
| on data transmission.</p> |
| |
| <h2 id="node_config">Node Configuration |
| <a class="slurm_link" href="#node_config"></a> |
| </h2> |
| |
| <p>While Slurm can track the amount of memory and disk space actually found |
| on each compute node and use it for scheduling purposes, this entails |
| extra overhead. |
| Optimize performance by specifying the expected configuration using |
| the available parameters (<i>RealMemory</i>, <i>CPUs</i>, and |
| <i>TmpDisk</i>). |
| If the node is found to contain less resources than configured, |
| it will be marked DOWN and not used. |
| While Slurm can easily handle a heterogeneous cluster, configuring |
| the nodes using the minimal number of lines in <i>slurm.conf</i> |
| will both make for easier administration and better performance.</p> |
| |
| <h2 id="timers">Timers<a class="slurm_link" href="#timers"></a></h2> |
| |
| <p>The <i>EioTimeout</i> configuration parameter controls how long the srun |
| command will wait for the slurmstepd to close the TCP/IP connection used to |
| relay data between the user application and srun when the user application |
| terminates. The default value is 60 seconds. Larger systems and/or slower |
| networks may need a higher value.</p> |
| |
| <p>If a high throughput of jobs is anticipated (i.e. large numbers of jobs |
| with brief execution times) then configure <i>MinJobAge</i> to the smallest |
| interval practical for your environment. <i>MinJobAge</i> specifies the |
| minimum number of seconds that a terminated job will be retained by Slurm's |
| control daemon before purging. After this time, information about terminated |
| jobs will only be available through accounting records.</p> |
| |
| <p>The configuration parameter <i>SlurmdTimeout</i> determines the interval |
| at which <i>slurmctld</i> routinely communicates with <i>slurmd</i>. |
| Communications occur at half the <i>SlurmdTimeout</i> value. |
| The purpose of this is to determine when a compute node fails |
| and thus should not be allocated work. |
| Longer intervals decrease system noise on compute nodes (we do |
| synchronize these requests across the cluster, but there will |
| be some impact upon applications). |
| For really large clusters, <i>SlurmdTimeout</i> values of |
| 120 seconds or more are reasonable.</p> |
| |
| <p>If MPICH-2 is used, the srun command will manage the key-pairs |
| used to bootstrap the application. |
| Depending upon the processor speed and architecture, the communication |
| of key-pair information may require extra time. |
| This can be done by setting an environment variable PMI_TIME before |
| executing srun to launch the tasks. |
| The default value of PMI_TIME is 500 and this is the number of |
| microseconds allotted to transmit each key-pair. |
| We have executed up to 16,000 tasks with a value of PMI_TIME=4000.</p> |
| |
| <p>The individual slurmd daemons on compute nodes will initiate messages |
| to the slurmctld daemon only when they start up or when the epilog |
| completes for a job. When a job allocated a large number of nodes |
| completes, it can cause a very large number of messages to be sent |
| by the slurmd daemons on these nodes to the slurmctld daemon all at |
| the same time. In order to spread this message traffic out over time |
| and avoid message loss, The <i>EpilogMsgTime</i> parameter may be |
| used. Note that even if messages are lost, they will be retransmitted, |
| but this will result in a delay for reallocating resources to new jobs.</p> |
| |
| <h2 id="other">Other<a class="slurm_link" href="#other"></a></h2> |
| |
| <p>Slurm uses hierarchical communications between the slurmd daemons |
| in order to increase parallelism and improve performance. The |
| <i>TreeWidth</i> configuration parameter controls the fanout of messages. |
| The default value is 16, meaning each slurmd daemon can communicate |
| with up to 16 other slurmd daemons and 4368 nodes can be contacted |
| with three message hops. |
| The default value will work well for most clusters. |
| Optimal system performance can typically be achieved if <i>TreeWidth</i> |
| is set to the cube root of the number of nodes in the cluster.</p> |
| |
| <p>The srun command automatically increases its open file limit to |
| the hard limit in order to process all of the standard input and output |
| connections to the launched tasks. It is recommended that you set the |
| open file hard limit to 8192 across the cluster.</p> |
| |
| <p style="text-align:center;">Last modified 2 October 2024</p> |
| |
| <!--#include virtual="footer.txt"--> |