blob: 23f11c18ad7dc3ef77bc0eb8b84dc654bd21882c [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1>Large Cluster Administration Guide</h1>
<p>This document contains SLURM administrator information specifically
for clusters containing 1,024 nodes or more.
Virtually all SLURM components have been validated (through emulation)
for clusters containing up to 65,536 compute nodes.
Getting optimal performance at that scale does require some tuning and
this document should help you off to a good start.
A working knowledge of SLURM should be considered a prerequisite
for this material.</p>
<h2>Performance Results</h2>
<p>SLURM has been used on clusters containing up to 4,184 nodes.
At that scale, the total time to execute a simple program (resource
allocation, task launch, I/O processing, and cleanup, e.g.
"time srun -N4184 -n8368 uname") at 8,368 tasks
across the 4,184 nodes was under 57 seconds. The table below shows
total execution times for several large clusters with different architectures.</p>
<table border>
<caption>SLURM Total Job Execution Time</caption>
<tr>
<th>Nodes</th><th>Tasks</th><th>Seconds</th>
</tr>
<tr>
<th>256</th><th>512</th><th>1.0</th>
</tr>
<tr>
<th>512</th><th>1024</th><th>2.2</th>
</tr>
<tr>
<th>1024</th><th>2048</th><th>3.7</th>
</tr>
<tr>
<th>2123</th><th>4246</th><th>19.5</th>
</tr>
<tr>
<th>4184</th><th>8368</th><th>56.6</th>
</tr>
</table>
<h2>Node Selection Plugin (SelectType)</h2>
<p>While allocating individual processors within a node is great
for smaller clusters, the overhead of keeping track of the individual
processors and memory within each node adds significant overhead.
For best scalability, allocate whole nodes using <i>select/linear</i>
or <i>select/bluegene</i> and avoid <i>select/cons_res</i>.</p>
<h2>Job Accounting Gather Plugin (JobAcctGatherType)</h2>
<p>Job accounting relies upon the <i>slurmstepd</i> daemon on each compute
node periodically sampling data.
This data collection will take compute cycles away from the application
inducing what is known as <i>system noise</i>.
For large parallel applications, this system noise can detract for
application scalability.
For optimal application performance, disabling job accounting
is best (<i>jobacct_gather/none</i>).
Consider use of job completion records (<i>JobCompType</i>) for accounting
purposes as this entails far less overhead.
If job accounting is required, configure the sampling interval
to a relatively large size (e.g. <i>JobAcctGatherFrequency=300</i>).
Some experimentation may also be required to deal with collisions
on data transmission.</p>
<h2>Node Configuration</h2>
<p>While SLURM can track the amount of memory and disk space actually found
on each compute node and use it for scheduling purposes, this entails
extra overhead.
Optimize performance by specifying the expected configuration using
the available parameters (<i>RealMemory</i>, <i>Procs</i>, and
<i>TmpDisk</i>).
If the node is found to contain less resources than configured,
it will be marked DOWN and not used.
Also set the <i>FastSchedule</i> parameter.
While SLURM can easily handle a heterogeneous cluster, configuring
the nodes using the minimal number of lines in <i>slurm.conf</i>
will both make for easier administration and better performance.</p>
<h2>Timers</h2>
<p>The configuration parameter <i>SlurmdTimeout</i> determines the interval
at which <i>slurmctld</i> routinely communicates with <i>slurmd</i>.
Communications occur at half the <i>SlurmdTimeout</i> value.
The purpose of this is to determine when a compute node fails
and thus should not be allocated work.
Longer intervals decrease system noise on compute nodes (we do
synchronize these requests across the cluster, but there will
be some impact upon applications).
For really large clusters, <i>SlurmdTimeout</i> values of
120 seconds or more are reasonable.</p>
<p>If MPICH-2 is used, the srun command will manage the key-pairs
used to bootstrap the application.
Depending upon the processor speed and architecture, the communication
of key-pair information may require extra time.
This can be done by setting an environment variable PMI_TIME before
executing srun to launch the tasks.
The default value of PMI_TIME is 500 and this is the number of
microseconds alloted to transmit each key-pair.
We have executed up to 16,000 tasks with a value of PMI_TIME=4000.</p>
<p>The individual slurmd daemons on compute nodes will initiate messages
to the slurmctld daemon only when they start up or when the epilog
completes for a job. When a job allocated a large number of nodes
completes, it can cause a very large number of messages to be sent
by the slurmd daemons on these nodes to the slurmctld daemon all at
the same time. In order to spread this message traffic out over time
and avoid message loss, The <i>EpilogMsgTime</i> parameter may be
used. Note that even if messages are lost, they will be retransmitted,
but this will result in a delay for reallocating resources to new jobs.</p>
<h2>Other</h2>
<p>SLURM uses hierarchical communications between the slurmd daemons
in order to increase parallelism and improve performance. The
<i>TreeWidth</i> configuration parameter controls the fanout of messages.
The default value is 50, meaning each slurmd daemon can communicate
with up to 50 other slurmd daemons and over 2500 nodes can be contacted
with two message hops.
The default value will work well for most clusters.
Optimal system performance can typically be achieved if <i>TreeWidth</i>
is set to the square root of the number of nodes in the cluster for
systems having no more than 2500 nodes or the cube root for larger
systems.</p>
<p>The srun command automatically increases its open file limit to
the hard limit in order to process all of the standard input and output
connections to the launched tasks. It is recommended that you set the
open file hard limit to 8192 across the cluster.</p>
<p style="text-align:center;">Last modified 11 March 2008</p>
<!--#include virtual="footer.txt"-->