blob: b384866413a381e4926487be43fcccd080fcf021 [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1>High Throughput Computing Administration Guide</h1>
<p>This document contains SLURM administrator information specifically
for high throughput computing, namely the execution of many short jobs.
Getting optimal performance for high throughput computing does require
some tuning and this document should help you off to a good start.
A working knowledge of SLURM should be considered a prerequisite
for this material.</p>
<h2>Performance Results</h2>
<p>SLURM has also been validated to process 100,000 jobs and job steps per hour
on a sustained basis with short bursts of activity at a much higher level.
Actual performance depends upon the jobs to be executed plus the hardware and
configuration used.</p>
<h2>System configuration</h2>
<p>Three system configuration parameters must be set to support a large number
of open files and TCP connections with large bursts of messages. Changes can
be made using the <b>/etc/rc.d/rc.local</b> or <b>/etc/sysctl.conf</b>
script to preserve changes after reboot. In either case, you can write values
directly into these files
(e.g. <i>"echo 32832 &gt; /proc/sys/fs/file-max"</i>).</p>
<ul>
<li><b>/proc/sys/fs/file-max</b>:
The maximum number of concurrently open files.
We recommend a limit of at least 32,832.</li>
<li><b>/proc/sys/net/ipv4/tcp_max_syn_backlog</b>:
Maximum number of remembered connection requests, which are still did not
receive an acknowledgment from connecting client.
The default value is 1024 for systems with more than 128Mb of memory, and 128
for low memory machines. If server suffers of overload, try to increase this
number.</li>
<li><b>/proc/sys/net/core/somaxconn</b>:
Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to
128. The value should be raised substantially to support bursts of request.
For example, to support a burst of 1024 requests, set somaxconn to 1024.</li>
</ul>
<p>The transmit queue length (<b>txqueuelen</b>) may also need to be modified
using the ifconfig command. A value of 4096 has been found to work well for one
site with a very large cluster
(e.g. <i>"ifconfig <interface> txqueuelen 4096"</i>).</p>
<h2>User limits</h2>
<p>The <b>ulimit</b> values in effect for the <b>slurmctld</b> daemon should
be set quite high for memory size, open file count and stack size.</p>
<h2>SLURM Configuration</h2>
<p>NOTE: Substantial changes were made in SLURM version 2.4 to support higher
throughput rates. Version 2.5 includes more enhancements.</p>
<p>Several SLURM configuration parameters should be adjusted to
reflect the needs of high throughput computing.</p>
<ul>
<li><b>MaxJobCount</b>:
Controls how many jobs may be in the <b>slurmctld</b> daemon records at any
point in time (pending, running, suspended or completed[temporarily]).
The default value is 10,000</li>
<li><b>MessageTimeout</b>:
Controls how long to wait for a response to messages.
The default value is 10 seconds.
While the <b>slurmctld</b> daemon is highly threaded, its responsiveness
is load dependent. This value might need to be increased somewhat.</li>
<li><b>MinJobAge</b>:
Controls how soon the record of a completed job can be purged from the
<b>slurmctld</b> memory and thus not visible using the <b>squeue</b> command.
The record of jobs run will be preserved in accounting records and logs.
The default value is 300 seconds. The value should be reduced to a few
seconds if possible.</li>
<li><b>PriorityType</b>:
The <b>priority/builtin</b> is considerably faster than other options, but
schedules jobs only on a First In First Out (FIFO) basis.</li>
<li><b>SchedulerParameters</b>:
Several scheduling parameters are available.
<ul>
<li>Setting option <b>defer</b> will avoid attempting to schedule each job
individually at job submit time, but defer it until a later time when
scheduling multiple jobs simultaneously may be possible.
This option may improve system responsiveness when large numbers of jobs
(many hundreds) are submitted at the same time, but it will delay the
initiation time of individual jobs.</li>
<li>A variation of <b>defer</b> would be to configure <b>default_queue_depth</b>
to a relatively small number to avoid attempting to schedule large numbers of
jobs every time some job completes or another routine action occurs. (NOTE:
the default value of <b>default_queue_depth</b> should be fine in most
cases).</li>
<li>The <i>sched/backfill</i> plugin has relatively high overhead if used with
large numbers of job. Configuring <b>max_job_bf</b> to a modest size (say 100
jobs or less) and <b>interval</b> to 30 seconds or more will limit the
overhead of backfill scheduling (NOTE: the default values are fine for both
of these parameters).</li>
</ul></li>
<li><b>SelectType</b>:
The <b>select/serial</b> plugin is highly optimized if executing only serial
(single CPU) jobs.</li>
<li><b>SlurmctldPort</b>:
It is desirable to configure the <b>slurmctld</b> daemon to accept incoming
messages on more than one port in order to avoid having incoming messages
discarded by the operating system due to exceeding the SOMAXCONN limit
described above. Using between two and ten ports is suggested when large
numbers of simultaneous requests are to be supported.</li>
<li><b>SlurmctldDebug</b>:
More detailed logging will decrease system throughput. Set to 2 (log errors
only) or 3 (general information logging). Each increment in the logging level
will increase the number of message by a factor of about 3.</li>
<li><b>SlurmdDebug</b>:
More detailed logging will decrease system throughput. Set to 2 (log errors
only) or 3 (general information logging). Each increment in the logging level
will increase the number of message by a factor of about 3.</li>
<li><b>SlurmdLogFile</b>:
Writing to local storage is recommended.</li>
<li>Other: Configure logging, accounting and other overhead to a minimum
appropriate for your environment.</li>
</ul>
<p style="text-align:center;">Last modified 12 July 2012</p>
<!--#include virtual="footer.txt"-->