| <!--#include virtual="header.txt"--> |
| |
| <h1>High Throughput Computing Administration Guide</h1> |
| |
| <p>This document contains Slurm administrator information specifically |
| for high throughput computing, namely the execution of many short jobs. |
| Getting optimal performance for high throughput computing does require |
| some tuning and this document should help you off to a good start. |
| A working knowledge of Slurm should be considered a prerequisite |
| for this material.</p> |
| |
| <h2 id="performance">Performance Results |
| <a class="slurm_link" href="#performance"></a> |
| </h2> |
| |
| <p>Slurm has also been validated to execute 500 simple batch jobs per second |
| on a sustained basis with short bursts of activity at a much higher level. |
| Actual performance depends upon the jobs to be executed plus the hardware and |
| configuration used.</p> |
| |
| <h2 id="sys_config">System configuration |
| <a class="slurm_link" href="#sys_config"></a> |
| </h2> |
| |
| <p>Several system configuration parameters may require modification to support a large number |
| of open files and TCP connections with large bursts of messages. Changes can |
| be made using the <b>/etc/rc.d/rc.local</b> or <b>/etc/sysctl.conf</b> |
| script to preserve changes after reboot. In either case, you can write values |
| directly into these files |
| (e.g. <i>"echo 32832 > /proc/sys/fs/file-max"</i>).</p> |
| <ul> |
| <li><b>/proc/sys/fs/file-max</b>: |
| The maximum number of concurrently open files. |
| We recommend a limit of at least 32,832.</li> |
| <li><b>/proc/sys/net/ipv4/tcp_max_syn_backlog</b>: |
| The maximum number of SYN requests to keep in memory that we have yet to get |
| the third packet in a 3-way handshake from. |
| The default value is 1024 for systems with more than 128Mb of memory, and 128 |
| for low memory machines. If server suffers of overload, try to increase this |
| number.</li> |
| <li><b>/proc/sys/net/ipv4/tcp_syncookies</b>: |
| Used to send out <i>syncookies</i> to hosts when the kernels syn backlog queue |
| for a specific socket is overflowed. |
| The default value is 0, which disables this functionality. |
| Set the value to 1. |
| <li><b>/proc/sys/net/ipv4/tcp_synack_retries</b>: |
| How many times to retransmit the SYN,ACK reply to an SYN request. |
| In other words, this tells the system how many times to try to establish a |
| passive TCP connection that was started by another host. |
| This variable takes an integer value, but should under no circumstances be |
| larger than 255. |
| Each retransmission will take approximately 30 to 40 seconds. |
| The default value of 5, which results in a timeout of passive TCP connections |
| of approximately 180 seconds and is generally satisfactory. |
| <li><b>/proc/sys/net/core/somaxconn</b>: |
| Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to |
| 128. The value should be raised substantially to support bursts of request. |
| For example, to support a burst of 1024 requests, set somaxconn to 1024.</li> |
| <li><b>/proc/sys/net/ipv4/ip_local_port_range</b>: |
| Identify the ephemeral ports available, which are used for many Slurm |
| communications. The value may be raised to support a high volume of |
| communications. |
| For example, write the value "32768 65535" into the ip_local_port_range file |
| in order to make that range of ports available.</li> |
| </ul> |
| |
| <p>The transmit queue length (<b>txqueuelen</b>) may also need to be modified |
| using the ifconfig command. A value of 4096 has been found to work well for one |
| site with a very large cluster |
| (e.g. <i>"ifconfig <interface> txqueuelen 4096"</i>).</p> |
| |
| <h2 id="munge_config">Munge configuration |
| <a class="slurm_link" href="#munge_config"></a> |
| </h2> |
| |
| <p>By default the Munge daemon runs with two threads, but a higher thread count |
| can improve its throughput. We suggest starting the Munge daemon with ten |
| threads for high throughput support (e.g. <i>"munged --num-threads 10"</i>).</p> |
| |
| <h2 id="user_limits">User limits |
| <a class="slurm_link" href="#user_limits"></a> |
| </h2> |
| |
| <p>The <b>ulimit</b> values in effect for the <b>slurmctld</b> daemon should |
| be set quite high for memory size, open file count and stack size.</p> |
| |
| <h2 id="slurm_config">Slurm Configuration |
| <a class="slurm_link" href="#slurm_config"></a> |
| </h2> |
| |
| <p>Several Slurm configuration parameters should be adjusted to |
| reflect the needs of high throughput computing. The changes described below |
| will not be possible in all environments, but these are the configuration |
| options that you may want to consider for higher throughput.</p> |
| |
| <ul> |
| <li><b>AccountingStorageType</b>: |
| Disabling storing accounting records by not setting this plugin. |
| Turning accounting off provides minimal improvement in performance. |
| If using the SlurmDBD increased speedup can be achieved by setting the |
| CommitDelay option in the <a href=slurmdbd.conf.html>slurmdbd.conf</a></li> |
| <li><b>JobAcctGatherType</b>: |
| Disabling the collection of job accounting information will improve job |
| throughput. Disable collection of accounting by using the |
| <i>jobacct_gather/none</i> plugin.</li> |
| <li><b>JobCompType</b>: |
| Disabling recording of job completion information will improve job |
| throughput. Disable recording of job completion information by using the |
| <i>jobcomp/none</i> plugin.</li> |
| <li><b>JobSubmitPlugins</b>: |
| Use of a lua job submit plugin is not recommended. slurmctld runs this |
| script while holding internal locks, and only a single copy of this script |
| can run at a time. This blocks most concurrency in slurmctld. Therefore, we |
| do not recommend using it in a high throughput environment.</li> |
| <li><b>MaxJobCount</b>: |
| Controls how many jobs may be in the <b>slurmctld</b> daemon records at any |
| point in time (pending, running, suspended or completed[temporarily]). |
| The default value is 10,000.</li> |
| <li><b>MessageTimeout</b>: |
| Controls how long to wait for a response to messages. |
| The default value is 10 seconds. |
| While the <b>slurmctld</b> daemon is highly threaded, its responsiveness |
| is load dependent. This value might need to be increased somewhat.</li> |
| <li><b>MinJobAge</b>: |
| Controls how soon the record of a completed job can be purged from the |
| <b>slurmctld</b> memory and thus not visible using the <b>squeue</b> command. |
| The record of jobs run will be preserved in accounting records and logs. |
| The default value is 300 seconds. The value should be reduced to a few |
| seconds if possible. Use of accounting records for older jobs can increase |
| the job throughput rate compared with retaining old jobs in the memory of |
| the slurmctld daemon.</li> |
| <li><b>PriorityType</b>: |
| The <i>priority/basic</i> is considerably faster than other options, but |
| schedules jobs only on a First In First Out (FIFO) basis.</li> |
| <li><b>PrologSlurmctld/EpilogSlurmctld</b>: |
| Neither of these is recommended for a high throughput environment. When they |
| are enabled a separate slurmctld thread has to be created for every job start |
| (or task for a job array). |
| Current architecture requires acquisition of a job write lock in every thread, |
| which is a costly operation that severely limits scheduler throughput. |
| <li><b>SchedulerParameters</b>: |
| Many scheduling parameters are available. |
| <ul> |
| <li>Setting option <b>batch_sched_delay</b> will control how long the |
| scheduling of batch jobs can be delayed. This effects only batch jobs. |
| For example, if many jobs are submitted each second, the overhead of |
| trying to schedule each one will adversely impact the rate at which jobs |
| can be submitted. The default value is 3 seconds.</li> |
| <li>Setting option <b>defer</b> will avoid attempting to schedule each job |
| individually at job submit time, but defer it until a later time when |
| scheduling multiple jobs simultaneously may be possible. |
| This option may improve system responsiveness when large numbers of jobs |
| (many hundreds) are submitted at the same time, but it will delay the |
| initiation time of individual jobs.</li> |
| <li>Setting the <b>defer_batch</b> option is similar to the <b>defer</b> |
| option, as explained above. The difference is that <b>defer_batch</b> will |
| allow interactive jobs to be started immediately, but jobs submitted with |
| sbatch will be deferred to allow multiple jobs to accumulate and be scheduled |
| at once.</li> |
| <li><b>sched_min_interval</b> is yet another configuration parameter to control |
| how frequently the scheduling logic runs. It can still be triggered on each |
| job submit, job termination, or other state change which could permit a new |
| job to be started. However that triggering does not cause the scheduling logic |
| to be started immediately, but only within the configured <b>sched_interval</b>. |
| For example, if sched_min_interval=2000000 (microseconds) and 100 jobs are submitted |
| within a 2 second time window, then the scheduling logic will be executed one time |
| rather than 100 times if sched_min_interval was set to 0 (no delay).</li> |
| <li>Besides controlling how frequently the scheduling logic is executed, the |
| <b>default_queue_depth</b> configuration parameter controls how many jobs are |
| considered to be started in each scheduler iteration. The default value of |
| default_queue_depth is 100 (jobs), which should be fine in most cases.</li> |
| <li>The <i>sched/backfill</i> plugin has relatively high overhead if used with |
| large numbers of job. Configuring <b>bf_max_job_test</b> to a modest size (say 100 |
| jobs or less) and <b>bf_interval</b> to 30 seconds or more will limit the |
| overhead of backfill scheduling (<b>NOTE</b>: the default values are fine for |
| both of these parameters). Other backfill options available for tuning backfill |
| scheduling include <b>bf_max_job_user</b>, <b>bf_resolution</b> and |
| <b>bf_window</b>. See the slurm.conf man page for details.</li> |
| <li>A set of scheduling parameters currently used for running hundreds of jobs |
| per second on a sustained basis on one cluster follows. Note that every |
| environment is different and this set of parameters will not work well |
| in every case, but it may serve as a good starting point.</li> |
| <ul> |
| <li>batch_sched_delay=20</li> |
| <li>bf_continue</li> |
| <li>bf_interval=300</li> |
| <li>bf_min_age_reserve=10800</li> |
| <li>bf_resolution=600</li> |
| <li>bf_yield_interval=1000000</li> |
| <li>partition_job_depth=500</li> |
| <li>sched_max_job_start=200</li> |
| <li>sched_min_interval=2000000</li> |
| </ul> |
| </ul></li> |
| <li><b>SlurmctldParameters</b>: |
| Many slurmctld daemon parameters are available. |
| <ul> |
| <li>Increasing <b>conmgr_max_connections</b> will allow slurmctld to accept |
| more connections at once to avoid connect() timeouts during times of high load |
| but not necessarily read or write timeouts. The trade off is that slurmctld will |
| use more memory as each connection reserves memory to buffer inbound and |
| outbound data along with the connection state. <b>conmgr_max_connections</b> |
| should at least be the number of hardware CPU threads available but less than |
| <code>sysctl net.nf_conntrack_max</code> and |
| <code>sysctl net.core.somaxconn</code>. Enabling |
| <code>sysctl net.ipv4.tcp_syncookies=1</code> is also suggested to |
| allow the kernel to better manage larger bursts of incoming sockets. |
| When modifying this parameter, you should monitor for relative changes in |
| <code>sdiag</code>'s output. The <i>ave_time</i> field section under <i>Remote |
| Procedure Call statistics</i> should be given special attention as changes to |
| that can have a dramatic impact on overall response times. Increasing |
| <b>conmgr_max_connections</b> too much could cause an <i>Out of Memory</i> |
| event which will cause slurmctld to crash, potentially losing jobs and |
| accounting. Sites are advised to try changing <b>MessageTimeout</b> and |
| <b>TCPTimeout</b> before changing the <b>conmgr_max_connections</b> parameter. |
| </li> |
| <li> |
| The <b>conmgr_threads</b> option controls the size of the thread pool that is |
| used to process communications. Threads are used as needed to handle I/O or to |
| process incoming RPCs and generate replies. The trade off is that slurmctld will |
| use more memory for each additional thread. Increasing thread counts will also |
| cause increased kernel scheduler contention when there are more threads than |
| available hardware CPUs, increasing the potential for thread starvation. While |
| processing incoming RPC requests, slurmctld usually has to obtain one or |
| more of the global slurmctld locks. Each thread attempting to obtain a lock can |
| cause increased contention with the scheduler threads. Lock contention will |
| result in the job scheduler running slower or with non-neglible delays. |
| Sites wishing for more RPC throughput can increase <b>conmgr_threads</b> from |
| the defaults, while sites wishing to prioritize scheduler threads can decrease |
| the thread count. Sites are advised to monitor for changes in |
| <code>sdiag</code>'s job start statistics when changing this parameter. |
| <b>conmgr_threads</b> should at least be between 2-4 times the number of |
| hardware CPU threads available to slurmctld daemon due to most RPC processing |
| needing to wait on global locks. Increasing <b>conmgr_threads</b> too much |
| could cause an <i>Out of Memory</i> event which will cause slurmctld to |
| crash, potentially losing jobs and accounting. |
| </li> |
| </ul> |
| <li><b>SchedulerType</b>: |
| If most jobs are short lived then use of the <i>sched/builtin</i> plugin is |
| recommended. This manages a queue of jobs on a First-In-First-Out (FIFO) basis |
| and eliminates logic used to sort the queue by priority. |
| <li><b>SlurmctldDebug</b>: |
| More detailed logging will decrease system throughput. Set to <i>error</i> or |
| <i>info</i> for regular operations with high throughput workload.</li> |
| <li><b>SlurmctldPort</b>: |
| It is desirable to configure the <b>slurmctld</b> daemon to accept incoming |
| messages on more than one port in order to avoid having incoming messages |
| discarded by the operating system due to exceeding the SOMAXCONN limit |
| described above. Using between two and ten ports is suggested when large |
| numbers of simultaneous requests are to be supported.</li> |
| <li><b>SlurmdDebug</b>: |
| More detailed logging will decrease system throughput. Set to <i>error</i> or |
| <i>info</i> for regular operations with high throughput workload.</li> |
| <li><b>SlurmdLogFile</b>: |
| Writing to local storage is recommended.</li> |
| <li>The ability to do RPC rate limiting on a per-user basis is a new feature |
| with 23.02. It acts as a virtual bucket of tokens that users consume with |
| Remote Procedure Calls. This allows users to submit a large number of requests |
| in a short period of time, but not a sustained high rate of requests that |
| would add stress to the scheduler. You can define the maximum number of tokens |
| with <b>rl_bucket_size</b>, the rate at which new tokens are added with |
| <b>rl_refill_rate</b>, the frequency with which tokens are refilled with |
| <b>rl_refill_period</b> and the number of entities to track with |
| <b>rl_table_size</b>. It is enabled with <b>rl_enable</b>.</li> |
| <li><b>Other</b>: Configure logging, accounting and other overhead to a minimum |
| appropriate for your environment.</li> |
| </ul> |
| |
| <h2 id="slurmdbd_config">SlurmDBD Configuration |
| <a class="slurm_link" href="#slurmdbd_config"></a> |
| </h2> |
| |
| <p>Turning accounting off provides a minimal improvement in performance. |
| If using SlurmDBD increased speedup can be achieved by setting the CommitDelay |
| option in the <a href=slurmdbd.conf.html>slurmdbd.conf</a> to introduce a |
| delay between the time slurmdbd receives a connection from slurmctld and |
| when it commits the information to the database. This allows multiple |
| requests to be accumulated and reduces the number of commit requests |
| to the database.</p> |
| |
| <p>You might also consider setting the '<i>Purge*</i>' options in your |
| slurmdbd.conf to clear out old Data. A Typical configuration would |
| look like this...</p> |
| <ul> |
| <li><b>PurgeEventAfter</b>=12months</li> |
| <li><b>PurgeJobAfter</b>=12months</li> |
| <li><b>PurgeResvAfter</b>=2months</li> |
| <li><b>PurgeStepAfter</b>=2months</li> |
| <li><b>PurgeSuspendAfter</b>=1month</li> |
| <li><b>PurgeTXNAfter</b>=12months</li> |
| <li><b>PurgeUsageAfter</b>=12months</li> |
| |
| </ul> |
| |
| <p style="text-align:center;">Last modified 13 March 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |