doc/html/big_sys.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Large Cluster Administration Guide</h1>

 <p>This document contains Slurm administrator information specifically
 for clusters containing 1,024 nodes or more.
 Some examples of large systems currently managed by Slurm are:
 <ul>
 <li>Frontier at Oak Ridge National Laboratory (ORNL) with 8,699,904 cores.</li>
 <li>Tianhe-2 at the National University of Defense Technology in China with
 4,981,760 cores.</li>
 <li>Perlmutter at National Energy Research Scientific Computing (NERSC) with
 761,856 cores.</li>
 </ul>
 Slurm operation on systems orders of magnitude larger has been validated
 using emulation.
 Getting optimal performance at that scale does require some tuning and
 this document should help get you off to a good start.
 A working knowledge of Slurm should be considered a prerequisite
 for this material.</p>

 <h2 id="perf">Performance<a class="slurm_link" href="#perf"></a></h2>

 <p>Times below are for execution of an MPI program printing "Hello world" and
 exiting and includes the time for processing output. Your performance may
 vary due to differences in hardware, software, and configuration.</p>
 <ul>
 <li>1,966,080 tasks on 122,880 compute nodes of a BlueGene/Q: 322 seconds</li>
 <li>30,000 tasks on 15,000 compute nodes of a Linux cluster: 30 seconds</li>
 </ul>

 <h2 id="sys_config">System Configuration
 <a class="slurm_link" href="#sys_config"></a>
 </h2>

 <p>Three system configuration parameters must be set to support a large number
 of open files and TCP connections with large bursts of messages. Changes can
 be made using the <b>/etc/rc.d/rc.local</b> or <b>/etc/sysctl.conf</b>
 script to preserve changes after reboot. In either case, you can write values
 directly into these files
 (e.g. <i>"echo 388067 &gt; /proc/sys/fs/file-max"</i>).</p>
 <ul>
 <li><b>/proc/sys/fs/file-max</b>:
 The maximum number of concurrently open files. The appropriate amount is highly
 dependent on system specs and workload. We recommend starting with a minimum of
 388067 or the default for your OS, whichever is greater. This may need to be
 adjusted upwards, depending on your needs.</li>
 <li><b>/proc/sys/net/ipv4/tcp_max_syn_backlog</b>:
 Maximum number of remembered connection requests, which still have not
 received an acknowledgment from the connecting client.
 The default value is 1024 for systems with more than 128Mb of memory, and 128
 for low memory machines. If server suffers of overload, try to increase this
 number.</li>
 <li><b>/proc/sys/net/core/somaxconn</b>:
 Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to
 128. The value should be raised substantially to support bursts of request.
 For example, to support a burst of 1024 requests, set somaxconn to 1024.</li>
 </ul>

 <p>The transmit queue length (<b>txqueuelen</b>) may also need to be modified
 using the ifconfig command. A value of 4096 has been found to work well for one
 site with a very large cluster
 (e.g. <i>"ifconfig <interface> txqueuelen 4096"</i>).</p>

 <h3 id="process_limit">Thread/Process Limit
 <a class="slurm_link" href="#process_limit"></a>
 </h3>

 <p>There is a newly introduced limit in SLES 12 SP2 (used on Cray systems
 with CLE 6.0UP04, to be released mid-2017).
 The version of systemd shipped with SLES 12 SP2 contains support for the
 <a href="https://www.suse.com/releasenotes/x86_64/SUSE-SLES/12-SP2/#fate-320358">
 PIDs cgroup controller</a>.
 Under the new systemd version, each init script or systemd service is limited
 to 512 threads/processes by default.
 This could cause issues for the slurmctld and slurmd daemons on large clusters
 or systems with a high job throughput rate.
 To increase the limit beyond the default:</p>
 <ul>
 <li>If using a systemd service file: Add <i>TasksMax=N</i> to the [Service]
  section. N can be a specific number, or special value <i>infinity</i>.</li>
 <li>If using an init script: Create the file<br>
 /etc/systemd/system/&lt;init script name&gt;.service.d/override.conf<br>
 with these contents:
 <pre>
   [Service]
   TasksMax=N
 </pre></li>
 </ul></p>
 <p>Note: Earlier versions of systemd that don't support the PIDs cgroup
 controller simply ignore the TasksMax setting.</p>

 <h2 id="limits">User Limits<a class="slurm_link" href="#limits"></a></h2>

 <p>The <b>ulimit</b> values in effect for the <b>slurmctld</b> daemon should
 be set quite high for memory size, open file count and stack size.</p>

 <h2 id="select_type">Node Selection Plugin (SelectType)
 <a class="slurm_link" href="#select_type"></a>
 </h2>

 <p>While allocating individual processors within a node is great
 for smaller clusters, the overhead of keeping track of the individual
 processors and memory within each node adds significant overhead.
 For best scalability, allocate whole nodes using <i>select/linear</i>
 and avoid <i>select/cons_tres</i>.</p>

 <h2 id="acct_gather">Job Accounting Gather Plugin (JobAcctGatherType)
 <a class="slurm_link" href="#acct_gather"></a>
 </h2>

 <p>Job accounting relies upon the <i>slurmstepd</i> daemon on each compute
 node periodically sampling data.
 This data collection will take compute cycles away from the application
 inducing what is known as <i>system noise</i>.
 For large parallel applications, this system noise can detract from
 application scalability.
 For optimal application performance, disabling job accounting
 is best (<i>jobacct_gather/none</i>).
 Consider use of job completion records (<i>JobCompType</i>) for accounting
 purposes as this entails far less overhead.
 If job accounting is required, configure the sampling interval
 to a relatively large size (e.g. <i>JobAcctGatherFrequency=task=300</i>).
 Some experimentation may be required to deal with collisions
 on data transmission.</p>

 <h2 id="node_config">Node Configuration
 <a class="slurm_link" href="#node_config"></a>
 </h2>

 <p>While Slurm can track the amount of memory and disk space actually found
 on each compute node and use it for scheduling purposes, this entails
 extra overhead.
 Optimize performance by specifying the expected configuration using
 the available parameters (<i>RealMemory</i>, <i>CPUs</i>, and
 <i>TmpDisk</i>).
 If the node is found to contain less resources than configured,
 it will be marked DOWN and not used.
 While Slurm can easily handle a heterogeneous cluster, configuring
 the nodes using the minimal number of lines in <i>slurm.conf</i>
 will both make for easier administration and better performance.</p>

 <h2 id="timers">Timers<a class="slurm_link" href="#timers"></a></h2>

 <p>The <i>EioTimeout</i> configuration parameter controls how long the srun
 command will wait for the slurmstepd to close the TCP/IP connection used to
 relay data between the user application and srun when the user application
 terminates. The default value is 60 seconds. Larger systems and/or slower
 networks may need a higher value.</p>

 <p>If a high throughput of jobs is anticipated (i.e. large numbers of jobs
 with brief execution times) then configure <i>MinJobAge</i> to the smallest
 interval practical for your environment. <i>MinJobAge</i> specifies the
 minimum number of seconds that a terminated job will be retained by Slurm's
 control daemon before purging. After this time, information about terminated
 jobs will only be available through accounting records.</p>

 <p>The configuration parameter <i>SlurmdTimeout</i> determines the interval
 at which <i>slurmctld</i> routinely communicates with <i>slurmd</i>.
 Communications occur at half the <i>SlurmdTimeout</i> value.
 The purpose of this is to determine when a compute node fails
 and thus should not be allocated work.
 Longer intervals decrease system noise on compute nodes (we do
 synchronize these requests across the cluster, but there will
 be some impact upon applications).
 For really large clusters, <i>SlurmdTimeout</i> values of
 120 seconds or more are reasonable.</p>

 <p>If MPICH-2 is used, the srun command will manage the key-pairs
 used to bootstrap the application.
 Depending upon the processor speed and architecture, the communication
 of key-pair information may require extra time.
 This can be done by setting an environment variable PMI_TIME before
 executing srun to launch the tasks.
 The default value of PMI_TIME is 500 and this is the number of
 microseconds allotted to transmit each key-pair.
 We have executed up to 16,000 tasks with a value of PMI_TIME=4000.</p>

 <p>The individual slurmd daemons on compute nodes will initiate messages
 to the slurmctld daemon only when they start up or when the epilog
 completes for a job. When a job allocated a large number of nodes
 completes, it can cause a very large number of messages to be sent
 by the slurmd daemons on these nodes to the slurmctld daemon all at
 the same time. In order to spread this message traffic out over time
 and avoid message loss, The <i>EpilogMsgTime</i> parameter may be
 used. Note that even if messages are lost, they will be retransmitted,
 but this will result in a delay for reallocating resources to new jobs.</p>

 <h2 id="other">Other<a class="slurm_link" href="#other"></a></h2>

 <p>Slurm uses hierarchical communications between the slurmd daemons
 in order to increase parallelism and improve performance. The
 <i>TreeWidth</i> configuration parameter controls the fanout of messages.
 The default value is 16, meaning each slurmd daemon can communicate
 with up to 16 other slurmd daemons and 4368 nodes can be contacted
 with three message hops.
 The default value will work well for most clusters.
 Optimal system performance can typically be achieved if <i>TreeWidth</i>
 is set to the cube root of the number of nodes in the cluster.</p>

 <p>The srun command automatically increases its open file limit to
 the hard limit in order to process all of the standard input and output
 connections to the launched tasks. It is recommended that you set the
 open file hard limit to 8192 across the cluster.</p>

 <p style="text-align:center;">Last modified 2 October 2024</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Large Cluster Administration Guide</h1>

	<p>This document contains Slurm administrator information specifically
	for clusters containing 1,024 nodes or more.
	Some examples of large systems currently managed by Slurm are:
	<ul>
	<li>Frontier at Oak Ridge National Laboratory (ORNL) with 8,699,904 cores.</li>
	<li>Tianhe-2 at the National University of Defense Technology in China with
	4,981,760 cores.</li>
	<li>Perlmutter at National Energy Research Scientific Computing (NERSC) with
	761,856 cores.</li>
	</ul>
	Slurm operation on systems orders of magnitude larger has been validated
	using emulation.
	Getting optimal performance at that scale does require some tuning and
	this document should help get you off to a good start.
	A working knowledge of Slurm should be considered a prerequisite
	for this material.</p>

	<h2 id="perf">Performance<a class="slurm_link" href="#perf"></a></h2>

	<p>Times below are for execution of an MPI program printing "Hello world" and
	exiting and includes the time for processing output. Your performance may
	vary due to differences in hardware, software, and configuration.</p>
	<ul>
	<li>1,966,080 tasks on 122,880 compute nodes of a BlueGene/Q: 322 seconds</li>
	<li>30,000 tasks on 15,000 compute nodes of a Linux cluster: 30 seconds</li>
	</ul>

	<h2 id="sys_config">System Configuration
	<a class="slurm_link" href="#sys_config"></a>
	</h2>

	<p>Three system configuration parameters must be set to support a large number
	of open files and TCP connections with large bursts of messages. Changes can
	be made using the <b>/etc/rc.d/rc.local</b> or <b>/etc/sysctl.conf</b>
	script to preserve changes after reboot. In either case, you can write values
	directly into these files
	(e.g. <i>"echo 388067 > /proc/sys/fs/file-max"</i>).</p>
	<ul>
	<li><b>/proc/sys/fs/file-max</b>:
	The maximum number of concurrently open files. The appropriate amount is highly
	dependent on system specs and workload. We recommend starting with a minimum of
	388067 or the default for your OS, whichever is greater. This may need to be
	adjusted upwards, depending on your needs.</li>
	<li><b>/proc/sys/net/ipv4/tcp_max_syn_backlog</b>:
	Maximum number of remembered connection requests, which still have not
	received an acknowledgment from the connecting client.
	The default value is 1024 for systems with more than 128Mb of memory, and 128
	for low memory machines. If server suffers of overload, try to increase this
	number.</li>
	<li><b>/proc/sys/net/core/somaxconn</b>:
	Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to
	128. The value should be raised substantially to support bursts of request.
	For example, to support a burst of 1024 requests, set somaxconn to 1024.</li>
	</ul>

	<p>The transmit queue length (<b>txqueuelen</b>) may also need to be modified
	using the ifconfig command. A value of 4096 has been found to work well for one
	site with a very large cluster
	(e.g. <i>"ifconfig <interface> txqueuelen 4096"</i>).</p>

	<h3 id="process_limit">Thread/Process Limit
	<a class="slurm_link" href="#process_limit"></a>
	</h3>

	<p>There is a newly introduced limit in SLES 12 SP2 (used on Cray systems
	with CLE 6.0UP04, to be released mid-2017).
	The version of systemd shipped with SLES 12 SP2 contains support for the
	<a href="https://www.suse.com/releasenotes/x86_64/SUSE-SLES/12-SP2/#fate-320358">
	PIDs cgroup controller</a>.
	Under the new systemd version, each init script or systemd service is limited
	to 512 threads/processes by default.
	This could cause issues for the slurmctld and slurmd daemons on large clusters
	or systems with a high job throughput rate.
	To increase the limit beyond the default:</p>
	<ul>
	<li>If using a systemd service file: Add <i>TasksMax=N</i> to the [Service]
	section. N can be a specific number, or special value <i>infinity</i>.</li>
	<li>If using an init script: Create the file<br>
	/etc/systemd/system/<init script name>.service.d/override.conf<br>
	with these contents:
	<pre>
	[Service]
	TasksMax=N
	</pre></li>
	</ul></p>
	<p>Note: Earlier versions of systemd that don't support the PIDs cgroup
	controller simply ignore the TasksMax setting.</p>

	<h2 id="limits">User Limits<a class="slurm_link" href="#limits"></a></h2>

	<p>The <b>ulimit</b> values in effect for the <b>slurmctld</b> daemon should
	be set quite high for memory size, open file count and stack size.</p>

	<h2 id="select_type">Node Selection Plugin (SelectType)
	<a class="slurm_link" href="#select_type"></a>
	</h2>

	<p>While allocating individual processors within a node is great
	for smaller clusters, the overhead of keeping track of the individual
	processors and memory within each node adds significant overhead.
	For best scalability, allocate whole nodes using <i>select/linear</i>
	and avoid <i>select/cons_tres</i>.</p>

	<h2 id="acct_gather">Job Accounting Gather Plugin (JobAcctGatherType)
	<a class="slurm_link" href="#acct_gather"></a>
	</h2>

	<p>Job accounting relies upon the <i>slurmstepd</i> daemon on each compute
	node periodically sampling data.
	This data collection will take compute cycles away from the application
	inducing what is known as <i>system noise</i>.
	For large parallel applications, this system noise can detract from
	application scalability.
	For optimal application performance, disabling job accounting
	is best (<i>jobacct_gather/none</i>).
	Consider use of job completion records (<i>JobCompType</i>) for accounting
	purposes as this entails far less overhead.
	If job accounting is required, configure the sampling interval
	to a relatively large size (e.g. <i>JobAcctGatherFrequency=task=300</i>).
	Some experimentation may be required to deal with collisions
	on data transmission.</p>

	<h2 id="node_config">Node Configuration
	<a class="slurm_link" href="#node_config"></a>
	</h2>

	<p>While Slurm can track the amount of memory and disk space actually found
	on each compute node and use it for scheduling purposes, this entails
	extra overhead.
	Optimize performance by specifying the expected configuration using
	the available parameters (<i>RealMemory</i>, <i>CPUs</i>, and
	<i>TmpDisk</i>).
	If the node is found to contain less resources than configured,
	it will be marked DOWN and not used.
	While Slurm can easily handle a heterogeneous cluster, configuring
	the nodes using the minimal number of lines in <i>slurm.conf</i>
	will both make for easier administration and better performance.</p>

	<h2 id="timers">Timers<a class="slurm_link" href="#timers"></a></h2>

	<p>The <i>EioTimeout</i> configuration parameter controls how long the srun
	command will wait for the slurmstepd to close the TCP/IP connection used to
	relay data between the user application and srun when the user application
	terminates. The default value is 60 seconds. Larger systems and/or slower
	networks may need a higher value.</p>

	<p>If a high throughput of jobs is anticipated (i.e. large numbers of jobs
	with brief execution times) then configure <i>MinJobAge</i> to the smallest
	interval practical for your environment. <i>MinJobAge</i> specifies the
	minimum number of seconds that a terminated job will be retained by Slurm's
	control daemon before purging. After this time, information about terminated
	jobs will only be available through accounting records.</p>

	<p>The configuration parameter <i>SlurmdTimeout</i> determines the interval
	at which <i>slurmctld</i> routinely communicates with <i>slurmd</i>.
	Communications occur at half the <i>SlurmdTimeout</i> value.
	The purpose of this is to determine when a compute node fails
	and thus should not be allocated work.
	Longer intervals decrease system noise on compute nodes (we do
	synchronize these requests across the cluster, but there will
	be some impact upon applications).
	For really large clusters, <i>SlurmdTimeout</i> values of
	120 seconds or more are reasonable.</p>

	<p>If MPICH-2 is used, the srun command will manage the key-pairs
	used to bootstrap the application.
	Depending upon the processor speed and architecture, the communication
	of key-pair information may require extra time.
	This can be done by setting an environment variable PMI_TIME before
	executing srun to launch the tasks.
	The default value of PMI_TIME is 500 and this is the number of
	microseconds allotted to transmit each key-pair.
	We have executed up to 16,000 tasks with a value of PMI_TIME=4000.</p>

	<p>The individual slurmd daemons on compute nodes will initiate messages
	to the slurmctld daemon only when they start up or when the epilog
	completes for a job. When a job allocated a large number of nodes
	completes, it can cause a very large number of messages to be sent
	by the slurmd daemons on these nodes to the slurmctld daemon all at
	the same time. In order to spread this message traffic out over time
	and avoid message loss, The <i>EpilogMsgTime</i> parameter may be
	used. Note that even if messages are lost, they will be retransmitted,
	but this will result in a delay for reallocating resources to new jobs.</p>

	<h2 id="other">Other<a class="slurm_link" href="#other"></a></h2>

	<p>Slurm uses hierarchical communications between the slurmd daemons
	in order to increase parallelism and improve performance. The
	<i>TreeWidth</i> configuration parameter controls the fanout of messages.
	The default value is 16, meaning each slurmd daemon can communicate
	with up to 16 other slurmd daemons and 4368 nodes can be contacted
	with three message hops.
	The default value will work well for most clusters.
	Optimal system performance can typically be achieved if <i>TreeWidth</i>
	is set to the cube root of the number of nodes in the cluster.</p>

	<p>The srun command automatically increases its open file limit to
	the hard limit in order to process all of the standard input and output
	connections to the launched tasks. It is recommended that you set the
	open file hard limit to 8192 across the cluster.</p>

	<p style="text-align:center;">Last modified 2 October 2024</p>

	<!--#include virtual="footer.txt"-->