doc/html/big_sys.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Large Cluster Administration Guide</h1>

 <p>This document contains SLURM administrator information specifically
 for clusters containing 1,024 nodes or more.
 Virtually all SLURM components have been validated (through emulation)
 for clusters containing up to 16,384 compute nodes.
 Getting good performance at that scale does require some tuning and
 this document should help you off to a good start.
 A working knowledge of SLURM should be considered a prerequisite
 for this material.</p>

 <h2>Node Selection Plugin (SelectType)</h2>

 <p>While allocating individual processors within a node is great
 for smaller clusters, the overhead of keeping track of the individual
 processors and memory within each node adds significant overhead.
 For best scalability, the consumable resource plugin (<i>select/cons_res</i>)
 is best avoided.</p>

 <h2>Job Accounting Plugin (JobAcctType)</h2>

 <p>Job accounting relies upon the <i>slurmstepd</i> daemon on each compute
 node periodically sampling data.
 This data collection will take compute cycles away from the application
 inducing what is known as <i>system noise</i>.
 For large parallel applications, this system noise can detract for
 application scalability.
 For optimal application performance, disabling job accounting
 is best (<i>jobacct/none</i>).
 Consider use of job completion records (<i>JobCompType</i>) for accounting
 purposes as this entails far less overhead.
 If job accounting is required, configure the sampling interval
 to a relatively large size (e.g. <i>JobAcctFrequency=300</i>).
 Some experimentation may also be required to deal with collisions
 on data transmission.</p>

 <h2>Node Configuration</h2>

 <p>While SLURM can track the amount of memory and disk space actually found
 on each compute node and use it for scheduling purposes, this entails
 extra overhead.
 Optimize performance by specifying the expected configuration using
 the available parameters (<i>RealMemory</i>, <i>Procs</i>, and
 <i>TmpDisk</i>).
 If the node is found to contain less resources than configured,
 it will be marked DOWN and not used.
 Also set the <i>FastSchedule</i> parameter.
 While SLURM can easily handle a heterogeneous cluster, configuring
 the nodes using the minimal number of lines in <i>slurm.conf</i>
 will both make for easier administration and better performance.</p>

 <h2>Timers</h2>

 <p>The configuration parameter <i>SlurmdTimeout</i> determines the interval
 at which <i>slurmctld</i> routinely communicates with <i>slurmd</i>.
 Communications occur at half the <i>SlurmdTimeout</i> value.
 The purpose of this is to determine when a compute node fails
 and thus should not be allocated work.
 Longer intervals decrease system noise on compute nodes (we do
 synchronize these requests across the cluster, but there will
 be some impact upon applications).
 For really large clusters, <i>SlurmdTimeoutl</i> values of
 120 seconds or more are reasonable.</p>

 <p style="text-align:center;">Last modified 28 January 2006</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Large Cluster Administration Guide</h1>

	<p>This document contains SLURM administrator information specifically
	for clusters containing 1,024 nodes or more.
	Virtually all SLURM components have been validated (through emulation)
	for clusters containing up to 16,384 compute nodes.
	Getting good performance at that scale does require some tuning and
	this document should help you off to a good start.
	A working knowledge of SLURM should be considered a prerequisite
	for this material.</p>

	<h2>Node Selection Plugin (SelectType)</h2>

	<p>While allocating individual processors within a node is great
	for smaller clusters, the overhead of keeping track of the individual
	processors and memory within each node adds significant overhead.
	For best scalability, the consumable resource plugin (<i>select/cons_res</i>)
	is best avoided.</p>

	<h2>Job Accounting Plugin (JobAcctType)</h2>

	<p>Job accounting relies upon the <i>slurmstepd</i> daemon on each compute
	node periodically sampling data.
	This data collection will take compute cycles away from the application
	inducing what is known as <i>system noise</i>.
	For large parallel applications, this system noise can detract for
	application scalability.
	For optimal application performance, disabling job accounting
	is best (<i>jobacct/none</i>).
	Consider use of job completion records (<i>JobCompType</i>) for accounting
	purposes as this entails far less overhead.
	If job accounting is required, configure the sampling interval
	to a relatively large size (e.g. <i>JobAcctFrequency=300</i>).
	Some experimentation may also be required to deal with collisions
	on data transmission.</p>

	<h2>Node Configuration</h2>

	<p>While SLURM can track the amount of memory and disk space actually found
	on each compute node and use it for scheduling purposes, this entails
	extra overhead.
	Optimize performance by specifying the expected configuration using
	the available parameters (<i>RealMemory</i>, <i>Procs</i>, and
	<i>TmpDisk</i>).
	If the node is found to contain less resources than configured,
	it will be marked DOWN and not used.
	Also set the <i>FastSchedule</i> parameter.
	While SLURM can easily handle a heterogeneous cluster, configuring
	the nodes using the minimal number of lines in <i>slurm.conf</i>
	will both make for easier administration and better performance.</p>

	<h2>Timers</h2>

	<p>The configuration parameter <i>SlurmdTimeout</i> determines the interval
	at which <i>slurmctld</i> routinely communicates with <i>slurmd</i>.
	Communications occur at half the <i>SlurmdTimeout</i> value.
	The purpose of this is to determine when a compute node fails
	and thus should not be allocated work.
	Longer intervals decrease system noise on compute nodes (we do
	synchronize these requests across the cluster, but there will
	be some impact upon applications).
	For really large clusters, <i>SlurmdTimeoutl</i> values of
	120 seconds or more are reasonable.</p>

	<p style="text-align:center;">Last modified 28 January 2006</p>

	<!--#include virtual="footer.txt"-->