doc/html/power_save.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Power Saving Guide</h1>

 <p>SLURM provides an integrated power saving mechanism for idle nodes.
 Nodes that remain idle for a configurable period of time can be placed
 in a power saving mode.
 The nodes will be restored to normal operation once work is assigned to them.
 Beginning with version 2.0.0, nodes can be fully powered down.
 Earlier releases of SLURM do not support the powering down of nodes,
 only support of reducing their performance and thus their power consumption.
 For example, power saving can be accomplished using a <i>cpufreq</i> governor
 that can change CPU frequency and voltage (note that the <i>cpufreq</i> driver
 must be enabled in the Linux kernel configuration).
 Of particular note, SLURM can power nodes up or down
 at a configurable rate to prevent rapid changes in power demands.
 For example, starting a 1000 node job on an idle cluster could result
 in an instantaneous surge in power demand of multiple megawatts without
 SLURM's support to increase power demands in a gradual fashion.</p>


 <h2>Configuration</h2>

 <p>A great deal of flexibility is offered in terms of when and
 how idle nodes are put into or removed from power save mode.
 Note that the SLURM control daemon, <i>slurmctld</i>, must be
 restarted to initially enable power saving mode.
 Changes in the configuration parameters (e.g. <i>SuspendTime</i>)
 will take effect after modifying the <i>slurm.conf</i> configuration
 file and executing "<i>scontrol reconfig</i>".
 The following configuration parameters are available:
 <ul>

 <li><b>SuspendTime</b>:
 Nodes becomes eligible for power saving mode after being idle
 for this number of seconds.
 The configured value should exceed the time to suspend and resume a node.
 A negative number disables power saving mode.
 The default value is -1 (disabled).</li>

 <li><b>SuspendRate</b>:
 Maximum number of nodes to be placed into power saving mode
 per minute.
 A value of zero results in no limits being imposed.
 The default value is 60.
 Use this to prevent rapid drops in power consumption.</li>

 <li><b>ResumeRate</b>:
 Maximum number of nodes to be removed from power saving mode
 per minute.
 A value of zero results in no limits being imposed.
 The default value is 300.
 Use this to prevent rapid increases in power consumption.</li>

 <li><b>SuspendProgram</b>:
 Program to be executed to place nodes into power saving mode.
 The program executes as <i>SlurmUser</i> (as configured in
 <i>slurm.conf</i>).
 The argument to the program will be the names of nodes to
 be placed into power savings mode (using SLURM's hostlist
 expression format).</li>

 <li><b>ResumeProgram</b>:
 Program to be executed to remove nodes from power saving mode.
 The program executes as <i>SlurmUser</i> (as configured in
 <i>slurm.conf</i>).
 The argument to the program will be the names of nodes to
 be removed from power savings mode (using SLURM's hostlist
 expression format).
 This program may use the <i>scontrol show node</i> command
 to insure that a node has booted and the <i>slurmd</i>
 daemon started.
 If the <i>slurmd</i> daemon fails to respond within the
 configured <b>SlurmdTimeout</b> value, the node will be
 placed in a DOWN state and the job requesting the node
 will be requeued.
 For reasons of reliability, <b>ResumeProgram</b> may execute
 more than once for a node when the <b>slurmctld</b> daemon
 crashes and is restarted.</li>

 <li><b>SuspendTimeout</b>:
 Maximum time permitted (in second) between when a node suspend request
 is issued and when the node shutdown is complete.
 At that time the node must ready for a resume request to be issued
 as needed for new workload.
 The default value is 30 seconds.</li>

 <li><b>ResumeTimeout</b>:
 Maximum time permitted (in second) between when a node resume request
 is issued and when the node is actually available for use.
 Nodes which fail to respond in this time frame may be marked DOWN and
 the jobs scheduled on the node requeued.
 The default value is 60 seconds.</li>

 <li><b>SuspendExcNodes</b>:
 List of nodes to never place in power saving mode.
 Use SLURM's hostlist expression format.
 By default, no nodes are excluded.</li>

 <li><b>SuspendExcParts</b>:
 List of partitions with nodes to never place in power saving mode.
 Multiple partitions may be specified using a comma separator.
 By default, no nodes are excluded.</li>
 </ul></p>

 <p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> execute as
 <i>SlurmUser</i> on the node where the <i>slurmctld</i> daemon runs
 (primary and backup server nodes).
 Use of <i>sudo</i> may be required for <i>SlurmUser</i>to power down
 and restart nodes.
 If you need to convert SLURM's hostlist expression into individual node
 names, the <i>scontrol show hostnames</i> command may prove useful.
 The commands used to boot or shut down nodes will depend upon your
 cluster management tools.</p>

 <p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> are not
 subject to any time limits.
 They should perform the required action, ideally verify the action
 (e.g. node boot and start the <i>slurmd</i> daemon, thus the node is
 no longer non-responsive to <i>slurmctld</i>) and terminate.
 Long running programs will be logged by <i>slurmctld</i>, but not
 aborted.</p>

 <p>Also note that the stderr/out of the suspend and resume programs
 are not logged.  If logging is desired it should be added to the
 scripts.</p>

 <pre>
 #!/bin/bash
 # Example SuspendProgram
 echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
 hosts=`scontrol show hostnames $1`
 for host in "$hosts"
 do
    sudo node_shutdown $host
 done

 #!/bin/bash
 # Example ResumeProgram
 echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
 hosts=`scontrol show hostnames $1`
 for host in "$hosts"
 do
    sudo node_startup $host
 done
 </pre>

 <p>Subject to the various rates, limits and exclusions, the power save
 code follows this logic:
 <ol>
 <li>Identify nodes which have been idle for at least <b>SuspendTime</b>.</li>
 <li>Execute <b>SuspendProgram</b> with an argument of the idle node names.</li>
 <li>Identify the nodes which are in power save mode (a flag in the node's
 state field), but have been allocated to jobs.</li>
 <li>Execute <b>ResumeProgram</b> with an argument of the allocated node names.</li>
 <li>Once the <i>slurmd</i> responds, initiate the job and/or job steps
 allocated to it.</li>
 <li>If the <i>slurmd</i> fails to respond within the value configured for
 <b>SlurmdTimeout</b>, the node will be marked DOWN and the job requeued
 if possible.</li>
 <li>Repeat indefinitely.</li>
 </ol></p>

 <p>The slurmctld daemon will periodically (every 10 minutes) log how many
 nodes are in power save mode using messages of this sort:
 <pre>
 [May 02 15:31:25] Power save mode 0 nodes
 ...
 [May 02 15:41:26] Power save mode 10 nodes
 ...
 [May 02 15:51:28] Power save mode 22 nodes
 </pre>

 <p>Using these logs you can easily see the effect of SLURM's power saving
 support.
 You can also configure SLURM with programs that perform no action as <b>SuspendProgram</b> and <b>ResumeProgram</b> to assess the potential
 impact of power saving mode before enabling it.</p>

 <h2>Use of Allocations</h2>

 <p>A resource allocation request will be granted as soon as resources
 are selected for use, possibly before the nodes are all available
 for use.
 The launching of job steps will be delayed until the required nodes
 have been restored to service (it prints a warning about waiting for
 nodes to become available and periodically retries until they are
 available).</p>

 <p>In the case of an <i>sbatch</i> command, the batch program will start
 when node zero of the allocation is ready for use and pre-processing can
 be performed as needed before using <i>srun</i> to launch job steps.
 The operation of <i>salloc</i> and <i>srun</i> follow a similar pattern
 of getting an job allocation at one time, but possibly being unable to
 launch job steps until later.
 If <i>ssh</i> or some other tools is used by <i>salloc</i> it may be
 desirable to execute "<i>srun /bin/true</i>" or some other command
 first to insure that all nodes are booted and ready for use.
 We plan to add a job and node state of <i>CONFIGURING</i> in SLURM
 version 2.1, which could be used to prevent salloc from executing
 any processes (including <i>ssh</i>) until all of the nodes are
 ready for use.</p>

 <h2>Fault Tolerance</h2>

 <p>If the <i>slurmctld</i> daemon is terminated gracefully, it will
 wait up to <b>SuspendTimeout</b> or <b>ResumeTimeout</b> (whichever
 is larger) for any spawned <b>SuspendProgram</b> or
 <b>ResumeProgram</b> to terminate before the daemon terminates.
 If the spawned program does not terminate within that time period,
 the event will be logged and <i>slurmctld</i> will exit in order to
 permit another <i>slurmctld</i> daemon to be initiated.
 Synchronization problems could also occur when the <i>slurmctld</i>
 daemon crashes (a rare event) and is restarted. </p>

 <p>In either event, the newly initiated <i>slurmctld</i> daemon (or
 the backup server) will recover saved node state information that
 may not accurately describe the actual node state.
 In the case of a failed <b>SuspendProgram</b>, the negative impact is
 limited to increased power consumption, so no special action is
 currently taken to execute <b>SuspendProgram</b> multiple times in
 order to insure the node is in a reduced power mode.
 The case of a failed <b>ResumeProgram</b> is more serious in that the
 node could be placed into a DOWN state and/or jobs could fail.
 In order to minimize this risk, when the <i>slurmctld</i> daemon is
 started and node which should be allocated to a job fails to respond,
 the <b>ResumeProgram</b> will be executed (possibly for a second time).</p>

 <h2>Booting Different Images</h2>

 <p>SLURM's <b>PrologSlurmctld</b> configuration parameter can identify a
 program to boot different operating system images for each job based upon it's
 constraint field (or possibly comment).
 If you want <b>ResumeProgram</b> to boot a various images according to
 job specifications, it will need to be a fairly sophisticated program
 and perform the following actions:
 <ol>
 <li>Determine which jobs are associated with the nodes to be booted</li>
 <li>Determine which image is required for each job and</li>
 <li>Boot the appropriate image for each node</li>
 </ol>

 <p style="text-align:center;">Last modified 6 August 2009</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Power Saving Guide</h1>

	<p>SLURM provides an integrated power saving mechanism for idle nodes.
	Nodes that remain idle for a configurable period of time can be placed
	in a power saving mode.
	The nodes will be restored to normal operation once work is assigned to them.
	Beginning with version 2.0.0, nodes can be fully powered down.
	Earlier releases of SLURM do not support the powering down of nodes,
	only support of reducing their performance and thus their power consumption.
	For example, power saving can be accomplished using a <i>cpufreq</i> governor
	that can change CPU frequency and voltage (note that the <i>cpufreq</i> driver
	must be enabled in the Linux kernel configuration).
	Of particular note, SLURM can power nodes up or down
	at a configurable rate to prevent rapid changes in power demands.
	For example, starting a 1000 node job on an idle cluster could result
	in an instantaneous surge in power demand of multiple megawatts without
	SLURM's support to increase power demands in a gradual fashion.</p>


	<h2>Configuration</h2>

	<p>A great deal of flexibility is offered in terms of when and
	how idle nodes are put into or removed from power save mode.
	Note that the SLURM control daemon, <i>slurmctld</i>, must be
	restarted to initially enable power saving mode.
	Changes in the configuration parameters (e.g. <i>SuspendTime</i>)
	will take effect after modifying the <i>slurm.conf</i> configuration
	file and executing "<i>scontrol reconfig</i>".
	The following configuration parameters are available:
	<ul>

	<li><b>SuspendTime</b>:
	Nodes becomes eligible for power saving mode after being idle
	for this number of seconds.
	The configured value should exceed the time to suspend and resume a node.
	A negative number disables power saving mode.
	The default value is -1 (disabled).</li>

	<li><b>SuspendRate</b>:
	Maximum number of nodes to be placed into power saving mode
	per minute.
	A value of zero results in no limits being imposed.
	The default value is 60.
	Use this to prevent rapid drops in power consumption.</li>

	<li><b>ResumeRate</b>:
	Maximum number of nodes to be removed from power saving mode
	per minute.
	A value of zero results in no limits being imposed.
	The default value is 300.
	Use this to prevent rapid increases in power consumption.</li>

	<li><b>SuspendProgram</b>:
	Program to be executed to place nodes into power saving mode.
	The program executes as <i>SlurmUser</i> (as configured in
	<i>slurm.conf</i>).
	The argument to the program will be the names of nodes to
	be placed into power savings mode (using SLURM's hostlist
	expression format).</li>

	<li><b>ResumeProgram</b>:
	Program to be executed to remove nodes from power saving mode.
	The program executes as <i>SlurmUser</i> (as configured in
	<i>slurm.conf</i>).
	The argument to the program will be the names of nodes to
	be removed from power savings mode (using SLURM's hostlist
	expression format).
	This program may use the <i>scontrol show node</i> command
	to insure that a node has booted and the <i>slurmd</i>
	daemon started.
	If the <i>slurmd</i> daemon fails to respond within the
	configured <b>SlurmdTimeout</b> value, the node will be
	placed in a DOWN state and the job requesting the node
	will be requeued.
	For reasons of reliability, <b>ResumeProgram</b> may execute
	more than once for a node when the <b>slurmctld</b> daemon
	crashes and is restarted.</li>

	<li><b>SuspendTimeout</b>:
	Maximum time permitted (in second) between when a node suspend request
	is issued and when the node shutdown is complete.
	At that time the node must ready for a resume request to be issued
	as needed for new workload.
	The default value is 30 seconds.</li>

	<li><b>ResumeTimeout</b>:
	Maximum time permitted (in second) between when a node resume request
	is issued and when the node is actually available for use.
	Nodes which fail to respond in this time frame may be marked DOWN and
	the jobs scheduled on the node requeued.
	The default value is 60 seconds.</li>

	<li><b>SuspendExcNodes</b>:
	List of nodes to never place in power saving mode.
	Use SLURM's hostlist expression format.
	By default, no nodes are excluded.</li>

	<li><b>SuspendExcParts</b>:
	List of partitions with nodes to never place in power saving mode.
	Multiple partitions may be specified using a comma separator.
	By default, no nodes are excluded.</li>
	</ul></p>

	<p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> execute as
	<i>SlurmUser</i> on the node where the <i>slurmctld</i> daemon runs
	(primary and backup server nodes).
	Use of <i>sudo</i> may be required for <i>SlurmUser</i>to power down
	and restart nodes.
	If you need to convert SLURM's hostlist expression into individual node
	names, the <i>scontrol show hostnames</i> command may prove useful.
	The commands used to boot or shut down nodes will depend upon your
	cluster management tools.</p>

	<p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> are not
	subject to any time limits.
	They should perform the required action, ideally verify the action
	(e.g. node boot and start the <i>slurmd</i> daemon, thus the node is
	no longer non-responsive to <i>slurmctld</i>) and terminate.
	Long running programs will be logged by <i>slurmctld</i>, but not
	aborted.</p>

	<p>Also note that the stderr/out of the suspend and resume programs
	are not logged. If logging is desired it should be added to the
	scripts.</p>

	<pre>
	#!/bin/bash
	# Example SuspendProgram
	echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
	hosts=`scontrol show hostnames $1`
	for host in "$hosts"
	do
	sudo node_shutdown $host
	done

	#!/bin/bash
	# Example ResumeProgram
	echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
	hosts=`scontrol show hostnames $1`
	for host in "$hosts"
	do
	sudo node_startup $host
	done
	</pre>

	<p>Subject to the various rates, limits and exclusions, the power save
	code follows this logic:
	<ol>
	<li>Identify nodes which have been idle for at least <b>SuspendTime</b>.</li>
	<li>Execute <b>SuspendProgram</b> with an argument of the idle node names.</li>
	<li>Identify the nodes which are in power save mode (a flag in the node's
	state field), but have been allocated to jobs.</li>
	<li>Execute <b>ResumeProgram</b> with an argument of the allocated node names.</li>
	<li>Once the <i>slurmd</i> responds, initiate the job and/or job steps
	allocated to it.</li>
	<li>If the <i>slurmd</i> fails to respond within the value configured for
	<b>SlurmdTimeout</b>, the node will be marked DOWN and the job requeued
	if possible.</li>
	<li>Repeat indefinitely.</li>
	</ol></p>

	<p>The slurmctld daemon will periodically (every 10 minutes) log how many
	nodes are in power save mode using messages of this sort:
	<pre>
	[May 02 15:31:25] Power save mode 0 nodes
	...
	[May 02 15:41:26] Power save mode 10 nodes
	...
	[May 02 15:51:28] Power save mode 22 nodes
	</pre>

	<p>Using these logs you can easily see the effect of SLURM's power saving
	support.
	You can also configure SLURM with programs that perform no action as <b>SuspendProgram</b> and <b>ResumeProgram</b> to assess the potential
	impact of power saving mode before enabling it.</p>

	<h2>Use of Allocations</h2>

	<p>A resource allocation request will be granted as soon as resources
	are selected for use, possibly before the nodes are all available
	for use.
	The launching of job steps will be delayed until the required nodes
	have been restored to service (it prints a warning about waiting for
	nodes to become available and periodically retries until they are
	available).</p>

	<p>In the case of an <i>sbatch</i> command, the batch program will start
	when node zero of the allocation is ready for use and pre-processing can
	be performed as needed before using <i>srun</i> to launch job steps.
	The operation of <i>salloc</i> and <i>srun</i> follow a similar pattern
	of getting an job allocation at one time, but possibly being unable to
	launch job steps until later.
	If <i>ssh</i> or some other tools is used by <i>salloc</i> it may be
	desirable to execute "<i>srun /bin/true</i>" or some other command
	first to insure that all nodes are booted and ready for use.
	We plan to add a job and node state of <i>CONFIGURING</i> in SLURM
	version 2.1, which could be used to prevent salloc from executing
	any processes (including <i>ssh</i>) until all of the nodes are
	ready for use.</p>

	<h2>Fault Tolerance</h2>

	<p>If the <i>slurmctld</i> daemon is terminated gracefully, it will
	wait up to <b>SuspendTimeout</b> or <b>ResumeTimeout</b> (whichever
	is larger) for any spawned <b>SuspendProgram</b> or
	<b>ResumeProgram</b> to terminate before the daemon terminates.
	If the spawned program does not terminate within that time period,
	the event will be logged and <i>slurmctld</i> will exit in order to
	permit another <i>slurmctld</i> daemon to be initiated.
	Synchronization problems could also occur when the <i>slurmctld</i>
	daemon crashes (a rare event) and is restarted. </p>

	<p>In either event, the newly initiated <i>slurmctld</i> daemon (or
	the backup server) will recover saved node state information that
	may not accurately describe the actual node state.
	In the case of a failed <b>SuspendProgram</b>, the negative impact is
	limited to increased power consumption, so no special action is
	currently taken to execute <b>SuspendProgram</b> multiple times in
	order to insure the node is in a reduced power mode.
	The case of a failed <b>ResumeProgram</b> is more serious in that the
	node could be placed into a DOWN state and/or jobs could fail.
	In order to minimize this risk, when the <i>slurmctld</i> daemon is
	started and node which should be allocated to a job fails to respond,
	the <b>ResumeProgram</b> will be executed (possibly for a second time).</p>

	<h2>Booting Different Images</h2>

	<p>SLURM's <b>PrologSlurmctld</b> configuration parameter can identify a
	program to boot different operating system images for each job based upon it's
	constraint field (or possibly comment).
	If you want <b>ResumeProgram</b> to boot a various images according to
	job specifications, it will need to be a fairly sophisticated program
	and perform the following actions:
	<ol>
	<li>Determine which jobs are associated with the nodes to be booted</li>
	<li>Determine which image is required for each job and</li>
	<li>Boot the appropriate image for each node</li>
	</ol>

	<p style="text-align:center;">Last modified 6 August 2009</p>

	<!--#include virtual="footer.txt"-->