blob: 0aec4b931307133b1c8789f99cf3ffe6d9f74c00 [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1>Power Saving Guide</h1>
<p>SLURM provides an integrated power saving mechanism for idle nodes.
Nodes that remain idle for a configurable period of time can be placed
in a power saving mode.
The nodes will be restored to normal operation once work is assigned to them.
Beginning with version 2.0.0, nodes can be fully powered down.
Earlier releases of SLURM do not support the powering down of nodes,
only support of reducing their performance and thus their power consumption.
For example, power saving can be accomplished using a <i>cpufreq</i> governor
that can change CPU frequency and voltage (note that the <i>cpufreq</i> driver
must be enabled in the Linux kernel configuration).
Of particular note, SLURM can power nodes up or down
at a configurable rate to prevent rapid changes in power demands.
For example, starting a 1000 node job on an idle cluster could result
in an instantaneous surge in power demand of multiple megawatts without
SLURM's support to increase power demands in a gradual fashion.</p>
<h2>Configuration</h2>
<p>A great deal of flexibility is offered in terms of when and
how idle nodes are put into or removed from power save mode.
Note that the SLURM control daemon, <i>slurmctld</i>, must be
restarted to initially enable power saving mode.
Changes in the configuration parameters (e.g. <i>SuspendTime</i>)
will take effect after modifying the <i>slurm.conf</i> configuration
file and executing "<i>scontrol reconfig</i>".
The following configuration parameters are available:
<ul>
<li><b>SuspendTime</b>:
Nodes becomes eligible for power saving mode after being idle
for this number of seconds.
The configured value should exceed the time to suspend and resume a node.
A negative number disables power saving mode.
The default value is -1 (disabled).</li>
<li><b>SuspendRate</b>:
Maximum number of nodes to be placed into power saving mode
per minute.
A value of zero results in no limits being imposed.
The default value is 60.
Use this to prevent rapid drops in power consumption.</li>
<li><b>ResumeRate</b>:
Maximum number of nodes to be removed from power saving mode
per minute.
A value of zero results in no limits being imposed.
The default value is 300.
Use this to prevent rapid increases in power consumption.</li>
<li><b>SuspendProgram</b>:
Program to be executed to place nodes into power saving mode.
The program executes as <i>SlurmUser</i> (as configured in
<i>slurm.conf</i>).
The argument to the program will be the names of nodes to
be placed into power savings mode (using SLURM's hostlist
expression format).</li>
<li><b>ResumeProgram</b>:
Program to be executed to remove nodes from power saving mode.
The program executes as <i>SlurmUser</i> (as configured in
<i>slurm.conf</i>).
The argument to the program will be the names of nodes to
be removed from power savings mode (using SLURM's hostlist
expression format).
This program may use the <i>scontrol show node</i> command
to insure that a node has booted and the <i>slurmd</i>
daemon started.
If the <i>slurmd</i> daemon fails to respond within the
configured <b>SlurmdTimeout</b> value, the node will be
placed in a DOWN state and the job requesting the node
will be requeued.
For reasons of reliability, <b>ResumeProgram</b> may execute
more than once for a node when the <b>slurmctld</b> daemon
crashes and is restarted.</li>
<li><b>SuspendTimeout</b>:
Maximum time permitted (in second) between when a node suspend request
is issued and when the node shutdown is complete.
At that time the node must ready for a resume request to be issued
as needed for new workload.
The default value is 30 seconds.</li>
<li><b>ResumeTimeout</b>:
Maximum time permitted (in second) between when a node resume request
is issued and when the node is actually available for use.
Nodes which fail to respond in this time frame may be marked DOWN and
the jobs scheduled on the node requeued.
The default value is 60 seconds.</li>
<li><b>SuspendExcNodes</b>:
List of nodes to never place in power saving mode.
Use SLURM's hostlist expression format.
By default, no nodes are excluded.</li>
<li><b>SuspendExcParts</b>:
List of partitions with nodes to never place in power saving mode.
Multiple partitions may be specified using a comma separator.
By default, no nodes are excluded.</li>
</ul></p>
<p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> execute as
<i>SlurmUser</i> on the node where the <i>slurmctld</i> daemon runs
(primary and backup server nodes).
Use of <i>sudo</i> may be required for <i>SlurmUser</i>to power down
and restart nodes.
If you need to convert SLURM's hostlist expression into individual node
names, the <i>scontrol show hostnames</i> command may prove useful.
The commands used to boot or shut down nodes will depend upon your
cluster management tools.</p>
<p>Note that <i>SuspendProgram</i> and <i>ResumeProgram</i> are not
subject to any time limits.
They should perform the required action, ideally verify the action
(e.g. node boot and start the <i>slurmd</i> daemon, thus the node is
no longer non-responsive to <i>slurmctld</i>) and terminate.
Long running programs will be logged by <i>slurmctld</i>, but not
aborted.</p>
<p>Also note that the stderr/out of the suspend and resume programs
are not logged. If logging is desired it should be added to the
scripts.</p>
<pre>
#!/bin/bash
# Example SuspendProgram
echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in "$hosts"
do
sudo node_shutdown $host
done
#!/bin/bash
# Example ResumeProgram
echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in "$hosts"
do
sudo node_startup $host
done
</pre>
<p>Subject to the various rates, limits and exclusions, the power save
code follows this logic:
<ol>
<li>Identify nodes which have been idle for at least <b>SuspendTime</b>.</li>
<li>Execute <b>SuspendProgram</b> with an argument of the idle node names.</li>
<li>Identify the nodes which are in power save mode (a flag in the node's
state field), but have been allocated to jobs.</li>
<li>Execute <b>ResumeProgram</b> with an argument of the allocated node names.</li>
<li>Once the <i>slurmd</i> responds, initiate the job and/or job steps
allocated to it.</li>
<li>If the <i>slurmd</i> fails to respond within the value configured for
<b>SlurmdTimeout</b>, the node will be marked DOWN and the job requeued
if possible.</li>
<li>Repeat indefinitely.</li>
</ol></p>
<p>The slurmctld daemon will periodically (every 10 minutes) log how many
nodes are in power save mode using messages of this sort:
<pre>
[May 02 15:31:25] Power save mode 0 nodes
...
[May 02 15:41:26] Power save mode 10 nodes
...
[May 02 15:51:28] Power save mode 22 nodes
</pre>
<p>Using these logs you can easily see the effect of SLURM's power saving
support.
You can also configure SLURM with programs that perform no action as <b>SuspendProgram</b> and <b>ResumeProgram</b> to assess the potential
impact of power saving mode before enabling it.</p>
<h2>Use of Allocations</h2>
<p>A resource allocation request will be granted as soon as resources
are selected for use, possibly before the nodes are all available
for use.
The launching of job steps will be delayed until the required nodes
have been restored to service (it prints a warning about waiting for
nodes to become available and periodically retries until they are
available).</p>
<p>In the case of an <i>sbatch</i> command, the batch program will start
when node zero of the allocation is ready for use and pre-processing can
be performed as needed before using <i>srun</i> to launch job steps.
The operation of <i>salloc</i> and <i>srun</i> follow a similar pattern
of getting an job allocation at one time, but possibly being unable to
launch job steps until later.
If <i>ssh</i> or some other tools is used by <i>salloc</i> it may be
desirable to execute "<i>srun /bin/true</i>" or some other command
first to insure that all nodes are booted and ready for use.
We plan to add a job and node state of <i>CONFIGURING</i> in SLURM
version 2.1, which could be used to prevent salloc from executing
any processes (including <i>ssh</i>) until all of the nodes are
ready for use.</p>
<h2>Fault Tolerance</h2>
<p>If the <i>slurmctld</i> daemon is terminated gracefully, it will
wait up to <b>SuspendTimeout</b> or <b>ResumeTimeout</b> (whichever
is larger) for any spawned <b>SuspendProgram</b> or
<b>ResumeProgram</b> to terminate before the daemon terminates.
If the spawned program does not terminate within that time period,
the event will be logged and <i>slurmctld</i> will exit in order to
permit another <i>slurmctld</i> daemon to be initiated.
Synchronization problems could also occur when the <i>slurmctld</i>
daemon crashes (a rare event) and is restarted. </p>
<p>In either event, the newly initiated <i>slurmctld</i> daemon (or
the backup server) will recover saved node state information that
may not accurately describe the actual node state.
In the case of a failed <b>SuspendProgram</b>, the negative impact is
limited to increased power consumption, so no special action is
currently taken to execute <b>SuspendProgram</b> multiple times in
order to insure the node is in a reduced power mode.
The case of a failed <b>ResumeProgram</b> is more serious in that the
node could be placed into a DOWN state and/or jobs could fail.
In order to minimize this risk, when the <i>slurmctld</i> daemon is
started and node which should be allocated to a job fails to respond,
the <b>ResumeProgram</b> will be executed (possibly for a second time).</p>
<h2>Booting Different Images</h2>
<p>SLURM's <b>PrologSlurmctld</b> configuration parameter can identify a
program to boot different operating system images for each job based upon it's
constraint field (or possibly comment).
If you want <b>ResumeProgram</b> to boot a various images according to
job specifications, it will need to be a fairly sophisticated program
and perform the following actions:
<ol>
<li>Determine which jobs are associated with the nodes to be booted</li>
<li>Determine which image is required for each job and</li>
<li>Boot the appropriate image for each node</li>
</ol>
<p style="text-align:center;">Last modified 6 August 2009</p>
<!--#include virtual="footer.txt"-->