blob: c5c461c3545b8b1544fdca1aa7e523bd3bfe0484 [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1>IBM Parallel Environment User and Administrator Guide</h1>
<p>
<a href="#overview">Overview</a><br>
<a href="#user">User Tools</a><br>
<a href="#admin">System Administration</a></p>
<h2><a name="overview">Overview</a></h2>
<p>This document describes the unique features of SLURM on the
IBM computers with the
<a href="http://www-03.ibm.com/systems/software/parallel/">Parallel Environment (PE)</a>
software. You should be familiar with the SLURM's mode of operation on Linux
clusters before studying the relatively few differences in operation on systems
with PE, which are described in this document.</p>
<p>Note that Slurm is designed to be a replacement for IBM's LoadLeveler.
They are not designed to concurrently schedule resources.
Slurm provides manages network resources and provides the POE command with
a library that emulates LoadLeveler functionality.</p>
<h2><a name="user">User Tools</a></h2>
<p>The normal set of SLURM user tools: srun, scancel, sinfo, squeue, scontrol,
etc. provide all of the expected services.
The only SLURM command not supported is sattach.
Job steps are launched using the
srun command, which translates its options and invokes IBM's poe command. The
poe command actually launches the tasks. The poe command may also be invoked
directly if desired. The actual task launch process is as follows:</p>
<ol>
<li>Invoke srun command with desired options.</li>
<li>The srun command creates a job allocation (if necessary).</li>
<li>The srun command translates its options and invokes the poe command.</li>
<li>The poe command loads a SLURM library that provides various resource
management functionality.</li>
<li>The poe command, through the SLURM library, creates a SLURM step
allocation and launches a process named
"pmdv12" on the appropriate compute nodes. Note that the "v12" on the end of
the process name represents the version number of the "pmd" process and is
subject to change.</li>
<li>The poe command interacts with the pmdv12 process to launch the application
tasks, handle their I/O, etc. Since the task launch procedure occurs outside of
SLURM's control, none of the normal task-level SLURM support is available.</li>
<li>The poe command, through the SLURM library, reports the completion of the
job step.</li>
</ol>
<h3>Network Options</h3>
<p>Each job step can specify it's desired network options.
For example, one job step may use IP mode communications and the next use
User Space (US) mode communications.
Network specifications may be specified using srun's --network option or the
SLURM_NETWORK environment variable. Supported network options include:</p>
<ul>
<li>Network protocol</li>
<ul>
<li><b>ip</b> Internet protocol, version 4</li>
<li><b>ipv4</b> Internet protocol, version 4 (default)</li>
<li><b>ipv6</b> Internet protocol, version 6</li>
<li><b>us</b> User Space protocol, may be combined with ibv4 or ipv6</li>
</ul>
<li>Programming interface</li>
<ul>
<li><b>lapi</b> Low-level Application Programming Interface</li>
<li><b>mpi</b> Message Passing Interface (default)</li>
<li><b>pami</b> Parallel Active Message Interface</li>
<li><b>shmem</b> OpenSHMEM interface</li>
<li><b>upc</b> Unified Parallel C Interface</li>
</ul>
<li>Other options</li>
<ul>
<li><b>bulk_xfer [=<i>resources</i>]</b>
Enable bulk transfer of data using Remote Direct-Memory Access (RDMA).
The optional <i>resources</i> specification is a numeric value which can have
a suffix of "k", "K", "m", "M", "g" or "G" for kilobytes, megabytes or
gigabytes.
<b>NOTE:</b> The <i>resources</i> specification is not supported by the
underlying IBM infrastructure as of Parallel Environment version 2.2 and no
value should be specified at this time.</li>
<li><b>cau=<i>count</i></b>
Specify the count of Collective Acceleration Units (CAU) required per
programming interface.
Default value is zero.
Applies only to IBM Power7-IH processors.
POE requires that if <b>cau</b> has a non-zero value then <b>us</b>,
<b>devtype=IB</b> or <b>devtype=HFI</b> must be explicitly specified otherwise
the request may attempt to allocate CAU with IP communications and fail.
<li><b>devname=<i>name</i></b>
Specify the name of an individual network adapter to use.
For example: "eth0" or "mlx4_0".</li>
<li><b>devtype=<i>type</i></b>
Specify the device type to use for communications.
The supported values of <i>type</i> are:
"IB" (InfiniBand), "HFI" (P7 Host Fabric Interface),
"IPONLY" (IP-Only interfaces), "HPCE" (HPC Ethernet), and
"KMUX" (Kernel Emulation of HPCE).
The devices allocated to a job must all be of the same type.
The default value depends upon depends upon what hardware is available and in
order of preferences is IPONLY (which is not considered in User Space mode),
HFI, IB, HPCE, and KMUX.</li>
<li><b>immed=<i>count</i></b>
Specify the count of immediate send slots per adapter window.
Default value is zero.
Applies only to IBM Power7-IH processors.
<li><b>instances=<i>count</i></b>
Specify number of network connections for each task on each network connection.
The default instance count is 1.</li>
<li><b>sn_all</b> Use all available switch adapters (default).
This option can not be combined with sn_single.</li>
<li><b>sn_single</b> Use only one switch adapters.
This option can not be combined with sn_all.
If multiple adapters of different types exist, the devname and/or
devtype option can also be used to select one of them.</li>
</ul>
</ul>
<p>Examples of network option use:
<br><br>
<b>--network=sn_all,mpi</b><br>
Allocate one switch window per task on each node and every network supporting
MPI.
<br><br>
<b>--network=sn_all,mpi,bulk_xfer,us</b><br>
Allocate one switch window per task on each node and every network supporting
MPI and user space communications. Reserve resources for RDMA.
<br><br>
<b>--network=sn_all,instances=3,mpi</b><br>
Allocate three switch window per task on each node and every network supporting
MPI.
<br><br>
<b>--network=sn_all,mpi,pami</b><br>
Allocate one switch window per task on each node and every network supporting
MPI and a second window supporting PAMI.
<br><br>
<b>--network=devtype=ib,instances=2,lapi,mpi</b><br>
On every InfiniBand network connection, allocate two switch windows each for
both lapi and mpi interfaces. If each node has one InfiniBand network connection,
this would result in four switch windows per task.
</p>
<p><b>NOTE:</b> Switch resources on a node are shared between all job steps on
that node. If a job step can not be initiated due to insufficient switch
resources being available, that job step will periodically retry allocating
resources for the lifetime of the job unless srun's --immediate option is
used.</p>
<h3>Debugging</h3>
<p>Most debuggers require detailed information about launched tasks such as
host name, process ID, etc. Since that information is only available from
poe (which launches those tasks), the srun command wrapper can not be used
for most debugging purposes. You or the debugging tool must invoke the poe
command directly. In order to facilitate the direct use of poe, srun's
<br><b>--launch-cmd</b> option may be used with the options normally used.
srun will then print the equivalent poe command line, which
can subsequently be used with the debugger. The poe options must be explicitly
set even if the command is executed from within an existing SLURM allocation
(i.e. from within an allocation created by the salloc or sbatch command).</p>
<h3>Checkpoint</h3>
<p>Checkpoint/restart is only supported with LoadLeveler.
The <b>checkpoint/poe</b> plugin is based SLURM support for checkpoint support
of poe in in the 2005 time frame and does not work with the current versions
of poe.</p>
<!--
<p>In order to enable checkpoint, the shell executing the poe command must
itself be initiated with the environment variable <b>CHECKPOINT=yes</b>.
One file is written for each node on which the job is executing, plus
another for the script executing poe.
By default, the checkpoint files will be written to the current working
directory of the job.
Names and locations of these files can be controlled using the
environment variables <b>MP_CKPTFILE</b> and <b>MP_CKPTDIR</b>.
Use the squeue command to identify the job and job step of interest.
To initiate a checkpoint in which the job step will continue execution,
use the command: <br>
<b>scontrol check create <i>job_id.step_id</i></b><br>
To initiate a checkpoint in which the job step will terminate afterwards,
use the command: <br>
<b>scontrol check vacate <i>job_id.step_id</i></b></p>
-->
<h3>Unsupported Options</h3>
<p>Some SLURM options can not be supported by PE and the following srun options
are silently ignored:</p>
<ul>
<li>-D, --chdir (set working directory)</li>
<li>-K, --kill-on-bad-exit (terminate step if any task has a non-zero exit code)</li>
<li>-k, --no-kill (set to not kill job upon node failure)</li>
<li>--ntasks-per-core (number of tasks to invoke per code)</li>
<li>--ntasks-per-socket (number of tasks to invoke per socket)</li>
<li>-O, --overcommit (over subscribe resources)</li>
<li>--resv-ports (communication ports reserved for OpenMPI)</li>
<li>--runjob-opts (used only on IBM BlueGene/Q systems)</li>
<li>--signal (signal to send when near time limit and the remaining time required)</li>
<li>--sockets-per-node (number of sockets per node required)</li>
<li>--task-epilog (per-task epilog program)</li>
<li>--task-prolog (per-task prolog program)</li>
<li>-u, --unbuffered (avoid line buffering)</li>
<li>-W, --wait (specify job swait time after first task exit)</li>
<li>-Z, --no-allocate (launch tasks without creating job allocation></li>
</ul>
<p>A limited subset of srun's --cpu-bind options are supported as shown
below. If the --cpus-per-task option is not specified, a value of one is used
by default. Note that SLURM's mask_cpu and map_cpu options are not supported,
nor are options to bind to sockets or boards.</p>
<table border=1 cellspacing=0 cellpadding=4>
<tr>
<td><b>SLURM option</b></td>
<td><b>POE equivalent</b></td>
</tr>
<tr>
<td>--cpu-bind=threads --cpus-per-task=#</td>
<td>-task_affinity=cpu:#</td>
</tr>
<tr>
<td>--cpu-bind=cores --cpus-per-task=#</td>
<td>-task_affinity=core:#</td>
</tr>
<tr>
<td>--cpu-bind=rank --cpus-per-task=#</td>
<td>-task_affinity=cpu:#</td>
</tr>
</table>
<p>In addition, file name specifications with expression substitution
(e.g. file names including "%j" for job_ID, "%J" for job_ID.step_ID,
"%s" for step_ID, "%t" for task_ID, or "%n" for node_ID) are not supported.
This effects the following options:</p>
<ul>
<li>-e, --error</li>
<li>-i, --input</li>
<li>-o, --output</li>
</ul>
<p>For the srun command's --multi-prog option (Multiple Program,
Multiple Data configurations), the command file will be translated from
SLURM's format to a POE format. POE does not support SLURM expressions
in the MPMD configuration file (e.g. "%t" will not be replaced with the task's
number and "%o" will not be replaced with the task's offset within
this range). The command file will be stored in a temporary file
in the ".slurm" subdirectory of your home directory. The file name will have a
prefix of "slurm_cmdfile." followed by srun's process id. If the srun command
does not terminate gracefully, this file may persist after the job step's
termination and not be purged. You can find and purge these files using
commands of the form shown below. You should only purge older files, after
their job steps are completed.</p>
<pre>
$ ls -l ~/.slurm/slurm_cmdfile.*
$ rm ~/.slurm/slurm_cmdfile.*
</pre>
<p>The -L/--label option differs slightly in that when the output from multiple
tasks are identical, they are combined on a single line with the prefix
identifying which task(s) generated the output. In addition, there is a colon
but no space between the task IDs and output. For example:</p>
<pre>
# SLURM OUTPUT
0: foo
1: foo
2: foo
0: bar
1: barr
2: bar
# POE OUTPUT
0-2:foo
0 2:bar
1:barr
</pre>
<p>In addition, when srun's --multi-prog option (for Multiple Program,
Multiple Data configurations) is used with the -L/--label option then a job
step ID, colon and space will precede the task ID and colon. For example:</p>
<pre>
# SLURM OUTPUT
0: zero
1: one
2: two
# POE OUTPUT (FOR STEP ID 1)
1: 0: zero
1: 1: one
1: 2: two
</pre>
<p>The srun command is not able to report task status upon receipt of a SIGINT
signal (ctrl-c interrupt from keyboard), however two SIGINT signals within a
one second interval will terminate the job as on other SLURM configurations.</p>
<h3>Environment Variables</h3>
<p>Since SLURM is not directly launching user tasks, the following environment
variables are NOT available with POE:</p>
<ul>
<li>SLURM_CPU_BIND_LIST</li>
<li>SLURM_CPU_BIND_TYPE</li>
<li>SLURM_CPU_BIND_VERBOSE</li>
<li>SLURM_CPUS_ON_NODE</li>
<li>SLURM_GTIDS</li>
<li>SLURM_LAUNCH_NODE_IPADDR</li>
<li>SLURM_LOCALID</li>
<li>SLURM_MEM_BIND_LIST</li>
<li>SLURM_MEM_BIND_TYPE</li>
<li>SLURM_MEM_BIND_VERBOSE</li>
<li>SLURM_NODEID</li>
<li>SLURM_PROCID</li>
<li>SLURM_SRUN_COMM_HOST</li>
<li>SLURM_SRUN_COMM_PORT</li>
<li>SLURM_TASK_PID</li>
<li>SLURM_TASKS_PER_NODE</li>
<li>SLURM_TOPOLOGY_ADDR</li>
<li>SLURM_TOPOLOGY_ADDR_PATTERN</li>
</ul>
<p>Note that POE sets a variety of environment variables that provide similar
information to some the missing SLURM environment variables. Particularly
note the following environment variables:</p>
<ul>
<li>MP_I_UPMD_HOSTNAME (local hostname)</li>
<li>MP_CHILD (global task ID)</li>
</ul>
<h3>Gang Scheduling</h3>
<p>SLURM can be configured to gang schedule (time slice) parallel jobs by
alternately suspending and resuming them. Depending upon the the number of
jobs configured to time slice and the time slice interval (as specified in
the <i>slurm.conf</i> file using the <b>Shared</b> and <b>SchedulerTimeSlice</b>
options), the job may experience communication timeouts. Set the environment
variable <b>MP_TIMEOUT</b> to specify an appropriate communication timeout
value. Note that the default timeout is 150 seconds. See
<a href="gang_scheduling.html">Gang Scheduling</a> for more information.</p>
<pre>
export MP_TIMEOUT=600
</pre>
<h3>Other User Notes</h3>
<p>POE can not support a step ID of zero. In POE installations, a job's
first step ID will be 1 rather than 0.</p>
<p>Since the SLURM step launches the PE PMD process instead of the
users tasks the exit code stored in accounting will be that of the PMD
instead of the users tasks. The exit code of the job allocation if
started with srun will be correct as we will grab the exit code from
the wrapped poe.</p>
<h2><a name="admin">System Administration</a></h2>
<p>There are several critical SLURM configuration parameters for use with PE.
These configuration parameters should be set in your <b>slurm.conf</b> file.
<b>LaunchType</b> defines the task launch mechanism to be used and must be set
to <b>launch/poe</b>. This configuration means that poe will be used to launch
all applications.
<b>SwitchType</b> defines the mechanism used to manage the network switch and
it must be set to <b>switch/nrt</b> and use IBM's Network Resource Table (NRT)
interface on ALL nodes in the cluster (only a few of them will actually interact
with IBM's NRT library, but all need to work with the NRT data structures).
Task launch is slower in this environment than with a typical Linux cluster
and the <b>MessageTimeout</b> must be configured to a sufficiently large value
so that large parallel jobs can be launched without SLURM's job step
credentials expiring.
When switch resources are allocated to a job, all processes spawned by that job
must be terminated before the switch resources can be released for use by another
program. This means that reliable tracking of all spawned processes is critical
for switch use. Use of <b>ProctrackType=proctrack/cgroup</b> is strongly
recommended. Use of any other process tracking plugin significantly increases
the likelyhood of orphan processes that must be manually identified and killed
in order to release switch resources.
While it is possible to to configure distinct <b>NodeName</b> and
<b>NodeHostName</b> parameters for the compute nodes, this is discouraged
for performance reasons (the <b>switch/nrt</b> plugin is not optimized for
such a configuration).</p>
<pre>
# Excerpt of slurm.conf
LaunchType=launch/poe
SwitchType=switch/nrt
MessageTimeout=30
ProctrackType=proctrack/cgroup
</pre>
<p>In order for these plugins to be built, the locations of the POE Resource
Manager header file (permapi.h) the NRT header file (nrt.h) and NRT library
(libnrt.so) must be identified at the time the SLURM is built.
The header files are needed at build time to get NRT data structures, function
return codes, etc.
The NRT library location is needed to identify where the library is to be
loaded from by Slurm's switch/nrt plugin, but the library is only actually
when needed and only by the slurmd daemon.</p>
<p>SLURM searches for the header files in the /usr/include directory by default.
If the files are not installed there, you can specify a different location using
the <b>--with-nrth=PATH</b> option to the configure program, where "PATH" is
the fully qualified pathname of the parent directory(ies) of the nrt.h and
permapi.h files.
SLURM searches for the libnrt.so file in the /usr/lib and /usr/lib64 directories
by default. If the file is not installed there, you can specify a different
location using the <b>--with-nrtlib=PATH</b> option to the configure program,
where "PATH" is the fully qualified pathname of the parent directory of the
libnrt.so file.
Alternately these values may be specified in your ~/.rpmmacros file.
For example:</p>
<pre>
%_with_nrth "/opt/ibmhpc/pecurrent/base/include"
%_with_libnrt "/opt/ibmhpc/pecurrent/base/intel/lib64"
</pre>
<p><b>IMPORTANT:</b>The poe command interacts with SLURM by loading a
SLURM library providing a variety of functions for its use. The
library name is <i>"libpermapi.so"</i> and it is in installed with the
other SLURM libraries in the subdirectory "lib/slurm". You must
modify the link of /usr/lib64/libpermapi.so to point to the location
of the slurm version of this library.</p>
<p>Modifying the "/etc/poe.limits" file is <b>not</b> enough. The poe
command is loading and using the libpermapi.so library initially
from the /usr/lib64 directory. It later reads the /etc/poe.limits
file and loads the library listed there. In order for poe to work
with SLURM, it needs to use the "libpermapi.so" generated by SLURM
for all of its functions. Until poe is modified to only load the
correct library, it is necessary for /usr/lib64/libpermapi.so to
contain SLURM's library or a link to it.</p>
<p>If you are having problems running on more than 32 nodes this is
most likely your issue.</p>
<h3>Job Scheduling</h3>
<p>SLURM can be configured to gang schedule (time slice) parallel jobs by
alternately suspending and resuming them. Depending upon the the number of
jobs configured to time slice and the time slice interval (as specified in
the <i>slurm.conf</i> file using the <b>Shared</b> and <b>SchedulerTimeSlice</b>
options), the job may experience communication timeouts. Set the environment
variable <b>MP_TIMEOUT</b> to specify an appropriate communication timeout
value. Note that the default timeout is 150 seconds. See
<a href="gang_scheduling.html">Gang Scheduling</a> for more information.</p>
<pre>
export MP_TIMEOUT=600
</pre>
<p>SLURM also can support long term job preemption with IBM's Parallel
Environment. Job's can be explicitly preempted and later resumed using the
<b>scontrol suspend &lt;jobid&gt;</b> and <b>scontrol resume &lt;jobid&gt;</b>
commands. This functionality relies upon NRT functions to suspend/resume
programs and reset MPI timeouts. Note that SLURM supports the preemption only
of whole jobs rather than individual job steps. A suspended job will relinquish
CPU resources, but retain memory and switch window resources. Note that the
long term suspension of jobs with any allocated Collective Acceleration
Units (CAU) is disabled and an error message to that effect will be generated
in response to such a request. In addition, version 1200 or higher of IBM's NRT
API is required to support this functionality.</p>
<h3>Design Notes</h3>
<p>It is necessary for all nodes that can be used for scheduling a single job
have the same network adapter types and count. For example, if node "tux1"
has two ethernet adapters then the node "tux2" in the same cluster must also
have two ethernet adapters on the same networks or be in a different SLURM
partition so that one job can not be allocated resources on both nodes.
Without this restriction, a job may allocated adapter resources on one node
and be unable to allocate the corresponding adapter resources on another
node.</p>
<p>It is possible to configure SLURM and LoadLeveler to simultaneously exist
on a cluster, however each scheduler must be configured to manage different
compute nodes (e.g. LoadLeveler can manage compute nodes "tux[1-8]" and SLURM
can manage compute nodes "tux[9-16]" on the same cluster). In addition, the
/etc/poe.limits file on each node must identify the MP_PE_RMLIB appropriate
for that node (e.g. IBM's or SLURM's libpermapi.so).
If Slurm and LoadLeveler are configured to simultaneously manage the same
nodes, you should expect both resource managers to try assigning the same
resources. This will result in job failures.</p>
<p>The srun command uses the <b>launch/poe</b> plugin to launch the poe program.
Then poe uses the <b>launch/slurm</b> plugin to launch the "pmd" process on the
compute nodes, so two launch plugins are actually used.</p>
<p>Depending upon job size and network options, allocating and deallocating
switch resources can take multiple seconds per node and the process of launching
applications on multiple nodes is not well parallelized.
This is outside of SLURM's control.</p>
<p>The two figures below show a high-level overview of the switch/nrt and
launch/poe plugins. A typical Slurm installation with IBM PE would make
use of both plugins, but the operation of each is shown independently for
improved clarity. Note the the switch/nrt plugin is needed by not only the
slurmd daemon, but also the slurmctld daemon (for managing switch allocation
data structures) and the srun command (for packing and unpacking switch
allocation information used at task launch time).
In figure 2, note that the libpermapi library issues the job and job step
creation requests. The srun command is an optional front-end for the poe
command and the poe command can be invoked directly by the user if desired.</p>
<img src="ibm_pe_fig1.png" width=600>
<center>
<p>Figure 1: Use of the switch/nrt plugin</p>
</center>
<img src="ibm_pe_fig2.png" width=600>
<center>
<p>Figure 2: Use of the launch/poe plugin</p>
</center>
<h3>Debugging Notes</h3>
<p>It is possible to generate detailed logging of all switch/nrt actions and
data by configuring <b>DebugFlags=switch</b>.</p>
<p>The environment variable <b>MP_INFOLEVEL</b> can be used to enable the
logging of POE debug messages. To enable fairly detailed logging, set
<b>MP_INFOLEVEL=6</b>.</p>
<p>The Protocol Network Services Daemon (PNSD) manages the Network Resource
Table (NRT) information on each node. It's logs are written to the file
<b>/tmp/serverlog</b>, which may be useful to diagnose problems. In order to
execute PNSD in debug node (for extra debugging information), run the following
commands as user root:</p>
<pre>
stopsrc -s pnsd
startsrc -s pnsd -a -D
</pre>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 14 March 2013</p></td>
<!--#include virtual="footer.txt"-->