blob: 8b6f7f896913ab2cee8d7cf5148de1691d06e0bf [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1>SLURM User and Administrator Guide for Cray systems</h1>
<h2>User Guide</h2>
<p>This document describes the unique features of SLURM on Cray computers.
You should be familiar with the SLURM's mode of operation on Linux clusters
before studying the differences in Cray system operation described in this
document.</p>
<p>SLURM version 2.3 is designed to operate as a job scheduler over Cray's
Application Level Placement Scheduler (ALPS).
Use SLURM's <i>sbatch</i> or <i>salloc</i> commands to create a resource
allocation in ALPS.
Then use ALPS' <i>aprun</i> command to launch parallel jobs within the resource
allocation.
The resource allocation is terminated once the the batch script or the
<i>salloc</i> command terminates.
Alternately there is an <i>aprun</i> wrapper distributed with SLURM in
<i>contribs/cray/srun</i> which will translate <i>srun</i> options
into the equivalent <i>aprun</i> options. This wrapper will also execute
<i>salloc</i> as needed to create a job allocation in which to run the
<i>aprun</i> command. The <i>srun</i> script contains two new options:
<i>--man</i> will print a summary of the options including notes about which
<i>srun</i> options are not supported and <i>--alps="</i> which can be used
to specify <i>aprun</i> options which lack an equivalent within <i>srun</i>.
For example, <i>srun --alps="-a xt" -n 4 a.out</i>.
Since <i>aprun</i> is used to launch tasks (the equivalent of a SLURM
job step), the job steps will not be visible using SLURM commands.
Other than SLURM's <i>srun</i> command being replaced by <i>aprun</i>
and the job steps not being visible, all other SLURM commands will operate
as expected. Note that in order to build and install the aprun wrapper
described above, execute "configure" with the <i>--with-srun2aprun</i>
option or add <i>%_with_srun2aprun 1</i> to your <i>~/.rpmmacros</i> file.</p>
<h3>Node naming and node geometry on Cray XT/XE systems</h3>
<p>SLURM node names will be of the form "nid#####" where "#####" is a five-digit sequence number.
Other information available about the node are it's XYZ coordinate in the node's <i>NodeAddr</i>
field and it's component label in the <i>HostNodeName</i> field.
The format of the component label is "c#-#c#s#n#" where the "#" fields represent in order:
cabinet, row, cage, blade or slot, and node.
For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.</p>
<p>Cray XT/XE systems come with a 3D torus by default. On smaller systems the cabling in X dimension is
omitted, resulting in a two-dimensional torus (1 x Y x Z). On Gemini/XE systems, pairs of adjacent nodes
(nodes 0/1 and 2/3 on each blade) share one network interface each. This causes the same Y coordinate to
be assigned to those nodes, so that the number of distinct torus coordinates is half the number of total
nodes.</p>
<p>The SLURM <i>smap</i> and <i>sview</i> tools can visualize node torus positions. Clicking on a particular
node shows its <i>NodeAddr</i> field, which is its (X,Y,Z) torus coordinate base-36 encoded as a 3-character
string. For example, a NodeAddr of '07A' corresponds to the coordinates X = 0, Y = 7, Z = 10.
The <i>NodeAddr</i> of a node can also be shown using 'scontrol show node nid#####'.</p>
<p>Please note that the sbatch/salloc options "<i>--geometry</i>" and "<i>--no-rotate</i>" are BlueGene-specific
and have no impact on Cray systems. Topological node placement depends on what Cray makes available via the
ALPS_NIDORDER configuration option (see below).</p>
<h3>Specifying thread depth</h3>
<p>For threaded applications, use the <i>--cpus-per-task</i>/<i>-c</i> parameter of sbatch/salloc to set
the thread depth per node. This corresponds to mppdepth in PBS and to the aprun -d parameter. Please
note that SLURM does not set the OMP_NUM_THREADS environment variable. Hence, if an application spawns
4 threads, an example script would look like</p>
<pre>
#SBATCH --comment="illustrate the use of thread depth and OMP_NUM_THREADS"
#SBATCH --ntasks=3
#SBATCH -c 4
export OMP_NUM_THREADS=4
aprun -n 3 -d $OMP_NUM_THREADS ./my_exe
</pre>
<h3>Specifying number of tasks per node</h3>
<p>SLURM uses the same default as ALPS, assigning each task to a single core/CPU. In order to
make more resources available per task, you can reduce the number of processing elements
per node (<i>aprun -N</i> parameter, <i>mppnppn</i> in PBS) with the
<i>--ntasks-per-node</i> option of <i>sbatch/salloc</i>.
This is in particular necessary when tasks require more memory than the per-CPU default.</p>
<h3>Specifying per-task memory</h3>
<p>In Cray terminology, a task is also called a "processing element" (PE), hence below we
refer to the per-task memory and "per-PE" memory interchangeably. The per-PE memory
requested through the batch system corresponds to the <i>aprun -m</i> parameter.</p>
<p>Due to the implicit default assumption that 1 task runs per core/CPU, the default memory
available per task is the <i>per-CPU share</i> of node_memory / number_of_cores. For
example, on a XT5 system with 16000MB per 12-core node, the per-CPU share is 1333MB.</p>
<p>If nothing else is specified, the <i>--mem</i> option to sbatch/salloc can only be used to
<i>reduce</i> the per-PE memory below the per-CPU share. This is also the only way that
the <i>--mem-per-cpu</i> option can be applied (besides, the <i>--mem-per-cpu</i> option
is ignored if the user forgets to set --ntasks/-n).
Thus, the preferred way of specifying memory is the more general <i>--mem</i> option.</p>
<p>To <i>increase</i> the per-PE memory settable via the <i>--mem</i> option requires making
more per-task resources available using the <i>--ntasks-per-node</i> option to sbatch/salloc.
This allows <i>--mem</i> to request up to node_memory / ntasks_per_node MegaBytes.</p>
<p>When <i>--ntasks-per-node</i> is 1, the entire node memory may be requested by the application.
Setting <i>--ntasks-per-node</i> to the number of cores per node yields the default per-CPU share
minimum value.</p>
<p>For all cases in between these extremes, set --mem=per_task_memory and</p>
<pre>
--ntasks-per-node=floor(node_memory / per_task_memory)
</pre>
<p>whenever per_task_memory needs to be larger than the per-CPU share.</p>
<p><b>Example:</b> An application with 64 tasks needs 7500MB per task on a cluster with 32000MB and 24 cores
per node. Hence ntasks_per_node = floor(32000/7500) = 4.</p>
<pre>
#SBATCH --comment="requesting 7500MB per task on 32000MB/24-core nodes"
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=4
#SBATCH --mem=7500
</pre>
<p>If you would like to fine-tune the memory limit of your application, you can set the same parameters in
a salloc session and then check directly, using</p>
<pre>
apstat -rvv -R $BASIL_RESERVATION_ID
</pre>
<p>to see how much memory has been requested.</p>
<h3>Using aprun -B</h3>
<p>CLE 3.x allows a nice <i>aprun</i> shortcut via the <i>-B</i> option, which
reuses all the batch system parameters (<i>--ntasks, --ntasks-per-node,
--cpus-per-task, --mem</i>) at application launch, as if the corresponding
(<i>-n, -N, -d, -m</i>) parameters had been set; see the aprun(1) manpage
on CLE 3.x systems for details.</p>
<h3>Node ordering options</h3>
<p>SLURM honours the node ordering policy set for Cray's Application Level Placement Scheduler (ALPS). Node
ordering is a configurable system option (ALPS_NIDORDER in /etc/sysconfig/alps). The current
setting is reported by '<i>apstat -svv</i>' (look for the line starting with "nid ordering option") and
can not be changed at runtime. The resulting, effective node ordering is revealed by '<i>apstat -no</i>'
(if no special node ordering has been configured, 'apstat -no' shows the
same order as '<i>apstat -n</i>').</p>
<p>SLURM uses exactly the same order as '<i>apstat -no</i>' when selecting
nodes for a job. With the <i>--contiguous</i> option to <i>sbatch/salloc</i>
you can request a contiguous (relative to the current ALPS nid ordering) set
of nodes. Note that on a busy system there is typically more fragmentation,
hence it may take longer (or even prove impossible) to allocate contiguous
sets of a larger size.</p>
<p>Cray/ALPS node ordering is a topic of ongoing work, some information can be found in the CUG-2010 paper
"<i>ALPS, Topology, and Performance</i>" by Carl Albing and Mark Baker.</p>
<h2>Administrator Guide</h2>
<h3>Install supporting RPMs</h3>
<p>The build requires a few -devel RPMs listed below. You can obtain these from
SuSe/Novell.
<ul>
<li>CLE 2.x uses SuSe SLES 10 packages (rpms may be on the normal isos)</li>
<li>CLE 3.x uses Suse SLES 11 packages (rpms are on the SDK isos, there
are two SDK iso files for SDK)</li>
</ul></p>
<p>You can check by logging onto the boot node and running</p>
<pre>
boot: # xtopview
default: # rpm -qa
</pre>
<p>The list of packages that should be installed is:</p>
<ul>
<li>expat-2.0.xxx</li>
<li>libexpat-devel-2.0.xxx</li>
<li>cray-MySQL-devel-enterprise-5.0.64 (this should be on the Cray iso)</li>
</ul>
<p>For example, loading MySQL can be done like this:</p>
<pre>
smw: # mkdir mnt
smw: # mount -o loop, ro xe-sles11sp1-trunk.201107070231a03.iso mnt
smw: # find mnt -name cray-MySQL-devel-enterprise\*
mnt/craydist/xt-packages/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64.rpm
smw: # scp mnt/craydist/xt-packages/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64
</pre>
<p>Then switch to boot node and run:</p>
<pre>
boot: # xtopview
default: # rpm -ivh /software/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64.rpm
default: # exit
</pre>
<p>All Cray-specific PrgEnv and compiler modules should be removed and root
privileges will be required to install these files.</p>
<h3>Create a build root</h3>
<p>The build is done on a normal service node, where you like
(e.g. <i>/ufs/slurm/build</i> would work).
Most scripts check for the environment variable LIBROOT.
You can either edit the scripts or export this variable. Easiest way:</p>
<pre>
login: # export LIBROOT=/ufs/slurm/build
login: # mkdir -vp $LIBROOT
login: # cd $LIBROOT
</pre>
<h3>Install SLURM modulefile</h3>
<p>This file is distributed as part the SLURM tar-ball in
<i>contribs/cray/opt_modulefiles_slurm</i>. Install it as
<i>/opt/modulefiles/slurm</i> (or anywhere else in your module path).
It means that you can use Munge as soon as it is built.</p>
<pre>
login: # scp ~/slurm/contribs/cray/opt_modulefiles_slurm root@boot:/rr/current/software/
</pre>
<h3>Build and install Munge</h3>
<p>Note the Munge installation process on Cray systems differs
somewhat from that described in the
<a href="http://code.google.com/p/munge/wiki/InstallationGuide">
MUNGE Installation Guide</a>.</p>
<p>Munge is the authentication daemon and needed by SLURM. Download
munge-0.5.10.tar.bz2 or newer from
<a href="http://code.google.com/p/munge/downloads/list">
http://code.google.com/p/munge/downloads/list</a>. This is how one
can build on a login node and install it.</p>
<pre>
login: # cd $LIBROOT
login: # cp ~/slurm/contribs/cray/munge_build_script.sh $LIBROOT
login: # mkdir -p ${LIBROOT}/munge/zip
login: # curl -O http://munge.googlecode.com/files/munge-0.5.10.tar.bz2
login: # cp munge-0.5.10.tar.bz2 ${LIBROOT}/munge/zip
login: # chmod u+x ${LIBROOT}/munge/zip/munge_build_script.sh
login: # ${LIBROOT}/munge/zip/munge_build_script.sh
(generates lots of output and enerates a tar-ball called
$LIBROOT/munge_build-.*YYYY-MM-DD.tar.gz)
login: # scp munge_build-2011-07-12.tar.gz root@boot:/rr/current/software
</pre>
<p>Install the tar-ball by on the boot node and build an encryption
key file executing:
<pre>
boot: # xtopview
default: # tar -zxvf $LIBROOT/munge_build-*.tar.gz -C /rr/current /
default: # dd if=/dev/urandom bs=1 count=1024 >/opt/slurm/munge/etc/munge.key
default: # chmod go-rxw /opt/slurm/munge/etc/munge.key
default: # exit
</pre>
<h3>Configure Munge</h3>
<p>The following steps apply to each login node and the sdb, where
<ul>
<li>The <i>slurmd</i> or <i>slurmctld</i> daemon will run and/or</li>
<li>Users will be submitting jobs</li>
</ul></p>
<pre>
login: # mkdir --mode=0711 -vp /var/lib/munge
login: # mkdir --mode=0700 -vp /var/log/munge
login: # mkdir --mode=0755 -vp /var/run/munge
login: # module load slurm
</pre>
<pre>
sdb: # mkdir --mode=0711 -vp /var/lib/munge
sdb: # mkdir --mode=0700 -vp /var/log/munge
sdb: # mkdir --mode=0755 -vp /var/run/munge
</pre>
<p>Start the munge daemon and test it.</p>
<pre>
login: # munged --key-file /opt/slurm/munge/etc/munge.key
login: # munge -n
MUNGE:AwQDAAAEy341MRViY+LacxYlz+mchKk5NUAGrYLqKRUvYkrR+MJzHTgzSm1JALqJcunWGDU6k3vpveoDFLD7fLctee5+OoQ4dCeqyK8slfAFvF9DT5pccPg=:
</pre>
<p>When done, verify network connectivity by executing:
<ul>
<li><i>munge -n | ssh other-login-host /opt/slurm/munge/bin/unmunge</i></li>
</ul>
<p>If you decide to keep the installation, you may be interested in automating
the process using an <i>init.d</i> script distributed with the Munge. This
should be installed on all nodes running munge, e.g., 'xtopview -c login' and
'xtopview -n sdbNodeID'
</p>
<pre>
boot: # xtopview -c login
login: # cp /software/etc_init_d_munge /etc/init.d/munge
login: # chmod u+x /etc/init.d/munge
login: # chkconfig munge on
login: # exit
boot: # xtopview -n 31
node/31: # cp /software/etc_init_d_munge /etc/init.d/munge
node/31: # chmod u+x /etc/init.d/munge
node/31: # chkconfig munge on
node/31: # exit
</pre>
<h3>Enable the Cray job service</h3>
<p>This is a common dependency on Cray systems. ALPS relies on the Cray job service to
generate cluster-unique job container IDs (PAGG IDs). These identifiers are used by
ALPS to track running (aprun) job steps. The default (session IDs) is not unique
across multiple login nodes. This standard procedure is described in chapter 9 of
<a href="http://docs.cray.com/books/S-2393-30/">S-2393</a> and takes only two
steps, both to be done on all 'login' class nodes (xtopview -c login):</p>
<ul>
<li>make sure that the /etc/init.d/job service is enabled (chkconfig) and started</li>
<li>enable the pam_job.so module from /opt/cray/job/default in /etc/pam.d/common-session<br/>
(NB: the default pam_job.so is very verbose, a simpler and quieter variant is provided
in contribs/cray.)</li>
</ul>
<p>The latter step is required only if you would like to run interactive
<i>salloc</i> sessions.</p>
<pre>
boot: # xtopview -c login
login: # chkconfig job on
login: # emacs -nw /etc/pam.d/common-session
(uncomment the pam_job.so line)
session optional /opt/cray/job/default/lib64/security/pam_job.so
login: # exit
boot: # xtopview -n 31
node/31:# chkconfig job on
node/31:# emacs -nw /etc/pam.d/common-session
(uncomment the pam_job.so line as shown above)
</pre>
<h3>Build and Configure SLURM</h3>
<p>SLURM can be built and installed as on any other computer as described
<a href="quickstart_admin.html">Quick Start Administrator Guide</a>.
An example of building and installing SLURM version 2.3.0 is shown below.</p>
<p><b>NOTE:</b> By default neither the <i>salloc</i> command or <i>srun</i>
command wrapper can be executed as a background process. This is done for two
reasons:</p>
<ol>
<li>Only one ALPS reservation can be created from each session ID. The
<i>salloc</i> command can not change it's session ID without disconnecting
itself from the terminal and its parent process, meaning the process could not
be later put into the foreground or easily identified</li>
<li>To better identify every process spawned under the <i>salloc</i> process
using terminal foreground process group IDs</li>
</ol>
<p>You can optionally enable <i>salloc</i> and <i>srun</i> to execute as
background processes by using the configure option
<i>"--enable-salloc-background"</i>, however doing will result in failed
resource allocations
(<i>error: Failed to allocate resources: Requested reservation is in use</i>)
if not executed sequentially and
increase the likelyhood of orphaned processes.</p>
<!-- Example:
Modify srun script or ask user to execute "/usr/bin/setsid"
before salloc or srun command -->
<!-- Example:
salloc spawns zsh, zsh spawns bash, etc.
when salloc terminates, bash becomes a child of init -->
<pre>
login: # mkdir build && cd build
login: # slurm/configure \
--prefix=/opt/slurm/2.3.0 \
--with-munge=/opt/slurm/munge/ \
--with-mysql_config=/opt/cray/MySQL/5.0.64-1.0000.2899.20.2.gem/bin \
--with-srun2aprun
login: # make -j
login: # mkdir install
login: # make DESTDIR=/tmp/slurm/build/install install
login: # make DESTDIR=/tmp/slurm/build/install install-contrib
login: # cd install
login: # tar czf slurm_opt.tar.gz opt
login: # scp slurm_opt.tar.gz boot:/rr/current/software
</pre>
<pre>
boot: # xtopview
default: # tar xzf /software/slurm_opt.tar.gz -C /
default: # cd /opt/slurm/
default: # ln -s 2.3.0 default
</pre>
<p>When building SLURM's <i>slurm.conf</i> configuration file, use the
<i>NodeName</i> parameter to specify all batch nodes to be scheduled.
If nodes are defined in ALPS, but not defined in the <i>slurm.conf</i> file, a
complete list of all batch nodes configured in ALPS will be logged by
the <i>slurmctld</i> daemon when it starts.
One would typically use this information to modify the <i>slurm.conf</i> file
and restart the <i>slurmctld</i> daemon.
Note that the <i>NodeAddr</i> and <i>NodeHostName</i> fields should not be
configured, but will be set by SLURM using data from ALPS.
<i>NodeAddr</i> be set to the node's XYZ coordinate and be used by SLURM's
<i>smap</i> and <i>sview</i> commands.
<i>NodeHostName</i> will be set to the node's component label.
The format of the component label is "c#-#c#s#n#" where the "#" fields
represent in order: cabinet, row, cate, blade or slot, and node.
For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.</p>
<p>The <i>slurmd</i> daemons will not execute on the compute nodes, but will
execute on one or more front end nodes.
It is from here that batch scripts will execute <i>aprun</i> commands to
launch tasks.
This is specified in the <i>slurm.conf</i> file by using the
<i>FrontendName</i> and optionally the <i>FrontEndAddr</i> fields
as seen in the examples below.</p>
<p>Note that SLURM will by default kill running jobs when a node goes DOWN,
while a DOWN node in ALPS only prevents new jobs from being scheduled on the
node. To help avoid confusion, we recommend that <i>SlurmdTimeout</i> in the
<i>slurm.conf</i> file be set to the same value as the <i>suspectend</i>
parameter in ALPS' <i>nodehealth.conf</i> file.</p>
<p>You need to specify the appropriate resource selection plugin (the
<i>SelectType</i> option in SLURM's <i>slurm.conf</i> configuration file).
Configure <i>SelectType</i> to <i>select/cray</i> The <i>select/cray</i>
plugin provides an interface to ALPS plus issues calls to the
<i>select/linear</i>, which selects resources for jobs using a best-fit
algorithm to allocate whole nodes to jobs (rather than individual sockets,
cores or threads).</p>
<p>Note that the system topology is based upon information gathered from
the ALPS database and is based upon the ALPS_NIDORDER configuration in
<i>/etc/sysconfig/alps</i>. Excerpts of a <i>slurm.conf</i> file for
use on a Cray systems follow:</p>
<pre>
#---------------------------------------------------------------------
# SLURM USER
#---------------------------------------------------------------------
# SLURM user on cray systems must be root
# This requirement derives from Cray ALPS:
# - ALPS reservations can only be created by the job owner or root
# (confirmation may be done by other non-privileged users)
# - Freeing a reservation always requires root privileges
SlurmUser=root
#---------------------------------------------------------------------
# PLUGINS
#---------------------------------------------------------------------
# Network topology (handled internally by ALPS)
TopologyPlugin=topology/none
# Scheduling
SchedulerType=sched/backfill
# Node selection: use the special-purpose "select/cray" plugin.
# Internally this uses select/linar, i.e. nodes are always allocated
# in units of nodes (other allocation is currently not possible, since
# ALPS does not yet allow to run more than 1 executable on the same
# node, see aprun(1), section LIMITATIONS).
#
# Add CR_memory as parameter to support --mem/--mem-per-cpu.
SelectType=select/cray
SelectTypeParameters=CR_Memory
# Proctrack plugin: only/default option is proctrack/sgi_job
# ALPS requires cluster-unique job container IDs and thus the /etc/init.d/job
# service needs to be started on all slurmd and login nodes, as described in
# S-2393, chapter 9. Due to this requirement, ProctrackType=proctrack/sgi_job
# is the default on Cray and need not be specified explicitly.
#---------------------------------------------------------------------
# PATHS
#---------------------------------------------------------------------
SlurmdSpoolDir=/ufs/slurm/spool
StateSaveLocation=/ufs/slurm/spool/state
# main logfile
SlurmctldLogFile=/ufs/slurm/log/slurmctld.log
# slurmd logfiles (using %h for hostname)
SlurmdLogFile=/ufs/slurm/log/%h.log
# PIDs
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
#---------------------------------------------------------------------
# COMPUTE NODES
#---------------------------------------------------------------------
# Return DOWN nodes to service when e.g. slurmd has been unresponsive
ReturnToService=1
# Configure the suspectend parameter in ALPS' nodehealth.conf file to the same
# value as SlurmdTimeout for consistent behavior (e.g. "suspectend: 600")
SlurmdTimeout=600
# Controls how a node's configuration specifications in slurm.conf are
# used.
# 0 - use hardware configuration (must agree with slurm.conf)
# 1 - use slurm.conf, nodes with fewer resources are marked DOWN
# 2 - use slurm.conf, but do not mark nodes down as in (1)
FastSchedule=2
# Per-node configuration for PALU AMD G34 dual-socket "Magny Cours"
# Compute Nodes. We deviate from slurm's idea of a physical socket
# here, since the Magny Cours hosts two NUMA nodes each, which is
# also visible in the ALPS inventory (4 Segments per node, each
# containing 6 'Processors'/Cores).
NodeName=DEFAULT Sockets=4 CoresPerSocket=6 ThreadsPerCore=1
NodeName=DEFAULT RealMemory=32000 State=UNKNOWN
# List the nodes of the compute partition below (service nodes are not
# allowed to appear)
NodeName=nid00[002-013,018-159,162-173,178-189]
# Frontend nodes: these should not be available to user logins, but
# have all filesystems mounted that are also
# available on a login node (/scratch, /home, ...).
FrontendName=palu[7-9]
#---------------------------------------------------------------------
# ENFORCING LIMITS
#---------------------------------------------------------------------
# Enforce the use of associations: {associations, limits, wckeys}
AccountingStorageEnforce=limits
# Do not propagate any resource limits from the user's environment to
# the slurmd
PropagateResourceLimits=NONE
#---------------------------------------------------------------------
# Resource limits for memory allocation:
# * the Def/Max 'PerCPU' and 'PerNode' variants are mutually exclusive;
# * use the 'PerNode' variant for both default and maximum value, since
# - slurm will automatically adjust this value depending on
# --ntasks-per-node
# - if using a higher per-cpu value than possible, salloc will just
# block.
#--------------------------------------------------------------------
# XXX replace both values below with your values from 'xtprocadmin -A'
DefMemPerNode=32000
MaxMemPerNode=32000
#---------------------------------------------------------------------
# PARTITIONS
#---------------------------------------------------------------------
# defaults common to all partitions
PartitionName=DEFAULT Nodes=nid00[002-013,018-159,162-173,178-189]
PartitionName=DEFAULT MaxNodes=178
PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60
# "User Support" partition with a higher priority
PartitionName=usup Hidden=YES Priority=10 MaxTime=720 AllowGroups=staff
# normal partition available to all users
PartitionName=day Default=YES Priority=1 MaxTime=01:00:00
</pre>
<p>SLURM supports an optional <i>cray.conf</i> file containing Cray-specific
configuration parameters. <b>This file is NOT needed for production systems</b>,
but is provided for advanced configurations. If used, <i>cray.conf</i> must be
located in the same directory as the <i>slurm.conf</i> file. Configuration
parameters supported by <i>cray.conf</i> are listed below.</p>
<p><dl>
<dt><b>apbasil</b></dt>
<dd>Fully qualified pathname to the apbasil command.
The default value is <i>/usr/bin/apbasil</i>.</dd>
<dt><b>apkill</b></dt>
<dd>Fully qualified pathname to the apkill command.
The default value is <i>/usr/bin/apkill</i>.</dd>
<dt><b>SDBdb</b></dt>
<dd>Name of the ALPS database.
The default value is <i>XTAdmin</i>.</dd>
<dt><b>SDBhost</b></dt>
<dd>Hostname of the database server.
The default value is based upon the contents of the 'my.cnf' file used to
store default database access information and that defaults to user 'sdb'.</dd>
<dt><b>SDBpass</b></dt>
<dd>Password used to access the ALPS database.
The default value is based upon the contents of the 'my.cnf' file used to
store default database access information and that defaults to user 'basic'.</dd>
<dt><b>SDBport</b></dt>
<dd>Port used to access the ALPS database.
The default value is 0.</dd>
<dt><b>SDBuser</b></dt>
<dd>Name of user used to access the ALPS database.
The default value is based upon the contents of the 'my.cnf' file used to
store default database access information and that defaults to user 'basic'.</dd>
</dl></p>
<pre>
# Example cray.conf file
apbasil=/opt/alps_simulator_40_r6768/apbasil.sh
SDBhost=localhost
SDBuser=alps_user
SDBdb=XT5istanbul
</pre>
<p>One additional configuration script can be used to insure that the slurmd
daemons execute with the highest resource limits possible, overriding default
limits on Suse systems. Depending upon what resource limits are propagated
from the user's environment, lower limits may apply to user jobs, but this
script will insure that higher limits are possible. Copy the file
<i>contribs/cray/etc_sysconfig_slurm</i> into <i>/etc/sysconfig/slurm</i>
for these limits to take effect. This script is executed from
<i>/etc/init.d/slurm</i>, which is typically executed to start the SLURM
daemons. An excerpt of <i>contribs/cray/etc_sysconfig_slurm</i>is shown
below.</p>
<pre>
#
# /etc/sysconfig/slurm for Cray XT/XE systems
#
# Cray is SuSe-based, which means that ulimits from
# /etc/security/limits.conf will get picked up any time SLURM is
# restarted e.g. via pdsh/ssh. Since SLURM respects configured limits,
# this can mean that for instance batch jobs get killed as a result
# of configuring CPU time limits. Set sane start limits here.
#
# Values were taken from pam-1.1.2 Debian package
ulimit -t unlimited # max amount of CPU time in seconds
ulimit -d unlimited # max size of a process's data segment in KB
</pre>
<p>SLURM's <i>init.d</i> script should also be installed to automatically
start SLURM daemons when nodes boot as shown below. Be sure to edit the script
as appropriate to reference the proper file location (modify the variable
<i>PREFIX</i>).
<pre>
login: # scp /home/crayadm/ben/slurm/etc/init.d.slurm boot:/rr/current/software/
</pre>
<p>SLURM will ignore any interactive jobs or nodes in interactive mode
so set all your nodes to batch from any service node. Dropping the
-n option will make all nodes batch.</p>
<pre>
# xtprocadmin -k m batch -n NODEIDS
</pre>
<p>Now create the needed directories for logs and state files then start the
daemons on the sdb and login nodes as shown below.</p>
<pre>
sdb: # mkdir -p /ufs/slurm/log
sdb: # mkdir -p /ufs/slurm/spool
sdb: # /etc/init.d/slurm start
</pre>
<pre>
login: # /etc/init.d/slurm start
</pre>
<h3>Srun wrapper configuration</h3>
<p>The <i>srun</i> wrapper to <i>aprun</i> might require modification to run
as desired. Specifically the <i>$aprun</i> variable could be set to the
absolute pathname of that executable file. Without that modification, the
<i>aprun</i> command executed will depend upon the user's search path.</p>
<p>In order to debug the <i>srun</i> wrapper, uncomment the line</p>
<pre>
print "comment=$command\n"
</pre>
<p>If the <i>srun</i> wrapper is executed from
within an existing SLURM job allocation (i.e. within <i>salloc</i> or an
<i>sbatch</i> script), then it just executes the <i>aprun</i> command with
appropriate options. If executed without an allocation, the wrapper executes
<i>salloc</i>, which then executes the <i>srun</i> wrapper again. This second
execution of the <i>srun</i> wrapper is required in order to process environment
variables that are set by the <i>salloc</i> command based upon the resource
allocation.</p>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 13 March 2012</p></td>
<!--#include virtual="footer.txt"-->